Jul 10, 2019

Why the service Command Is Lying to Me

Prologue

Cost reduction has been an important goal in our company. After some investigation, I found that some EBS data volumes that holds PostgreSQL data were assigned 300G space, while only a few megabytes were used.

So in 2018, I developed an Ansible tool to reduce the size of EBS data volumes for these instances. The scripts applied smoothly, but in December 2018 I encountered an unexpected failure related to Linux service command.

In this article I will take down the troubleshooting process, in order to provide some clues if the reader get similar problems.

Background

EBS Shrink Scripts

The scripts basically do the following things:

Stop PostgreSQL service.
Create snapshots for old volumes.
Create a new data volume from the snapshot.
Create a new empty volume of target size.
Copy the data from old data volume to the new empty volume using rsync.
Make sure the contents between old and new volumes are same using diff.
Unmount and detach old data volume and attach the new one.
Restart PostgreSQL service.

Problem

One day, I got reported from an operator that the script failed on a commercial environment after a long execution time. The old and newly created data volumes were still mounted and service command gives the following output:

~ $ sudo service postgresql status
postgresql stop/waiting

~ $ sudo service postgresql start
postgresql start/running, process 6102

~ $ sudo service postgresql status
postgresql stop/waiting

The status of PostgreSQL service became stop/waiting immediately after executing start command.

Troubleshooting Process

First try to Recover PostgreSQL Service

From the description, it could be easily inferred that the EBS shrink script failed at the step 7, as the original data volume was not yet unmounted or removed.

I thought it would be easy to fix by simply restarting PostgreSQL service and removing additional volumes in order to restore the environment.

However, the sudo service postgresql status kept telling that PostgreSQL service is not running, even restarting the instance did not work.

Targeting Problem Source

At first I wanted to find out if this is related to specific product version, in order to get PostgreSQL service up and running as soon as possible. So I selected a testing environment with same version but everything worked properly, as shown below:

~ $ sudo service postgresql status
pg_ctl: server is running (PID: 5822)
/usr/local/pgsql/bin/postgres "-D" "/postgresql/data"

Because our script did not touch PostgreSQL configurations at all, we then assumed that this problem is bound to that specific environment.

When Linux is booting, the kernel boots first. Afterwards, an init system is invoked to initialize system processes that are necessary for use. The init system creates process init with PID 1 and other processes under the init process, constructing a process tree.

We ensured that the Linux version running on that environment was Ubuntu 14.04 by using uname -a command. The Ubuntu 14.04 uses upstart init system that executes corresponding scripts under /etc/init.d/ directory. The user may use service command to manipulate running services.

So we asked the operator to view the content of postgresql file under that directory.

~ $ sudo cat /etc/init.d/postgresql
cat: /etc/init.d/postgresql: No such file or directory

We were surprised that the postgresql script did not exist! This should never happen in a normal Ubuntu 14.04 environment. We checked the corresponding testing environment and that file existed. At this point, I was sure that the commercial environment was broken.

Finding the Ghost Process of PostgreSQL

We then investigated the log of upstart init system and got the following output:

pg_ctl: another server might be running; trying to start server anyway
server starting
pg_ctl: another server might be running; trying to start server anyway
server starting
[2018-12-13 10:54:03 JST][][][][1867] FATAL:  lock file "postmaster.pid" already exists
[2018-12-13 10:54:03 JST][][][][1867] HINT:  Is another postmaster (PID 1027) running in data directory "/postgresql/data"?

This partly explained that reason that restarting PostgreSQL kept failing. The PostgreSQL failed to start because the lock file was occupied. We then tried the following command:

/var/run# ps aux | grep postgres
postgres       1027  0.0  0.8 560320 73300 ?        S    10:47   0:00 /usr/local/pgsql/bin/postgres
postgres       1093  0.0  0.0  26752  2132 ?        Ss   10:47   0:00 postgres: logger process
postgres       1099  0.0  0.0 560320  2964 ?        Ss   10:47   0:00 postgres: checkpointer process
postgres       1100  0.0  0.0 560320  6648 ?        Ss   10:47   0:00 postgres: writer process
postgres       1101  0.0  0.0 560320  2964 ?        Ss   10:47   0:00 postgres: wal writer process
postgres       1102  0.0  0.0 560752  5308 ?        Ss   10:47   0:00 postgres: autovacuum launcher process
postgres       1103  0.0  0.0  28972  3020 ?        Ss   10:47   0:00 postgres: stats collector process
root      2295  0.0  0.0  11424  2196 pts/0    S+   11:11   0:00 grep --color=auto postgres

Aha! The PostgreSQL was actually running, but it was not monitored by upstart system for some unknown reason.

Then we tried to kill the running process and restart it using the service command:

/var/run# kill -9 1027

/var/run# ps aux | grep post
root      2310  0.0  0.0  11420  2172 pts/0    S+   11:12   0:00 grep --color=auto post

/var/run# service postgresql status
postgresql stop/waiting

/var/run# service postgresql start
postgresql start/running, process 2337

/var/run# service postgresql status
postgresql stop/waiting

I suspected that it was because the service command used wrong arguments when starting PostgreSQL. So we killed the PostgreSQL process 2337 and manually started it using correct commands copied from another normal environment:

~ $ sudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /postgresql/data start
pg_ctl: another server might be running; trying to start server anyway
server starting
[2018-12-13 11:18:55 JST][][][][2478] LOG:  redirecting log output to logging collector process
[2018-12-13 11:18:55 JST][][][][2478] HINT:  Future log output will appear in directory "/postgresql/log".

It seemed working, but the service command insisted that the PostgreSQL process was dead:

~ $ service postgresql status
postgresql stop/waiting

~ $ sudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /postgresql/data status
pg_ctl: server is running (PID: 2478)
/usr/local/pgsql/bin/postgres "-D" "/postgresql/data"

At this point I was sure that the service command was lying and the PostgreSQL service was running properly.

Why the `service` Command Is Lying

Our script was written as an Ansible playbook. Its official documentation says that supported init systems include BSD init, OpenRC, SysV, Solaris SMF, systemd and upstart. Why even the great Ansible service modules was not working? After some investigation, I finally found the root cause of this problem.

In environments before product version 17.06, those instances used upstart to manage the PostgreSQL service. The upstart way was to put configurations files under the directory /etc/init/ and write corresponding commands in configuration file. Newer instances used SystemV way to manage PostgreSQL service, i.e. using script files under /etc/init.d/ directory. Both ways are supported by Ubuntu 14.04 system.

However, there used to be a bug in upgrading scripts that caused the conversion between the two ways mentioned above was not done properly. This bug was fixed in a commit later, but in environments created before that commit, the file /etc/init.d/postgresql did not exist, while /etc/init/postgresql.conf still existed.

The service command is compatible with both way. It first checks if /etc/init/postgresql.conf exists and execute commands defined in it. Otherwise, the service command checks if /etc/init.d/postgresql exists and execute commands accordingly. Because of the change, the service command could not manage the state of PostgreSQL correctly.

The End of the Story

Since the root cause was found, I wrote another script to fix this problem by renaming the /etc/init/postgresql.conf file and create the /etc/init.d/postgresql file according to existing templates. Also a checking code was added to the EBS shrink script to automatically fix affected environments.

Conclusion

From this even I learned some lessons and got some inspirations:

The most important thing is, the production (commercial) environment might be much more complicated than you thought. You can never be prepared enough.
The old environments might not work as expected according to historical reasons. It is important to get service up and running at first, then try to investigate and fix the root cause.
There are many hidden details in the tools we are using every day. Personally I never knew that /etc/init/xxx.conf is actually the upstart way to manage services.

I wish this little story will bring you something useful.

Thanks for reading.