Why the service Command Is Lying to Me
Prologue
Cost reduction has been an important goal in our company. After some investigation, I found that some EBS data volumes that holds PostgreSQL data were assigned 300G space, while only a few megabytes were used.
So in 2018, I developed an Ansible tool to reduce the size of EBS data volumes for these instances. The scripts applied smoothly, but in December 2018 I encountered an unexpected failure related to Linux service
command.
In this article I will take down the troubleshooting process, in order to provide some clues if the reader get similar problems.
Background
EBS Shrink Scripts
The scripts basically do the following things:
-
Stop PostgreSQL service.
-
Create snapshots for old volumes.
-
Create a new data volume from the snapshot.
-
Create a new empty volume of target size.
-
Copy the data from old data volume to the new empty volume using
rsync
. -
Make sure the contents between old and new volumes are same using
diff
. -
Unmount and detach old data volume and attach the new one.
-
Restart PostgreSQL service.
Problem
One day, I got reported from an operator that the script failed on a commercial environment after a long execution time. The old and newly created data volumes were still mounted and service
command gives the following output:
~ $ sudo service postgresql status
postgresql stop/waiting
~ $ sudo service postgresql start
postgresql start/running, process 6102
~ $ sudo service postgresql status
postgresql stop/waiting
The status of PostgreSQL service became stop/waiting
immediately after executing start
command.
Troubleshooting Process
First try to Recover PostgreSQL Service
From the description, it could be easily inferred that the EBS shrink script failed at the step 7, as the original data volume was not yet unmounted or removed.
I thought it would be easy to fix by simply restarting PostgreSQL service and removing additional volumes in order to restore the environment.
However, the sudo service postgresql status
kept telling that PostgreSQL service is not running, even restarting the instance did not work.
Targeting Problem Source
At first I wanted to find out if this is related to specific product version, in order to get PostgreSQL service up and running as soon as possible. So I selected a testing environment with same version but everything worked properly, as shown below:
~ $ sudo service postgresql status
pg_ctl: server is running (PID: 5822)
/usr/local/pgsql/bin/postgres "-D" "/postgresql/data"
Because our script did not touch PostgreSQL configurations at all, we then assumed that this problem is bound to that specific environment.
When Linux is booting, the kernel boots first. Afterwards, an init system is invoked to initialize system processes that are necessary for use. The init system creates process init
with PID 1 and other processes under the init
process, constructing a process tree.
We ensured that the Linux version running on that environment was Ubuntu 14.04 by using uname -a
command. The Ubuntu 14.04 uses upstart init system that executes corresponding scripts under /etc/init.d/
directory. The user may use service
command to manipulate running services.
So we asked the operator to view the content of postgresql
file under that directory.
~ $ sudo cat /etc/init.d/postgresql
cat: /etc/init.d/postgresql: No such file or directory
We were surprised that the postgresql
script did not exist! This should never happen in a normal Ubuntu 14.04 environment. We checked the corresponding testing environment and that file existed. At this point, I was sure that the commercial environment was broken.
Finding the Ghost Process of PostgreSQL
We then investigated the log of upstart init system and got the following output:
pg_ctl: another server might be running; trying to start server anyway
server starting
pg_ctl: another server might be running; trying to start server anyway
server starting
[2018-12-13 10:54:03 JST][][][][1867] FATAL: lock file "postmaster.pid" already exists
[2018-12-13 10:54:03 JST][][][][1867] HINT: Is another postmaster (PID 1027) running in data directory "/postgresql/data"?
This partly explained that reason that restarting PostgreSQL kept failing. The PostgreSQL failed to start because the lock file was occupied. We then tried the following command:
/var/run# ps aux | grep postgres
postgres 1027 0.0 0.8 560320 73300 ? S 10:47 0:00 /usr/local/pgsql/bin/postgres
postgres 1093 0.0 0.0 26752 2132 ? Ss 10:47 0:00 postgres: logger process
postgres 1099 0.0 0.0 560320 2964 ? Ss 10:47 0:00 postgres: checkpointer process
postgres 1100 0.0 0.0 560320 6648 ? Ss 10:47 0:00 postgres: writer process
postgres 1101 0.0 0.0 560320 2964 ? Ss 10:47 0:00 postgres: wal writer process
postgres 1102 0.0 0.0 560752 5308 ? Ss 10:47 0:00 postgres: autovacuum launcher process
postgres 1103 0.0 0.0 28972 3020 ? Ss 10:47 0:00 postgres: stats collector process
root 2295 0.0 0.0 11424 2196 pts/0 S+ 11:11 0:00 grep --color=auto postgres
Aha! The PostgreSQL was actually running, but it was not monitored by upstart system for some unknown reason.
Then we tried to kill the running process and restart it using the service
command:
/var/run# kill -9 1027
/var/run# ps aux | grep post
root 2310 0.0 0.0 11420 2172 pts/0 S+ 11:12 0:00 grep --color=auto post
/var/run# service postgresql status
postgresql stop/waiting
/var/run# service postgresql start
postgresql start/running, process 2337
/var/run# service postgresql status
postgresql stop/waiting
I suspected that it was because the service
command used wrong arguments when starting PostgreSQL. So we killed the PostgreSQL process 2337 and manually started it using correct commands copied from another normal environment:
~ $ sudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /postgresql/data start
pg_ctl: another server might be running; trying to start server anyway
server starting
[2018-12-13 11:18:55 JST][][][][2478] LOG: redirecting log output to logging collector process
[2018-12-13 11:18:55 JST][][][][2478] HINT: Future log output will appear in directory "/postgresql/log".
It seemed working, but the service
command insisted that the PostgreSQL process was dead:
~ $ service postgresql status
postgresql stop/waiting
~ $ sudo -u postgres /usr/local/pgsql/bin/pg_ctl -D /postgresql/data status
pg_ctl: server is running (PID: 2478)
/usr/local/pgsql/bin/postgres "-D" "/postgresql/data"
At this point I was sure that the service
command was lying and the PostgreSQL service was running properly.
Why the service
Command Is Lying
Our script was written as an Ansible playbook. Its official documentation says that supported init systems include BSD init, OpenRC, SysV, Solaris SMF, systemd and upstart. Why even the great Ansible service
modules was not working? After some investigation, I finally found the root cause of this problem.
In environments before product version 17.06, those instances used upstart to manage the PostgreSQL service. The upstart way was to put configurations files under the directory /etc/init/
and write corresponding commands in configuration file. Newer instances used SystemV way to manage PostgreSQL service, i.e. using script files under /etc/init.d/
directory. Both ways are supported by Ubuntu 14.04 system.
However, there used to be a bug in upgrading scripts that caused the conversion between the two ways mentioned above was not done properly. This bug was fixed in a commit later, but in environments created before that commit, the file /etc/init.d/postgresql
did not exist, while /etc/init/postgresql.conf
still existed.
The service
command is compatible with both way. It first checks if /etc/init/postgresql.conf
exists and execute commands defined in it. Otherwise, the service
command checks if /etc/init.d/postgresql
exists and execute commands accordingly. Because of the change, the service
command could not manage the state of PostgreSQL correctly.
The End of the Story
Since the root cause was found, I wrote another script to fix this problem by renaming the /etc/init/postgresql.conf
file and create the /etc/init.d/postgresql
file according to existing templates. Also a checking code was added to the EBS shrink script to automatically fix affected environments.
Conclusion
From this even I learned some lessons and got some inspirations:
-
The most important thing is, the production (commercial) environment might be much more complicated than you thought. You can never be prepared enough.
-
The old environments might not work as expected according to historical reasons. It is important to get service up and running at first, then try to investigate and fix the root cause.
-
There are many hidden details in the tools we are using every day. Personally I never knew that
/etc/init/xxx.conf
is actually the upstart way to manage services.
I wish this little story will bring you something useful.
Thanks for reading.