Note: This article applies to Python 2 environment.
Background
In Unix/Linux programming, we use fork
+ exec
mechanism to create child process, replace child process with the program we want to run. Then we call wait()
or waitpid()
to wait for child process to return.
But in some scenarios, the child process just blocks and hangs there due to some unexpected reasons(eg. the rsync
program is likely to block because of network issues), therefore the child process won’t return at all, levaing the parent process(probably the main process) hangs too.
Problem
It’s natural to think of using parent process to manage child process, control its life cycle. In more detail, we can set a maxmium time for the child process to return. If the child process doesn’t return in the specified time, the parent process can just kill it.
We can verify our idea using Python:
#!/usr/bin/env python2.7 # coding=utf-8 # vim: ts=4 sw=4 import os import errno import time import signal def kill_operation(pid, signal): try: os.kill(pid, signal) except OSError as err: if err.errno == errno.EPERM or err.errno == errno.ESRCH: return False return True pid = os.fork() if pid == 0: print "in child process ..." time.sleep(5*60) os._exit(0) else: print "parent process %d, sub process id %d" % (os.getpid(), pid) max_sleep_time = 20 current_sleep_time = 0 sleep_interval = 5 existence = True while existence: if (current_sleep_time < max_sleep_time): try: os.waitpid(pid, os.WNOHANG) except OSError as err: if err.errno == errno.ECHILD: break time.sleep(sleep_interval) current_sleep_time += sleep_interval else: print("kill 9\t\t" + str(kill_operation(pid, 15))) existence = kill_operation(pid, 0) print("kill 0\t\t" + str(existence)) exit(0)
However, the code above doesn’t work as expected, the child process doesn’t exit at all. Here are the output of the program and ps
command:
The program output:
% ./fork_exec.bug.py parent process 1680, sub process id 1681 in child process ... kill 0 True kill 0 True kill 0 True kill 0 True kill 9 True kill 0 True kill 9 True
ps
command output before the child process was killed:
% ps -efww | grep -v grep | grep python2.7 nostalg+ 1680 2442 0 22:41 pts/55 00:00:00 python2.7 ./fork_exec.bug.py nostalg+ 1681 1680 0 22:41 pts/55 00:00:00 python2.7 ./fork_exec.bug.py
ps
command output after the child process was killed:
% ps -efww | grep -v grep | grep python2.7 nostalg+ 1680 2442 0 22:41 pts/55 00:00:00 python2.7 ./fork_exec.bug.py nostalg+ 1681 1680 0 22:41 pts/55 00:00:00 [python2.7] <defunct>
Workaround
We can find that the killed child process became zombie/defunct process.
About zombie process from Wikipedia:
On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the "Terminated state". This occurs for child processes, where the entry is still needed to allow the parent process to read its child's exit status: once the exit status is read via the wait system call, the zombie's entry is removed from the process table and it is said to be "reaped". A child process always first becomes a zombie before being removed from the resource table. In most cases, under normal system operation zombies are immediately waited on by their parent and then reaped by the system – processes that stay zombies for a long time are generally an error and cause a resource leak.
Here, the parent process kills the child process, but leave it unreaped. A child process can be reaped in one of the following two ways:
- The parent process call
wait()
/waitpid()
immediately after it kills the child process.
#!/usr/bin/env python2.7 # coding=utf-8 # vim: ts=4 sw=4 import os import errno import time import signal def kill_operation(pid, signal): try: os.kill(pid, signal) except OSError as err: if err.errno == errno.EPERM or err.errno == errno.ESRCH: return False return True pid = os.fork() if pid == 0: print "in child process ..." time.sleep(5*60) os._exit(0) else: print "parent process %d, sub process id %d" % (os.getpid(), pid) max_sleep_time = 20 current_sleep_time = 0 sleep_interval = 5 existence = True while existence: if (current_sleep_time < max_sleep_time): try: os.waitpid(pid, os.WNOHANG) except OSError as err: if err.errno == errno.ECHILD: break time.sleep(sleep_interval) current_sleep_time += sleep_interval else: print("kill 9\t\t" + str(kill_operation(pid, 15))) os.waitpid(pid, 0) existence = kill_operation(pid, 0) print("kill 0\t\t" + str(existence)) exit(0)
- Handle the
SIGCHLD
signal.
#!/usr/bin/env python2.7 # coding=utf-8 # vim: ts=4 sw=4 import os import errno import time import signal def kill_operation(pid, signal): try: os.kill(pid, signal) except OSError as err: if err.errno == errno.EPERM or err.errno == errno.ESRCH: return False return True # 父进程等待子进程的异步退出 def on_child_exit(signum, frame): pid, status = os.wait() print("on_child_exit(): parent %d, child %d, exit status %d" % (os.getpid(), pid, status)) pid = os.fork() if pid == 0: print "in child process ..." time.sleep(5*60) os._exit(0) else: print "parent process %d, sub process id %d" % (os.getpid(), pid) signal.signal(signal.SIGCHLD, on_child_exit) max_sleep_time = 20 current_sleep_time = 0 sleep_interval = 5 existence = True while existence: if (current_sleep_time < max_sleep_time): try: os.waitpid(pid, os.WNOHANG) except OSError as err: if err.errno == errno.ECHILD: break time.sleep(sleep_interval) current_sleep_time += sleep_interval else: print("kill 9\t\t" + str(kill_operation(pid, 15))) existence = kill_operation(pid, 0) print("kill 0\t\t" + str(existence)) exit(0)
Now everything works as expected. 🙂
PS: All above code can be downloaded from here.
References: