Note: This article applies to Python 2 environment.

Background

In Unix/Linux programming, we use fork + exec mechanism to create child process, replace child process with the program we want to run. Then we call wait() or waitpid() to wait for child process to return.

But in some scenarios, the child process just blocks and hangs there due to some unexpected reasons(eg. the rsync program is likely to block because of network issues), therefore the child process won’t return at all, levaing the parent process(probably the main process) hangs too.

Problem

It’s natural to think of using parent process to manage child process, control its life cycle. In more detail, we can set a maxmium time for the child process to return. If the child process doesn’t return in the specified time, the parent process can just kill it.

We can verify our idea using Python:

#!/usr/bin/env python2.7
# coding=utf-8
# vim: ts=4 sw=4

import os
import errno
import time
import signal


def kill_operation(pid, signal):
    try:
        os.kill(pid, signal)
    except OSError as err:
        if err.errno == errno.EPERM or err.errno == errno.ESRCH:
            return False
    return True


pid = os.fork()
if pid == 0:
    print "in child process ..."
    time.sleep(5*60)
    os._exit(0)
else:
    print "parent process %d, sub process id %d" % (os.getpid(), pid)


max_sleep_time = 20
current_sleep_time = 0
sleep_interval = 5
existence = True
while existence:
    if (current_sleep_time < max_sleep_time):
        try:
            os.waitpid(pid, os.WNOHANG)
        except OSError as err:
            if err.errno == errno.ECHILD:
                break
        time.sleep(sleep_interval)
        current_sleep_time += sleep_interval
    else:
        print("kill 9\t\t" + str(kill_operation(pid, 15)))
    existence = kill_operation(pid, 0)
    print("kill 0\t\t" + str(existence))


exit(0)


However, the code above doesn’t work as expected, the child process doesn’t exit at all. Here are the output of the program and ps command:

The program output:

% ./fork_exec.bug.py
parent process 1680, sub process id 1681
in child process ...
kill 0          True
kill 0          True
kill 0          True
kill 0          True
kill 9          True
kill 0          True
kill 9          True

ps command output before the child process was killed:

% ps -efww | grep -v grep | grep python2.7
nostalg+  1680  2442  0 22:41 pts/55   00:00:00 python2.7 ./fork_exec.bug.py
nostalg+  1681  1680  0 22:41 pts/55   00:00:00 python2.7 ./fork_exec.bug.py

ps command output after the child process was killed:

% ps -efww | grep -v grep | grep python2.7
nostalg+  1680  2442  0 22:41 pts/55   00:00:00 python2.7 ./fork_exec.bug.py
nostalg+  1681  1680  0 22:41 pts/55   00:00:00 [python2.7] <defunct>

Workaround

We can find that the killed child process became zombie/defunct process.

About zombie process from Wikipedia:

On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the "Terminated state". This occurs for child processes, where the entry is still needed to allow the parent process to read its child's exit status: once the exit status is read via the wait system call, the zombie's entry is removed from the process table and it is said to be "reaped". A child process always first becomes a zombie before being removed from the resource table. In most cases, under normal system operation zombies are immediately waited on by their parent and then reaped by the system – processes that stay zombies for a long time are generally an error and cause a resource leak.

Here, the parent process kills the child process, but leave it unreaped. A child process can be reaped in one of the following two ways:

    • The parent process call wait()/waitpid() immediately after it kills the child process.
#!/usr/bin/env python2.7
# coding=utf-8
# vim: ts=4 sw=4

import os
import errno
import time
import signal


def kill_operation(pid, signal):
    try:
        os.kill(pid, signal)
    except OSError as err:
        if err.errno == errno.EPERM or err.errno == errno.ESRCH:
            return False
    return True


pid = os.fork()
if pid == 0:
    print "in child process ..."
    time.sleep(5*60)
    os._exit(0)
else:
    print "parent process %d, sub process id %d" % (os.getpid(), pid)


max_sleep_time = 20
current_sleep_time = 0
sleep_interval = 5
existence = True
while existence:
    if (current_sleep_time < max_sleep_time):
        try:
            os.waitpid(pid, os.WNOHANG)
        except OSError as err:
            if err.errno == errno.ECHILD:
                break
        time.sleep(sleep_interval)
        current_sleep_time += sleep_interval
    else:
        print("kill 9\t\t" + str(kill_operation(pid, 15)))
        os.waitpid(pid, 0)
    existence = kill_operation(pid, 0)
    print("kill 0\t\t" + str(existence))


exit(0)


    • Handle the SIGCHLD signal.
#!/usr/bin/env python2.7
# coding=utf-8
# vim: ts=4 sw=4

import os
import errno
import time
import signal


def kill_operation(pid, signal):
    try:
        os.kill(pid, signal)
    except OSError as err:
        if err.errno == errno.EPERM or err.errno == errno.ESRCH:
            return False
    return True


# 父进程等待子进程的异步退出
def on_child_exit(signum, frame):
    pid, status = os.wait()
    print("on_child_exit(): parent %d, child %d, exit status %d" % (os.getpid(), pid, status))


pid = os.fork()
if pid == 0:
    print "in child process ..."
    time.sleep(5*60)
    os._exit(0)
else:
    print "parent process %d, sub process id %d" % (os.getpid(), pid)


signal.signal(signal.SIGCHLD, on_child_exit)

max_sleep_time = 20
current_sleep_time = 0
sleep_interval = 5
existence = True
while existence:
    if (current_sleep_time < max_sleep_time):
        try:
            os.waitpid(pid, os.WNOHANG)
        except OSError as err:
            if err.errno == errno.ECHILD:
                break
        time.sleep(sleep_interval)
        current_sleep_time += sleep_interval
    else:
        print("kill 9\t\t" + str(kill_operation(pid, 15)))
    existence = kill_operation(pid, 0)
    print("kill 0\t\t" + str(existence))


exit(0)


 

Now everything works as expected. 🙂

PS: All above code can be downloaded from here.

 

References:

Leave a Reply

Your email address will not be published. Required fields are marked *