linux 中进程的状态

1. 进程的状态
2. TASK_RUNNING
3. TASK_INTERRUPTIBLE 和 TASK_UNINTERRUPTIBLE
4. TASK_KILLABLE
5. __TASK_STOPPED和__TASK_TRACED
6. EXIT_ZOMBIE 和 EXIT_DEAD

1 进程的状态

linux （本文使用linux4.8.4）下，进程状态大致有7种。

进程状态	说明
TASK_RUNNING	可运行状态。未必正在使用CPU，也许是在等待调度
TASK_INTERRUPTIBLE	可中断的睡眠状态。正在等待某个条件满足
TASK_UNINTERRUPTIBLE	不可中断的睡眠状态。不会被信号中断
__TASK_STOPPED	暂停状态。收到某种信号，运行被停止
__TASK_TRACED	被跟踪状态。进程停止，被另一个进程跟踪
EXIT_ZOMBIE	僵尸状态。进程已经退出，但尚未被父进程或者init进程收尸
EXIT_DEAD	真正的死亡状态

在include/linux/sched.h中，进程状态的定义并没有那么少：

/*
 * Task state bitmask. NOTE! These bits are also
 * encoded in fs/proc/array.c: get_task_state().
 *
 * We have two separate sets of flags: task->state
 * is about runnability, while task->exit_state are
 * about the task exiting. Confusing, but this way
 * modifying one set can't modify the other one by
 * mistake.
 */
#define TASK_RUNNING            0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define __TASK_STOPPED          4
#define __TASK_TRACED           8
/* in tsk->exit_state */
#define EXIT_DEAD               16
#define EXIT_ZOMBIE             32
#define EXIT_TRACE              (EXIT_ZOMBIE | EXIT_DEAD)
/* in tsk->state again */
#define TASK_DEAD               64
#define TASK_WAKEKILL           128
#define TASK_WAKING             256
#define TASK_PARKED             512
#define TASK_NOLOAD             1024
#define TASK_NEW                2048
#define TASK_STATE_MAX          4096

#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn"

extern char ___assert_task_state[1 - 2*!!(
                sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];

/* Convenience macros for the sake of set_task_state */
#define TASK_KILLABLE           (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
#define TASK_STOPPED            (TASK_WAKEKILL | __TASK_STOPPED)
#define TASK_TRACED             (TASK_WAKEKILL | __TASK_TRACED)

#define TASK_IDLE               (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)

/* Convenience macros for the sake of wake_up */
#define TASK_NORMAL             (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
#define TASK_ALL                (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)

/* get_task_state() */
#define TASK_REPORT             (TASK_RUNNING | TASK_INTERRUPTIBLE | \
                                 TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
                                 __TASK_TRACED | EXIT_ZOMBIE | EXIT_DEAD)

2 TASK_RUNNING

TASK_RUNNING是教科书中两种状态的结合，一种是正在占用CPU事件的RUNNING状态，一种是RUNNING状态的进程时间片耗尽或者主动让出CPU，或者被更高优先级进程抢占后，进入的READY状态。处于TASK_RUNNING状态的进程要么正在CPU上运行，要么随时都可以投入运行，只不过CPU资源有限，调度器暂时没有选中他们。

处于TASK_RUNNING状态的进程是调度器的调度对象。在linux中，每个CPU都有自己的运行队列集合。如果是实时进程，则根据优先级的情况落在相应的优先级的队列上；如果是普通进程，则根据虚拟运行时间，落在红黑树相应位置上。

Linux提供了time命令可以统计进程在用户态和内核态消耗的CPU时间。time命令提供了三种事件：实际时间，用户CPU时间和内核CPU时间。下面的输出可以看出 \( real \not= user+sys \) 。在多核处理器上，两边的大小是不确定的。

[root@localhost ~]# time ntpdate pool.ntp.org
xxx xxxxxx outputs of ntpdate xxx xxxxxx

real    0m8.710s
user    0m0.002s
sys     0m0.013s

如果想在进程尚未结束时获得程序的执行时间，可以空过procfs中的信息，/proc/<PID>/stat中字段13是用户态CPU时间，14是内核态CPU时间，两者单位是始终嘀嗒。在配置内核的时候，有100HZ，250HZ，300HZ和1000HZ这4个选项。一个始终嘀嗒的事件可以通过下面的命令获得：

[root@localhost ~]# grep CONFIG_HZ /boot/config-*

pidstat命令也可以获取各个进程的CPU使用情况。如果想获取进程的实际运行时间，可以使用ps命令：

[] ~ ps -p 20590 -o etime,cmd,pid
    ELAPSED CMD                            PID
   01:21:57 emacs taskstatus.org         20590

3 TASK_INTERRUPTIBLE 和 TASK_UNINTERRUPTIBLE

当进程和慢速设备打交道，或者需要等待条件满足时，这种等待时间是不可预估的，这种情况下，内核会将该进程从CPU的运行队列中移除，从而进程进入睡眠状态。

Linux的进程有两种睡眠状态：TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE，这两种状态的区别是能否响应收到的信号。处于TASK_INTERRUPTIBLE状态的进程遇到下面两种情况会返回到TASK_RUNNING状态：

等待条件满足；
收到未被屏蔽的信号。

收到信号时，会返回EINTR，需要检测返回值以作出正确处理。对于TASK_UNINTERRUPTIBLE，只有等待条件满足才有可能返回运行状态，任何信号都无法打断它。如果这种状态的进程出错，无法杀死，只能重启。

TASK_UNINTERRUPTIBLE的存在是因为内核中某些处理是不能被打断的，比如read系统调用正在操作磁盘，就要用TASK_UNINTERRUPTIBLE将其保护起来以免受到打扰而陷入不可控的状态。

khungtaskd内核线程(源码在kernel/hung_task.c)会定期唤醒(120秒)检查所有 TASK_UNINTERRUPTIBLE进程，如果有进程超过120秒没有被调度，那么内核就会打印进程的堆栈信息。通过下面的命令可以查看kungtaskd周期:

[root@localhost ~]# sysctl kernel.hung_task_timeout_secs
kernel.hung_task_timeout_secs = 120

通过/proc/<pid>/wchan (what channel的缩写) 或者 proc/<pic>/stack，或者 /proc/<pid>/status 可以知道进程处于什么状态。

睡眠状态的进程都保存在等待队列中。队列在include/linux/wait.h中定义。

typedef struct __wait_queue wait_queue_t;
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key);
int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key);

/* __wait_queue::flags */
#define WQ_FLAG_EXCLUSIVE       0x01
#define WQ_FLAG_WOKEN           0x02

struct __wait_queue {
        unsigned int            flags;
        void                    *private;
        wait_queue_func_t       func; // 唤醒回调函数
        struct list_head        task_list;
};

struct wait_bit_key {
        void                    *flags;
        int                     bit_nr;
#define WAIT_ATOMIC_T_BIT_NR    -1
        unsigned long           timeout;
};

struct wait_bit_queue {
        struct wait_bit_key     key;
        wait_queue_t            wait;
};

struct __wait_queue_head {
        spinlock_t              lock;
        struct list_head        task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

等待队列元素private在__WAITQUEUE_INITIALIZER中指向了进程描述符task_struct，这就可以将进程加入到对应的队列上了。使用add_wait_queue或者 add_wait_queue_exclusive将队列元素加到相应队列。这两个函数的区别在于：

一个将队列元素设置WQ_FLAG_EXCLUSIVE标志位，另一个没有；
一个将元素放到队列尾部，另一个放到队列头部。

这是因为有时候当等待条件满足，有时可以将队列中的所有进程唤醒，有时唤醒操作是排他的(EXCLUSIVE)则只能唤醒一个。

内核使用wait_event系列宏和函数等待条件是否满足。

#define ___wait_is_interruptible(state)                                 \
        (!__builtin_constant_p(state) ||                                \
                state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)  \

/*
 * The below macro ___wait_event() has an explicit shadow of the __ret
 * variable when used from the wait_event_*() macros.
 *
 * This is so that both can use the ___wait_cond_timeout() construct
 * to wrap the condition.
 *
 * The type inconsistency of the wait_event_*() __ret variable is also
 * on purpose; we use long where we can return timeout values and int
 * otherwise.
 */

#define ___wait_event(wq, condition, state, exclusive, ret, cmd)        \
({                                                                      \
        __label__ __out;                                                \
        wait_queue_t __wait;                                            \
        long __ret = ret;       /* explicit shadow */                   \
                                                                        \
        INIT_LIST_HEAD(&__wait.task_list);                              \
        if (exclusive)                                                  \
                __wait.flags = WQ_FLAG_EXCLUSIVE;                       \
        else                                                            \
                __wait.flags = 0;                                       \
                                                                        \
        for (;;) {                                                      \
                long __int = prepare_to_wait_event(&wq, &__wait, state);\
                                                                        \
                if (condition)                                          \
                        break;                                          \
                                                                        \
                if (___wait_is_interruptible(state) && __int) {         \
                        __ret = __int;                                  \
                        if (exclusive) {                                \
                                abort_exclusive_wait(&wq, &__wait,      \
                                                     state, NULL);      \
                                goto __out;                             \
                        }                                               \
                        break;                                          \
                }                                                       \
                                                                        \
                cmd;                                                    \
        }                                                               \
        finish_wait(&wq, &__wait);                                      \
__out:  __ret;                                                          \
})

#define __wait_event(wq, condition)                                     \
        (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,  \
                            schedule())

/**
 * wait_event - sleep until a condition gets true
 * @wq: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 */
#define wait_event(wq, condition)                                       \
do {                                                                    \
        might_sleep();                                                  \
        if (condition)                                                  \
                break;                                                  \
        __wait_event(wq, condition);                                    \
} while (0)

prepare_to_wait函数将队列元素添加到对应的等待队列，同时将进程状态设置成 TASK_UNINTERRUPTIBLE，完成prepare_to_wait后，检查条件是否满足，如果不满足则调用schedule()主动让出CPU使用权。prepare_to_wait在/kernel/sched/wait.c中。

内核是通过wake_up系列宏实现唤醒操作的。这些宏最终调用__wake_up函数。这个函数在kernel/sched/wait.c中wait_up最终调用try_to_wake_up。

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
                        int nr_exclusive, int wake_flags, void *key)
{
        wait_queue_t *curr, *next;

        list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
                unsigned flags = curr->flags;

                if (curr->func(curr, mode, wake_flags, key) &&
                                (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
                        break;
        }
}

/**
 * __wake_up - wake up threads blocked on a waitqueue.
 * @q: the waitqueue
 * @mode: which threads
 * @nr_exclusive: how many wake-one or wake-many threads to wake up
 * @key: is directly passed to the wakeup function
 *
 * It may be assumed that this function implies a write memory barrier before
 * changing the task state if and only if any tasks are woken up.
 */
void __wake_up(wait_queue_head_t *q, unsigned int mode,
                        int nr_exclusive, void *key)
{
        unsigned long flags;

        spin_lock_irqsave(&q->lock, flags);
        __wake_up_common(q, mode, nr_exclusive, 0, key);
        spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(__wake_up);

4 TASK_KILLABLE

有人认为使用vfork函数子进程在调用exec或者退出之前，父进程处于 TASK_UNINTERRUPTIBLE 状态，事实并非如此，因为进程可以轻易被Kill命令杀死。但是此时ps命令显示这个进程确实是D+状态。内核自2.6.25开始，引入了TASK_KILLABLE，处于TASK_UNINTERRUPTIBLE和TASK_INTERRUPTIBLE之间，进程收到致命信号SIGKILL时会被唤醒。

5 __TASK_STOPPED和__TASK_TRACED

SIGSTOP、SIGTSTP、SIGTTIN、SIGTTOUT等信号会将进程暂时停止，进入__TASK_STOPPED 状态。这4种状态不可被忽略，不可被屏蔽，不能安装新的处理函数。在收到SIGCONT 后进程可以恢复执行。

使用gdb跟踪进程可以进入__TASK_TRACED状态。调试进程下达PTRACE_COUT或者 PTRACE_DETACH等可将其重新执行。

6 EXIT_ZOMBIE 和 EXIT_DEAD

这两种状态下面，进程已经死掉了，只是TASK_ZOMBIE状态中的进程没有被收尸，或者父进程没有设置SIGCHLD处理函数为SIG_IGN,或者为SIGCHLD设置SA_NOCLDWAIT标志位。

进程的状态可以在/proc/<pid>/status中看到。对应关系如下。

procfs	进程状态
R(runnng)	TASK_RUNNING
S(sleeping)	TASK_INTERRUPTIBLE
D(disk sleeping)	TASK_UNINTERRUPTIBLE
T(stopped)	__TASK_STOPPED
t(tracing stop)	__TASK_TRACED
Z(zombie)	EXIT_ZOMBIE
X(dead)	EXIT_DEAD