linux 中进程的状态
Table of Contents
1 进程的状态
linux (本文使用linux4.8.4) 下,进程状态大致有7种。
进程状态 | 说明 |
---|---|
TASK_RUNNING | 可运行状态。未必正在使用CPU,也许是在等待调度 |
TASK_INTERRUPTIBLE | 可中断的睡眠状态。正在等待某个条件满足 |
TASK_UNINTERRUPTIBLE | 不可中断的睡眠状态。不会被信号中断 |
__TASK_STOPPED | 暂停状态。收到某种信号,运行被停止 |
__TASK_TRACED | 被跟踪状态。进程停止,被另一个进程跟踪 |
EXIT_ZOMBIE | 僵尸状态。进程已经退出,但尚未被父进程或者init进程收尸 |
EXIT_DEAD | 真正的死亡状态 |
在include/linux/sched.h中,进程状态的定义并没有那么少:
/* * Task state bitmask. NOTE! These bits are also * encoded in fs/proc/array.c: get_task_state(). * * We have two separate sets of flags: task->state * is about runnability, while task->exit_state are * about the task exiting. Confusing, but this way * modifying one set can't modify the other one by * mistake. */ #define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 #define __TASK_STOPPED 4 #define __TASK_TRACED 8 /* in tsk->exit_state */ #define EXIT_DEAD 16 #define EXIT_ZOMBIE 32 #define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) /* in tsk->state again */ #define TASK_DEAD 64 #define TASK_WAKEKILL 128 #define TASK_WAKING 256 #define TASK_PARKED 512 #define TASK_NOLOAD 1024 #define TASK_NEW 2048 #define TASK_STATE_MAX 4096 #define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn" extern char ___assert_task_state[1 - 2*!!( sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)]; /* Convenience macros for the sake of set_task_state */ #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) #define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED) #define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD) /* Convenience macros for the sake of wake_up */ #define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE) #define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED) /* get_task_state() */ #define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \ TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \ __TASK_TRACED | EXIT_ZOMBIE | EXIT_DEAD)
2 TASK_RUNNING
TASK_RUNNING是教科书中两种状态的结合,一种是正在占用CPU事件的RUNNING状态,一种是RUNNING状态的进程时间片耗尽或者主动让出CPU,或者被更高优先级进程抢占后,进入的READY状态。处于TASK_RUNNING状态的进程要么正在CPU上运行,要么随时都可以投入运行,只不过CPU资源有限,调度器暂时没有选中他们。
处于TASK_RUNNING状态的进程是调度器的调度对象。在linux中,每个CPU都有自己的运行队列集合。如果是实时进程,则根据优先级的情况落在相应的优先级的队列上;如果是普通进程,则根据虚拟运行时间,落在红黑树相应位置上。
Linux提供了time命令可以统计进程在用户态和内核态消耗的CPU时间。time命令提供了三种事件:实际时间,用户CPU时间和内核CPU时间。下面的输出可以看出 \( real \not= user+sys \) 。在多核处理器上,两边的大小是不确定的。
[root@localhost ~]# time ntpdate pool.ntp.org xxx xxxxxx outputs of ntpdate xxx xxxxxx real 0m8.710s user 0m0.002s sys 0m0.013s
如果想在进程尚未结束时获得程序的执行时间,可以空过procfs中的信息,/proc/<PID>/stat中字段13是用户态CPU时间,14是内核态CPU时间,两者单位是始终嘀嗒。在配置内核的时候,有100HZ,250HZ,300HZ和1000HZ这4个选项。一个始终嘀嗒的事件可以通过下面的命令获得:
[root@localhost ~]# grep CONFIG_HZ /boot/config-*
pidstat命令也可以获取各个进程的CPU使用情况。如果想获取进程的实际运行时间,可以使用ps命令:
[] ~ ps -p 20590 -o etime,cmd,pid ELAPSED CMD PID 01:21:57 emacs taskstatus.org 20590
3 TASK_INTERRUPTIBLE 和 TASK_UNINTERRUPTIBLE
当进程和慢速设备打交道,或者需要等待条件满足时,这种等待时间是不可预估的,这种情况下,内核会将该进程从CPU的运行队列中移除,从而进程进入睡眠状态。
Linux的进程有两种睡眠状态:TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE,这两种状态的区别是能否响应收到的信号。处于TASK_INTERRUPTIBLE状态的进程遇到下面两种情况会返回到TASK_RUNNING状态:
- 等待条件满足;
- 收到未被屏蔽的信号。
收到信号时,会返回EINTR,需要检测返回值以作出正确处理。对于TASK_UNINTERRUPTIBLE,只有等待条件满足才有可能返回运行状态,任何信号都无法打断它。如果这种状态的进程出错,无法杀死,只能重启。
TASK_UNINTERRUPTIBLE的存在是因为内核中某些处理是不能被打断的,比如read系统调用正在操作磁盘,就要用TASK_UNINTERRUPTIBLE将其保护起来以免受到打扰而陷入不可控的状态。
khungtaskd内核线程(源码在kernel/hung_task.c)会定期唤醒(120秒)检查所有 TASK_UNINTERRUPTIBLE进程,如果有进程超过120秒没有被调度,那么内核就会打印进程的堆栈信息。通过下面的命令可以查看kungtaskd周期:
[root@localhost ~]# sysctl kernel.hung_task_timeout_secs kernel.hung_task_timeout_secs = 120
通过/proc/<pid>/wchan (what channel的缩写) 或者 proc/<pic>/stack,或者 /proc/<pid>/status 可以知道进程处于什么状态。
睡眠状态的进程都保存在等待队列中。队列在include/linux/wait.h中定义。
typedef struct __wait_queue wait_queue_t; typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key); int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key); /* __wait_queue::flags */ #define WQ_FLAG_EXCLUSIVE 0x01 #define WQ_FLAG_WOKEN 0x02 struct __wait_queue { unsigned int flags; void *private; wait_queue_func_t func; // 唤醒回调函数 struct list_head task_list; }; struct wait_bit_key { void *flags; int bit_nr; #define WAIT_ATOMIC_T_BIT_NR -1 unsigned long timeout; }; struct wait_bit_queue { struct wait_bit_key key; wait_queue_t wait; }; struct __wait_queue_head { spinlock_t lock; struct list_head task_list; }; typedef struct __wait_queue_head wait_queue_head_t;
等待队列元素private在__WAITQUEUE_INITIALIZER中指向了进程描述符task_struct,这就可以将进程加入到对应的队列上了。使用add_wait_queue或者 add_wait_queue_exclusive将队列元素加到相应队列。这两个函数的区别在于:
- 一个将队列元素设置WQ_FLAG_EXCLUSIVE标志位,另一个没有;
- 一个将元素放到队列尾部,另一个放到队列头部。
这是因为有时候当等待条件满足,有时可以将队列中的所有进程唤醒,有时唤醒操作是排他的(EXCLUSIVE)则只能唤醒一个。
内核使用wait_event系列宏和函数等待条件是否满足。
#define ___wait_is_interruptible(state) \ (!__builtin_constant_p(state) || \ state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE) \ /* * The below macro ___wait_event() has an explicit shadow of the __ret * variable when used from the wait_event_*() macros. * * This is so that both can use the ___wait_cond_timeout() construct * to wrap the condition. * * The type inconsistency of the wait_event_*() __ret variable is also * on purpose; we use long where we can return timeout values and int * otherwise. */ #define ___wait_event(wq, condition, state, exclusive, ret, cmd) \ ({ \ __label__ __out; \ wait_queue_t __wait; \ long __ret = ret; /* explicit shadow */ \ \ INIT_LIST_HEAD(&__wait.task_list); \ if (exclusive) \ __wait.flags = WQ_FLAG_EXCLUSIVE; \ else \ __wait.flags = 0; \ \ for (;;) { \ long __int = prepare_to_wait_event(&wq, &__wait, state);\ \ if (condition) \ break; \ \ if (___wait_is_interruptible(state) && __int) { \ __ret = __int; \ if (exclusive) { \ abort_exclusive_wait(&wq, &__wait, \ state, NULL); \ goto __out; \ } \ break; \ } \ \ cmd; \ } \ finish_wait(&wq, &__wait); \ __out: __ret; \ }) #define __wait_event(wq, condition) \ (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ schedule()) /** * wait_event - sleep until a condition gets true * @wq: the waitqueue to wait on * @condition: a C expression for the event to wait for * * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the * @condition evaluates to true. The @condition is checked each time * the waitqueue @wq is woken up. * * wake_up() has to be called after changing any variable that could * change the result of the wait condition. */ #define wait_event(wq, condition) \ do { \ might_sleep(); \ if (condition) \ break; \ __wait_event(wq, condition); \ } while (0)
prepare_to_wait函数将队列元素添加到对应的等待队列,同时将进程状态设置成 TASK_UNINTERRUPTIBLE,完成prepare_to_wait后,检查条件是否满足,如果不满足则调用schedule()主动让出CPU使用权。prepare_to_wait在/kernel/sched/wait.c中。
内核是通过wake_up系列宏实现唤醒操作的。这些宏最终调用__wake_up函数。这个函数在kernel/sched/wait.c中wait_up最终调用try_to_wake_up。
/* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve * number) then we wake all the non-exclusive tasks and one exclusive task. * * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; list_for_each_entry_safe(curr, next, &q->task_list, task_list) { unsigned flags = curr->flags; if (curr->func(curr, mode, wake_flags, key) && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; } } /** * __wake_up - wake up threads blocked on a waitqueue. * @q: the waitqueue * @mode: which threads * @nr_exclusive: how many wake-one or wake-many threads to wake up * @key: is directly passed to the wakeup function * * It may be assumed that this function implies a write memory barrier before * changing the task state if and only if any tasks are woken up. */ void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key) { unsigned long flags; spin_lock_irqsave(&q->lock, flags); __wake_up_common(q, mode, nr_exclusive, 0, key); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(__wake_up);
4 TASK_KILLABLE
有人认为使用vfork函数子进程在调用exec或者退出之前,父进程处于 TASK_UNINTERRUPTIBLE 状态,事实并非如此,因为进程可以轻易被Kill命令杀死。但是此时ps命令显示这个进程确实是D+状态。内核自2.6.25开始,引入了TASK_KILLABLE,处于TASK_UNINTERRUPTIBLE和TASK_INTERRUPTIBLE之间,进程收到致命信号SIGKILL时会被唤醒。
5 __TASK_STOPPED和__TASK_TRACED
SIGSTOP、SIGTSTP、SIGTTIN、SIGTTOUT等信号会将进程暂时停止,进入__TASK_STOPPED 状态。这4种状态不可被忽略,不可被屏蔽,不能安装新的处理函数。在收到SIGCONT 后进程可以恢复执行。
使用gdb跟踪进程可以进入__TASK_TRACED状态。调试进程下达PTRACE_COUT或者 PTRACE_DETACH等可将其重新执行。
6 EXIT_ZOMBIE 和 EXIT_DEAD
这两种状态下面,进程已经死掉了,只是TASK_ZOMBIE状态中的进程没有被收尸,或者父进程没有设置SIGCHLD处理函数为SIG_IGN,或者为SIGCHLD设置SA_NOCLDWAIT标志位。
进程的状态可以在/proc/<pid>/status中看到。对应关系如下。
procfs | 进程状态 |
---|---|
R(runnng) | TASK_RUNNING |
S(sleeping) | TASK_INTERRUPTIBLE |
D(disk sleeping) | TASK_UNINTERRUPTIBLE |
T(stopped) | __TASK_STOPPED |
t(tracing stop) | __TASK_TRACED |
Z(zombie) | EXIT_ZOMBIE |
X(dead) | EXIT_DEAD |