Linux进程的调度
Table of Contents
1 调度类
Linux内核实现了4种调度类,优先级从高到低分别是:
调度类 | 名称 | 优先级 |
---|---|---|
stop_sched_class | 停止类 | - |
rt_sched_class | 实时类 | 0-99 |
fair_sched_class | 完全公平调度类 | 100-139 |
idle_sched_class | 空闲类 | - |
调度器先从停止类中挑选进程,如果停止类中没有挑选到可运行的进程,再从实时类挑选,依此类推。这可以从kernel/sched/core.c中pick_next_task函数看出来。
其它调度类都很简单,我们这里只讲完全公平调度类(CFS)。
2 调度延迟和调度最小粒度
完全公平调度类使用了一种动态时间片的算法,给每个进程分配CPU占用时间。调度延迟指的是任何一个可运行进程两次运行之间的时间间隔。比如调度延迟是20毫秒,那么每个进程可以执行10毫秒;如果是4个进程,可以执行5毫秒。调度延迟称为 sysctl_sched_latency,记录在/proc/sys/kernel/sched_latency_ns中,以纳秒为单位。
如果进程很多,那么可能每个进程每次运行的时间都很短,这浪费了大量的时间进行调度。所以引入了调度最小粒度的感念。除非进程进行了阻塞任务或者主动让出CPU,否则进程至少执行调度最小粒度的时间。调度最小粒度称为sysctl_sche_min_granularity,记录在/proc/sys/sched_min_granulariry_ns中,以纳秒为单位。
\[ sched\_nr\_latency=\frac{sysctl\_sched\_latency}{sysctl\_sched\_min\_granularity} \]
这个比值是一个调度延迟内允许的最大运行数目。如果可运行进程个数小于 sched_nr_latency,调度周期等于调度延迟。如果可运行进程超过了sched_nr_latency,系统就不去理会调度延迟,转而保证调度最小粒度,这种情况下,调度周期等于最小粒度乘可运行进程个数。这在kernel/sched/fair.c中计算调度周期的函数可以看出来。
/* * The idea is to set a period in which each task runs once. * * When there are too many tasks (sched_nr_latency) we have to stretch * this period because otherwise the slices get too small. * * p = (nr <= nl) ? l : l*nr/nl */ static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; }
3 进程权重
通过赋予进程权重weight,就可以计算出每个进程的运行时间: \[ runtime=period \frac{weight}{sum of weight} \]
Linux下每个进程都有一个nice值,取值范围是[-20,19],nice值越高,表示越友好,就越谦让,优先级越底。因为内核不能进行浮点运算,在kernel/sched/core.c定义了预先计算出的nice值和weight的对应关系。这样的对应关系遵从的公式已经在注释中给出了。
/* * Nice levels are multiplicative, with a gentle 10% change for every * nice level changed. I.e. when a CPU-bound task goes from nice 0 to * nice 1, it will get ~10% less CPU time than another CPU-bound task * that remained on nice 0. * * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. * If a task goes up by ~10% and another task goes down by ~10% then * the relative distance between them is ~25%.) */ const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; /* * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated. * * In cases where the weight does not change often, we can use the * precalculated inverse to speed up arithmetics by turning divisions * into multiplications: */ const u32 sched_prio_to_wmult[40] = { /* -20 */ 48388, 59856, 76040, 92818, 118348, /* -15 */ 147320, 184698, 229616, 287308, 360437, /* -10 */ 449829, 563644, 704093, 875809, 1099582, /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326, /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587, /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126, /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, };
Linux提供了getpriority和setpriority函数来修改nice值。
4 时间片和虚拟运行时间
在Linux中,每个CPU都拥有一个运行队列,如果队列中存在多个可执行状态的进程,如何选择哪个进程获得CPU呢?
完全公平调度的思想是尽可能使所有进程获得相同的运行时间。每次总是选取队列中已经运行时间最小的进程进行调度。由于引入了优先级的概念,Linux使用加权运行时间作标准。这个加权运行时间称为虚拟运行时间(vruntime),而真实的运行时间称为 sum_exec_runtime。
\[ vruntime = sum\_exec\_runtime\times \frac{NICE\_0\_LOAD}{weigh} \]
NICE_0_LOAD的值是nice值为0的进程权重,即1024。每次调度时总是选取vruntime最小的进程进行调度。
include/linux/sched.h中定义了调度实体,里面涉及了组调度的内容,这在后面会提到。
struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; struct list_head group_node; unsigned int on_rq; u64 exec_start; u64 sum_exec_runtime; u64 vruntime; u64 prev_sum_exec_runtime; u64 nr_migrations; #ifdef CONFIG_SCHEDSTATS struct sched_statistics statistics; #endif #ifdef CONFIG_FAIR_GROUP_SCHED int depth; struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif #ifdef CONFIG_SMP /* * Per entity load average tracking. * * Put into separate cache line so it does not * collide with read-mostly values above. */ struct sched_avg avg ____cacheline_aligned_in_smp; #endif };
在kernel/sched/fair.c中定义了完全公平调度的相关函数。其中sched_slice负责计算一个进程在本轮调度周期应分得的真实运行时间。
/* * delta_exec * weight / lw.weight * OR * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT * * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case * we're guaranteed shift stays positive because inv_weight is guaranteed to * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22. * * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus * weight/lw.weight <= 1, and therefore our shift will also be positive. */ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) { u64 fact = scale_load_down(weight); int shift = WMULT_SHIFT; __update_inv_weight(lw); if (unlikely(fact >> 32)) { while (fact >> 32) { fact >>= 1; shift--; } } /* hint to use a 32x32->64 mul */ fact = (u64)(u32)fact * lw->inv_weight; while (fact >> 32) { fact >>= 1; shift--; } return mul_u64_u32_shr(delta_exec, fact, shift); } /* * The idea is to set a period in which each task runs once. * * When there are too many tasks (sched_nr_latency) we have to stretch * this period because otherwise the slices get too small. * * p = (nr <= nl) ? l : l*nr/nl */ static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; } /* * We calculate the wall-time slice from the period by taking a part * proportional to the weight. * * s = p*P[w/rw] */ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq); for_each_sched_entity(se) { struct load_weight *load; struct load_weight lw; cfs_rq = cfs_rq_of(se); load = &cfs_rq->load; if (unlikely(!se->on_rq)) { lw = cfs_rq->load; update_load_add(&lw, se->load.weight); load = &lw; } slice = __calc_delta(slice, se->load.weight, load); } return slice; }
内核周期地使用sched_slice计算出来的值检查进程是不是已经消耗完了自己的时间片。如果已经耗尽了时间片,那么应该发生一次抢占。
调度类需要实现一个update_curr函数用于更新运行数据统计。更新的数据统计中包含了一个队列的最小虚拟运行事件。最小运行时间的作用在后文中提到。内核使用红黑树存储进程结构,最小运行时间对应的进程总是在红黑树的最左边。
/* * Update the current task's runtime statistics. */ static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; if (unlikely(!curr)) return; delta_exec = now - curr->exec_start; if (unlikely((s64)delta_exec <= 0)) return; curr->exec_start = now; schedstat_set(curr->statistics.exec_max, max(delta_exec, curr->statistics.exec_max)); curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq, exec_clock, delta_exec); curr->vruntime += calc_delta_fair(delta_exec, curr); update_min_vruntime(cfs_rq); if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); cpuacct_charge(curtask, delta_exec); account_group_exec_runtime(curtask, delta_exec); } account_cfs_rq_runtime(cfs_rq, delta_exec); }
5 周期性调度
系统通过周期性的任务检查当前进程是否已经耗尽了它的时间片,以决定是否应该发起一次抢占。每个调度类都要实现一个task_tick函数,完全公平调度类对应的是 task_tick_fair,这个函数也在kernel/sched/fair.c中定义。每次时钟中断时,首先调用tick_handle_peroid函数,最终调用调度类的task_tick。task_tick检查是否应该发生抢占,如果应该发生,则设置need_resched标志位,告诉内核尽快调用schedule 函数。task_tick函数中并不进行真正的进程切换,只是设置标志位。当中断处理完毕,内核会检查need_resched标志位,如果置位,则使用schedule进行一次切换。
/* * scheduler tick hitting a task of our scheduling class: */ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); } if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); } static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) { /* * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); /* * Ensure that runnable average is periodically updated. */ update_load_avg(curr, 1); update_cfs_shares(cfs_rq); #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother * validating it and just reschedule. */ if (queued) { resched_curr(rq_of(cfs_rq)); return; } /* * don't let the period tick interfere with the hrtick preemption */ if (!sched_feat(DOUBLE_TICK) && hrtimer_active(&rq_of(cfs_rq)->hrtick_timer)) return; #endif /* * 如果可运行进程数量大于1,检查是否可以抢占当前进程 */ if (cfs_rq->nr_running > 1) check_preempt_tick(cfs_rq, curr); }
其中的check_preempt_tick函数用于检查是否应该发生抢占。如果需要抢占,则设置 need_resched标志位来抢占当前进程。
/* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; struct sched_entity *se; s64 delta; /* 记录本次时间片 */ ideal_runtime = sched_slice(cfs_rq, curr); /* 记录已经运行的时间 */ delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; /* 如果已经运行的事件大于时间片,则需要进行调度 */ if (delta_exec > ideal_runtime) { /* resched_curr负责修改need_resched标志位 */ resched_curr(rq_of(cfs_rq)); /* * The current task ran long enough, ensure it doesn't get * re-elected due to buddy favours. */ clear_buddies(cfs_rq, curr); return; } /* * Ensure that a task that missed wakeup preemption by a * narrow margin doesn't have to wait for a full slice. * This also mitigates buddy induced latencies under load. */ if (delta_exec < sysctl_sched_min_granularity) return; se = __pick_first_entity(cfs_rq); delta = curr->vruntime - se->vruntime; if (delta < 0) return; if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); }
6 创建新进程或进程被唤醒
创建新的进程时,如何调度这个进程呢?如果新创建的进程vruntime为0,那么它将长期保持调度优势,这显然是不合理的。kernel/sched/core.c定义类sched_fork处理新创建进程时的情况。
/* * fork()/clone()-time setup: */ int sched_fork(unsigned long clone_flags, struct task_struct *p) { unsigned long flags; int cpu = get_cpu(); __sched_fork(clone_flags, p); /* * We mark the process as NEW here. This guarantees that * nobody will actually run it, and a signal or other external * event cannot wake it up and insert it on the runqueue either. */ p->state = TASK_NEW; /* * Make sure we do not leak PI boosting priority to the child. */ p->prio = current->normal_prio; /* * Revert to default priority/policy on fork if requested. */ if (unlikely(p->sched_reset_on_fork)) { if (task_has_dl_policy(p) || task_has_rt_policy(p)) { p->policy = SCHED_NORMAL; p->static_prio = NICE_TO_PRIO(0); p->rt_priority = 0; } else if (PRIO_TO_NICE(p->static_prio) < 0) p->static_prio = NICE_TO_PRIO(0); p->prio = p->normal_prio = __normal_prio(p); set_load_weight(p); /* * We don't need the reset flag anymore after the fork. It has * fulfilled its duty: */ p->sched_reset_on_fork = 0; } if (dl_prio(p->prio)) { put_cpu(); return -EAGAIN; } else if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; } else { p->sched_class = &fair_sched_class; } init_entity_runnable_average(&p->se); /* * The child is not yet in the pid-hash so no cgroup attach races, * and the cgroup is pinned to this child due to cgroup_fork() * is ran before sched_fork(). * * Silence PROVE_RCU. */ raw_spin_lock_irqsave(&p->pi_lock, flags); /* * We're setting the cpu for the first time, we don't migrate, * so use __set_task_cpu(). */ __set_task_cpu(p, cpu); if (p->sched_class->task_fork) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); #ifdef CONFIG_SCHED_INFO if (likely(sched_info_on())) memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif #if defined(CONFIG_SMP) p->on_cpu = 0; #endif init_task_preempt_count(p); #ifdef CONFIG_SMP plist_node_init(&p->pushable_tasks, MAX_PRIO); RB_CLEAR_NODE(&p->pushable_dl_tasks); #endif put_cpu(); return 0; }
其中,task_fork在完全公平调度类中对应的是kernel/sched/fair.c中的是 task_fork_fair。
/* * called on fork with the child task as argument from the parent's context * - child not yet on the tasklist * - preemption disabled */ static void task_fork_fair(struct task_struct *p) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se, *curr; struct rq *rq = this_rq(); raw_spin_lock(&rq->lock); update_rq_clock(rq); cfs_rq = task_cfs_rq(current); curr = cfs_rq->curr; if (curr) { update_curr(cfs_rq); se->vruntime = curr->vruntime; } /* 调整虚拟运行时间 */ place_entity(cfs_rq, se, 1); if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) { /* * Upon rescheduling, sched_class::put_prev_task() will place * 'current' within the tree based on its new key value. */ swap(curr->vruntime, se->vruntime); resched_curr(rq); } se->vruntime -= cfs_rq->min_vruntime; raw_spin_unlock(&rq->lock); } static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; /* * The 'current' period is already promised to the current tasks, * however the extra weight of the new task will slow them down a * little, place the new task so that it fits in the slot that * stays open at the end. */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se); /* sleeps up to a single latency don't count. */ if (!initial) { unsigned long thresh = sysctl_sched_latency; /* * Halve their sleep time's effect, to allow * for a gentler effect of sleepers: */ if (sched_feat(GENTLE_FAIR_SLEEPERS)) thresh >>= 1; vruntime -= thresh; } /* ensure we never gain time by being placed backwards. */ se->vruntime = max_vruntime(se->vruntime, vruntime); }
如果没有开启START_DEBIT,子进程的虚拟运行时间是父进程的虚拟运行时间与CFS运行队列的最小虚拟运行时间的较小值。如果设置了START_DEBIT,会通过增大虚拟运行时间来惩罚新创建的进程,增加的时间为一个虚拟时间片。
注意到sysctl_sched_child_runs_first那一行,可以指定 /proc/sys/kernel/sched_child_runs_first为1使子进程优先获得调度,如果是0,则父进程优先获得调度。但这只是一个偏好设置,并不是保证。
再看这一行:
se->vruntime -= cfs_rq->min_vruntime;
在多处理器结构中,新创建的进程和父进程不一定在同一个CPU上,min_vruntime可能相差较大,为了减少这个差距,在迁移之前减去所在CPU运行队列的最小虚拟运行时间;在迁移后,再加上迁移后的CPU的运行队列中最小虚拟运行时间。在enqueue_task中可以看到vruntime再加回来。enqueue_task在完全公平调度类中对应的是task_fork_fair。
/* * called on fork with the child task as argument from the parent's context * - child not yet on the tasklist * - preemption disabled */ static void task_fork_fair(struct task_struct *p) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se, *curr; struct rq *rq = this_rq(); raw_spin_lock(&rq->lock); update_rq_clock(rq); cfs_rq = task_cfs_rq(current); curr = cfs_rq->curr; if (curr) { update_curr(cfs_rq); se->vruntime = curr->vruntime; } place_entity(cfs_rq, se, 1); if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) { /* * Upon rescheduling, sched_class::put_prev_task() will place * 'current' within the tree based on its new key value. */ swap(curr->vruntime, se->vruntime); resched_curr(rq); } se->vruntime -= cfs_rq->min_vruntime; raw_spin_unlock(&rq->lock); } static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) { u64 vruntime = cfs_rq->min_vruntime; /* * The 'current' period is already promised to the current tasks, * however the extra weight of the new task will slow them down a * little, place the new task so that it fits in the slot that * stays open at the end. */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se); /* sleeps up to a single latency don't count. */ if (!initial) { unsigned long thresh = sysctl_sched_latency; /* * Halve their sleep time's effect, to allow * for a gentler effect of sleepers: */ if (sched_feat(GENTLE_FAIR_SLEEPERS)) thresh >>= 1; vruntime -= thresh; } /* ensure we never gain time by being placed backwards. */ se->vruntime = max_vruntime(se->vruntime, vruntime); }
try_to_wake_up负责将睡眠进程唤醒。对应代码在kernel/sched/core.c中。其中也使用了enqueue_task_fair。在place_entity中,可以看到当initial为0即被唤醒时,虚拟运行时间为最小虚拟时间减去半个或一个周期。
无论是try_to_wake_up最后都会调用check_preempt_wakeup检查唤醒后者创建的进程是否可以抢占当前进程。
/* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) { struct task_struct *curr = rq->curr; struct sched_entity *se = &curr->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(curr); int scale = cfs_rq->nr_running >= sched_nr_latency; int next_buddy_marked = 0; if (unlikely(se == pse)) return; /* * This is possible from callers such as attach_tasks(), in which we * unconditionally check_prempt_curr() after an enqueue (which may have * lead to a throttle). This both saves work and prevents false * next-buddy nomination below. */ if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) return; if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) { set_next_buddy(pse); next_buddy_marked = 1; } /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. * * Note: this also catches the edge-case of curr being in a throttled * group (e.g. via set_curr_task), since update_curr() (in the * enqueue of curr) will have resulted in resched being set. This * prevents us from potentially nominating it as a false LAST_BUDDY * below. */ if (test_tsk_need_resched(curr)) return; /* Idle tasks are by definition preempted by non-idle tasks. */ if (unlikely(curr->policy == SCHED_IDLE) && likely(p->policy != SCHED_IDLE)) goto preempt; /* * Batch and idle tasks do not preempt non-idle tasks (their preemption * is driven by the tick): */ if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) return; find_matching_se(&se, &pse); update_curr(cfs_rq_of(se)); BUG_ON(!pse); if (wakeup_preempt_entity(se, pse) == 1) { /* * Bias pick_next to pick the sched entity that is * triggering this preemption. */ if (!next_buddy_marked) set_next_buddy(pse); goto preempt; } return; preempt: resched_curr(rq); /* * Only set the backward buddy when the current task is still * on the rq. This can happen when a wakeup gets interleaved * with schedule on the ->pre_schedule() or idle_balance() * point, either of which can * drop the rq lock. * * Also, during early boot the idle thread is in the fair class, * for obvious reasons its a bad idea to schedule back to it. */ if (unlikely(!se->on_rq || curr == rq->idle)) return; if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se)) set_last_buddy(se); }
7 完全公平调度类的组调度
Linux把cgroup时现成了文件系统,可以mount。一般发行版已经mount好了,输入以下命令就可以看到:
[root@localhost ~]# mount -t cgroup cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
如果没有,可以自己mount:
mkdir cgroup mount -t tmpfs cgroup_root ./cgroup mkdir cgroup/cpuset mount -t cgroup -ocpuset cpuset ./cgroup/cpuset/ mkdir cgroup/cpu mount -t cgroup -ocpu cpu ./cgroup/cpu/ mkdir cgroup/memory mount -t cgroup -omemory memory ./cgroup/memory/
具体操作请参考文档。
8 关于实时进程的调度策略
实时进程有先进先出的SCHED_FIFO策略和时间片轮转的SCHED_RR策略。此外,一般进程还有SCHED_OTHER策略,就是前面提及的-20-19nice值范围内的进程。实时进程可以设置为SCHED_BATCH,但是这个策略不属于实时策略。在\(O(1)\)调度器之后,这个策略和SCHED_OTHER几乎一样。SCHED_IDLE策略的权重很低,比nice值为19的权重15还要底,它采用的权重是3。
可以通过sched_setscheduler设置调度策略和优先级。
如果希望实时进程存在的情况下一般进程也可以消耗少量CPU时间,而不是等待实时进程全部结束后才能执行,可以修改两个控制项:kernel.sched_rt_period_us和 kernel.sched_rt_runtime_us。