Linux进程的调度

1. 调度类
2. 调度延迟和调度最小粒度
3. 进程权重
4. 时间片和虚拟运行时间
5. 周期性调度
6. 创建新进程或进程被唤醒
7. 完全公平调度类的组调度
8. 关于实时进程的调度策略

1 调度类

Linux内核实现了4种调度类，优先级从高到低分别是：

调度类	名称	优先级
stop_sched_class	停止类	-
rt_sched_class	实时类	0-99
fair_sched_class	完全公平调度类	100-139
idle_sched_class	空闲类	-

调度器先从停止类中挑选进程，如果停止类中没有挑选到可运行的进程，再从实时类挑选，依此类推。这可以从kernel/sched/core.c中pick_next_task函数看出来。

其它调度类都很简单，我们这里只讲完全公平调度类(CFS)。

2 调度延迟和调度最小粒度

完全公平调度类使用了一种动态时间片的算法，给每个进程分配CPU占用时间。调度延迟指的是任何一个可运行进程两次运行之间的时间间隔。比如调度延迟是20毫秒，那么每个进程可以执行10毫秒；如果是4个进程，可以执行5毫秒。调度延迟称为 sysctl_sched_latency，记录在/proc/sys/kernel/sched_latency_ns中，以纳秒为单位。

如果进程很多，那么可能每个进程每次运行的时间都很短，这浪费了大量的时间进行调度。所以引入了调度最小粒度的感念。除非进程进行了阻塞任务或者主动让出CPU，否则进程至少执行调度最小粒度的时间。调度最小粒度称为sysctl_sche_min_granularity，记录在/proc/sys/sched_min_granulariry_ns中，以纳秒为单位。

\[ sched\_nr\_latency=\frac{sysctl\_sched\_latency}{sysctl\_sched\_min\_granularity} \]

这个比值是一个调度延迟内允许的最大运行数目。如果可运行进程个数小于 sched_nr_latency，调度周期等于调度延迟。如果可运行进程超过了sched_nr_latency，系统就不去理会调度延迟，转而保证调度最小粒度，这种情况下，调度周期等于最小粒度乘可运行进程个数。这在kernel/sched/fair.c中计算调度周期的函数可以看出来。

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))
                return nr_running * sysctl_sched_min_granularity;
        else
                return sysctl_sched_latency;
}

3 进程权重

通过赋予进程权重weight，就可以计算出每个进程的运行时间： \[ runtime=period \frac{weight}{sum of weight} \]

Linux下每个进程都有一个nice值，取值范围是[-20,19]，nice值越高，表示越友好，就越谦让，优先级越底。因为内核不能进行浮点运算，在kernel/sched/core.c定义了预先计算出的nice值和weight的对应关系。这样的对应关系遵从的公式已经在注释中给出了。

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

Linux提供了getpriority和setpriority函数来修改nice值。

4 时间片和虚拟运行时间

在Linux中，每个CPU都拥有一个运行队列，如果队列中存在多个可执行状态的进程，如何选择哪个进程获得CPU呢？

完全公平调度的思想是尽可能使所有进程获得相同的运行时间。每次总是选取队列中已经运行时间最小的进程进行调度。由于引入了优先级的概念，Linux使用加权运行时间作标准。这个加权运行时间称为虚拟运行时间(vruntime)，而真实的运行时间称为 sum_exec_runtime。

\[ vruntime = sum\_exec\_runtime\times \frac{NICE\_0\_LOAD}{weigh} \]

NICE_0_LOAD的值是nice值为0的进程权重，即1024。每次调度时总是选取vruntime最小的进程进行调度。

include/linux/sched.h中定义了调度实体，里面涉及了组调度的内容，这在后面会提到。

struct sched_entity {
        struct load_weight      load;           /* for load-balancing */
        struct rb_node          run_node;
        struct list_head        group_node;
        unsigned int            on_rq;

        u64                     exec_start;
        u64                     sum_exec_runtime;
        u64                     vruntime;
        u64                     prev_sum_exec_runtime;

        u64                     nr_migrations;

#ifdef CONFIG_SCHEDSTATS
        struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
        int                     depth;
        struct sched_entity     *parent;
        /* rq on which this entity is (to be) queued: */
        struct cfs_rq           *cfs_rq;
        /* rq "owned" by this entity/group: */
        struct cfs_rq           *my_q;
#endif

#ifdef CONFIG_SMP
        /*
         * Per entity load average tracking.
         *
         * Put into separate cache line so it does not
         * collide with read-mostly values above.
         */
        struct sched_avg        avg ____cacheline_aligned_in_smp;
#endif
};

在kernel/sched/fair.c中定义了完全公平调度的相关函数。其中sched_slice负责计算一个进程在本轮调度周期应分得的真实运行时间。

/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
        u64 fact = scale_load_down(weight);
        int shift = WMULT_SHIFT;

        __update_inv_weight(lw);

        if (unlikely(fact >> 32)) {
                while (fact >> 32) {
                        fact >>= 1;
                        shift--;
                }
        }

        /* hint to use a 32x32->64 mul */
        fact = (u64)(u32)fact * lw->inv_weight;

        while (fact >> 32) {
                fact >>= 1;
                shift--;
        }

        return mul_u64_u32_shr(delta_exec, fact, shift);
}

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))
                return nr_running * sysctl_sched_min_granularity;
        else
                return sysctl_sched_latency;
}

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

        for_each_sched_entity(se) {
                struct load_weight *load;
                struct load_weight lw;

                cfs_rq = cfs_rq_of(se);
                load = &cfs_rq->load;

                if (unlikely(!se->on_rq)) {
                        lw = cfs_rq->load;

                        update_load_add(&lw, se->load.weight);
                        load = &lw;
                }
                slice = __calc_delta(slice, se->load.weight, load);
        }
        return slice;
}

内核周期地使用sched_slice计算出来的值检查进程是不是已经消耗完了自己的时间片。如果已经耗尽了时间片，那么应该发生一次抢占。

调度类需要实现一个update_curr函数用于更新运行数据统计。更新的数据统计中包含了一个队列的最小虚拟运行事件。最小运行时间的作用在后文中提到。内核使用红黑树存储进程结构，最小运行时间对应的进程总是在红黑树的最左边。

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;
        u64 now = rq_clock_task(rq_of(cfs_rq));
        u64 delta_exec;

        if (unlikely(!curr))
                return;

        delta_exec = now - curr->exec_start;
        if (unlikely((s64)delta_exec <= 0))
                return;

        curr->exec_start = now;

        schedstat_set(curr->statistics.exec_max,
                      max(delta_exec, curr->statistics.exec_max));

        curr->sum_exec_runtime += delta_exec;
        schedstat_add(cfs_rq, exec_clock, delta_exec);

        curr->vruntime += calc_delta_fair(delta_exec, curr);
        update_min_vruntime(cfs_rq);

        if (entity_is_task(curr)) {
                struct task_struct *curtask = task_of(curr);

                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
                cpuacct_charge(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
        }

        account_cfs_rq_runtime(cfs_rq, delta_exec);
}

5 周期性调度

系统通过周期性的任务检查当前进程是否已经耗尽了它的时间片，以决定是否应该发起一次抢占。每个调度类都要实现一个task_tick函数，完全公平调度类对应的是 task_tick_fair，这个函数也在kernel/sched/fair.c中定义。每次时钟中断时，首先调用tick_handle_peroid函数，最终调用调度类的task_tick。task_tick检查是否应该发生抢占，如果应该发生，则设置need_resched标志位，告诉内核尽快调用schedule 函数。task_tick函数中并不进行真正的进程切换，只是设置标志位。当中断处理完毕，内核会检查need_resched标志位，如果置位，则使用schedule进行一次切换。

/*
 * scheduler tick hitting a task of our scheduling class:
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                entity_tick(cfs_rq, se, queued);
        }

        if (static_branch_unlikely(&sched_numa_balancing))
                task_tick_numa(rq, curr);
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
        update_load_avg(curr, 1);
        update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
        /*
         * queued ticks are scheduled to match the slice, so don't bother
         * validating it and just reschedule.
         */
        if (queued) {
                resched_curr(rq_of(cfs_rq));
                return;
        }
        /*
         * don't let the period tick interfere with the hrtick preemption
         */
        if (!sched_feat(DOUBLE_TICK) &&
                        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
                return;
#endif
        /*
         * 如果可运行进程数量大于1，检查是否可以抢占当前进程
         */
        if (cfs_rq->nr_running > 1)
                check_preempt_tick(cfs_rq, curr);
}

其中的check_preempt_tick函数用于检查是否应该发生抢占。如果需要抢占，则设置 need_resched标志位来抢占当前进程。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        /* 记录本次时间片 */
        ideal_runtime = sched_slice(cfs_rq, curr);
        /* 记录已经运行的时间 */
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        /* 如果已经运行的事件大于时间片，则需要进行调度 */
        if (delta_exec > ideal_runtime) {
                /* resched_curr负责修改need_resched标志位 */
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

        /*
         * Ensure that a task that missed wakeup preemption by a
         * narrow margin doesn't have to wait for a full slice.
         * This also mitigates buddy induced latencies under load.
         */
        if (delta_exec < sysctl_sched_min_granularity)
                return;

        se = __pick_first_entity(cfs_rq);
        delta = curr->vruntime - se->vruntime;

        if (delta < 0)
                return;

        if (delta > ideal_runtime)
                resched_curr(rq_of(cfs_rq));
}

6 创建新进程或进程被唤醒

创建新的进程时，如何调度这个进程呢？如果新创建的进程vruntime为0，那么它将长期保持调度优势，这显然是不合理的。kernel/sched/core.c定义类sched_fork处理新创建进程时的情况。

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        unsigned long flags;
        int cpu = get_cpu();

        __sched_fork(clone_flags, p);
        /*
         * We mark the process as NEW here. This guarantees that
         * nobody will actually run it, and a signal or other external
         * event cannot wake it up and insert it on the runqueue either.
         */
        p->state = TASK_NEW;

        /*
         * Make sure we do not leak PI boosting priority to the child.
         */
        p->prio = current->normal_prio;

        /*
         * Revert to default priority/policy on fork if requested.
         */
        if (unlikely(p->sched_reset_on_fork)) {
                if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
                        p->policy = SCHED_NORMAL;
                        p->static_prio = NICE_TO_PRIO(0);
                        p->rt_priority = 0;
                } else if (PRIO_TO_NICE(p->static_prio) < 0)
                        p->static_prio = NICE_TO_PRIO(0);

                p->prio = p->normal_prio = __normal_prio(p);
                set_load_weight(p);

                /*
                 * We don't need the reset flag anymore after the fork. It has
                 * fulfilled its duty:
                 */
                p->sched_reset_on_fork = 0;
        }

        if (dl_prio(p->prio)) {
                put_cpu();
                return -EAGAIN;
        } else if (rt_prio(p->prio)) {
                p->sched_class = &rt_sched_class;
        } else {
                p->sched_class = &fair_sched_class;
        }

        init_entity_runnable_average(&p->se);

        /*
         * The child is not yet in the pid-hash so no cgroup attach races,
         * and the cgroup is pinned to this child due to cgroup_fork()
         * is ran before sched_fork().
         *
         * Silence PROVE_RCU.
         */
        raw_spin_lock_irqsave(&p->pi_lock, flags);
        /*
         * We're setting the cpu for the first time, we don't migrate,
         * so use __set_task_cpu().
         */
        __set_task_cpu(p, cpu);
        if (p->sched_class->task_fork)
                p->sched_class->task_fork(p);
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#ifdef CONFIG_SCHED_INFO
        if (likely(sched_info_on()))
                memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
        p->on_cpu = 0;
#endif
        init_task_preempt_count(p);
#ifdef CONFIG_SMP
        plist_node_init(&p->pushable_tasks, MAX_PRIO);
        RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

        put_cpu();
        return 0;
}

其中，task_fork在完全公平调度类中对应的是kernel/sched/fair.c中的是 task_fork_fair。

/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */
static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;
        struct rq *rq = this_rq();

        raw_spin_lock(&rq->lock);
        update_rq_clock(rq);

        cfs_rq = task_cfs_rq(current);
        curr = cfs_rq->curr;
        if (curr) {
                update_curr(cfs_rq);
                se->vruntime = curr->vruntime;
        }
        /* 调整虚拟运行时间 */
        place_entity(cfs_rq, se, 1);

        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr(rq);
        }

        se->vruntime -= cfs_rq->min_vruntime;
        raw_spin_unlock(&rq->lock);
}

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        u64 vruntime = cfs_rq->min_vruntime;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice(cfs_rq, se);

        /* sleeps up to a single latency don't count. */
        if (!initial) {
                unsigned long thresh = sysctl_sched_latency;

                /*
                 * Halve their sleep time's effect, to allow
                 * for a gentler effect of sleepers:
                 */
                if (sched_feat(GENTLE_FAIR_SLEEPERS))
                        thresh >>= 1;

                vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        se->vruntime = max_vruntime(se->vruntime, vruntime);
}

如果没有开启START_DEBIT，子进程的虚拟运行时间是父进程的虚拟运行时间与CFS运行队列的最小虚拟运行时间的较小值。如果设置了START_DEBIT，会通过增大虚拟运行时间来惩罚新创建的进程，增加的时间为一个虚拟时间片。

注意到sysctl_sched_child_runs_first那一行，可以指定 /proc/sys/kernel/sched_child_runs_first为1使子进程优先获得调度，如果是0，则父进程优先获得调度。但这只是一个偏好设置，并不是保证。

再看这一行：

se->vruntime -= cfs_rq->min_vruntime;

在多处理器结构中，新创建的进程和父进程不一定在同一个CPU上，min_vruntime可能相差较大，为了减少这个差距，在迁移之前减去所在CPU运行队列的最小虚拟运行时间；在迁移后，再加上迁移后的CPU的运行队列中最小虚拟运行时间。在enqueue_task中可以看到vruntime再加回来。enqueue_task在完全公平调度类中对应的是task_fork_fair。

/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */
static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;
        struct rq *rq = this_rq();

        raw_spin_lock(&rq->lock);
        update_rq_clock(rq);

        cfs_rq = task_cfs_rq(current);
        curr = cfs_rq->curr;
        if (curr) {
                update_curr(cfs_rq);
                se->vruntime = curr->vruntime;
        }
        place_entity(cfs_rq, se, 1);

        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr(rq);
        }

        se->vruntime -= cfs_rq->min_vruntime;
        raw_spin_unlock(&rq->lock);
}

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        u64 vruntime = cfs_rq->min_vruntime;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice(cfs_rq, se);

        /* sleeps up to a single latency don't count. */
        if (!initial) {
                unsigned long thresh = sysctl_sched_latency;

                /*
                 * Halve their sleep time's effect, to allow
                 * for a gentler effect of sleepers:
                 */
                if (sched_feat(GENTLE_FAIR_SLEEPERS))
                        thresh >>= 1;

                vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        se->vruntime = max_vruntime(se->vruntime, vruntime);
}

try_to_wake_up负责将睡眠进程唤醒。对应代码在kernel/sched/core.c中。其中也使用了enqueue_task_fair。在place_entity中，可以看到当initial为0即被唤醒时，虚拟运行时间为最小虚拟时间减去半个或一个周期。

无论是try_to_wake_up最后都会调用check_preempt_wakeup检查唤醒后者创建的进程是否可以抢占当前进程。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
        struct task_struct *curr = rq->curr;
        struct sched_entity *se = &curr->se, *pse = &p->se;
        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
        int scale = cfs_rq->nr_running >= sched_nr_latency;
        int next_buddy_marked = 0;

        if (unlikely(se == pse))
                return;

        /*
         * This is possible from callers such as attach_tasks(), in which we
         * unconditionally check_prempt_curr() after an enqueue (which may have
         * lead to a throttle).  This both saves work and prevents false
         * next-buddy nomination below.
         */
        if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
                return;

        if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
                set_next_buddy(pse);
                next_buddy_marked = 1;
        }

        /*
         * We can come here with TIF_NEED_RESCHED already set from new task
         * wake up path.
         *
         * Note: this also catches the edge-case of curr being in a throttled
         * group (e.g. via set_curr_task), since update_curr() (in the
         * enqueue of curr) will have resulted in resched being set.  This
         * prevents us from potentially nominating it as a false LAST_BUDDY
         * below.
         */
        if (test_tsk_need_resched(curr))
                return;

        /* Idle tasks are by definition preempted by non-idle tasks. */
        if (unlikely(curr->policy == SCHED_IDLE) &&
            likely(p->policy != SCHED_IDLE))
                goto preempt;

        /*
         * Batch and idle tasks do not preempt non-idle tasks (their preemption
         * is driven by the tick):
         */
        if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
                return;

        find_matching_se(&se, &pse);
        update_curr(cfs_rq_of(se));
        BUG_ON(!pse);
        if (wakeup_preempt_entity(se, pse) == 1) {
                /*
                 * Bias pick_next to pick the sched entity that is
                 * triggering this preemption.
                 */
                if (!next_buddy_marked)
                        set_next_buddy(pse);
                goto preempt;
        }

        return;

preempt:
        resched_curr(rq);
        /*
         * Only set the backward buddy when the current task is still
         * on the rq. This can happen when a wakeup gets interleaved
         * with schedule on the ->pre_schedule() or idle_balance()
         * point, either of which can * drop the rq lock.
         *
         * Also, during early boot the idle thread is in the fair class,
         * for obvious reasons its a bad idea to schedule back to it.
         */
        if (unlikely(!se->on_rq || curr == rq->idle))
                return;

        if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
                set_last_buddy(se);
}

7 完全公平调度类的组调度

Linux把cgroup时现成了文件系统，可以mount。一般发行版已经mount好了，输入以下命令就可以看到：

[root@localhost ~]# mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)

如果没有，可以自己mount：

mkdir cgroup
mount -t tmpfs cgroup_root ./cgroup
mkdir cgroup/cpuset
mount -t cgroup -ocpuset cpuset ./cgroup/cpuset/
mkdir cgroup/cpu
mount -t cgroup -ocpu cpu ./cgroup/cpu/
mkdir cgroup/memory
mount -t cgroup -omemory memory ./cgroup/memory/

具体操作请参考文档。

8 关于实时进程的调度策略

实时进程有先进先出的SCHED_FIFO策略和时间片轮转的SCHED_RR策略。此外，一般进程还有SCHED_OTHER策略，就是前面提及的-20-19nice值范围内的进程。实时进程可以设置为SCHED_BATCH，但是这个策略不属于实时策略。在\(O(1)\)调度器之后，这个策略和SCHED_OTHER几乎一样。SCHED_IDLE策略的权重很低，比nice值为19的权重15还要底，它采用的权重是3。

可以通过sched_setscheduler设置调度策略和优先级。

如果希望实时进程存在的情况下一般进程也可以消耗少量CPU时间，而不是等待实时进程全部结束后才能执行，可以修改两个控制项：kernel.sched_rt_period_us和 kernel.sched_rt_runtime_us。