Linux进程的调度

Table of Contents

1. 调度类

Linux内核实现了4种调度类,优先级从高到低分别是:

调度类 名称 优先级
stop_sched_class 停止类 -
rt_sched_class 实时类 0-99
fair_sched_class 完全公平调度类 100-139
idle_sched_class 空闲类 -

调度器先从停止类中挑选进程,如果停止类中没有挑选到可运行的进程,再从实时类挑选,依此类推。这可以从kernel/sched/core.c中pick_next_task函数看出来。

其它调度类都很简单,我们这里只讲完全公平调度类(CFS)。

2. 调度延迟和调度最小粒度

完全公平调度类使用了一种动态时间片的算法,给每个进程分配CPU占用时间。调度延迟指的是任何一个可运行进程两次运行之间的时间间隔。比如调度延迟是20毫秒,那么每个进程可以执行10毫秒;如果是4个进程,可以执行5毫秒。调度延迟称为 sysctl_sched_latency,记录在/proc/sys/kernel/sched_latency_ns中,以纳秒为单位。

如果进程很多,那么可能每个进程每次运行的时间都很短,这浪费了大量的时间进行调度。所以引入了调度最小粒度的感念。除非进程进行了阻塞任务或者主动让出CPU,否则进程至少执行调度最小粒度的时间。调度最小粒度称为sysctl_sche_min_granularity,记录在/proc/sys/sched_min_granulariry_ns中,以纳秒为单位。

\[ sched\_nr\_latency=\frac{sysctl\_sched\_latency}{sysctl\_sched\_min\_granularity} \]

这个比值是一个调度延迟内允许的最大运行数目。如果可运行进程个数小于 sched_nr_latency,调度周期等于调度延迟。如果可运行进程超过了sched_nr_latency,系统就不去理会调度延迟,转而保证调度最小粒度,这种情况下,调度周期等于最小粒度乘可运行进程个数。这在kernel/sched/fair.c中计算调度周期的函数可以看出来。

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
  if (unlikely(nr_running > sched_nr_latency))
  return nr_running * sysctl_sched_min_granularity;
  else
  return sysctl_sched_latency;
}

3. 进程权重

通过赋予进程权重weight,就可以计算出每个进程的运行时间: \[ runtime=period \frac{weight}{sum of weight} \]

Linux下每个进程都有一个nice值,取值范围是[-20,19],nice值越高,表示越友好,就越谦让,优先级越底。因为内核不能进行浮点运算,在kernel/sched/core.c定义了预先计算出的nice值和weight的对应关系。这样的对应关系遵从的公式已经在注释中给出了。

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */  88761,  71755,  56483,  46273,  36291,
 /* -15 */  29154,  23254,  18705,  14949,  11916,
 /* -10 */  9548,  7620,  6100,  4904,  3906,
 /*  -5 */  3121,  2501,  1991,  1586,  1277,
 /*  0 */  1024,  820,  655,  526,  423,
 /*  5 */  335,  272,  215,  172,  137,
 /*  10 */  110,  87,  70,  56,  45,
 /*  15 */  36,  29,  23,  18,  15,
};

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */  48388,  59856,  76040,  92818,  118348,
 /* -15 */  147320,  184698,  229616,  287308,  360437,
 /* -10 */  449829,  563644,  704093,  875809,  1099582,
 /*  -5 */  1376151,  1717300,  2157191,  2708050,  3363326,
 /*  0 */  4194304,  5237765,  6557202,  8165337,  10153587,
 /*  5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

Linux提供了getpriority和setpriority函数来修改nice值。

4. 时间片和虚拟运行时间

在Linux中,每个CPU都拥有一个运行队列,如果队列中存在多个可执行状态的进程,如何选择哪个进程获得CPU呢?

完全公平调度的思想是尽可能使所有进程获得相同的运行时间。每次总是选取队列中已经运行时间最小的进程进行调度。由于引入了优先级的概念,Linux使用加权运行时间作标准。这个加权运行时间称为虚拟运行时间(vruntime),而真实的运行时间称为 sum_exec_runtime。

\[ vruntime = sum\_exec\_runtime\times \frac{NICE\_0\_LOAD}{weigh} \]

NICE_0_LOAD的值是nice值为0的进程权重,即1024。每次调度时总是选取vruntime最小的进程进行调度。

include/linux/sched.h中定义了调度实体,里面涉及了组调度的内容,这在后面会提到。

struct sched_entity {
  struct load_weight  load;  /* for load-balancing */
  struct rb_node  run_node;
  struct list_head  group_node;
  unsigned int  on_rq;

  u64  exec_start;
  u64  sum_exec_runtime;
  u64  vruntime;
  u64  prev_sum_exec_runtime;

  u64  nr_migrations;

#ifdef CONFIG_SCHEDSTATS
  struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
  int  depth;
  struct sched_entity  *parent;
  /* rq on which this entity is (to be) queued: */
  struct cfs_rq  *cfs_rq;
  /* rq "owned" by this entity/group: */
  struct cfs_rq  *my_q;
#endif

#ifdef CONFIG_SMP
  /*
  * Per entity load average tracking.
  *
  * Put into separate cache line so it does not
  * collide with read-mostly values above.
  */
  struct sched_avg  avg ____cacheline_aligned_in_smp;
#endif
};

在kernel/sched/fair.c中定义了完全公平调度的相关函数。其中sched_slice负责计算一个进程在本轮调度周期应分得的真实运行时间。

/*
 * delta_exec * weight / lw.weight
 *  OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
  u64 fact = scale_load_down(weight);
  int shift = WMULT_SHIFT;

  __update_inv_weight(lw);

  if (unlikely(fact >> 32)) {
  while (fact >> 32) {
  fact >>= 1;
  shift--;
  }
  }

  /* hint to use a 32x32->64 mul */
  fact = (u64)(u32)fact * lw->inv_weight;

  while (fact >> 32) {
  fact >>= 1;
  shift--;
  }

  return mul_u64_u32_shr(delta_exec, fact, shift);
}

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
  if (unlikely(nr_running > sched_nr_latency))
  return nr_running * sysctl_sched_min_granularity;
  else
  return sysctl_sched_latency;
}

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
  u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

  for_each_sched_entity(se) {
  struct load_weight *load;
  struct load_weight lw;

  cfs_rq = cfs_rq_of(se);
  load = &cfs_rq->load;

  if (unlikely(!se->on_rq)) {
  lw = cfs_rq->load;

  update_load_add(&lw, se->load.weight);
  load = &lw;
  }
  slice = __calc_delta(slice, se->load.weight, load);
  }
  return slice;
}

内核周期地使用sched_slice计算出来的值检查进程是不是已经消耗完了自己的时间片。如果已经耗尽了时间片,那么应该发生一次抢占。

调度类需要实现一个update_curr函数用于更新运行数据统计。更新的数据统计中包含了一个队列的最小虚拟运行事件。最小运行时间的作用在后文中提到。内核使用红黑树存储进程结构,最小运行时间对应的进程总是在红黑树的最左边。

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
  struct sched_entity *curr = cfs_rq->curr;
  u64 now = rq_clock_task(rq_of(cfs_rq));
  u64 delta_exec;

  if (unlikely(!curr))
  return;

  delta_exec = now - curr->exec_start;
  if (unlikely((s64)delta_exec <= 0))
  return;

  curr->exec_start = now;

  schedstat_set(curr->statistics.exec_max,
  max(delta_exec, curr->statistics.exec_max));

  curr->sum_exec_runtime += delta_exec;
  schedstat_add(cfs_rq, exec_clock, delta_exec);

  curr->vruntime += calc_delta_fair(delta_exec, curr);
  update_min_vruntime(cfs_rq);

  if (entity_is_task(curr)) {
  struct task_struct *curtask = task_of(curr);

  trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
  cpuacct_charge(curtask, delta_exec);
  account_group_exec_runtime(curtask, delta_exec);
  }

  account_cfs_rq_runtime(cfs_rq, delta_exec);
}

5. 周期性调度

系统通过周期性的任务检查当前进程是否已经耗尽了它的时间片,以决定是否应该发起一次抢占。每个调度类都要实现一个task_tick函数,完全公平调度类对应的是 task_tick_fair,这个函数也在kernel/sched/fair.c中定义。每次时钟中断时,首先调用tick_handle_peroid函数,最终调用调度类的task_tick。task_tick检查是否应该发生抢占,如果应该发生,则设置need_resched标志位,告诉内核尽快调用schedule 函数。task_tick函数中并不进行真正的进程切换,只是设置标志位。当中断处理完毕,内核会检查need_resched标志位,如果置位,则使用schedule进行一次切换。

/*
 * scheduler tick hitting a task of our scheduling class:
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &curr->se;

  for_each_sched_entity(se) {
  cfs_rq = cfs_rq_of(se);
  entity_tick(cfs_rq, se, queued);
  }

  if (static_branch_unlikely(&sched_numa_balancing))
  task_tick_numa(rq, curr);
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
  /*
  * Update run-time statistics of the 'current'.
  */
  update_curr(cfs_rq);

  /*
  * Ensure that runnable average is periodically updated.
  */
  update_load_avg(curr, 1);
  update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
  /*
  * queued ticks are scheduled to match the slice, so don't bother
  * validating it and just reschedule.
  */
  if (queued) {
  resched_curr(rq_of(cfs_rq));
  return;
  }
  /*
  * don't let the period tick interfere with the hrtick preemption
  */
  if (!sched_feat(DOUBLE_TICK) &&
  hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
  return;
#endif
  /*
  * 如果可运行进程数量大于1,检查是否可以抢占当前进程
  */
  if (cfs_rq->nr_running > 1)
  check_preempt_tick(cfs_rq, curr);
}

其中的check_preempt_tick函数用于检查是否应该发生抢占。如果需要抢占,则设置 need_resched标志位来抢占当前进程。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
  unsigned long ideal_runtime, delta_exec;
  struct sched_entity *se;
  s64 delta;

  /* 记录本次时间片 */
  ideal_runtime = sched_slice(cfs_rq, curr);
  /* 记录已经运行的时间 */
  delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
  /* 如果已经运行的事件大于时间片,则需要进行调度 */
  if (delta_exec > ideal_runtime) {
  /* resched_curr负责修改need_resched标志位 */
  resched_curr(rq_of(cfs_rq));
  /*
  * The current task ran long enough, ensure it doesn't get
  * re-elected due to buddy favours.
  */
  clear_buddies(cfs_rq, curr);
  return;
  }

  /*
  * Ensure that a task that missed wakeup preemption by a
  * narrow margin doesn't have to wait for a full slice.
  * This also mitigates buddy induced latencies under load.
  */
  if (delta_exec < sysctl_sched_min_granularity)
  return;

  se = __pick_first_entity(cfs_rq);
  delta = curr->vruntime - se->vruntime;

  if (delta < 0)
  return;

  if (delta > ideal_runtime)
  resched_curr(rq_of(cfs_rq));
}

6. 创建新进程或进程被唤醒

创建新的进程时,如何调度这个进程呢?如果新创建的进程vruntime为0,那么它将长期保持调度优势,这显然是不合理的。kernel/sched/core.c定义类sched_fork处理新创建进程时的情况。

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
  unsigned long flags;
  int cpu = get_cpu();

  __sched_fork(clone_flags, p);
  /*
  * We mark the process as NEW here. This guarantees that
  * nobody will actually run it, and a signal or other external
  * event cannot wake it up and insert it on the runqueue either.
  */
  p->state = TASK_NEW;

  /*
  * Make sure we do not leak PI boosting priority to the child.
  */
  p->prio = current->normal_prio;

  /*
  * Revert to default priority/policy on fork if requested.
  */
  if (unlikely(p->sched_reset_on_fork)) {
  if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
  p->policy = SCHED_NORMAL;
  p->static_prio = NICE_TO_PRIO(0);
  p->rt_priority = 0;
  } else if (PRIO_TO_NICE(p->static_prio) < 0)
  p->static_prio = NICE_TO_PRIO(0);

  p->prio = p->normal_prio = __normal_prio(p);
  set_load_weight(p);

  /*
  * We don't need the reset flag anymore after the fork. It has
  * fulfilled its duty:
  */
  p->sched_reset_on_fork = 0;
  }

  if (dl_prio(p->prio)) {
  put_cpu();
  return -EAGAIN;
  } else if (rt_prio(p->prio)) {
  p->sched_class = &rt_sched_class;
  } else {
  p->sched_class = &fair_sched_class;
  }

  init_entity_runnable_average(&p->se);

  /*
  * The child is not yet in the pid-hash so no cgroup attach races,
  * and the cgroup is pinned to this child due to cgroup_fork()
  * is ran before sched_fork().
  *
  * Silence PROVE_RCU.
  */
  raw_spin_lock_irqsave(&p->pi_lock, flags);
  /*
  * We're setting the cpu for the first time, we don't migrate,
  * so use __set_task_cpu().
  */
  __set_task_cpu(p, cpu);
  if (p->sched_class->task_fork)
  p->sched_class->task_fork(p);
  raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#ifdef CONFIG_SCHED_INFO
  if (likely(sched_info_on()))
  memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
  p->on_cpu = 0;
#endif
  init_task_preempt_count(p);
#ifdef CONFIG_SMP
  plist_node_init(&p->pushable_tasks, MAX_PRIO);
  RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

  put_cpu();
  return 0;
}

其中,task_fork在完全公平调度类中对应的是kernel/sched/fair.c中的是 task_fork_fair。

/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */
static void task_fork_fair(struct task_struct *p)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &p->se, *curr;
  struct rq *rq = this_rq();

  raw_spin_lock(&rq->lock);
  update_rq_clock(rq);

  cfs_rq = task_cfs_rq(current);
  curr = cfs_rq->curr;
  if (curr) {
  update_curr(cfs_rq);
  se->vruntime = curr->vruntime;
  }
  /* 调整虚拟运行时间 */
  place_entity(cfs_rq, se, 1);

  if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
  /*
  * Upon rescheduling, sched_class::put_prev_task() will place
  * 'current' within the tree based on its new key value.
  */
  swap(curr->vruntime, se->vruntime);
  resched_curr(rq);
  }

  se->vruntime -= cfs_rq->min_vruntime;
  raw_spin_unlock(&rq->lock);
}

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
  u64 vruntime = cfs_rq->min_vruntime;

  /*
  * The 'current' period is already promised to the current tasks,
  * however the extra weight of the new task will slow them down a
  * little, place the new task so that it fits in the slot that
  * stays open at the end.
  */
  if (initial && sched_feat(START_DEBIT))
  vruntime += sched_vslice(cfs_rq, se);

  /* sleeps up to a single latency don't count. */
  if (!initial) {
  unsigned long thresh = sysctl_sched_latency;

  /*
  * Halve their sleep time's effect, to allow
  * for a gentler effect of sleepers:
  */
  if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

  vruntime -= thresh;
  }

  /* ensure we never gain time by being placed backwards. */
  se->vruntime = max_vruntime(se->vruntime, vruntime);
}

如果没有开启START_DEBIT,子进程的虚拟运行时间是父进程的虚拟运行时间与CFS运行队列的最小虚拟运行时间的较小值。如果设置了START_DEBIT,会通过增大虚拟运行时间来惩罚新创建的进程,增加的时间为一个虚拟时间片。

注意到sysctl_sched_child_runs_first那一行,可以指定 /proc/sys/kernel/sched_child_runs_first为1使子进程优先获得调度,如果是0,则父进程优先获得调度。但这只是一个偏好设置,并不是保证。

再看这一行:

se->vruntime -= cfs_rq->min_vruntime;

在多处理器结构中,新创建的进程和父进程不一定在同一个CPU上,min_vruntime可能相差较大,为了减少这个差距,在迁移之前减去所在CPU运行队列的最小虚拟运行时间;在迁移后,再加上迁移后的CPU的运行队列中最小虚拟运行时间。在enqueue_task中可以看到vruntime再加回来。enqueue_task在完全公平调度类中对应的是task_fork_fair。

/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */
static void task_fork_fair(struct task_struct *p)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &p->se, *curr;
  struct rq *rq = this_rq();

  raw_spin_lock(&rq->lock);
  update_rq_clock(rq);

  cfs_rq = task_cfs_rq(current);
  curr = cfs_rq->curr;
  if (curr) {
  update_curr(cfs_rq);
  se->vruntime = curr->vruntime;
  }
  place_entity(cfs_rq, se, 1);

  if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
  /*
  * Upon rescheduling, sched_class::put_prev_task() will place
  * 'current' within the tree based on its new key value.
  */
  swap(curr->vruntime, se->vruntime);
  resched_curr(rq);
  }

  se->vruntime -= cfs_rq->min_vruntime;
  raw_spin_unlock(&rq->lock);
}

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
  u64 vruntime = cfs_rq->min_vruntime;

  /*
  * The 'current' period is already promised to the current tasks,
  * however the extra weight of the new task will slow them down a
  * little, place the new task so that it fits in the slot that
  * stays open at the end.
  */
  if (initial && sched_feat(START_DEBIT))
  vruntime += sched_vslice(cfs_rq, se);

  /* sleeps up to a single latency don't count. */
  if (!initial) {
  unsigned long thresh = sysctl_sched_latency;

  /*
  * Halve their sleep time's effect, to allow
  * for a gentler effect of sleepers:
  */
  if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

  vruntime -= thresh;
  }

  /* ensure we never gain time by being placed backwards. */
  se->vruntime = max_vruntime(se->vruntime, vruntime);
}

try_to_wake_up负责将睡眠进程唤醒。对应代码在kernel/sched/core.c中。其中也使用了enqueue_task_fair。在place_entity中,可以看到当initial为0即被唤醒时,虚拟运行时间为最小虚拟时间减去半个或一个周期。

无论是try_to_wake_up最后都会调用check_preempt_wakeup检查唤醒后者创建的进程是否可以抢占当前进程。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
  struct task_struct *curr = rq->curr;
  struct sched_entity *se = &curr->se, *pse = &p->se;
  struct cfs_rq *cfs_rq = task_cfs_rq(curr);
  int scale = cfs_rq->nr_running >= sched_nr_latency;
  int next_buddy_marked = 0;

  if (unlikely(se == pse))
  return;

  /*
  * This is possible from callers such as attach_tasks(), in which we
  * unconditionally check_prempt_curr() after an enqueue (which may have
  * lead to a throttle).  This both saves work and prevents false
  * next-buddy nomination below.
  */
  if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
  return;

  if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
  set_next_buddy(pse);
  next_buddy_marked = 1;
  }

  /*
  * We can come here with TIF_NEED_RESCHED already set from new task
  * wake up path.
  *
  * Note: this also catches the edge-case of curr being in a throttled
  * group (e.g. via set_curr_task), since update_curr() (in the
  * enqueue of curr) will have resulted in resched being set.  This
  * prevents us from potentially nominating it as a false LAST_BUDDY
  * below.
  */
  if (test_tsk_need_resched(curr))
  return;

  /* Idle tasks are by definition preempted by non-idle tasks. */
  if (unlikely(curr->policy == SCHED_IDLE) &&
  likely(p->policy != SCHED_IDLE))
  goto preempt;

  /*
  * Batch and idle tasks do not preempt non-idle tasks (their preemption
  * is driven by the tick):
  */
  if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
  return;

  find_matching_se(&se, &pse);
  update_curr(cfs_rq_of(se));
  BUG_ON(!pse);
  if (wakeup_preempt_entity(se, pse) == 1) {
  /*
  * Bias pick_next to pick the sched entity that is
  * triggering this preemption.
  */
  if (!next_buddy_marked)
  set_next_buddy(pse);
  goto preempt;
  }

  return;

preempt:
  resched_curr(rq);
  /*
  * Only set the backward buddy when the current task is still
  * on the rq. This can happen when a wakeup gets interleaved
  * with schedule on the ->pre_schedule() or idle_balance()
  * point, either of which can * drop the rq lock.
  *
  * Also, during early boot the idle thread is in the fair class,
  * for obvious reasons its a bad idea to schedule back to it.
  */
  if (unlikely(!se->on_rq || curr == rq->idle))
  return;

  if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
  set_last_buddy(se);
}

7. 完全公平调度类的组调度

Linux把cgroup时现成了文件系统,可以mount。一般发行版已经mount好了,输入以下命令就可以看到:

[root@localhost ~]# mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)

如果没有,可以自己mount:

mkdir cgroup
mount -t tmpfs cgroup_root ./cgroup
mkdir cgroup/cpuset
mount -t cgroup -ocpuset cpuset ./cgroup/cpuset/
mkdir cgroup/cpu
mount -t cgroup -ocpu cpu ./cgroup/cpu/
mkdir cgroup/memory
mount -t cgroup -omemory memory ./cgroup/memory/

具体操作请参考文档

8. 关于实时进程的调度策略

实时进程有先进先出的SCHED_FIFO策略和时间片轮转的SCHED_RR策略。此外,一般进程还有SCHED_OTHER策略,就是前面提及的-20-19nice值范围内的进程。实时进程可以设置为SCHED_BATCH,但是这个策略不属于实时策略。在\(O(1)\)调度器之后,这个策略和SCHED_OTHER几乎一样。SCHED_IDLE策略的权重很低,比nice值为19的权重15还要底,它采用的权重是3。

可以通过sched_setscheduler设置调度策略和优先级。

如果希望实时进程存在的情况下一般进程也可以消耗少量CPU时间,而不是等待实时进程全部结束后才能执行,可以修改两个控制项:kernel.sched_rt_period_us和 kernel.sched_rt_runtime_us。


By .