Linux 异步 I/O：epoll 与 io_uring 对比

写网络服务时，「怎么等 I/O」往往比「怎么处理协议」更早决定性能上限。Linux 上两条常见路径是 epoll（就绪通知）和 io_uring（完成通知）。本文从机制、系统调用模型和代码形态三方面对比二者，并说明在什么条件下仍值得用 epoll。

若需要更系统的 io_uring 入门与系列导读，见 io_uring 系列索引；若已决定上 io_uring 并关心生产踩坑，见 io_uring 在生产环境翻车实录。

一、问题从哪来

一个典型的反向代理或 HTTP 服务器，要同时维护大量连接。若对每个连接阻塞在 read() / write() 上，线程或进程数会随连接数线性膨胀，上下文切换和调度开销很快成为瓶颈。

常见做法是 事件驱动：单线程（或少量 worker）在循环里等待「哪些 fd 现在可以读/写」，再发起实际 I/O。Linux 上这条路线长期由 epoll 主导；自内核 5.1（2019）起，io_uring 提供了另一条以共享环形队列为核心的异步 I/O 接口。

二者解决的是同一类问题——高并发下如何减少等待与系统调用开销——但抽象层次不同：

维度	epoll	io_uring
通知语义	fd 就绪（可读/可写）	操作完成（含结果码）
实际 I/O	应用自行 `read()` / `write()`	可在提交队列中描述 `IORING_OP_READ` 等
与内核交互	每个就绪事件常伴随一次 I/O 系统调用	一批提交/完成可合并为少量 `io_uring_enter()`
内核要求	2.6+（`epoll_create` 等）	5.1+（`io_uring_setup`）

nginx、haproxy 等成熟产品仍以 epoll（或等价就绪机制）为主；新项目在目标内核足够新、团队愿意承担新 API 学习成本时，io_uring 是值得评估的选项——但「有 io_uring 就完全不用 epoll」过于绝对，后文会说明例外。

二、epoll：就绪通知模型

2.1 基本流程

epoll 不替你完成读写，只回答：这个 fd 现在是否可能发生非阻塞 I/O？

sequenceDiagram
    participant App as 应用程序
    participant Kernel as 内核
    App->>Kernel: epoll_ctl(ADD, fd)
    App->>Kernel: epoll_wait(...)
    Note over Kernel: 等待 fd 就绪
    Kernel-->>App: 返回就绪事件列表
    loop 每个就绪 fd
        App->>Kernel: read() / write()
        Kernel-->>App: 返回数据或错误
    end

典型步骤：

epoll_create1() 创建 epoll 实例；
epoll_ctl() 把关心的 fd 及事件类型（如 EPOLLIN）注册进去；
epoll_wait() 阻塞或超时等待，得到就绪事件数组；
对每个就绪 fd 再调用 read() / write() / accept() 等完成实际 I/O。

2.2 系统调用开销

对一次成功的读操作，稳态路径上通常至少涉及：

一次 epoll_wait()（等待并取回就绪事件）；
一次 read()（把数据拷入用户缓冲区）。

注册阶段还有 epoll_ctl()，但连接存活期间往往摊销到很多次 I/O 上。连接数很大、每次可读数据很少时，「等就绪 + 再读」的双 syscall 模式会反复穿越用户态/内核态边界；这也是 epoll 方案里常被优化的点（批量 read、边缘触发 EPOLLET、减少 wakeup 等）。

需要 nuance 的一点是：并非每次 syscall 都等价于一次完整的上下文切换（例如部分 fast path、vDSO），但在高 QPS 场景下，syscall 次数仍是首要计数器之一。

2.3 最小示例

下列程序监听标准输入，在可读时读入一行并打印字节数：

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <stdlib.h>

#define MAX_EVENTS 8

int main(void) {
    int epfd = epoll_create1(0);
    if (epfd == -1) {
        perror("epoll_create1");
        return 1;
    }

    struct epoll_event ev = {
        .events = EPOLLIN,
        .data.fd = STDIN_FILENO,
    };
    if (epoll_ctl(epfd, EPOLL_CTL_ADD, STDIN_FILENO, &ev) == -1) {
        perror("epoll_ctl");
        return 1;
    }

    struct epoll_event events[MAX_EVENTS];
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    if (n == -1) {
        perror("epoll_wait");
        return 1;
    }

    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == STDIN_FILENO) {
            char buf[256];
            ssize_t count = read(STDIN_FILENO, buf, sizeof(buf));
            if (count < 0) {
                perror("read");
                return 1;
            }
            printf("read %zd bytes\n", count);
        }
    }

    close(epfd);
    return 0;
}

编译：cc -O2 -Wall -Wextra -o epoll_stdin epoll_stdin.c

首次处理一次 stdin 可读事件，路径上是 epoll_ctl（注册）+ epoll_wait + read。代码直观，错误多在 syscall 返回值上同步返回。

三、io_uring：完成通知与环形队列

3.1 架构

io_uring 在用户态与内核态之间映射 提交队列（SQ） 和 完成队列（CQ）。应用把「要做什么 I/O」写入 SQE（submission queue entry），内核执行后把结果写入 CQE（completion queue entry）。

flowchart LR
    subgraph userspace [用户态]
        App[应用程序]
        SQ[(Submission Queue)]
        CQ[(Completion Queue)]
    end
    subgraph kernel [内核]
        Worker[io_uring 执行路径]
    end
    App -->|填充 SQE| SQ
    App -->|io_uring_enter 提交/收割| Worker
    Worker -->|读 SQ / 写 CQ| SQ
    Worker -->|写 CQE| CQ
    App -->|读取 CQE| CQ

与 epoll 的关键差异：

你提交的是操作（读、写、accept、open 等），不只是「监听可读」；
完成时 cqe->res 携带结果（字节数或负 errno），而不是在 read() 返回值里同步得知；
一次 io_uring_enter() 可同时 提交多条 SQE 并收割多条 CQE，把 per-I/O 的 syscall 摊薄为 per-batch。

默认仍需要 io_uring_enter() 驱动内核处理队列（liburing 的 io_uring_submit() / io_uring_wait_cqe() 内部会调用它）。若设置 IORING_SETUP_SQPOLL，内核会启动 sqpoll 线程 轮询提交队列，稳态下可接近「零 syscall」，代价是空闲时也可能占用 CPU；可通过 sq_thread_idle 等参数在空闲一段时间后让该线程休眠。SQPOLL 在生产中的具体坑，见 io_uring 在生产环境翻车实录。

3.2 最小示例（liburing）

需安装开发包：liburing-dev（Debian/Ubuntu）或 liburing-devel（Fedora/RHEL）。

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
    struct io_uring ring;
    char buf[256];

    if (io_uring_queue_init(8, &ring, 0) < 0) {
        perror("io_uring_queue_init");
        return 1;
    }

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "submission queue full\n");
        io_uring_queue_exit(&ring);
        return 1;
    }
    io_uring_prep_read(sqe, STDIN_FILENO, buf, sizeof(buf), 0);

    if (io_uring_submit(&ring) < 0) {
        perror("io_uring_submit");
        io_uring_queue_exit(&ring);
        return 1;
    }

    struct io_uring_cqe *cqe;
    if (io_uring_wait_cqe(&ring, &cqe) < 0) {
        perror("io_uring_wait_cqe");
        io_uring_queue_exit(&ring);
        return 1;
    }

    if (cqe->res < 0) {
        fprintf(stderr, "read failed: %s\n", strerror(-cqe->res));
    } else {
        printf("read %d bytes\n", cqe->res);
    }

    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    return 0;
}

编译：cc -O2 -Wall -Wextra -o io_uring_stdin io_uring_stdin.c -luring

对比 epoll 示例：

无需 epoll_ctl 式「注册 fd 到 multiplexer」的单独一步（操作直接绑在 SQE 上）；
完成路径上没有额外的 read() syscall，数据已在提交读请求时指定的缓冲区；
错误在 CQE 的 res 字段异步返回，需按完成事件处理。

3.3 进阶能力（简述）

固定缓冲区：io_uring_register_buffers() 预先注册 buffer，减少每次 I/O 的内存 pin/unpin 开销。
网络零拷贝发送：IORING_OP_SEND_ZC（内核 6.0+）在支持的路径上避免把发送缓冲区复制进内核。
批量与链接：io_uring_submit_and_wait()、SQE 链（IOSQE_IO_LINK）等，用于表达依赖关系并减少 enter 次数。

这些特性使 io_uring 更适合「高吞吐、syscall 敏感」的服务端路径，但也提高了程序结构与调试复杂度。更完整的 API 与 Echo Server 实战，见 io_uring 系列。

四、并排对比

4.1 语义与编程模型

epoll 路径:
  epoll_wait → "fd 可读" → read() → 数据在用户缓冲区

io_uring 路径:
  prep_read + submit → 内核执行 read → wait_cqe → cqe->res 为字节数

epoll 是 Reactor 经典形态：多路复用 + 非阻塞 I/O。io_uring 更接近 Proactor：提交异步操作，在 completion 上续跑逻辑。从 epoll 迁到 io_uring 往往不是换 API，而是 重写事件循环——状态机、缓冲区生命周期、背压策略都要按完成语义重想。

4.2 系统调用数量（定性）

设一次读操作：

阶段	epoll	io_uring（默认）
注册 / 描述	`epoll_ctl`（每个 fd 一次，可 amortize）	每次 `get_sqe` + 填 SQE
等待	`epoll_wait`	`io_uring_enter`（submit + wait，可批处理）
实际 I/O	`read`	通常无额外 read syscall
单连接单次读（稳态）	约 2 次 syscall	约 1 次 enter（若 submit 与 wait 合并则更少）

连接数 \(N\)、批大小 \(B\) 时，io_uring 的理想情况是 \(O(N/B)\) 量级的 enter，而不是 \(O(N)\) 次的 wait+read 组合。具体收益取决于批处理策略、内核版本与 workload，本文不给未实测的数字；若要做选型，应在目标硬件上用代表流量压测，或参考 io_uring vs epoll 性能与架构对比。

4.3 何时仍用 epoll

内核版本：目标环境低于 5.1，或容器/云镜像策略不允许新 syscall。
可移植性：需要与 BSD kqueue、Windows IOCP 等同构抽象时，epoll 层更成熟。
生态与运维：现有工具链、文档、同事经验集中在 epoll；nginx、Redis 等大量生产代码仍基于此。
语义需求：有时业务就是要「只在可读时自己决定读多少」；就绪模型更直观，也便于与现有非阻塞 fd 代码拼接。
复杂度预算：小工具、连接数不高时，epoll 代码更短、调试更直接。

io_uring 在持续演进（新 op、安全加固、性能修复），跟进内核版本也是成本。

五、实践注意点

生产级循环：上文示例为教学极简版——需处理 EINTR、部分读/写、连接关闭、io_uring_get_sqe 返回 NULL（SQ 满）、CQ 溢出等。
SQPOLL 的 CPU：不要默认开启；仅在 syscall 仍是瓶颈且可接受专用核/idle 策略时评估。
内存与生命周期：读操作完成前 buffer 必须保持有效；完成通知与业务对象释放的时序要严格。
安全与权限：历史上 io_uring 曾受关注于沙箱绕过与攻击面；较新内核有多项限制与 IORING_SETUP_* 选项。部署时需对照发行版内核版本与安全公告。
依赖：示例使用 liburing（Jens Axboe 维护的用户态辅助库）；也可直接使用 io_uring_setup / io_uring_enter syscall，减少依赖但代码量更大。

六、小结

epoll 告诉你可以做 I/O 了，你还要再 syscall 一次把 I/O 做完；模型简单、兼容面广，是高并发 Linux 服务的长期基座。
io_uring 用共享环形队列把「描述操作」和「取回结果」批量化，把更多工作放进内核路径，适合在新内核上构建 syscall 敏感的服务端。
选型应看 目标内核、团队维护成本、流量形态，而不是教条地认为一方全面取代另一方。

若从 epoll 服务迁移到 io_uring，应预留 架构级重写 的时间，并在真实负载下对比延迟、P99 与 CPU 利用率——机制优势是否转化为业务指标，只能由测量回答。

参考资料

Linux io_uring 手册页：io_uring_setup(2)、io_uring_enter(2)（man7.org）
Jens Axboe，io_uring 设计系列文章与 liburing 源码
Linux 内核文档：Io_uring
epoll(7)、epoll_ctl(2)、epoll_wait(2) 手册页

同主题继续阅读

把当前热点继续串成多页阅读，而不是停在单篇消费。

2025-11-30 · linux / io_uring