【系统架构设计】Twitter 架构演进：从 Ruby 到分布式时间线

2006 年，Jack Dorsey 发出第一条推文时，Twitter 只是一个运行在单台服务器上的 Ruby on Rails 应用。到 2012 年，它已经变成一个由数百个 JVM 微服务组成的分布式系统，每秒处理超过 40 万条推文的写入和数十亿次时间线读取。这段演进历程浓缩了互联网架构从单体到微服务、从通用数据库到自研存储引擎的典型路径，值得每一位架构师深入研究。

本文将从技术栈迁移、时间线服务（Timeline Service）、扇出策略（Fanout Strategy）、Manhattan 存储引擎等核心维度，完整拆解 Twitter 架构的演进逻辑与工程实践。

一、Twitter 技术栈演进时间线（2006-2023）

Twitter 的技术演进可以分为四个阶段，每个阶段都伴随着流量规模的数量级增长和架构的根本性变化。

1.1 第一阶段：Ruby on Rails 单体（2006-2008）

最初的 Twitter 是一个标准的 Rails 应用，使用 MySQL 作为数据库，Memcached 作为缓存。所有功能（发推、关注、时间线）都在同一个代码库中。

# 早期 Twitter 的时间线查询（简化示意）
class TimelineController < ApplicationController
  def home
    following_ids = current_user.following.pluck(:id)
    @tweets = Tweet.where(user_id: following_ids)
                   .order(created_at: :desc)
                   .limit(200)
  end
end

这种查询方式在用户量较小时可以工作，但随着用户数增长到百万级别，一个关注了 2000 人的用户请求时间线时，需要在包含数亿条推文的表中执行一个庞大的 WHERE IN 查询，数据库很快成为瓶颈。

1.2 第二阶段：服务化拆分（2008-2010）

2008 年开始，Twitter 遭遇了著名的”失败鲸”（Fail Whale）问题，频繁宕机。团队开始将单体应用拆分为多个服务：

时间	事件	技术决策
2008	频繁宕机，出现”失败鲸”	开始服务拆分
2009	推文存储从 MySQL 迁移到 Cassandra	后因运维问题回退
2010	引入 Scala 编写 Kestrel 消息队列	JVM 生态初步引入
2010	开发 Snowflake 分布式 ID 生成器	解决分布式唯一 ID 问题

1.3 第三阶段：JVM 全面迁移（2010-2014）

这是 Twitter 架构演进最关键的阶段。团队决定将核心服务从 Ruby 迁移到 JVM 平台（主要是 Scala 和 Java），并构建了一系列自研基础设施：

Finagle：异步 RPC 框架
Manhattan：分布式键值存储
Zipkin：分布式追踪系统
Summingbird：实时和批处理混合计算框架

1.4 第四阶段：优化与整合（2014-2023）

在完成核心迁移后，Twitter 的架构进入优化阶段：

推出 Twitter Lite（渐进式 Web 应用）
引入机器学习（Machine Learning）驱动的算法时间线
GraphQL 网关统一 API 层
持续优化 Manhattan 存储引擎的性能

Twitter 技术栈演进概览

2006 ──── Ruby on Rails + MySQL + Memcached
  │
2008 ──── 服务拆分开始，Starling 消息队列
  │
2010 ──── Scala/Finagle 引入，Snowflake ID 生成器
  │
2011 ──── Realtime Timeline（实时时间线）上线
  │
2012 ──── Manhattan 存储引擎开发
  │
2013 ──── 核心路径全面迁移至 JVM
  │
2014 ──── Zipkin 开源，分布式追踪
  │
2016 ──── 算法时间线上线
  │
2018 ──── GraphQL 网关层
  │
2020 ──── 缓存架构统一
  │
2023 ──── 架构精简与团队重组

二、从 Ruby on Rails 到 JVM 的迁移

2.1 为什么离开 Ruby

Twitter 放弃 Ruby on Rails 并非因为 Ruby 语言本身的缺陷，而是基于以下工程现实：

性能瓶颈。 Ruby 的全局解释器锁（Global Interpreter Lock，GIL）使得单个进程无法充分利用多核 CPU。Twitter 的每台服务器需要运行数十个 Ruby 进程来处理并发请求，内存开销巨大。

垃圾回收问题。 Ruby 1.8 的垃圾回收器（Garbage Collector）采用标记-清除（Mark-Sweep）算法，在处理大量短生命周期对象时会产生明显的停顿（Stop-the-World Pause），这对实时性要求极高的时间线服务是不可接受的。

生态成熟度。 2010 年前后，JVM 平台在并发编程、性能调优、监控工具等方面远比 Ruby 生态成熟。Scala 的函数式编程特性和与 Java 生态的互操作性使其成为理想选择。

2.2 迁移策略：绞杀者模式

Twitter 采用了绞杀者模式（Strangler Pattern）进行迁移，而不是一次性重写。核心思路是在旧系统外围逐步构建新服务，将流量从旧系统切换到新系统，直到旧系统被完全替代。

graph TB
    subgraph "第一阶段：Ruby 单体"
        Client1[客户端] --> Rails[Ruby on Rails 单体]
        Rails --> MySQL1[(MySQL)]
        Rails --> MC1[Memcached]
    end

    subgraph "第二阶段：绞杀者模式迁移中"
        Client2[客户端] --> API[API 网关]
        API --> TweetSvc[Tweet Service<br/>Scala]
        API --> UserSvc[User Service<br/>Scala]
        API --> RailsLegacy[Rails 遗留服务]
        TweetSvc --> MySQL2[(MySQL)]
        UserSvc --> Manhattan1[(Manhattan)]
        RailsLegacy --> MySQL3[(MySQL)]
    end

    subgraph "第三阶段：JVM 微服务"
        Client3[客户端] --> GW[API 网关<br/>Finagle]
        GW --> TS[Tweet Service]
        GW --> US[User Service]
        GW --> TLS[Timeline Service]
        GW --> SS[Search Service]
        TS --> MH1[(Manhattan)]
        US --> MH2[(Manhattan)]
        TLS --> Redis1[(Redis)]
        SS --> Earlybird[Earlybird<br/>搜索索引]
    end

2.3 Finagle：异步 RPC 框架

Finagle 是 Twitter 迁移到 JVM 后构建的核心基础设施之一。它是一个基于 Netty 的异步 RPC 框架，支持多种协议（Thrift、HTTP、Memcached 等），内置了服务发现、负载均衡、熔断器（Circuit Breaker）和超时控制等功能。

// Finagle Thrift 服务端示例
import com.twitter.finagle.Thrift
import com.twitter.util.{Await, Future}

object TweetServiceServer {
  def main(args: Array[String]): Unit = {
    val service = new TweetService.MethodPerEndpoint {
      override def getTweet(tweetId: Long): Future[Tweet] = {
        // 从 Manhattan 读取推文
        manhattanClient.get(s"tweets/$tweetId").map { result =>
          Tweet(
            id = tweetId,
            text = result.getString("text"),
            userId = result.getLong("user_id"),
            createdAt = result.getLong("created_at")
          )
        }
      }

      override def postTweet(userId: Long, text: String): Future[Tweet] = {
        val tweetId = snowflakeIdGenerator.nextId()
        val tweet = Tweet(tweetId, text, userId, System.currentTimeMillis())
        for {
          _ <- manhattanClient.put(s"tweets/$tweetId", tweet.toBytes)
          _ <- fanoutService.fanout(tweet)
        } yield tweet
      }
    }

    val server = Thrift.server
      .withLabel("tweet-service")
      .withRequestTimeout(500.milliseconds)
      .serveIface("0.0.0.0:9090", service)

    Await.ready(server)
  }
}

// Finagle 客户端示例，内置重试与熔断
import com.twitter.finagle.Thrift
import com.twitter.conversions.DurationOps._

val client = Thrift.client
  .withLabel("tweet-client")
  .withRequestTimeout(200.milliseconds)
  .withRetryBudget(RetryBudget(
    ttl = 10.seconds,
    minRetriesPerSec = 5,
    percentCanRetry = 0.1
  ))
  .build[TweetService.MethodPerEndpoint]("zk!myzkhost:2181!/twitter/services/tweet")

Finagle 的一个关键设计决策是使用 Future 作为核心抽象。与 Java 的 CompletableFuture 不同，Twitter 的 Future 支持中断传播（Interrupt Propagation），当下游服务超时时，可以自动取消正在进行的上游请求，避免资源浪费。

2.4 迁移过程中的关键挑战

迁移过程持续了近三年，期间遇到了多个技术挑战：

双写一致性。 在迁移存储层时，需要同时向旧系统（MySQL）和新系统（Manhattan）写入数据。Twitter 采用了影子写入（Shadow Write）加异步校验的方式来确保数据一致性。

// 双写与校验伪代码
public class DualWriteService {
    private final MysqlDao mysqlDao;
    private final ManhattanClient manhattanClient;
    private final ConsistencyChecker checker;

    public void writeTweet(Tweet tweet) {
        // 主写入：MySQL（当前主存储）
        mysqlDao.insert(tweet);

        // 影子写入：Manhattan（新存储）
        try {
            manhattanClient.put(tweet.getId(), tweet.serialize());
        } catch (Exception e) {
            // 影子写入失败不影响主流程
            metrics.increment("shadow_write_failure");
            repairQueue.enqueue(tweet.getId());
        }

        // 异步校验
        checker.scheduleVerification(tweet.getId(), Duration.ofMinutes(5));
    }
}

流量灰度。 Twitter 使用了基于用户 ID 哈希的灰度策略，逐步将流量从旧服务切换到新服务。每次灰度切换后，会持续监控错误率、延迟和业务指标至少 24 小时。

协议兼容性。 迁移期间，新旧服务需要使用相同的序列化格式。Twitter 选择了 Apache Thrift 作为统一的接口定义语言（Interface Definition Language，IDL），确保跨语言的兼容性。

三、Timeline Service 与 Fanout 架构

时间线服务（Timeline Service）是 Twitter 的核心系统，直接决定了用户打开应用后看到的内容。它的设计是 Twitter 架构中最复杂也最有创新性的部分。

3.1 时间线的本质

Twitter 的时间线本质上是一个按时间排序的推文聚合视图。每个用户的主页时间线（Home Timeline）是其关注的所有用户的推文的合集。这看似简单的需求，在 Twitter 的规模下变成了一个极具挑战性的分布式系统问题。

核心数据： - 平均每个用户关注约 200 个账号 - 热门账号（如明星、媒体）有数千万粉丝 - 用户期望的时间线加载延迟小于 200 毫秒 - 每秒有数十万条新推文产生

3.2 时间线服务架构

graph LR
    subgraph "写入路径"
        User[用户发推] --> TweetSvc[Tweet Service]
        TweetSvc --> TweetStore[(Tweet Store<br/>Manhattan)]
        TweetSvc --> FanoutSvc[Fanout Service]
        FanoutSvc --> SocialGraph[Social Graph<br/>Service]
        SocialGraph --> |返回粉丝列表| FanoutSvc
        FanoutSvc --> TimelineCache[(Timeline Cache<br/>Redis Cluster)]
    end

    subgraph "读取路径"
        Reader[用户刷新时间线] --> TimelineSvc[Timeline Service]
        TimelineSvc --> TimelineCache
        TimelineSvc --> TweetHydrator[Tweet Hydrator]
        TweetHydrator --> TweetStore
        TweetHydrator --> UserSvc[User Service]
        TweetHydrator --> MediaSvc[Media Service]
        TweetHydrator --> |水合后的完整推文| TimelineSvc
        TimelineSvc --> |返回时间线| Reader
    end

时间线服务的核心流程分为两个路径：

写入路径（Write Path）： 用户发推时，Tweet Service 首先将推文持久化到 Manhattan，然后通知 Fanout Service。Fanout Service 查询 Social Graph Service 获取发推用户的所有粉丝列表，然后将推文 ID 写入每个粉丝的时间线缓存（Redis）中。

读取路径（Read Path）： 用户请求时间线时，Timeline Service 从 Redis 中读取该用户的时间线（仅包含推文 ID 列表），然后通过 Tweet Hydrator 批量获取推文的完整信息（文本、用户头像、媒体链接等），最终返回给客户端。

3.3 时间线缓存结构

每个用户的时间线在 Redis 中存储为一个有序集合（Sorted Set），推文 ID 作为成员，Snowflake ID 本身包含时间戳信息，因此可以同时作为排序分数。

# 时间线缓存结构示意（Python 伪代码）
import redis

class TimelineCache:
    def __init__(self, redis_cluster):
        self.redis = redis_cluster
        self.MAX_TIMELINE_SIZE = 800  # 每个用户最多缓存 800 条推文 ID

    def add_tweet(self, user_id: int, tweet_id: int) -> None:
        """将推文 ID 添加到用户时间线"""
        key = f"timeline:{user_id}"
        pipeline = self.redis.pipeline()
        # 使用 Snowflake ID 作为分数（包含时间信息）
        pipeline.zadd(key, {str(tweet_id): float(tweet_id)})
        # 裁剪，只保留最新的 800 条
        pipeline.zremrangebyrank(key, 0, -(self.MAX_TIMELINE_SIZE + 1))
        pipeline.execute()

    def get_timeline(self, user_id: int, count: int = 200,
                     max_id: int = None) -> list:
        """读取用户时间线"""
        key = f"timeline:{user_id}"
        if max_id:
            # 分页：获取 max_id 之前的推文
            return self.redis.zrevrangebyscore(
                key, max_id - 1, "-inf", start=0, num=count
            )
        return self.redis.zrevrange(key, 0, count - 1)

    def remove_tweet(self, user_id: int, tweet_id: int) -> None:
        """用户删除推文时，从所有粉丝时间线移除"""
        key = f"timeline:{user_id}"
        self.redis.zrem(key, str(tweet_id))

3.4 Snowflake ID 生成器

Twitter 的 Snowflake 是一个分布式唯一 ID 生成器，生成的 64 位 ID 中嵌入了时间戳信息，使得 ID 本身具备时间排序特性，这对时间线的排序至关重要。

Snowflake ID 结构（64 位）

┌──────────────────────┬──────────┬──────────────┐
│   时间戳（41 位）     │机器ID    │  序列号      │
│   毫秒级，可用69年    │（10位）  │ （12位）     │
└──────────────────────┴──────────┴──────────────┘
 63                   22  21    12  11           0

// Snowflake ID 生成器核心逻辑
public class SnowflakeIdGenerator {
    // 起始时间戳：2010-11-04T01:42:54.657Z
    private static final long EPOCH = 1288834974657L;
    private static final int WORKER_ID_BITS = 10;
    private static final int SEQUENCE_BITS = 12;
    private static final long MAX_SEQUENCE = (1L << SEQUENCE_BITS) - 1;

    private final long workerId;
    private long sequence = 0L;
    private long lastTimestamp = -1L;

    public SnowflakeIdGenerator(long workerId) {
        if (workerId < 0 || workerId >= (1L << WORKER_ID_BITS)) {
            throw new IllegalArgumentException(
                "Worker ID must be between 0 and " + ((1L << WORKER_ID_BITS) - 1)
            );
        }
        this.workerId = workerId;
    }

    public synchronized long nextId() {
        long timestamp = System.currentTimeMillis();

        if (timestamp < lastTimestamp) {
            throw new RuntimeException(
                "Clock moved backwards. Refusing to generate id for "
                + (lastTimestamp - timestamp) + " milliseconds"
            );
        }

        if (timestamp == lastTimestamp) {
            sequence = (sequence + 1) & MAX_SEQUENCE;
            if (sequence == 0) {
                // 当前毫秒的序列号用完，等待下一毫秒
                timestamp = waitNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }

        lastTimestamp = timestamp;

        return ((timestamp - EPOCH) << (WORKER_ID_BITS + SEQUENCE_BITS))
             | (workerId << SEQUENCE_BITS)
             | sequence;
    }

    private long waitNextMillis(long lastTimestamp) {
        long timestamp = System.currentTimeMillis();
        while (timestamp <= lastTimestamp) {
            timestamp = System.currentTimeMillis();
        }
        return timestamp;
    }
}

四、Fan-out on Write vs Fan-out on Read 深度对比

扇出策略（Fanout Strategy）是时间线系统设计中最关键的架构决策。Twitter 在不同阶段尝试了两种策略，最终采用了混合方案。

4.1 两种策略的原理

扇出写入（Fan-out on Write），也称推模式（Push Model）： 当用户发推时，立即将推文 ID 写入所有粉丝的时间线缓存。读取时直接从缓存取出，速度极快。

扇出读取（Fan-out on Read），也称拉模式（Pull Model）： 发推时只写入推文存储。用户读取时间线时，实时查询其关注的所有用户的最新推文并合并排序。

graph TB
    subgraph "Fan-out on Write（推模式）"
        W_User[用户 A 发推] --> W_Write[写入推文存储]
        W_Write --> W_Fanout[扇出到所有粉丝的时间线缓存]
        W_Fanout --> W_F1[粉丝 1 的缓存]
        W_Fanout --> W_F2[粉丝 2 的缓存]
        W_Fanout --> W_F3[粉丝 3 的缓存]
        W_Fanout --> W_FN[粉丝 N 的缓存]
    end

    subgraph "Fan-out on Read（拉模式）"
        R_User[用户 B 请求时间线] --> R_Query[查询所有关注者的推文]
        R_Query --> R_U1[关注者 1 的推文]
        R_Query --> R_U2[关注者 2 的推文]
        R_Query --> R_U3[关注者 3 的推文]
        R_U1 --> R_Merge[合并排序]
        R_U2 --> R_Merge
        R_U3 --> R_Merge
        R_Merge --> R_Result[返回时间线]
    end

4.2 详细对比

维度	扇出写入（Fan-out on Write）	扇出读取（Fan-out on Read）
写入延迟	高（需扇出到所有粉丝）	低（只写一份）
读取延迟	极低（直接读缓存）	高（需实时聚合）
存储成本	高（每个粉丝一份副本）	低（只存原始数据）
写放大比	粉丝数 : 1	1 : 1
读放大比	1 : 1	关注数 : 1
名人问题	严重（百万级扇出）	无影响
数据新鲜度	准实时（扇出延迟）	实时
删除推文	复杂（需从所有粉丝缓存删除）	简单（删除原始数据即可）
不活跃用户	浪费（写入了不会被读取的缓存）	无浪费
适用场景	粉丝数有限的普通用户	超大粉丝数的名人账号

4.3 名人问题（Celebrity Problem）

扇出写入的最大挑战是名人问题。当 Lady Gaga（拥有超过 8000 万粉丝）发一条推文时，如果使用纯扇出写入，需要向 8000 万个时间线缓存中各写入一条记录。按每次写入 1 毫秒计算，单线程完成需要 22 个小时。即使使用 1000 个并发线程，也需要 80 秒。

这意味着一些粉丝在发推后数十秒甚至数分钟才能看到这条推文，这在新闻事件等实时性要求高的场景下是不可接受的。

4.4 混合扇出策略

Twitter 最终采用了混合扇出策略（Hybrid Fanout）：

普通用户（粉丝数低于阈值，如 10 万）：使用扇出写入
名人用户（粉丝数超过阈值）：不进行扇出写入

当用户读取时间线时，Timeline Service 执行以下步骤：

从 Redis 读取该用户的时间线缓存（已包含普通关注者的推文）
查询该用户关注的所有名人的最新推文
将两部分结果合并排序
返回最终时间线

// 混合扇出策略的时间线读取（Scala 伪代码）
class HybridTimelineService(
  timelineCache: TimelineCache,
  tweetStore: TweetStore,
  socialGraph: SocialGraphService,
  celebrityThreshold: Int = 100000
) {

  def getHomeTimeline(userId: Long, count: Int = 200): Future[Seq[Tweet]] = {
    // 并行执行两个查询
    val cachedTweetIdsFuture = timelineCache.get(userId, count * 2)
    val celebrityFollowingsFuture = socialGraph.getCelebrityFollowings(
      userId, celebrityThreshold
    )

    for {
      cachedTweetIds <- cachedTweetIdsFuture
      celebrityIds <- celebrityFollowingsFuture

      // 查询名人的最新推文
      celebrityTweets <- Future.collect(
        celebrityIds.map { celebId =>
          tweetStore.getLatestTweets(celebId, count = 10)
        }
      ).map(_.flatten)

      // 合并并去重
      allTweetIds = mergeTweetIds(cachedTweetIds, celebrityTweets.map(_.id))
                      .distinct
                      .sorted(Ordering[Long].reverse)
                      .take(count)

      // 水合推文（获取完整信息）
      hydratedTweets <- tweetHydrator.hydrate(allTweetIds)
    } yield hydratedTweets
  }

  private def mergeTweetIds(
    cached: Seq[Long], celebrity: Seq[Long]
  ): Seq[Long] = {
    // 基于 Snowflake ID 的时间戳进行归并排序
    val merged = new ArrayBuffer[Long](cached.length + celebrity.length)
    var i = 0
    var j = 0
    while (i < cached.length && j < celebrity.length) {
      if (cached(i) >= celebrity(j)) {
        merged += cached(i)
        i += 1
      } else {
        merged += celebrity(j)
        j += 1
      }
    }
    while (i < cached.length) { merged += cached(i); i += 1 }
    while (j < celebrity.length) { merged += celebrity(j); j += 1 }
    merged.toSeq
  }
}

4.5 扇出服务的内部架构

Fanout Service 是写入路径中最复杂的组件，需要处理高吞吐量的写入并保证最终一致性。

// Fanout Service 核心逻辑
public class FanoutService {
    private static final int CELEBRITY_THRESHOLD = 100_000;
    private static final int BATCH_SIZE = 1000;

    private final SocialGraphClient socialGraph;
    private final TimelineCacheClient timelineCache;
    private final ExecutorService executor;
    private final Metrics metrics;

    public void fanout(Tweet tweet) {
        long userId = tweet.getUserId();
        int followerCount = socialGraph.getFollowerCount(userId);

        if (followerCount >= CELEBRITY_THRESHOLD) {
            // 名人用户：不扇出，记录指标
            metrics.increment("fanout.celebrity.skipped");
            return;
        }

        // 普通用户：分批扇出
        metrics.increment("fanout.normal.started");
        long startTime = System.currentTimeMillis();

        Cursor cursor = socialGraph.getFollowersCursor(userId);
        List<Future<?>> futures = new ArrayList<>();

        while (cursor.hasNext()) {
            List<Long> batch = cursor.next(BATCH_SIZE);
            futures.add(executor.submit(() -> {
                for (long followerId : batch) {
                    try {
                        timelineCache.addTweet(followerId, tweet.getId());
                    } catch (Exception e) {
                        metrics.increment("fanout.write.failure");
                        // 写入失败的条目进入重试队列
                        retryQueue.enqueue(followerId, tweet.getId());
                    }
                }
            }));
        }

        // 等待所有批次完成
        for (Future<?> future : futures) {
            try {
                future.get(30, TimeUnit.SECONDS);
            } catch (TimeoutException e) {
                metrics.increment("fanout.batch.timeout");
            }
        }

        long elapsed = System.currentTimeMillis() - startTime;
        metrics.recordTime("fanout.duration", elapsed);
        metrics.increment("fanout.normal.completed");
    }
}

五、Manhattan 存储引擎

Manhattan 是 Twitter 自研的分布式键值存储引擎，用于替代 MySQL 和早期尝试的 Cassandra。它是 Twitter 数据基础设施的核心。

5.1 为什么自研存储

Twitter 选择自研存储引擎而非使用开源方案，原因如下：

Cassandra 的教训。 2009 年，Twitter 尝试将推文存储迁移到 Cassandra，但在生产环境中遇到了严重的运维问题，包括压缩风暴（Compaction Storm）导致的延迟尖峰和数据一致性问题。最终不得不回退到 MySQL。

定制化需求。 Twitter 的工作负载具有独特特征：极高的读写比（约 100:1）、热点数据集中（最新推文被频繁访问）、需要跨数据中心复制等。通用存储引擎难以在所有维度同时满足这些需求。

运维可控性。 自研存储可以深度集成 Twitter 的监控、告警和运维工具链，减少故障排查时间。

5.2 Manhattan 架构

graph TB
    subgraph "Manhattan 集群"
        Client[Manhattan 客户端] --> Coordinator[协调层<br/>Coordinator]

        Coordinator --> SN1[存储节点 1]
        Coordinator --> SN2[存储节点 2]
        Coordinator --> SN3[存储节点 3]

        SN1 --> LSM1[LSM-Tree<br/>存储引擎]
        SN2 --> LSM2[LSM-Tree<br/>存储引擎]
        SN3 --> LSM3[LSM-Tree<br/>存储引擎]

        LSM1 --> SST1[(SSTable 文件)]
        LSM2 --> SST2[(SSTable 文件)]
        LSM3 --> SST3[(SSTable 文件)]
    end

    subgraph "跨数据中心复制"
        DC1[数据中心 1] <--> RepLog[复制日志<br/>Replication Log]
        DC2[数据中心 2] <--> RepLog
        DC3[数据中心 3] <--> RepLog
    end

Manhattan 的核心架构包含以下组件：

协调层（Coordinator Layer）。 接收客户端请求，根据键的哈希值确定数据所在的分片（Shard），将请求路由到正确的存储节点。支持可配置的一致性级别（Consistency Level）。

存储节点（Storage Node）。 每个存储节点使用基于日志结构合并树（Log-Structured Merge Tree，LSM-Tree）的存储引擎，针对写入密集型工作负载进行了优化。

复制层（Replication Layer）。 支持跨数据中心的异步复制，使用最终一致性（Eventual Consistency）模型。每条写入都会生成一个带有向量时钟（Vector Clock）的复制日志条目。

5.3 Manhattan 的数据模型

Manhattan 使用层次化的键空间（Key Space）组织数据：

# Manhattan 数据模型示例
dataset: tweets
  key: tweet_id (Long)
  columns:
    - name: text
      type: String
    - name: user_id
      type: Long
    - name: created_at
      type: Long
    - name: retweet_count
      type: Counter
    - name: favorite_count
      type: Counter
    - name: media_ids
      type: List<Long>

dataset: user_tweets
  key: [user_id, tweet_id] (Composite Key)
  columns:
    - name: created_at
      type: Long

dataset: timelines
  key: user_id (Long)
  columns:
    - name: tweet_ids
      type: SortedSet<Long>

// Manhattan 客户端使用示例
public class TweetRepository {
    private final ManhattanClient manhattan;
    private static final String DATASET = "tweets";

    public Optional<Tweet> getTweet(long tweetId) {
        ManhattanKey key = ManhattanKey.of(DATASET, Bytes.fromLong(tweetId));
        ManhattanResult result = manhattan.get(key, ConsistencyLevel.ONE);

        if (result.isEmpty()) {
            return Optional.empty();
        }

        return Optional.of(Tweet.builder()
            .id(tweetId)
            .text(result.getString("text"))
            .userId(result.getLong("user_id"))
            .createdAt(result.getLong("created_at"))
            .retweetCount(result.getCounter("retweet_count"))
            .favoriteCount(result.getCounter("favorite_count"))
            .build());
    }

    public void saveTweet(Tweet tweet) {
        ManhattanKey key = ManhattanKey.of(DATASET, Bytes.fromLong(tweet.getId()));

        ManhattanMutation mutation = ManhattanMutation.builder()
            .put("text", tweet.getText())
            .put("user_id", tweet.getUserId())
            .put("created_at", tweet.getCreatedAt())
            .putCounter("retweet_count", 0)
            .putCounter("favorite_count", 0)
            .build();

        manhattan.mutate(key, mutation, ConsistencyLevel.QUORUM);
    }

    public void incrementRetweetCount(long tweetId) {
        ManhattanKey key = ManhattanKey.of(DATASET, Bytes.fromLong(tweetId));
        manhattan.incrementCounter(key, "retweet_count", 1);
    }
}

5.4 LSM-Tree 存储引擎细节

Manhattan 的 LSM-Tree 实现针对 Twitter 的工作负载做了多项优化：

LSM-Tree 写入与压缩流程

写入请求
    │
    ▼
┌──────────────┐
│  WAL（预写日志）│  ──── 顺序写入磁盘，保证持久性
└──────────────┘
    │
    ▼
┌──────────────┐
│  MemTable    │  ──── 内存中的有序跳表
│  （活跃）     │       达到阈值后冻结
└──────────────┘
    │ 冻结
    ▼
┌──────────────┐
│  Immutable   │  ──── 等待刷盘
│  MemTable    │
└──────────────┘
    │ 刷盘
    ▼
┌──────────────┐
│  Level 0     │  ──── SSTable 文件，可能重叠
│  SSTable     │
└──────────────┘
    │ 压缩合并
    ▼
┌──────────────┐
│  Level 1     │  ──── SSTable 文件，键范围不重叠
│  SSTable     │
└──────────────┘
    │ 压缩合并
    ▼
┌──────────────┐
│  Level N     │  ──── 更大的 SSTable 文件
│  SSTable     │
└──────────────┘

布隆过滤器优化（Bloom Filter Optimization）。 每个 SSTable 文件附带一个布隆过滤器，用于快速判断某个键是否存在于该文件中。这可以避免大量无效的磁盘读取。

// 布隆过滤器在 SSTable 查找中的应用
public class SSTableReader {
    private final BloomFilter bloomFilter;
    private final IndexBlock indexBlock;
    private final RandomAccessFile dataFile;

    public Optional<byte[]> get(byte[] key) {
        // 先检查布隆过滤器，O(1) 时间复杂度
        if (!bloomFilter.mightContain(key)) {
            // 一定不存在，避免磁盘 IO
            return Optional.empty();
        }

        // 布隆过滤器返回 true，可能存在
        // 在索引块中二分查找
        long offset = indexBlock.findOffset(key);
        if (offset < 0) {
            return Optional.empty();
        }

        // 读取数据块
        byte[] value = readDataBlock(offset, key);
        return Optional.ofNullable(value);
    }
}

5.5 多租户与资源隔离

Manhattan 支持多租户（Multi-Tenancy），不同的数据集（Dataset）共享同一个集群，但通过资源隔离机制避免相互干扰：

配额管理（Quota Management）： 每个数据集有独立的读写 QPS 配额和存储容量配额
优先级调度（Priority Scheduling）： 关键路径请求（如时间线读取）优先于后台任务（如分析查询）
流量控制（Rate Limiting）： 防止单个数据集的突发流量影响整个集群

六、从 MySQL 到自研存储的迁移路径

6.1 迁移前的 MySQL 架构

在迁移前，Twitter 的 MySQL 架构已经相当复杂：

Twitter MySQL 分片架构（迁移前）

                    ┌─────────────┐
                    │  应用层      │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Gizzard    │  ──── Twitter 自研的分片中间件
                    │  分片路由    │
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
   │  分片 0-999  │ │ 分片 1000-  │ │ 分片 2000-  │
   │  MySQL主库   │ │ 1999       │ │ 2999       │
   │             │ │ MySQL主库   │ │ MySQL主库   │
   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
          │                │                │
   ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
   │  MySQL从库   │ │  MySQL从库   │ │  MySQL从库   │
   │  （读副本）   │ │  （读副本）   │ │  （读副本）   │
   └─────────────┘ └─────────────┘ └─────────────┘

Twitter 使用自研的 Gizzard 框架进行分片管理。Gizzard 负责将数据按用户 ID 哈希到不同的 MySQL 实例上，但这个架构存在几个问题：

分片重平衡困难。 添加新分片需要大量的数据迁移
跨分片查询低效。 需要在多个分片上执行查询并合并结果
运维成本高。 数千个 MySQL 实例的备份、升级、故障恢复工作量巨大

6.2 迁移策略：四步走

Twitter 从 MySQL 迁移到 Manhattan 采用了四步渐进策略：

第一步：暗读取（Dark Read）。 在不影响线上流量的前提下，将 MySQL 的数据异步同步到 Manhattan。同时，对一小部分读请求同时查询 MySQL 和 Manhattan，对比结果，验证数据一致性。

// 暗读取（Dark Read）实现
public class DarkReadService {
    private final MysqlDao mysql;
    private final ManhattanClient manhattan;
    private final double darkReadPercentage;

    public Tweet getTweet(long tweetId) {
        // 始终从 MySQL 读取（主路径）
        Tweet mysqlResult = mysql.getTweet(tweetId);

        // 按比例执行暗读取
        if (ThreadLocalRandom.current().nextDouble() < darkReadPercentage) {
            CompletableFuture.runAsync(() -> {
                try {
                    Tweet manhattanResult = manhattan.getTweet(tweetId);
                    if (!mysqlResult.equals(manhattanResult)) {
                        metrics.increment("dark_read.mismatch");
                        alertService.log(
                            "Data mismatch for tweet " + tweetId
                        );
                    } else {
                        metrics.increment("dark_read.match");
                    }
                } catch (Exception e) {
                    metrics.increment("dark_read.error");
                }
            });
        }

        return mysqlResult;
    }
}

第二步：影子写入（Shadow Write）。 所有写入同时发送到 MySQL 和 Manhattan。MySQL 仍然是主存储，Manhattan 是影子存储。

第三步：主备切换（Primary Switch）。 将 Manhattan 提升为主存储，MySQL 降级为影子存储。此时读取从 Manhattan 获取，写入同时发送到两者。

第四步：下线 MySQL（Decommission）。 在确认 Manhattan 稳定运行一段时间后，停止向 MySQL 写入，最终下线 MySQL 实例。

graph LR
    subgraph "第一步：暗读取"
        A1[应用] -->|读/写| M1[(MySQL<br/>主)]
        A1 -.->|暗读取| MH1[(Manhattan<br/>影子)]
        M1 -->|异步同步| MH1
    end

    subgraph "第二步：影子写入"
        A2[应用] -->|读/写| M2[(MySQL<br/>主)]
        A2 -->|影子写入| MH2[(Manhattan<br/>影子)]
    end

    subgraph "第三步：主备切换"
        A3[应用] -->|读/写| MH3[(Manhattan<br/>主)]
        A3 -->|影子写入| M3[(MySQL<br/>影子)]
    end

    subgraph "第四步：下线"
        A4[应用] -->|读/写| MH4[(Manhattan<br/>主)]
        M4[(MySQL)] -.->|下线| X[停止服务]
    end

6.3 数据校验与修复

在迁移过程中，数据一致性校验是最关键的环节。Twitter 构建了专门的校验管道（Verification Pipeline）来持续比对两个存储系统中的数据：

# 数据校验管道（Python 伪代码）
class DataVerificationPipeline:
    def __init__(self, mysql_client, manhattan_client, sample_rate=0.01):
        self.mysql = mysql_client
        self.manhattan = manhattan_client
        self.sample_rate = sample_rate
        self.mismatch_count = 0
        self.total_checked = 0

    def verify_batch(self, tweet_ids: list) -> dict:
        """批量校验推文数据一致性"""
        results = {
            "matched": 0,
            "mismatched": 0,
            "mysql_only": 0,
            "manhattan_only": 0,
            "errors": 0
        }

        for tweet_id in tweet_ids:
            try:
                mysql_tweet = self.mysql.get_tweet(tweet_id)
                manhattan_tweet = self.manhattan.get_tweet(tweet_id)

                if mysql_tweet is None and manhattan_tweet is None:
                    continue
                elif mysql_tweet is None:
                    results["manhattan_only"] += 1
                    self.repair_to_mysql(tweet_id, manhattan_tweet)
                elif manhattan_tweet is None:
                    results["mysql_only"] += 1
                    self.repair_to_manhattan(tweet_id, mysql_tweet)
                elif self.compare_tweets(mysql_tweet, manhattan_tweet):
                    results["matched"] += 1
                else:
                    results["mismatched"] += 1
                    self.resolve_conflict(tweet_id, mysql_tweet, manhattan_tweet)

            except Exception as e:
                results["errors"] += 1
                logging.error(f"Verification error for tweet {tweet_id}: {e}")

        self.total_checked += len(tweet_ids)
        return results

    def compare_tweets(self, mysql_tweet, manhattan_tweet) -> bool:
        """比较两个存储中的推文数据"""
        fields_to_compare = ["text", "user_id", "created_at"]
        return all(
            getattr(mysql_tweet, f) == getattr(manhattan_tweet, f)
            for f in fields_to_compare
        )

    def resolve_conflict(self, tweet_id, mysql_tweet, manhattan_tweet):
        """冲突解决：以 MySQL（主存储）为准"""
        self.manhattan.put_tweet(tweet_id, mysql_tweet)
        logging.warning(
            f"Conflict resolved for tweet {tweet_id}, "
            f"MySQL version applied to Manhattan"
        )

七、消息队列与异步处理

7.1 从 Starling 到 Kestrel 再到 Kafka

Twitter 的消息队列经历了三代演进：

阶段	消息队列	语言	特点	问题
第一代	Starling	Ruby	兼容 Memcached 协议	性能不足，不支持持久化
第二代	Kestrel	Scala	支持持久化，高吞吐	不支持消费者组，缺乏回放能力
第三代	EventBus/Kafka	Java/Scala	支持回放、消费者组、跨 DC 复制	运维复杂度较高

7.2 Kestrel 的设计

Kestrel 是 Twitter 用 Scala 编写的轻量级消息队列，设计理念是”做好一件事”。它兼容 Memcached 协议，使得从 Starling 迁移几乎不需要修改客户端代码。

// Kestrel 消息队列的使用示例
import com.twitter.finagle.kestrel.Client

object FanoutWorker {
  val kestrelClient = Client("kestrel-cluster:22133")

  def processFanoutQueue(): Unit = {
    val queue = kestrelClient.from("fanout_tasks")

    queue.foreach { message =>
      try {
        val task = FanoutTask.deserialize(message.bytes)
        performFanout(task)
        message.acknowledge()
      } catch {
        case e: Exception =>
          // 处理失败，消息将被重新投递
          message.nack()
          metrics.increment("fanout.worker.failure")
      }
    }
  }

  private def performFanout(task: FanoutTask): Unit = {
    val followers = socialGraph.getFollowers(task.userId)
    followers.grouped(1000).foreach { batch =>
      batch.foreach { followerId =>
        timelineCache.addTweet(followerId, task.tweetId)
      }
    }
  }
}

7.3 EventBus：Twitter 的统一事件总线

随着微服务数量的增长，Twitter 开发了 EventBus 作为统一的事件发布-订阅系统。EventBus 基于 Apache Kafka 构建，增加了以下特性：

模式注册（Schema Registry）： 使用 Apache Thrift 定义事件格式，确保生产者和消费者之间的契约
死信队列（Dead Letter Queue，DLQ）： 处理失败的消息自动进入死信队列，便于后续分析和重试
跨数据中心镜像（Cross-DC Mirroring）： 事件在多个数据中心之间自动复制

// EventBus 生产者示例
public class TweetEventProducer {
    private final EventBusProducer<TweetEvent> producer;

    public TweetEventProducer(EventBusConfig config) {
        this.producer = EventBus.newProducer(
            config,
            "tweet_events",
            TweetEvent.class
        );
    }

    public void publishTweetCreated(Tweet tweet) {
        TweetEvent event = TweetEvent.builder()
            .eventType(EventType.TWEET_CREATED)
            .tweetId(tweet.getId())
            .userId(tweet.getUserId())
            .text(tweet.getText())
            .timestamp(System.currentTimeMillis())
            .build();

        producer.publish(event).onSuccess(offset -> {
            metrics.increment("tweet.event.published");
        }).onFailure(ex -> {
            metrics.increment("tweet.event.publish_failed");
            deadLetterQueue.enqueue(event);
        });
    }
}

// EventBus 消费者示例
public class FanoutEventConsumer {
    private final EventBusConsumer<TweetEvent> consumer;
    private final FanoutService fanoutService;

    public void start() {
        consumer = EventBus.newConsumer(
            config,
            "tweet_events",
            "fanout-consumer-group",
            TweetEvent.class
        );

        consumer.subscribe(event -> {
            if (event.getEventType() == EventType.TWEET_CREATED) {
                try {
                    fanoutService.fanout(event.toTweet());
                    return ConsumerResult.SUCCESS;
                } catch (Exception e) {
                    return ConsumerResult.RETRY;
                }
            }
            return ConsumerResult.SUCCESS;
        });
    }
}

八、缓存架构

8.1 多层缓存策略

Twitter 的缓存架构分为多个层次，每一层服务于不同的目的：

graph TB
    Client[客户端] --> CDN[CDN<br/>静态资源与媒体]
    CDN --> LB[负载均衡器]
    LB --> AppServer[应用服务器]

    AppServer --> L1[L1 缓存<br/>进程内缓存<br/>Caffeine/Guava]
    AppServer --> L2[L2 缓存<br/>Redis Cluster<br/>时间线缓存]
    AppServer --> L3[L3 缓存<br/>Memcached 集群<br/>推文与用户缓存]
    AppServer --> Backend[(后端存储<br/>Manhattan)]

    L1 -.->|未命中| L2
    L2 -.->|未命中| L3
    L3 -.->|未命中| Backend

缓存层	技术	用途	容量	延迟
L1	Caffeine	热点数据本地缓存	数百 MB/实例	微秒级
L2	Redis Cluster	时间线缓存	数十 TB	亚毫秒级
L3	Twemcache	推文、用户数据缓存	数百 TB	毫秒级
后端	Manhattan	持久化存储	PB 级	毫秒级

8.2 Twemcache：Twitter 的 Memcached 改进版

Twitter 发现原版 Memcached 在其规模下存在若干问题，因此开发了 Twemcache 进行了以下改进：

Slab 分配器优化。 原版 Memcached 的 slab 分配器在长时间运行后会出现内存碎片（Memory Fragmentation），导致有效内存利用率下降。Twemcache 引入了 slab 自动重平衡（Slab Automove）机制。

连接管理优化。 Twitter 的应用服务器数量巨大，每台服务器都需要与 Memcached 集群建立连接。Twemcache 使用了连接池（Connection Pooling）和多路复用（Multiplexing）来减少连接数量。

// Twemcache 客户端使用示例
public class TweetCacheService {
    private final TwemcacheClient cache;
    private static final int TWEET_TTL_SECONDS = 86400; // 24 小时
    private static final int HOT_TWEET_TTL_SECONDS = 3600; // 热点推文 1 小时

    public Optional<Tweet> getCachedTweet(long tweetId) {
        String key = "tweet:" + tweetId;
        byte[] cached = cache.get(key);
        if (cached != null) {
            metrics.increment("tweet.cache.hit");
            return Optional.of(Tweet.deserialize(cached));
        }
        metrics.increment("tweet.cache.miss");
        return Optional.empty();
    }

    public void cacheTweet(Tweet tweet) {
        String key = "tweet:" + tweet.getId();
        int ttl = isHotTweet(tweet) ? HOT_TWEET_TTL_SECONDS : TWEET_TTL_SECONDS;
        cache.set(key, tweet.serialize(), ttl);
    }

    public Map<Long, Tweet> multiGetTweets(List<Long> tweetIds) {
        // 批量获取，减少网络往返
        List<String> keys = tweetIds.stream()
            .map(id -> "tweet:" + id)
            .collect(Collectors.toList());

        Map<String, byte[]> results = cache.multiGet(keys);
        Map<Long, Tweet> tweets = new HashMap<>();

        for (int i = 0; i < tweetIds.size(); i++) {
            byte[] data = results.get(keys.get(i));
            if (data != null) {
                tweets.put(tweetIds.get(i), Tweet.deserialize(data));
            }
        }
        return tweets;
    }

    private boolean isHotTweet(Tweet tweet) {
        return tweet.getRetweetCount() > 10000
            || tweet.getFavoriteCount() > 50000;
    }
}

8.3 缓存一致性问题

在分布式缓存环境下，缓存一致性是一个持续性挑战。Twitter 采用了以下策略：

基于事件的缓存失效（Event-Driven Cache Invalidation）。 当推文被更新或删除时，通过 EventBus 广播缓存失效事件，所有缓存节点异步清除对应的缓存条目。

租约机制（Lease Mechanism）。 为了防止缓存击穿（Cache Stampede），当缓存未命中时，只允许一个请求去后端加载数据并更新缓存，其他请求等待或使用短暂的陈旧数据。

// 防止缓存击穿的租约机制
public class LeaseBasedCache<K, V> {
    private final TwemcacheClient cache;
    private final ConcurrentMap<K, CompletableFuture<V>> pendingLoads;

    public V getOrLoad(K key, Function<K, V> loader) {
        // 先尝试缓存
        V cached = cache.get(key);
        if (cached != null) {
            return cached;
        }

        // 尝试获取租约
        CompletableFuture<V> future = new CompletableFuture<>();
        CompletableFuture<V> existing = pendingLoads.putIfAbsent(key, future);

        if (existing != null) {
            // 已有其他线程在加载，等待结果
            try {
                return existing.get(500, TimeUnit.MILLISECONDS);
            } catch (TimeoutException e) {
                // 等待超时，直接从后端加载
                return loader.apply(key);
            }
        }

        // 获得租约，负责加载数据
        try {
            V value = loader.apply(key);
            cache.set(key, value);
            future.complete(value);
            return value;
        } catch (Exception e) {
            future.completeExceptionally(e);
            throw e;
        } finally {
            pendingLoads.remove(key);
        }
    }
}

九、工程案例：名人推文的混合 Fanout

本节通过一个完整的工程案例，展示 Twitter 如何处理名人推文的写入和读取流程。

9.1 场景设定

假设 Taylor Swift（粉丝数约 9200 万）发了一条推文：“New album out now!”。我们追踪这条推文从写入到出现在粉丝时间线上的完整路径。

9.2 写入路径

时间线：名人推文写入流程

T+0ms     Taylor Swift 点击"发推"按钮
          │
T+5ms     请求到达 API 网关（Finagle）
          │
T+8ms     API 网关将请求路由到 Tweet Service
          │
T+15ms    Tweet Service 调用 Snowflake 生成推文 ID
          ID = 1625847293847293952
          │
T+20ms    Tweet Service 将推文写入 Manhattan
          键: tweets/1625847293847293952
          值: {text: "New album out now!", user_id: 17766918, ...}
          │
T+25ms    Tweet Service 发布 TweetCreated 事件到 EventBus
          │
T+30ms    Fanout Service 消费事件
          │
T+35ms    Fanout Service 查询 Social Graph：
          Taylor Swift 的粉丝数 = 92,000,000
          超过名人阈值（100,000），标记为名人推文
          │
T+40ms    Fanout Service 跳过扇出写入
          记录指标：fanout.celebrity.skipped
          │
T+45ms    搜索索引服务（Earlybird）消费事件
          将推文索引到倒排索引中
          │
T+50ms    推送通知服务消费事件
          向启用通知的粉丝发送推送

9.3 读取路径

时间线：粉丝读取名人推文

T+0ms     粉丝 Alice 打开 Twitter 应用，请求主页时间线
          │
T+5ms     请求到达 Timeline Service
          │
T+10ms    Timeline Service 并行执行两个查询：
          │
          ├── 查询 1：从 Redis 读取 Alice 的时间线缓存
          │   返回 200 个推文 ID（来自普通用户的扇出写入）
          │   延迟：2ms
          │
          └── 查询 2：查询 Alice 关注的名人列表
              Alice 关注了 15 个名人账号
              │
              对每个名人查询最近 10 条推文
              │
              从 Manhattan 或 L3 缓存读取
              返回约 150 个推文 ID
              延迟：8ms
          │
T+18ms    合并两部分推文 ID，按 Snowflake ID 排序
          去重后取前 200 条
          │
T+20ms    Tweet Hydrator 批量获取 200 条推文的完整信息
          │
          ├── 从 L1 缓存命中 30 条（15%）
          ├── 从 L3 缓存命中 150 条（75%）
          └── 从 Manhattan 读取 20 条（10%）
          │
T+35ms    并行获取推文关联的用户信息、媒体链接
          │
T+50ms    Timeline Service 返回完整的时间线响应
          │
T+55ms    客户端开始渲染时间线

9.4 关键性能指标

指标	普通用户推文	名人推文
写入延迟（P50）	50ms	25ms
写入延迟（P99）	200ms	50ms
读取延迟（P50）	30ms	55ms
读取延迟（P99）	100ms	150ms
扇出写入量	粉丝数条	0 条
读取时额外查询	0 次	关注名人数次

从表中可以看出，混合扇出策略实质上是用读取延迟的少量增加来换取写入延迟和系统负载的大幅降低。对于名人推文，写入路径的延迟从可能的数秒降低到 25 毫秒，代价是读取路径多了约 25 毫秒的延迟，这在用户体验上几乎不可感知。

9.5 异常处理与降级策略

在实际生产中，混合 Fanout 系统需要处理多种异常情况：

// 混合时间线读取的降级策略
public class ResilientTimelineService {
    private final TimelineCache timelineCache;
    private final CelebrityTweetFetcher celebrityFetcher;
    private final TweetHydrator hydrator;
    private final CircuitBreaker celebrityCircuitBreaker;

    public TimelineResponse getTimeline(long userId, int count) {
        // 从缓存读取基础时间线（始终执行）
        List<Long> baseTweetIds;
        try {
            baseTweetIds = timelineCache.get(userId, count);
        } catch (Exception e) {
            // 缓存不可用时的降级：直接从存储读取
            metrics.increment("timeline.cache.fallback");
            baseTweetIds = fallbackToStorage(userId, count);
        }

        // 获取名人推文（可降级）
        List<Long> celebrityTweetIds = Collections.emptyList();
        if (celebrityCircuitBreaker.isAllowed()) {
            try {
                celebrityTweetIds = celebrityFetcher.fetch(userId, 10);
            } catch (Exception e) {
                // 名人推文获取失败：降级为只显示已缓存的推文
                metrics.increment("timeline.celebrity.degraded");
                celebrityCircuitBreaker.recordFailure();
            }
        }

        // 合并并水合
        List<Long> mergedIds = merge(baseTweetIds, celebrityTweetIds, count);

        List<Tweet> hydratedTweets;
        try {
            hydratedTweets = hydrator.hydrate(mergedIds);
        } catch (Exception e) {
            // 水合失败：返回部分结果
            metrics.increment("timeline.hydration.partial");
            hydratedTweets = hydrator.hydratePartial(mergedIds);
        }

        return new TimelineResponse(hydratedTweets);
    }

    private List<Long> fallbackToStorage(long userId, int count) {
        // 降级方案：查询用户关注列表的最新推文
        List<Long> followingIds = socialGraph.getFollowing(userId);
        return tweetStore.getLatestTweetsByUsers(followingIds, count);
    }
}

9.6 监控与告警

# 时间线服务的关键监控指标
monitoring:
  metrics:
    - name: timeline.read.latency.p50
      threshold: 50ms
      alert: warning
    - name: timeline.read.latency.p99
      threshold: 200ms
      alert: critical
    - name: timeline.cache.hit_rate
      threshold_below: 0.95
      alert: warning
    - name: timeline.celebrity.degraded.rate
      threshold_above: 0.05
      alert: critical
    - name: fanout.lag.seconds
      threshold_above: 30
      alert: critical
    - name: fanout.failure.rate
      threshold_above: 0.01
      alert: warning

  dashboards:
    timeline_health:
      panels:
        - title: "时间线读取延迟分布"
          type: histogram
          metrics: [timeline.read.latency]
        - title: "缓存命中率"
          type: gauge
          metrics: [timeline.cache.hit_rate]
        - title: "扇出延迟"
          type: timeseries
          metrics: [fanout.lag.seconds]
        - title: "名人推文降级率"
          type: timeseries
          metrics: [timeline.celebrity.degraded.rate]

十、架构演进的经验与教训

10.1 技术选型的核心原则

Twitter 的架构演进揭示了几个重要的技术选型原则：

第一，不要过早优化，但要为扩展做好准备。 早期使用 Ruby on Rails 是正确的选择，它让团队在产品尚未验证时快速迭代。但当流量增长到一定规模时，必须果断进行技术栈升级。

第二，渐进式迁移优于一次性重写。 绞杀者模式使 Twitter 能够在不停机的前提下完成从 Ruby 到 JVM 的迁移。每个阶段都有明确的回退方案，降低了风险。

第三，自研基础设施要谨慎。 Manhattan、Finagle、Snowflake 等自研组件解决了 Twitter 特有的问题，但也带来了巨大的维护成本。只有在开源方案确实无法满足需求时，才应该考虑自研。

10.2 架构权衡矩阵

决策点	选项 A	选项 B	Twitter 的选择	理由
编程语言	继续用 Ruby	迁移到 JVM	JVM（Scala/Java）	GIL 限制、GC 性能、生态成熟度
扇出策略	纯推模式	纯拉模式	混合模式	平衡名人问题与读取延迟
存储引擎	开源方案	自研 Manhattan	自研 Manhattan	Cassandra 运维问题、定制化需求
ID 生成	UUID	自增 ID	Snowflake	全局有序、嵌入时间戳、高性能
RPC 框架	gRPC	自研 Finagle	自研 Finagle	2010 年 gRPC 尚未发布，需要定制能力
消息队列	RabbitMQ	自研 Kestrel	先 Kestrel 后 Kafka	初期需要轻量级方案，后期需要回放能力
缓存	原版 Memcached	改进版 Twemcache	Twemcache	内存碎片问题、连接管理问题

10.3 关键教训

教训一：数据库不是万能的。 早期 Twitter 试图让 MySQL 承担所有的数据存储和查询职责，包括实时时间线查询。这种做法在小规模下可行，但在大规模下必然崩溃。正确的做法是将不同类型的数据放在最适合的存储系统中：推文元数据放在 Manhattan，时间线放在 Redis，搜索索引放在 Earlybird。

教训二：关注写放大与读放大。 扇出写入的核心问题是写放大——一次写入被放大为数百万次写入。扇出读取的核心问题是读放大——一次读取需要聚合数百个源。理解放大效应是设计高吞吐系统的关键。

教训三：可观测性（Observability）是生命线。 Twitter 开发的 Zipkin 分布式追踪系统后来成为 OpenZipkin 开源项目，被广泛采用。在数百个微服务组成的系统中，没有分布式追踪，排查延迟问题几乎不可能。

教训四：不要忽视运维。 Cassandra 在技术上满足了 Twitter 的存储需求，但在运维层面带来了太多问题。最终迫使 Twitter 自研 Manhattan。技术选型不仅要看功能和性能，还要看团队的运维能力。

10.4 对架构师的启示

Twitter 的架构演进提供了以下几个值得深思的启示：

分层思考。 将系统分为存储层、缓存层、服务层、网关层，每一层独立演进，通过清晰的接口解耦。这使得 Twitter 能够在不影响整体系统的前提下替换某一层的技术实现。

极端场景驱动设计。 名人推文的处理是一个经典的极端场景。架构设计必须考虑最坏情况，而不是平均情况。混合扇出策略正是为了应对这种极端场景而设计的。

拥抱最终一致性。 Twitter 的时间线不需要强一致性。一个用户在发推后的几秒内看不到这条推文出现在部分粉丝的时间线上是完全可以接受的。放弃强一致性换取可用性和性能，这是分布式系统设计中的常见权衡。

技术债务的管理。 Twitter 从 Ruby 到 JVM 的迁移持续了近三年，这本身就是一次大规模的技术债务偿还。关键在于将大的迁移分解为小的、可控的步骤，每一步都有回退方案和衡量标准。

graph TB
    subgraph "Twitter 架构全景（2014 年后）"
        Mobile[移动客户端] --> CDN[CDN]
        Web[Web 客户端] --> CDN
        CDN --> GW[API 网关<br/>Finagle]

        GW --> TweetSvc[Tweet Service]
        GW --> UserSvc[User Service]
        GW --> TimelineSvc[Timeline Service]
        GW --> SearchSvc[Search Service<br/>Earlybird]
        GW --> AdsSvc[Ads Service]

        TweetSvc --> MH_T[(Manhattan<br/>推文存储)]
        UserSvc --> MH_U[(Manhattan<br/>用户存储)]
        TimelineSvc --> Redis[(Redis Cluster<br/>时间线缓存)]
        SearchSvc --> EBIdx[(Earlybird<br/>搜索索引)]

        TweetSvc --> EB[EventBus<br/>Kafka]
        EB --> FanoutSvc[Fanout Service]
        EB --> IndexSvc[Index Service]
        EB --> NotifSvc[Notification Service]
        EB --> AnalyticsSvc[Analytics Service]

        FanoutSvc --> SG[Social Graph<br/>Service]
        FanoutSvc --> Redis

        TweetSvc --> Cache[Twemcache<br/>Cluster]
        UserSvc --> Cache

        subgraph "基础设施"
            ZK[ZooKeeper<br/>服务发现]
            Zipkin[Zipkin<br/>分布式追踪]
            Aurora[Aurora<br/>作业调度]
        end
    end

参考资料

Krikorian, R. “Timelines at Scale.” QCon San Francisco, 2012.
Twitter Engineering Blog. “The Infrastructure Behind Twitter: Scale.” 2013.
Aniszczyk, C. “Evolution of the Twitter Stack.” Twitter Engineering, 2013.
Twitter Engineering Blog. “Manhattan, our real-time, multi-tenant distributed database for Twitter scale.” 2014.
Twitter Engineering Blog. “How we built Twitter Lite.” 2017.
Slee, M., Agarwal, A., Kwiatkowski, M. “Thrift: Scalable Cross-Language Services Implementation.” Facebook Technical Report, 2007.
Twitter Open Source. “Finagle: A Protocol-Agnostic RPC System.” https://twitter.github.io/finagle/
Twitter Open Source. “Snowflake: Network Service for Generating Unique ID Numbers at High Scale.” GitHub Repository.
Eriksen, M. “Your Server as a Function.” PLOS, 2013.
Twitter Engineering Blog. “Caching at Twitter.” 2014.
Sigelman, B., et al. “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure.” Google Technical Report, 2010.
Twitter Engineering Blog. “The Unified Logging Infrastructure for Data Analytics at Twitter.” VLDB, 2012.

上一篇：Conway 定律下一篇：Netflix 架构全景

同主题继续阅读

把当前热点继续串成多页阅读，而不是停在单篇消费。

2026-06-14 · architecture / security

【身份与访问控制工程】SCIM 与账号生命周期：开通、变更、离职自动化

SSO 只解决认证，SCIM 解决账号的生命周期管理。但 SCIM 2.0 的实现远不是调几个 REST API 那么简单：User/Group schema 的映射、Delta vs Full sync 的同步策略、Patch 操作语义，每个环节都有坑。本文从账号生命周期的四个关键事件出发，拆解 SCIM 2.0 的核心协议、同步模式、Schema 扩展，以及与企业 IdP（Azure AD、Okta）对接的实际工程经验。

2026-06-13 · architecture / security

【身份与访问控制工程】IAM 全景：为什么这是高价值赛道

从 2020 年 SolarWinds 到 2024 年 Okta 支持系统泄露，身份基础设施的安全失败反复证明一件事：IAM 不是 IT 支撑系统，而是安全架构的承重墙。本文建立现代 IAM 的全景地图——从认证协议、令牌体系、权限模型到身份治理与平台选型，给出 5 个贯穿全系列的核心问题。

2026-06-13 · architecture / security

【身份与访问控制工程】企业单点登录：OIDC 与现代 SSO

OIDC 是当下企业 SSO 的事实标准，但大多数实现只用了它 20% 的规范。本文从 OIDC 核心规范出发，拆解 Authorization Code Flow + PKCE 的完整交互、ID Token 的验证规则、Discovery 与 Dynamic Registration 的互操作性机制，以及 RP-Initiated Logout 和 Session Management 的工程实现细节。

2026-06-13 · architecture / security

【身份与访问控制工程】OAuth 2.1 与 PKCE：现代授权主路径

OAuth 2.1 不是新协议，而是对 OAuth 2.0 的安全加固：废除 Implicit Grant 和 Resource Owner Password Grant，强制 PKCE 用于所有使用授权码模式的客户端，要求精确 redirect_uri 比对。本文从 PKCE 的密码学动机出发，拆解 OAuth 2.1 的授权码流程完整交互、Refresh Token 轮换与发送者约束、DPoP 令牌绑定，以及 DCR (Dynamic Client Registration) 和 RAR (Rich Authorization Requests) 的实际应用。

文章导航

目录

一、Twitter 技术栈演进时间线（2006-2023）

1.1 第一阶段：Ruby on Rails 单体（2006-2008）

1.2 第二阶段：服务化拆分（2008-2010）

1.3 第三阶段：JVM 全面迁移（2010-2014）

1.4 第四阶段：优化与整合（2014-2023）

二、从 Ruby on Rails 到 JVM 的迁移

2.1 为什么离开 Ruby

2.2 迁移策略：绞杀者模式

2.3 Finagle：异步 RPC 框架

2.4 迁移过程中的关键挑战

三、Timeline Service 与 Fanout 架构

3.1 时间线的本质

3.2 时间线服务架构

3.3 时间线缓存结构

3.4 Snowflake ID 生成器

四、Fan-out on Write vs Fan-out on Read 深度对比

4.1 两种策略的原理

4.2 详细对比

4.3 名人问题（Celebrity Problem）

4.4 混合扇出策略

4.5 扇出服务的内部架构

五、Manhattan 存储引擎

5.1 为什么自研存储

5.2 Manhattan 架构

5.3 Manhattan 的数据模型

5.4 LSM-Tree 存储引擎细节

5.5 多租户与资源隔离

六、从 MySQL 到自研存储的迁移路径

6.1 迁移前的 MySQL 架构

6.2 迁移策略：四步走

6.3 数据校验与修复

七、消息队列与异步处理

7.1 从 Starling 到 Kestrel 再到 Kafka

7.2 Kestrel 的设计

7.3 EventBus：Twitter 的统一事件总线

八、缓存架构

8.1 多层缓存策略

8.2 Twemcache：Twitter 的 Memcached 改进版

8.3 缓存一致性问题

九、工程案例：名人推文的混合 Fanout

9.1 场景设定

9.2 写入路径

9.3 读取路径

9.4 关键性能指标

9.5 异常处理与降级策略

9.6 监控与告警

十、架构演进的经验与教训

10.1 技术选型的核心原则

10.2 架构权衡矩阵

10.3 关键教训

10.4 对架构师的启示

参考资料

同主题继续阅读

【身份与访问控制工程】SCIM 与账号生命周期：开通、变更、离职自动化

【身份与访问控制工程】IAM 全景：为什么这是高价值赛道

【身份与访问控制工程】企业单点登录：OIDC 与现代 SSO

【身份与访问控制工程】OAuth 2.1 与 PKCE：现代授权主路径