存储与成本：采样、下采样、冷热分层、对象存储

可观测性账单的第一反应往往是”日志太多”。半真半假——在多数中等规模团队里，Logs 占 40%–60%、Traces 20%–35%、Metrics 10%–20%、Profiles < 5% 是常见结构（工程经验区间，非厂商报价）。真正的问题是：这些字节里，有多少是从未被查询过的冷数据？

本文给出显式假设下的成本估算 worksheet（不编造云厂商单价），说明采样、保留期、下采样、冷热分层的杠杆顺序，并对 Prometheus/Mimir、Loki、Tempo 给出可操作的 retention 配置策略。采样理论见埋点哲学；Trace 采样实战见 Traces 栈；数据模型见数据模型与 TSDB 内核。

一、可观测性成本拆解

1.1 五大成本维度

维度	包含	典型占比（假设模型）
计算	Collector、Querier、Compactor CPU/RAM	25%–35%
存储	热 SSD + 温 HDD + 冷对象存储	40%–55%
网络	跨 AZ 复制、公网 egress、S3 API	10%–20%
索引	ES/Loki index、Tempo 元数据	5%–15%
人力	管道维护、容量规划 SRE	10%–25%（常被忽略）

1.2 四大支柱成本结构

支柱	成本占比区间	主要驱动	最大杠杆
Logs	40%–60%	体积 + 索引策略	采样 + 保留期
Traces	20%–35%	Span 数量 × 大小	头部/尾部采样
Metrics	10%–20%	Series 数 × retention	基数 + recording rules
Profiles	< 5%	采样频率 × 符号表	按需 profiling

1.3 80/20 法则：高基数服务

通常是 20% 的服务产生 80% 的可观测数据量。定位步骤：

# Metrics：按 job 排序 series 数（需 prometheus 抓取）
topk(20, count by (job) ({__name__=~".+"}))

# 日志量：按 service label 的 ingest GB/day（Loki 指标因部署而异）
# sum by (service) (rate(loki_distributor_bytes_received_total[1d]))

对 Top 5 “大户”单独制定 retention 和采样 policy——全局一刀切浪费或误删。

1.4 与 SLO / 告警的交叉

SLO 工程：SLI 所需 Metrics 不可被过短 retention 删除
告警体系：Alert 历史不需 1 年 Metrics 精度——可 downsample

二、成本估算 Worksheet（显式假设）

重要：以下数字来自假设模型，用于相对对比和容量规划——不是任何云厂商报价。部署时请填入你环境的实际 $ / GB-month 和 $ / million samples。

2.1 场景假设（Scenario A）

假设项	取值	说明
微服务数量	200	含 batch job
Pod 数量	5000	K8s
总 QPS	50000	峰值 80000
Metrics scrape	15s	每 target
平均 series/target	800	含 histogram
Log 行/请求	3	JSON 结构化，平均 512 B/行
Trace 采样（当前）	100% head	假设未治理
Span/请求	8	微服务链
Span 大小	1 KB	protobuf 后
Profile 频率	1/min/pod	生产常更低

2.2 日增量估算公式

Metrics 样本数/天（单 Prometheus _shard 简化）：

\[N_{samples/day} \approx N_{series} \times \frac{86400}{scrape\_interval}\]

Scenario A：假设 $N_{series} = 4 \times 10^6$（5000 pod × 800），15s scrape：

\[N \approx 4 \times 10^6 \times 5760 \approx 2.3 \times 10^{10} \text{ samples/day}\]

Logs 体积/天：

\[V_{logs} = QPS \times 86400 \times lines/request \times bytes/line\]

\[V \approx 50000 \times 86400 \times 3 \times 512 \approx 6.6 \times 10^{12} \text{ B} \approx 6.1 \text{ TiB/day}\]

（全量 INFO——故意夸大以说明为何必须采样）

Traces 体积/天（100% 采样）：

\[V_{traces} = QPS \times 86400 \times spans/request \times bytes/span\]

\[\approx 50000 \times 86400 \times 8 \times 1024 \approx 3.5 \times 10^{13} \text{ B} \approx 32 \text{ TiB/day}\]

2.3 成本代入（占位符）

令：

$C_m$ = 每百万 samples 存储成本（$/M samples-month）— 读者自填
$C_l$ = 每 GiB 日志存储成本（$/GiB-month）— 读者自填
$C_t$ = 每 GiB Trace 存储成本 — 读者自填

支柱	月存储量（Scenario A 粗算）	月成本公式
Metrics	按 retention 30d 累加	$\approx N_{samples/month} \times C_m$
Logs	$V_{logs} \times retention\_days$	$V_{logs} \times 30 \times C_l$
Traces	$V_{traces} \times retention\_days$	同上

相对结论（不依赖绝对单价）：Trace 100% 采样 + Log 全量 INFO 时，Trace+Log >> Metrics——与 1.2 节占比区间一致。

2.4 Scenario B：治理后（假设）

杠杆	调整	存储倍数变化
Trace head 1% + tail error 100%	见 §3	Trace ≈ ×0.05–0.15
Log INFO 10%	见 §3	Log ≈ ×0.3–0.4
Metrics recording + 降 cardinality	见 §5	Series ≈ ×0.5
Retention 30d→14d（非 SLO 日志）	见 §4	×0.5 on 该部分

Worksheet 应用：先填 Scenario A 自填单价得 baseline，再填 Scenario B 倍数得 target——差值即 ROI 优先级。

三、采样：成本的最强杠杆

降本杠杆排序：采样 > 保留期 > 压缩 > 冷热分层 > SSD→S3。

3.1 与埋点哲学的关系

埋点哲学 §采样：四象限决策——ERROR/慢请求全保留，正常流量可采样。

3.2 Logs 采样

# OpenTelemetry Collector — 合成示例
processors:
  probabilistic_sampler/logs:
    sampling_percentage: 10
    hash_seed: 42

  filter/logs_info:
    logs:
      include:
        match_type: strict
        record_attributes:
          - key: level
            value: INFO

pipelines:
  logs:
    receivers: [otlp]
    processors: [filter/logs_info, probabilistic_sampler/logs, batch]
    exporters: [loki]

策略：

级别	采样率	理由
ERROR	100%	排障资产
WARN	100% 或 50%	视 volume
INFO	1%–10%	主要体积来源
DEBUG	0% prod	仅临时开启

3.3 Traces 采样

见 Traces 栈。头部 1% + 尾部保留 error/slow：

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 500}
      - name: baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

代价：未采样请求无 Trace——排障靠 Metrics + 采样日志 + 关联 trace_id。

3.4 Metrics “采样” = 预聚合

Recording rules 将高 card 原始指标聚合为低 card SLI：

- record: service:http_requests:rate5m
  expr: sum by (service, status) (rate(http_requests_total[5m]))

删除原始 endpoint label 级 series——存储与查询双赢。

3.5 Profiles 采样

Pyroscope/Parca 连续 profiling 默认低频率（100ms–10ms wall time 采样率因实现而异）——Profiles 通常不是账单第一来源。

四、保留期（Retention）工程策略

4.1 基线建议

支柱	热（原始精度）	温（降精度）	冷（归档）
Metrics	7–14d @ 15s	30–90d @ 5m	1y @ 1h
Logs	7–30d	—	合规归档
Traces	3–7d	—	通常不长期存
Profiles	14d	—	可选 S3

底线：retention ≥ 平均事故调查窗口 + 安全余量。支付链日志 30d、内部工具 7d。

4.2 Prometheus 本地 retention

# prometheus 启动参数
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=500GB

两者同时生效——先到者触发删块。见 TSDB 内核 compaction 时序。

坑：retention 过短 + 慢 compaction → WAL 堆积 → OOM。

4.3 Mimir / Thanos 长期 retention

Mimir compactor 配置（合成示例）：

compactor:
  compaction:
    block_ranges: [2h, 12h, 24h]
  retention:
    retention_period: 365d

Thanos：--retention.resolution-raw=15d --retention.resolution-5m=90d --retention.resolution-1h=365d。

4.4 Loki retention

limits_config:
  retention_period: 744h  # 31d

compactor:
  retention_enabled: true
  delete_request_store: s3

按 tenant/stream 差异化：table_manager（旧版）或 per-tenant overrides。

4.5 Tempo retention

compactor:
  compaction:
    block_retention: 168h  # 7d
  retention:
    retention: 168h

Trace 超 7d 查询价值陡降——优先采样而非延长 retention。

4.6 强制删除 vs 归档

删除不可逆；合规日志可 S3 Glacier——恢复延迟小时级，成本降数量级（单价读者自填）。

五、下采样（Downsampling）与聚合

5.1 Metrics 分辨率阶梯

原始 15s ──7d──► 5m 聚合 ──30d──► 1h 聚合 ──365d──► 删除或 Glacier 元数据

Mimir/Thanos compact 自动产生 multi-resolution blocks。

5.2 查询影响

30d 前 p99 精确值不可查——仅 5m/1h 平均。SLO 回顾用 recording rule 预聚合的 SLI，不依赖原始 15s。

5.3 日志 pattern 聚合

Drain/Spell 类算法：存 template + count，不存每条 raw——排障看 pattern 分布。适合安全审计以外的 INFO 洪水。

5.4 与数据模型对照

数据模型：Loki chunk 结构决定删除 retention 后 S3 对象何时 compact 真正释放——可能有延迟。

六、冷热分层与对象存储

flowchart TB
  subgraph hot [热层 SSD/NVMe 1-3d]
    PH[Prometheus Head]
    LI[Loki Index]
  end
  subgraph warm [温层 HDD/标准S3 3-30d]
    PM[Mimir Blocks]
    LC[Loki Chunks]
    TT[Tempo Blocks]
  end
  subgraph cold [冷层 Glacier 30d+]
    AR[Log Archive]
    MR[Metrics 1h blocks]
  end
  PH -->|remote write| PM
  LI --> LC
  LC --> AR
  PM --> MR
  TT --> warm

6.1 Loki boltdb-shipper + S3

Index：本地/SSD boltdb
Chunk：S3 标准
老 chunk：S3 IA / Glacier（生命周期 policy）

6.2 Tempo

默认全对象存储——“热”在 querier 缓存，非全量 SSD。

6.3 Prometheus

本地 TSDB 仅热；Thanos sidecar → S3 为温/冷。

6.4 自动化迁移

S3 Lifecycle：Transition: 30d → STANDARD_IA，90d → GLACIER。无需人工搬数据。

七、压缩与存储放大

7.1 Metrics：Gorilla + ZSTD

见 TSDB 内核：10:1–20:1 典型。Mimir block ZSTD 再 2:1–3:1。

7.2 Logs：ZSTD 块

Loki chunk ZSTD——相对 ES 倒排索引 5:1–10:1 体积比（同 workload 假设下，见 05-data-model 讨论）。

7.3 Traces：ProtoBuf + ZSTD

Jaeger → Tempo 迁移动机之一：去 ES 索引成本。

八、成本建模与预测

8.1 线性增长模型

\[Data/month \propto QPS \times (log\_lines + spans \times sample\_rate + series)\]

若 QPS 升 2× 而 observability 配置不变 → 账单升 ~2×。若 series 因 label 泄漏升 10× → Metrics 账单升 10× 而 QPS 仅升 20%——80/20 大户常是 label 问题。

8.2 何时考虑自建 vs SaaS

见自建 vs 托管。粗判：日 ingest > 10TB logs 或 series > 50M 时 SaaS 边际成本陡升——须用 Scenario A/B worksheet 自算。

8.3 Prometheus 容量粗算

\[Disk \approx N_{series} \times retention\_sec / scrape \times bytes\_per\_sample\]

$bytes\_per\_sample$ 取 1–2 B（压缩后经验值）——仅 order-of-magnitude。

九、工程坑点

9.1 Loki 大查询扫对象存储

max_entries_limit_per_query 过大 + 宽 label 查询 → S3 GET 费用与延迟爆炸。

9.2 Prometheus retention 与 compaction 竞态

删块快于 compact → 查询空洞 + WAL 压力。

9.3 采样率配置错误

sampling_percentage: 0 → 零 Trace。变更后验证 ingest rate。

9.4 retention 压到 3d

事故 T+5 调查时日志已删——保留期底线原则。

9.5 为省成本删 SLO Metrics

短 retention 应用在 DEBUG 日志，不应用在 SLI recording rules 依赖的 raw metrics。

十、降本路径图

度量：各支柱 GB/day、series 数、Top 20 服务（§1.3）
标记冷热：过去 30d 查询日志——无查询 stream 优先缩 retention
采样：INFO log 10%、Trace 1% head + tail error（§3）
Retention：非核心 30d→14d（§4）
Downsample：Mimir/Thanos 1h 长期（§5）
冷热分层：S3 lifecycle（§6）
季度 review：数据增速 vs 业务增速

十一、与告警、SLO 的联合治理

数据类型	SLO 需求	成本策略
SLI raw metrics	30d+ 可用	独立 retention policy，不与其他混删
Burn Rate recording	90d	Mimir 5m 块
Debug logs	7d	激进采样
Trace	7d	tail sampling

告警 Ticket 历史可依赖 Grafana annotations，不需全量 log 365d。

十二、关键概念回顾

杠杆顺序：采样 > retention > 压缩 > 分层 > 介质
Worksheet：Scenario 假设自填单价，算相对 ROI
80/20：Top 服务单独 policy
底线：retention 长于事故窗口
SLO 数据：不可作为降本首刀

十三、下一步

成本可控后，多租户隔离与成本分摊是平台化必经之路。下一篇多租户与安全。

上一篇：告警体系

下一篇：多租户与安全

参考资料

Grafana Mimir, Compactor, https://grafana.com/docs/mimir/latest/operators-guide/architecture/components/compactor/
Grafana Loki, Storage, https://grafana.com/docs/loki/latest/operations/storage/
Grafana Tempo, Retention, https://grafana.com/docs/tempo/latest/operations/retention/
Thanos, Compaction, https://thanos.io/tip/components/compact.md/
OpenTelemetry, Sampling, https://opentelemetry.io/docs/concepts/sampling/

附录 A：Worksheet 空白模板

假设项	你的环境取值
QPS
Series 数
Log GB/day
Trace GB/day
$/GiB-month (logs)
$/M samples-month
月总成本

附录 B：Retention 变更 checklist

通知 on-call 查询窗口变化
备份旧 compact 块（如需要）
分阶段：staging → 10% prod → 全量
监控 ingest 下降与 query 错误率

附录 C：高基数服务治理

见埋点哲学：禁止 user_id/trace_id 作 metric label。

附录 D：分服务 Retention Policy 模板

服务 tier	Logs	Traces	Metrics raw	审批人
Tier0 支付	30d	7d，error 100%	30d	SRE Lead
Tier1 核心	14d	7d，1% head + tail	15d	Team TL
Tier2 内部	7d	3d，1% head	7d	Team
Tier3 batch	3d	1d，error only	7d	Team

附录 E：Mimir 分层 Retention 配置详解

# mimir.yaml 片段 — 合成示例
limits:
  compactor_blocks_retention_period: 365d

compactor:
  compaction:
    block_ranges: [2h, 12h, 24h]
  sharding_ring:
    kvstore:
      store: memberlist

blocks_storage:
  backend: s3
  s3:
    bucket_name: mimir-blocks
    endpoint: s3.example.com

  tsdb:
    block_ranges_period: 2h
    retention_series_idempotency: 24h

分辨率阶梯（Thanos 等价概念）：

块类型	典型 retention	查询用途
raw 15s	7–15d	排障、SLO 精确窗口
5m downsampled	30–90d	趋势、容量
1h downsampled	365d	年度回顾

附录 F：Loki Retention 与 Compactor

# loki.yaml — 合成示例
schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: s3
  aws:
    s3: s3://loki-chunks

limits_config:
  retention_period: 744h
  max_query_series: 500
  max_entries_limit_per_query: 5000

compactor:
  working_directory: /loki/compactor
  retention_enabled: true
  delete_request_store: s3
  retention_delete_delay: 2h

Before/After 假设模型（非实测）：

状态	假设 ingest	假设 30d 磁盘
Before：全量 INFO，31d	6 TiB/day	180 TiB 量级
After：INFO 10% + 14d 非核心	0.6 TiB/day 有效	约 ×0.07 相对 Before

倍数仅供 Scenario worksheet 填参——须用你环境 loki_distributor_bytes_received_total 验证。

附录 G：Tempo Retention 与 Compactor

# tempo.yaml — 合成示例
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.example.com

compactor:
  compaction:
    block_retention: 168h
  retention:
    retention: 168h
    blocklist_poll: 5m

头部 100% 采样 vs 1% + tail error（假设 Scenario A QPS）：

策略	假设 span 存储/天	相对量
100% head	~32 TiB/day	1.0×
1% head + error 100%	~0.5–1.5 TiB/day	~0.03–0.05×

见 Traces 栈采样配置。

附录 H：Prometheus + Thanos 冷热路径

# thanos sidecar 与 store gateway — 概念配置
# prometheus.yml
global:
  external_labels:
    cluster: prod-eu-1
    replica: a

# thanos object store
type: S3
config:
  bucket: thanos-metrics
  endpoint: s3.example.com

数据路径：Prometheus 本地 15d 热 → sidecar 上传 S3 温 → compactor downsample 冷。

TSDB 内核解释 block 结构与 compaction 时序——删 retention 过快会导致 compact 与删除竞态（正文 §9.2）。

附录 I：Retention 变更 Before/After 工作流

Baseline 7 天：记录 prometheus_tsdb_storage_blocks_bytes、loki_bytes ingested、tempo_backend_objects
Staging 试点：单 namespace 缩短 30d→14d
观察：query 错误率、on-call 反馈、SLO 调查是否受阻
Rollout：按 Tier（附录 D）分批
Document：更新内部 wiki retention 表

不做：未通知 on-call 突然把支付链 log 从 30d 改为 7d。

附录 J：查询成本（对象存储 GET）

Loki/Tempo 查询成本 ∝ 扫描 chunk/block 数，不仅是存储 GB。

查询类型	扫描量	成本风险
`{service="x", level="ERROR"}` 1h	低	低
`{service="x"}` 7d 无 filter	高	S3 GET 爆炸
Trace by trace_id	单 block	低
Trace `{ span.service.name="x" }` 宽搜索	高	禁用或限流

Loki max_entries_limit_per_query 是 成本护栏，不是用户体验参数。

附录 K：网络与跨 AZ 成本

Remote write、S3 跨 AZ replication、Grafana Cloud egress 可能占账单 10%–20%。

假设模型：

\[Cost_{network} \approx Data_{replicated} \times Price_{egress}\]

降本：同 AZ 部署 querier 与 object store；避免跨 region remote write 双写。

附录 L：人力成本（常被忽略）

活动	假设工时/月	说明
Loki index 故障	4–16 h	与 label 治理相关
Mimir compact 调参	2–8 h
Retention policy 评审	2 h	季度

自建 observability 的 TCO = 存储 + 计算 + SRE 工时——见自建 vs 托管。

附录 M：案例叙事——日志 label 泄漏

症状：Loki 月账单 3 个月内假设增长 5×（须用实际 ingest 曲线确认，此处为叙事框架）。

调查：

topk(10, sum by (service) (rate(loki_distributor_bytes_received_total[1d])))
发现 debug-service 将 request_id 设为 label
每个 request 一个新 stream → chunk 数爆炸

修复：埋点哲学——request_id 进 log line，不进 label。

假设效果：该服务 ingest 降至原来的 1/10 量级（须本地验证）。

附录 N：Metrics 基数与 80/20 治理

# 按 job 统计 series
topk(20, count by (job) ({__name__=~".+"}))

# 找高 churn label
topk(20, count by (label_name) (label_replace({__name__=~".+"}, "label_name", "$1", "__name__", "(.*)")))

对 Top 5 job：

审查 histogram bucket 是否过细
删除 pod/instance 级聚合 recording rule 是否足够
与数据模型 §1.3 VictoriaMetrics TSID 对比

附录 O：Worksheet 演算示例（假设单价）

显式假设（虚构单价，仅演示算术）：

$C_l = 0.02\ \$/GiB-month$
$C_t = 0.015\ \$/GiB-month$
Scenario A log ingest = 6 TiB/day，30d retention

\[Cost_{logs} \approx 6 \times 1024 \times 30 \times 0.02 \approx 3686\ \$/month\]

Trace 100% 同理量级更高。治理后 INFO 10% + 14d：约 ×0.07 → ~258 $/month（同一假设单价）。

务必替换为你合同价——本文不给厂商报价。

附录 P：压缩比参考（来源分级）

数据	压缩机制	典型比	来源
Metrics	Gorilla	10:1–20:1	Facebook VLDB 2015（A）
Metrics block	ZSTD	+2:1–3:1	Mimir 文档（A）
Logs	ZSTD chunk	vs ES 5:1–10:1	05-data-model 讨论（B）
Traces	Protobuf+ZSTD	因 span 大小而异	Tempo 文档（A）

压缩是 第三杠杆——在采样与 retention 之后。

附录 Q：与 SLO / 告警数据保留交叉

数据	最低 retention	原因
SLI raw metrics	≥ SLO 窗口 + 7d	18-slo recording
Burn Rate recording	90d	月度评审
Page 历史	90d	19-alerting 归因
DEBUG logs	3–7d	可激进删

Alertmanager 自身不长期存 alert——Grafana 注释或 Loki alertmanager 导出可选。

附录 R：降本决策树

flowchart TD
  START[账单超预算] --> Q1{Top 20 服务<br/>占 80% 量?}
  Q1 -->|是| SAM[针对大户采样/retention]
  Q1 -->|否| Q2{Trace 100%?}
  Q2 -->|是| TS[启用 tail sampling]
  Q2 -->|否| Q3{Log INFO 全量?}
  Q3 -->|是| LS[INFO 10% 采样]
  Q3 -->|否| Q4{Metrics series 泄漏?}
  Q4 -->|是| CAR[基数治理]
  Q4 -->|否| RET[缩短非 Tier0 retention]

附录 S：常见问题

S.1 先降 retention 还是先采样？

先采样——删未采样数据的 retention 不减少 ingest 成本，只减少存储。

S.2 SLO 数据能否 downsample？

Burn Rate recording 可以 5m 精度；原始 SLI counter 在 SLO 窗口内保留 raw。

S.3 对象存储是否一定便宜？

存储便宜但 GET 贵——宽查询仍可能贵过 SSD 热存储。

S.4 如何证明降本未伤排障？

跟踪 MTTR 与 query 成功率 各 30d——与账单同步评审。

附录 T：本环境验证说明

项	状态
Loki/Tempo/Mimir	未在本机部署
账单数字	假设模型，非实测
Before/After	算术推导，标注于附录 F/G/O

建议在 minikube 或 staging 用 loki-canary / synthetic load 验证 retention 变更前后 ingest 指标。

附录 U：各组件配置索引

组件	关键参数	见正文
Prometheus	`--storage.tsdb.retention.time`	§4.2
Mimir	`compactor_blocks_retention_period`	§4.3、附录 E
Loki	`retention_period`, compactor	§4.4、附录 F
Tempo	`block_retention`	§4.5、附录 G
OTel Collector	`tail_sampling`	§3.3

附录 V：Profiles 与 Events 成本注记

Profiles（Profiling）：连续 profiling 默认低采样频率，通常 < 5% 账单——优先治理 Logs/Traces。

Events（Events）：常并入 Logs 或外部 CMDB——不计入四大支柱时可忽略，但 deployment 事件对 SLO 关联至关重要，不可为省存储删除 Events 采集。

附录 W：Metrics 栈成本对照

需求	Prometheus 单机	Thanos	Mimir	VictoriaMetrics
长期 retention	不推荐	S3 + compact	原生	cluster + S3
下采样	recording rules	compact	compactor	内置 downsample
多租户	无	有限	原生	原生
成本杠杆	降 series	对象存储	限 tenant + 块压缩	更高 series 密度

选型详见 Prometheus。成本角度：热数据 SSD + 冷块 S3 是 Metrics 分层标准路径。

附录 X：Elasticsearch vs Loki 成本模型（假设）

因素	Elasticsearch	Loki
索引	全字段倒排	仅 label
存储放大	1.5–3× raw	~0.1–0.2× raw（ZSTD chunk）
查询	任意字段快	label 窄 + 正文扫描
适用	ad-hoc 全文	K8s label 已知

迁移动机往往是 索引 RAM + 存储 而非查询能力——见数据模型 §2。

假设：同 workload 6 TiB/day raw logs，ES 集群磁盘 9–18 TiB/day 量级，Loki chunk 0.6–1.2 TiB/day 量级——倍数因压缩与 label 设计而异，须 POC 验证。

附录 Y：VictoriaMetrics 与 Prometheus 存储密度

VictoriaMetrics TSID 化减少 label 文本索引开销——同等 series 下磁盘约为 Prometheus 的 1/3 到 1/2（社区 benchmark 区间，版本与 workload 相关，作 B 级线索 非本文实测）。

成本含义：基数泄漏时 VM 缓冲期更长，但 不替代 label 治理——最终仍付存储费。

附录 Z：S3 Lifecycle 策略示例

{
  "Rules": [
    {
      "ID": "loki-chunks-tiering",
      "Status": "Enabled",
      "Filter": {"Prefix": "loki/chunks/"},
      "Transitions": [
        {"Days": 30, "StorageClass": "STANDARD_IA"},
        {"Days": 90, "StorageClass": "GLACIER"}
      ],
      "Expiration": {"Days": 365}
    },
    {
      "ID": "tempo-traces-short",
      "Status": "Enabled",
      "Filter": {"Prefix": "tempo/"},
      "Expiration": {"Days": 7}
    }
  ]
}

Lifecycle 对 新写入 对象生效——历史对象需 batch 迁移或等待自然 turnover。

附录 AA：Recording Rules 降本示例

将 5000 pod 级 http_requests_total 聚合为 200 service 级：

groups:
  - name: cost_reduction
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum by (service, status) (rate(http_requests_total[5m]))
      - record: service:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

假设：pod 级 series 5000×20=100k；service 级 200×20=4k——25× series 减少。原始 pod 级可缩短 retention 或 stop scrape（需确认无 pod 级排障需求）。

附录 AB：OTel Collector 管道成本

Collector CPU ∝ span/log 条数。在 Collector 采样比在后端删数据更省 ingest 与网络：

App → Collector [sample/filter] → Tempo/Loki
         ↑ 成本杠杆点

见 OpenTelemetry pipeline 与 Traces 栈。

附录 AC：Compactor 失败与存储泄漏

Loki/Mimir compactor 若长期失败——delete retention 不执行，S3 对象累积。监控：

# Loki compactor 失败（指标名因版本而异，部署时核对）
rate(loki_compactor_apply_retention_failed_total[1h]) > 0

成本影响：retention policy 纸上 30d，实际对象永不过期——账单线性增长。

附录 AD：查询日志与”谁在用冷数据”

启用 Loki/Tempo query frontend audit log——30d 无查询的 tenant/stream 标记 cold candidate 优先缩 retention。

信号	动作
30d 零查询 + Tier2	retention 7d→3d 试点
每日查询 + Tier0	维持 30d
ad-hoc 宽查询频繁	教育 + 限流，非延长 retention

附录 AE：与 19-alerting 的存储关联

告警 Ticket 历史不需全量 log 365d——Grafana annotation + 90d metrics 足够。降本时勿删 SLI/Burn Rate 依赖的 metrics（18-slo）。

Meta-alert AlertmanagerNotificationFailing 本身 negligible series——但其依赖的 AM metrics 应保留 30d。

附录 AF：季度成本评审议程

各支柱 GB/day 趋势 vs QPS 趋势
Top 10 服务 ingest 占比变化
新服务 onboarding 是否带 label 审查
Retention/Sampling policy 例外清单
Scenario worksheet 更新假设参数
下季度一个降本实验（单杠杆）

附录 AG：Synthetic Load 验证 Retention（建议步骤）

本环境未执行——供 staging 使用：

部署 loki-canary 或等价 log 发生器
记录 7d bytes_received baseline
改 retention_period 15d→7d
等待 2 个 compact 周期
对比 S3 bucket 对象总大小（AWS ListObjectsV2 或等价）
确认 query 仍可覆盖 7d 窗口

附录 AH：术语与公式索引

符号	含义
$N_{series}$	活跃 time series 数
$V_{logs}$	日 log 体积
$C_l, C_t, C_m$	自填单价
Scenario A/B	治理前/后假设

正文 §2 Worksheet 是全文成本讨论的数字起点——替换为你环境的真实 meter 读数。

附录 AI：降本实验记录模板

日期	杠杆	范围	ingest 变化	query 影响	回滚?
	sampling	service X
	retention	Tier2 logs

说明：实验行须填真实 meter 数据——空白表仅作流程模板。

附录 AJ：Kafka 管道成本边界

Logs/Traces 进 object storage 前常经 Kafka。Kafka 保留 24–72h 与 Loki 30d retention 解耦；Kafka 磁盘计入缓冲层。降本时勿为缩 Kafka retention 丢失 Collector 突发缓冲。

附录 AK：Histogram vs Summary 存储

Histogram 产生多 bucket series；Summary quantile 不可聚合。新服务统一 Histogram + 平台侧 recording rule 聚合 SLI——见指标体系。

附录 AL：Remote Write 双写

DR 全量 remote write 使 ingest 与存储约 ×2。除非合规，优先 query federation 而非双写全部样本。

附录 AM：Compliance Retention

合规要求 7y 日志时：热 30d 可查询 + 冷 WORM 归档。合规流单独 budget，不与 debug log 采样混谈。

附录 AN：Loki index 成本

高 stream 数同时推高 tsdb index 与 chunk 成本。label 治理是 Logs 降本前提——见埋点哲学。

附录 AO：Tempo span metrics

从 span 生成 RED metrics 增加 Metrics 体积、减少宽 Trace 查询。Trace 高采样 + span metrics 是常见折中——见 Traces 栈。

附录 AP：Series 预算制度

Tier	series 预算	超限
Tier0	50k	例外审批
Tier1	10k	review
Tier2	2k	reject label

附录 AQ：Prometheus WAL 峰值

WAL 在 compaction 滞后时可占数十 GB——磁盘规划含 head + WAL + blocks。见 TSDB 内核。

附录 AR：Cross-AZ 复制

3 AZ 对象存储 replication 可能使存储费近似 ×3（因云而异）——HA 与成本显式 tradeoff。

附录 AS：Profile 体积估算

\[V_{profile/day} \approx N_{pods} \times 1440 \times size/profile\]

Profiles 通常最后优化——见 Profiling。

附录 AT：成本 OKR 模板

OKR	测量
$/req −20% YoY	账单 / QPS
Tier2 log 14d	Loki policy
Trace 1%+tail	Tempo ingest

数字由组织自填——本文不提供 benchmark。

附录 AU：与 25-self-hosted-vs-saas 衔接

自建 vs 托管总 TCO 比较使用本篇 §2 Scenario worksheet 作为 pillar 输入。

附录 AV：Quick Reference

降本: 采样 > retention > 压缩 > 分层 > 介质
勿删: SLI raw, Burn Rate recording, Tier0 ERROR
80/20: topk(20, ingest by service)
护栏: max_entries_limit_per_query

附录 AW：证据分级

结论	等级
Gorilla 压缩比	A（VLDB 2015）
杠杆排序	B（工程共识）
Scenario 算术	方法推导
云单价	读者自填

附录 AX：Mimir 块上传延迟

Remote write 到 Mimir 后 block 上传 S3 有 delay——热查询走 ingester，冷查询走 store-gateway。retention 删块在 compact 后——与 Prometheus 本地 TSDB 行为不同。

附录 AY：VictoriaMetrics cluster 成本

VM cluster 单节点可承载更高 series 密度——适合 基数已泄漏 的过渡。长期仍须 label 治理；VM 不是采样替代品。

附录 AZ：日志 JSON 字段膨胀

message 内嵌 10KB stack trace 使 chunk 压缩比下降——应用侧截断 stack；平台侧 line_too_long discard 监控。

附录 BA：Trace 属性索引代价

Tempo 不索引任意 span attribute——宽查询扫 block。Jaeger ES 索引 attribute 时 storage 升数量级——迁 Tempo 动机之一（见 05-data-model）。

附录 BB：Grafana 变量与查询成本

Dashboard 高 cardinality 变量（如 $pod 5000 值）触发大范围 PromQL——教育用户用 $service；与存储无关但 query 成本 相关。

附录 BC：Thanos Store Gateway 缓存

Store gateway index cache 减 S3 GET——EC2 内存 vs S3 API 费 tradeoff。大规模部署应用 bucket index / sparse index（Thanos 版本功能核对官方文档）。

附录 BD：OpenTelemetry logs 桥接

OTel logs 经 Collector 进 Loki——Collector 层 batch + memory_limiter 防止 OOM；batch 增大略减 S3 PUT 次数，延迟换成本。

附录 BE：年度容量规划日历

季度	动作
Q1	Scenario worksheet 更新
Q2	Top 20 服务 label audit
Q3	Retention 试点
Q4	下年 budget forecast

附录 BF：与 19-alerting 存储交叉

Ticket 历史不需 365d 全量 log——90d metrics + annotations 足够。降本勿删 Alertmanager 自监控 metrics（alertmanager_notifications_*）。

附录 BG：对象存储请求费

S3 LIST/GET 在宽 Loki 查询时可超存储费——max_query_parallelism 与 query 教育同 retention 一样重要。

附录 BH：合成负载 disclaimer

本篇 Scenario A/B、附录 F/G 倍数均为 假设模型。任何 before/after 结论须 staging 实测后写入内部 wiki——禁止将假设数写入对外 SLA。

附录 BI：Prometheus 单集群 shard 规划

单 Prometheus 建议 < 1–2M active series（视内存而定）。超限时 水平 shard 按 team/region，而非单纯拉长 retention。每 shard 独立 retention 与 recording rules——避免全局 query 扫所有 shard 的冷数据。

信号	动作
`prometheus_tsdb_head_series` 持续 > 1.5M	计划 shard
compactor 落后	先修 compact 再 shard
query 超时	降 card 或 federate recording

附录 BJ：Loki 按 tenant 限流

# limits_config 片段
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 15MB

超 ingest 限流丢弃或限速——保护平台免于单 tenant 日志风暴拖垮账单。与 cost 直接相关：unbounded ingest → unbounded S3。

附录 BK：Tempo ingester 与 block 大小

Tempo ingester 将 trace 批成 block 上传 S3。block 过大 → 查询需下载整块；过小 → PUT 次数多。官方 compactor 合并 block——监控 compactor lag 同 Loki/Mimir。

附录 BL：Metrics 与 Logs 关联 trace_id

Logs 中带 trace_id 字段（非 label）可 Jump 到 Tempo——排障时减少所需 log 保留天数。投资 trace_id 注入比延长全量 log retention 更便宜——见数据模型 §2.1。

附录 BM：Recording rule 与 SLO 长期存储

18-slo 的 slo:*:burnrate* recording 应存于 长 retention Mimir tenant——与 debug metrics 短 retention shard 分离。删 debug 指标时不波及 SLI。

# 概念：remote_write 按 label 路由不同 tenant
# debug metrics → tenant-short (7d)
# slo recording → tenant-slo (400d)

附录 BN：Cloud egress 假设项

Scenario worksheet 增加一行 $C_{egress}（自填）：跨 cloud 复制 Grafana dashboard 不收费，但 remote read 跨 region 可能收费。Thanos query 跨 region 拉 block 计入 egress。

附录 BO：降本反模式清单

反模式	后果
全 pillar 3d retention	事故无法复盘
零 Trace 采样	无法定位 latency
删 SLO recording	Burn Rate 告警失效
宽 Loki 查询作监控	S3 请求费爆炸
user_id 作 metric label	series 指数增长
无 compact 监控	retention 删不掉对象

附录 BP：Mimir 多 tenant 成本分摊

Mimir X-Scope-OrgID 隔离 tenant——billing 按 tenant ingest 统计 cortex_ingester_memory_series / bytes received。平台团队可 showback 各业务 observability 成本，驱动 80/20 治理。

附录 BQ：ZSTD 字典与 Loki 版本

较新 Loki 版本改进 chunk 压缩——升级可能降 10%–20% 存储（须 POC，非保证）。升级前备份 index，staging 对比 bytes_received vs chunk size。

附录 BR：文档修订

v1.1（2026-06-18）：深化 Scenario worksheet、四大组件 retention 配置、冷热分层 Mermaid/SVG、80/20 与交叉引用；成本数字均为假设模型。

附录 BS：ClickHouse 与 Logs（边界）

部分团队用 ClickHouse 存 observability logs——列存压缩优秀但 全表扫描成本 与 Loki label-first 不同。本文聚焦 Loki/ES 主流栈；CH 成本模型需单独 worksheet 行（storage + merge + query CPU）。

附录 BT：InfluxDB / M3 遗留栈

M3DB、Influx 等时序栈仍有存量部署。迁 Prometheus/Mimir 时 dual-write 期 成本临时 ×2——预算中单独列 migration line item，周期通常 1–3 月。

附录 BU：Grafana dashboard JSON 体积

Dashboard JSON 不占 observability 数据账单——但 高 cardinality 变量 引发昂贵 query（附录 BB）。治理 dashboard 与治理 ingest 同等重要。

附录 BV：Synthetic monitoring 存储

Blackbox exporter metrics 系列数低、价值高——不要为降本删除 synthetic probe。probe 失败应走告警 Page 路径（与 Burn Rate 并列，见 19 附录 AE）。

附录 BW：Container 日志 stdout 驱动

K8s 容器 stdout 全进 Loki 是 log 体积主因之一。sidecar 过滤 DEBUG、应用结构化 JSON 减字段——在 节点侧 降本优于 S3 侧删。

# Fluent Bit 合成片段：丢弃 DEBUG
[FILTER]
    Name    grep
    Match   kube.*
    Exclude log DEBUG

附录 BX：Parquet / Iceberg 归档（趋势）

部分 vendor 推 Parquet 冷归档——查询用 Athena/Trino。适合 compliance 长期保留；热排障仍靠 7–30d Loki。归档层单价读者自填。

附录 BY：成本 review 参会角色

角色	贡献
SRE	ingest 趋势、Top 服务
平台	compact/retention 健康
业务 TL	Tier 分级确认
FinOps	单价与 showback

无 FinOps 时 SRE 须自填 $ / GiB——Scenario worksheet 不可留空。

附录 BZ：与 index 系列阅读顺序

成本治理路径：04 采样 → 05 数据模型 → 07 TSDB → 18 SLO → 19 告警 → 20 本文 → 21 多租户 → 25 自建/SaaS。按 observability index 治理层批次阅读。

附录 CA：Scenario B 完整演算步序（假设单价）

填 Scenario A 假设（§2.1）与自填 $C_l, C_t, C_m$
算 $V_{logs/day}$、$V_{traces/day}$（§2.2 公式）
得 $Cost_A = V \times retention \times C$
对 Trace 应用采样倍数 0.05、Log INFO 倍数 0.35（§2.4 表）
得 $Cost_B$——差值 $Cost_A - Cost_B$ 为理论 savings upper bound
staging 实测 ingest 曲线验证倍数假设
写 internal RFC：保留 Tier0 不变，先动 Tier2

全程标注「假设」——Step 6 未做前不得对外承诺 savings 百分比。

附录 CB：retention 与 query window 对齐

Grafana 默认 dashboard range 30d 而 Loki retention 14d → 用户困惑。降 retention 时 同步改 dashboard 默认 range 与 runbook 说明——否则 support ticket 增加，隐性人力成本上升。

附录 CC：四大组件 retention 速查表

组件	配置键	典型热	典型冷	正文
Prometheus	`--storage.tsdb.retention.time`	15d	Thanos S3	§4.2
Mimir	`retention_period` / compactor	15d raw	365d 1h	§4.3、附录 E
Loki	`retention_period` + compactor	7–31d	S3 lifecycle	§4.4、附录 F
Tempo	`block_retention`	3–7d	通常无长期	§4.5、附录 G

变更任一键后监控：ingest rate、querier error、bucket object count——三者至少观察 14d 再判 success。

附录 CD：成本治理 success 指标

指标	健康趋势
$/1M requests	flat 或 ↓ vs QPS ↑
cold stream ratio	↑ after retention 试点
MTTR	flat（降本不伤排障）
SLO 达标率	≥ 降本前

若 MTTR ↑ 而 $ ↓——retention 或采样过度，回滚 Tier0 policy。

附录 CE：Worksheet 字段检查清单

发布 internal cost RFC 前确认：

每个假设项有 owner 和测量来源
单价行标注合同编号或「待 FinOps 填写」
Scenario B 倍数来自 staging 或标注「待验证」
Tier0 retention 未被动
18-slo / 19-alerting 依赖 metrics 已标 exempt
回滚方案与观察期（建议 ≥14d）已写

同主题继续阅读

把当前热点继续串成多页阅读，而不是停在单篇消费。

2026-06-11 · architecture / observability

支柱	月存储量（Scenario A 粗算）	月成本公式
Metrics	按 retention 30d 累加	\(\approx N_{samples/month} \times C_m\)
Logs	\(V_{logs} \times retention\_days\)	\(V_{logs} \times 30 \times C_l\)
Traces	\(V_{traces} \times retention\_days\)	同上

符号	含义
\(N_{series}\)	活跃 time series 数
\(V_{logs}\)	日 log 体积
\(C_l, C_t, C_m\)	自填单价
Scenario A/B	治理前/后假设

文章导航

目录