gpu 标签归档 | 土法炼钢兴趣小组的算法知识备份

【GPU 算子工程】全景：算子工程在 AI 计算栈的位置

2026-06-26 | gpu · architecture | #cuda #gpu #kernel #cublas #cudnn #operator #ai-stack #ptx #sass

从框架一行 matmul 到 PTX/SASS，拆开 AI 计算栈的分层：框架算子、算子库、手写 kernel、编译器生成。回答工程师什么时候才需要自己写或调 kernel，以及本系列的实验环境与方法。

【GPU 算子工程】GPU 执行模型：SM、warp、线程层次与 occupancy

2026-06-27 | gpu · architecture | #cuda #gpu #sm #warp #simt #occupancy #thread-hierarchy #divergence

讲清 grid/block/warp 如何映射到 SM，SIMT 执行与 32 线程 warp 的本质，分支发散为何昂贵（实测 1.7 倍），以及 occupancy 的含义。建立一切 GPU 性能优化的硬件直觉。

GPU 高性能算子工程

2026-06-26 | gpu · architecture | #cuda #gpu #kernel #tensor-core #cutlass #triton #flash-attention #gemm #nsight #roofline #hpc

从 GPU 执行模型与内存层次出发，系统讲解如何写出并调优高性能 CUDA 算子：访存合并、occupancy、Roofline、Nsight 调优，reduction/GEMM/Tensor Core/FlashAttention 核心算子实现，以及 Triton、CUTLASS、kernel fusion 与算子库工程。

【GPU 算子工程】内存层次：global / L2 / shared / register 的带宽与延迟

2026-06-27 | gpu · architecture | #cuda #gpu #memory #shared-memory #l2-cache #bandwidth #latency #hbm

拆开 GPU 的存储金字塔：寄存器、shared memory、L1/L2、global memory 的容量、带宽与延迟量级。用实测展示 L2 命中（约 3.4 TB/s）与 DRAM（约 400 GB/s）相差近一个数量级，解释为什么数据放哪决定算子性能。

【大模型基础设施工程】02：GPU 计算入门——SM、Tensor Core、HBM、NVLink

2026-04-22 | architecture · ai-infra | #llm #infra #gpu #cuda #tensor-core #hopper #blackwell #hbm #flashattention #ascend

从 CPU 与 GPU 的架构差异出发，讲清楚 SM、Warp、Tensor Core、HBM、NVLink 的工程含义，并结合 Roofline、FlashAttention 与国产算力栈，给出大模型工程师能直接上手的 GPU 心智模型。

【编译器与 MLIR】面向异构硬件的代码生成

2026-06-09 | compiler · architecture | #mlir #llvm #compiler #gpu #spir-v #cuda #tiling #memory-hierarchy #iree #triton

解析 MLIR 的 GPU 代码生成框架：GPU 方言的层次化并行模型（Block/Thread/Memory）、gpu.launch 的语义、SPIR-V 出口路径、内存层次抽象与 tiling 策略，以及与 Triton、IREE 的协作关系。

【编译器与 MLIR】AI 时代的编译器基础设施

2026-06-09 | compiler · architecture | #mlir #llvm #compiler #dialect #linalg #affine #gpu #tensor #iree #codegen #ai-compiler #heterogeneous-computing #tablegen #pass

从三阶段编译器局限出发，系统讲解 MLIR 方言、渐进降阶与 Pass 基础设施，覆盖 Tensor/Linalg/Affine/GPU 到框架桥接的完整编译链。

【Transformer 与注意力机制】42｜FlashAttention：注意力计算的硬件级重写

2026-04-15 | transformer | #transformer #flashattention #attention #gpu #memory-io

FlashAttention 的关键不是近似注意力，也不是把公式改掉，而是重新安排标准 attention 在 GPU 内存层级里的计算路径。本文解释为什么标准 attention 的瓶颈常常是 HBM 读写，FlashAttention 如何用 tiling 和 online softmax 避免物化完整注意力矩阵，以及它为什么省显存、提吞吐，却没有消除 O(n²) 的根本复杂度。

并行排序：从归并网络到 GPU 双调排序

2025-07-15 | algorithms | #sorting #parallel #gpu #bitonic-sort #simd

当单核性能到达瓶颈，排序如何利用多核 CPU 和 GPU 的并行能力？从排序网络的理论优雅到工业级并行排序的工程妥协。