Megatron-LM

发表于 2025-04-29 更新于 2025-08-08 分类于 llm

Megatron-LM 是一个基于 Megatron-Core 实现高效大模型训练的框架。

Megatron-core是一个GPU 优化的训练技术库，专注于提升大模型训练的性能和效率。

Todo list:

数据处理：

Indexed_dataset是干什么的？
apex的作用是什么？

背景

内存限制的挑战： 随着模型规模的增大，现代处理器的内存限制成为瓶颈。例如，单个 GPU 的内存通常无法容纳数十亿参数的模型，因此需要额外的内存管理技术，如激活检查点。
现有方法的局限性： 现有的模型并行方法（如 GPipe 和 Mesh-TensorFlow）需要重写模型，并依赖于仍在开发中的自定义编译器和框架。

通过在 512 个 GPU 上训练高达 83 亿参数的 Transformer 模型，Megatron-LM 实现了 15.1 PetaFLOPs 的计算效率，相比单 GPU 基线有 76% 的扩展效率。

Model Parallel Transformers

其他信息

We show that the existing BERT architecture results in model degradation as the size increases. We overcome this challenge by rearranging the layer normalization and residual connection in the transformer layers and show that with this change, results for the downstream tasks on development sets improve monotonically as the model size increases.

参考文献

http://arxiv.org/abs/1909.08053