BluTrain：基于C++/CUDA的AI训练框架

精选理由

这个新框架用C++从头写，训练GPT-2比PyTorch快3%且省内存22%，适合追求极致性能的开发者。

AI 摘要

BluTrain是一个用标准C++和CUDA实现的AI训练框架。在8-GPU 6000 Ada系统上训练124M参数GPT-2模型（FP32），其吞吐量达407K tokens/s，比PyTorch的395K tokens/s高约3%。同时内存占用减少22%，且严格保持数值精度。框架包含原生实现的张量模块、反向模式自动微分、线性代数库、缓存分配器、分布式执行和MLIR编译器。

AI 翻译 · 中文

arXiv cs.AIProgress in deep learning is, at scale, more a matter of systems engineering than of modelling: the behaviour of a model in training (its throughput, its memory footprint, and the numerical fidelity of the result) is det…

阅读原文