With the release of PyTorch 2.0 and ROCm 5.4, we are excited to announce that LLM training works out of the box on AMD datacenter GPUs, with zero code changes, and at high performance (144 TFLOP/s/GPU)! We are thrilled to see promising alternative options for AI hardware, and look forward to evaluating future devices and larger clusters soon.
With PyTorch 2.0 and ROCm 5.4+, LLM training works out of the box on AMD MI250 with zero code changes when running our LLM Foundry training stack.
Some highlights:
- LLM training was stable. With our highly deterministic LLM Foundry training stack, training an MPT-1B LLM model on AMD MI250 vs. NVIDIA A100 produced nearly identical loss curves when starting from the same checkpoint. We were even able to switch back and forth between AMD and NVIDIA in a single training run!
- Performance was competitive with our existing A100 systems. We profiled training throughput of MPT models from 1B to 13B parameters and found that the per-GPU throughput of MI250 was within 80% of the A100-40GB and within 73% of the A100-80GB. We expect this gap will close as AMD software improves.
- It all just works. No code changes were needed.
Today’s results are all measured on a single node of 4xMI250 GPUs, but we are actively working with hyperscalers to validate larger AMD GPU clusters, and we look forward to sharing those results soon! Overall our initial tests have shown that AMD has built an efficient and easy-to-use software + hardware stack that can compete head to head with NVIDIA’s.
Subscribe To Our Free Newsletter |