
Params - Detailed Profile per GPU -Įach module profile is listed after its name in the following order: Top modules in terms of params, MACs or fwd latency at different model depths:

Params of model = params per GPU * mp_size: 336.23 Mįwd flops of model = fwd flops per GPU * mp_size: 6279.86 Gįwd FLOPS per GPU = fwd flops per GPU / fwd latency: 81.9 TFLOPSīwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 116.27 TFLOPSįwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency ): 102.0 TFLOPS
PERSECOND MAC UPDATE
Step (weights update latency ), iter latency ( sum of fwd, bwd and step latency )

Number of floating-point operations (flops ), floating-point operations per second (FLOPS ),įwd latency (forward propagation latency ), bwd latency (backward propagation latency ), Number of parameters (params ), number of multiply-accumulate operations (MACs ), DeepSpeed Flops Profiler -ĭata parallel size (dp_size ), model parallel size (mp_size ), DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point operations per second, i.e., FLOPS) of a model and its submodules, with an eye towards eliminating inefficiencies in existing implementations.īelow is an example output for BERT-Large(NVIDIA) on an A100 GPU with batch size 80: In this tutorial, we introduce the DeepSpeed Flops Profiler and provide examples of its usage.Įffective use of hardware resources is critical to good performance, but performance inefficiency in existing implementations for large-scale model training and inference are often hard to spot and attribute to specific module components.
