Collective Operations

These are operations called across devices to perform operations on data which is distributed and/or shared across devices. Here, device stands for a GPU, it could either be within a single machine or a cluster.

AllReduce

Reductions on data (like sum, min, max) across devices and stores the result in all the devices
Pasted image 20251119203224.png

Broadcast

Copies some data and shares it with every connected device.
Pasted image 20251119203453.png

Reduce

Performs the reduction and stores it in a single specified device.
Pasted image 20251119203602.png

AllGather

Gathers individual values from different devices and concatenates them into a vector and stores it in every device.
Pasted image 20251119203959.png

ReduceScatter

If each device have different vectors of data, then ReduceScatter performs a reduction across different indices of the vector stored in the different devices.
Pasted image 20251119204747.png

SImple Words,
Out0 -> Reduction with the 0th element in in0, in1, in2 and in3.
Out1 -> Reduction with the 1st element in in0, in1, in2 and in3.
...So on

AlltoAll

Every Device exchanges data with every other device
Pasted image 20251119205013.png

Gather

Gathers individual values from different devices and concatenates them into a specific device.
Pasted image 20251119205105.png

Scatter

Distributes a vector across all the available devices, with each devices getting an equal share.
Pasted image 20251119205157.png

Note:
All the images in this article were from https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html.