Flash Attention

#machine-learning #attention #optimization #gpu #transformers

An Efficient Attention Process

Standard Attention

We have the $Q, K, V \in R^{N * d}$ matrices in the HBM

Load $Q, K$ blocks from HBM to the GPU's SRAM.
Compute $S = Q K^{T}$ in GPU's SRAM and then move it to HBM.
Move $S$ from HBM and then $P = s o f t m a x (S)$ , write $P$ to HBM.
Load $P$ and $V$ by blocks from HBM, compute $O = P V$ and then move $O$ to HBM.

Flash Attention

Goal - Minimize the number of HBM accesses

To compute numerically stable softmax ->

$m (x) = m a x_{i} x_{i}$ -> Find the max value from the input vector
$f (x) = [e^{x_{1} - m (x)}, \dots, e^{x_{B} - m (x)}]$ -> before exponentiating each term, make sure to subtract the max from it.
$l (x) = \sum f (x)_{i}$ -> The running sum for the denominator.

We would like to have a way to compute softmax online and not after we calculate all the input terms. If we are able to do that, we would be able to compute the softmax without moving $S$ back and forth from HBM.

Let $x = [x_{1}, x_{2}]$ , then we could write $f (x)$ and $l (x)$ as follows:

f (x) = [e^{m (x_{1}) - m (x)} f (x_{1}), e^{m (x_{2}) - m (x)} f (x_{2})]