Member-only story

What is attention mask when we tokenize documents during model training?

2 min readOct 20, 2024

An attention mask is a mechanism used during model training, particularly in transformer-based models (like BERT, GPT, etc.), to differentiate between actual tokens (words or subwords) and padding tokens in a sequence. This mask is necessary because input sequences can have varying lengths, but models typically expect inputs to be of uniform length. To achieve this, shorter sequences are padded with extra tokens to match the length of the longest sequence.

Purpose of the Attention Mask:

Avoid processing padding tokens: Padding tokens are not meaningful, and the model should not attend to them during training or inference. The attention mask ensures that the model focuses only on the actual input tokens.

How it works:

The attention mask is a binary array of the same length as the input sequence.
1 indicates that the corresponding token should be attended to (it's part of the actual input).
0 indicates that the corresponding token is padding and should be ignored during attention.

For example, given the input sequence [word1, word2, word3, PAD, PAD], the attention mask would look like [1, 1, 1, 0, 0].

What is attention mask when we tokenize documents during model training?

Purpose of the Attention Mask:

How it works:

Written by Vivek Singh

No responses yet