Self-Attention for Transformer Neural Networks

Query, Keys and Values are calculated to determine similarity.

Query Values

Encode a word
Take the values from encoding and then multiply them by weights
The new values are the Query value for those words

Why do we do this?

Keys

Keys are a table of numbers that are calculated during training. There are keys for each word.

Use the Query values to calculate the similarity between two words

Create Key values for each word
Using the Query values for the word you are calculating similarity for
Use the dot product to calculate the similarity

Example

If there are query and key values for the word "Let's", to find the similarity of "Let's" to itself find the dot product of the query values with the key values for the word "Let's".

To find the similarity between "Let's" and "Go" find the dot product between the query values for "let's" and the key values of "Go"

Values

The Values table is (I think?) a set of weights that is then scaled by the result of the similarity scores after they've run through a dot product and then SoftMax. Each word has its own set of values.

The similarity scores calculated using the query and key values are run through SoftMax to determine how much of each word will be used to encode the word. So in the example so far the word "Let's" is going to be weighted much more heavily than the word "go" because it's so much more similar to itself.

The steps for calculating self-attention

Calculate the query value.

Multiply the position-encoded values (multi-dimensional vectors) by some weights.

Calculate similarity scores

Use the Key values for each token in the sequence to calculate a similarity score for each token by taking the dot product of the query value and the key and then applying softmax.

Use the Value table to calculate the final value

Use the corresponding numbers in the Value table for each word along with the calculated dot-product/softmax similarity scores to scale the value numbers for each word. Then add all of those together to get the self-attention score for the word that is being looked at in the sentence.

Keys and Values do not need to be recalculated for each word. The weights that are used to calculate query values are the same for every token. The same is true for the weights used to calculate the values using Keys and Values for each token.

Concepts and Insights

Why use Query, Key and Value tables?

The idea is that the token being looked at is the "query", it's being compared to "keys" (like in a database) and then those are passed to a function to create Values.

Each layer, (queries, values or keys), has a matrix that is created when the model is trained.

The whole point of this is to allow the training process to encode information in the embeddings about which words in the sentence the model should focus on.

"The dog found some pizza next to the baby and ate it"

In this sentence, the word "pizza" is much more important than the word "it" so that the neural network will know that it is pizza and not baby. It tends to be true that it will refer to the nearest preceding noun.

So by the time the transformer reaches this step, it has encoded position into the vector it's working with. Now it needs to figure out what it should focus on.

Initially, the model has no idea what word is more important to a given word so it starts with some random weights to figure out a starting point, and they called that the Query. Then the starting point has to be compared to something so there's a new set of values called Keys. These values are trained over time as well. Then once the similarity is calculated across all the tokens compared to the token being looked at using the dot-product and softmax, a new table of Values is used. I'm not sure why they needed the Values step but it adds another layer for the neural network to train and add knowledge as it learns.

Insight about similarity

Initially, it might seem confusing how a token can be "similar" to another token. In this context, similarity refers to the relevance or importance of one token to another, determined by learned weights.

The "similarity" is not pre-defined but is learned by the model during training.
The model optimizes these weights to maximize the effectiveness of its attention mechanism, thereby learning what "similarity" means in the context of the task it's trained on.