Build My Own LLM

This is my scratch pad for keeping track of notes as I make my way through learning how to build an LLM.

The goal is to build a conceptual framework for how each of the pieces works and why certain choices were made. Where did Queries, Keys and Values come from with self-attention, why did they do it this way? Why use RELU and not SoftMax for activation functions? How does gradient decent work? etc. etc.

Self Attention for Transformer Neural Networks
Positional Encoding for Transformers
Simplified Self-Attention Mechanism