This is my scratch pad for keeping track of notes as I make my way through learning how to build an LLM.
The goal is to build a conceptual framework for how each of the pieces works and why certain choices were made. Where did Queries, Keys and Values come from with self-attention, why did they do it this way? Why use RELU and not SoftMax for activation functions? How does gradient decent work? etc. etc.