Notes reading 'Build a Large Language Model (From Scratch)' #2

June 23rd, 2025

How GPT models differ from classic transformer architecture

GPT models actually use a simplified version of the transformer architecture. While the original transformer architecture has both encoding and decoding mechanisms, GPT only requires the decoding mechanism: the input is tokenized, vectorised, and fed into the decoder, which is then used to generate vectors for a reply, which is then repeatedly fed back into itself to build a full response. This type of model is called an autoregressive model, and the approach is a form of 'self-supervised learning'

The reason being is that this decoder-only approach performs well for GPT's purpose, which is text generation, while also performing better than expected at translation tasks (which came as a surprise to researches when they observed it)

Preparing data for pre-training

In order to pre-train a model on a large dataset, the dataset has to be prepared/transformed into a format that can be understood by a computer. In practice, this means tokenizing your input (breaking it apart into comparable chunks) and creating vectors for each token, which essentially encodes the context and meaning of your input dataset in a computer-readable format. This computer readable format is a continuous-valued vector, which is essentially an array of integers.

Step 1: Tokenisation for LLMs

Tokenisation for LLMs involves taking unstructured text (a book, a poem, a research paper, a WhatsApp message, etc.) and chunking it up into 'tokens'. Typically, the chunks are full words, as well as punctuation and other symbols we can derive meaning from.

Take the string: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

We could chunk this up into tokens like so:

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']

Step 2: assigning IDs to tokens

Once an array of tokens is generated, each token should be assigned a unique identifier that can be used to refer to a specific token, rather than the token itself. This reduces used memory and improves the efficiency of a model, as referring to the full token every time increase complexity and memory usage relative to the length of the token.

This essentially means generating a map of tokens -> IDs, like so: (',', 0) ('--', 1) ('Gisburn', 2) ('HAD', 3) ('I', 4) ('Jack', 5) ('a', 6) ('always', 7) ('cheap', 8) ('enough', 9) ('fellow', 10) ('genius', 11) ('good', 12) ('great', 13) ('hear', 14) ('in', 15) ('it', 16) ('me', 17) ('no', 18) ('rather', 19) ('so', 20) ('surprise', 21) ('that', 22) ('though', 23) ('thought', 24) ('to', 25) ('was', 26) For example:

ID 20 refers to token so
ID 1 refers to token --

Step 3: vectorising the tokens

To be covered next time…

What I still don't understand…

How vectors lend themselves to calculating similarity in context/meaning between two tokens
How/why vectors can sometimes support hundreds or thousands of dimensions, and how this impacts the performance and size of a model