June 23rd, 2025
How GPT models differ from classic transformer architecture
GPT models actually use a simplified version of the transformer architecture. While the original transformer architecture has both encoding and decoding mechanisms, GPT only requires the decoding mechanism: the input is tokenized, vectorised, and fed into the decoder, which is then used to generate vectors for a reply, which is then repeatedly fed back into itself to build a full response. This type of model is called an autoregressive model, and the approach is a form of 'self-supervised learning'
The reason being is that this decoder-only approach performs well for GPT's purpose, which is text generation, while also performing better than expected at translation tasks (which came as a surprise to researches when they observed it)
Preparing data for pre-training
In order to pre-train a model on a large dataset, the dataset has to be prepared/transformed into a format that can be understood by a computer. In practice, this means tokenizing your input (breaking it apart into comparable chunks) and creating vectors for each token, which essentially encodes the context and meaning of your input dataset in a computer-readable format. This computer readable format is a continuous-valued vector, which is essentially an array of integers.
Step 1: Tokenisation for LLMs
Tokenisation for LLMs involves taking unstructured text (a book, a poem, a research paper, a WhatsApp message, etc.) and chunking it up into 'tokens'. Typically, the chunks are full words, as well as punctuation and other symbols we can derive meaning from.
Take the string: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
We could chunk this up into tokens like so:
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
Step 2: assigning IDs to tokens
Once an array of tokens is generated, each token should be assigned a unique identifier that can be used to refer to a specific token, rather than the token itself. This reduces used memory and improves the efficiency of a model, as referring to the full token every time increase complexity and memory usage relative to the length of the token.
This essentially means generating a map of tokens -> IDs, like so:
(',', 0)
('--', 1)
('Gisburn', 2)
('HAD', 3)
('I', 4)
('Jack', 5)
('a', 6)
('always', 7)
('cheap', 8)
('enough', 9)
('fellow', 10)
('genius', 11)
('good', 12)
('great', 13)
('hear', 14)
('in', 15)
('it', 16)
('me', 17)
('no', 18)
('rather', 19)
('so', 20)
('surprise', 21)
('that', 22)
('though', 23)
('thought', 24)
('to', 25)
('was', 26)
For example:
- ID
20
refers to tokenso
- ID
1
refers to token--
Step 3: vectorising the tokens
To be covered next time…
What I still don't understand…
- How vectors lend themselves to calculating similarity in context/meaning between two tokens
- How/why vectors can sometimes support hundreds or thousands of dimensions, and how this impacts the performance and size of a model