Notes reading 'Build a Large Language Model (From Scratch)' #1

June 21st, 2025

In an effort to try and better understand machine learning and language models, I'm currently reading 'Build a Large Language Model (From Scratch)' by Sebastian Raschka.

Raschka has written a comprehensive (and comprehendible) guide on how specifically Large Language Models (LLMs) work, and also covers general machine learnings theory while doing so. Safe so say I'm really enjoying the book so far, and will try and document my understanding in a series of posts here while I read.

The eBook can be purchased from hive.co.uk, while a hardback cover can be purchased from Waterstones

Here's what I've learned so far…

Large Language Models (LLMs)

Large Language Models (LLMs) are deep neural networks capable of natural language processing. They're essentially self-taught algorithms made of many, many 'parameters' or 'weightings' (can be thought of as 'knowledge' at a high level) that can make scarily accurate textual predictions based on given input text. Some of those - depending on their underlying architecture - are great at generative tasks - such as text generation, autocompletion, and translation, to name a few - while some are better are more specialised tasks.

Transformer Architecture

Transformer Architecture (TA) is a specific architecture used by modern deep learning models, and has enabled LLMs are we know them today.

The GPT (Generative Pre-trained Transformers) architecture is based on Transformer Architecture, and lends itself well to generative tasks. An example of another popular architecture, BERT (Bidirectional Encoder Representations from Transformers), also based on Transformer Architecture, lends itself better to such as missing word analysis, a more specialised task. As a usage example, Twitter (X) uses a BERT-based model for auto-moderation and classification of 'toxic' content.

How models 'learn'

The 'learning process' can essentially be divided into two key stages:

pre-training
fine-tuning

Pre-training

To pre-train a model, you must take a giant dataset of pertinent information, and generate billions of 'weightings'. These weightings can be thought of collectively as 'knowledge', and give a model an emergent understanding of the dataset given to it. In the context of LLMs, the GPT-3 LLM was given terabytes worth of textual data (books, crawled internet pages, and all of Wikipedia), which can also be thought of as billions and billions of 'tokens' (499 billion precisely) used to create 'embeddings' or 'parameters'.

Pre-training can be thought of as the language-giving step, as it gives an LLM the embeddings it needs to 'understand' and carry out general language-/text-based tasks.

As a general rule: the more parameters a model has, the more 'knowledge' it can be thought to have (though this does not directly speak to how efficient the model is, or how accurate its outputs are)

A pre-trained model is also called a 'base' or 'foundation' model, and can be used for specialised tasks through fine-tuning. Examples of existing foundation models are: Open AI's GPT-4o, and Meta's LLaMa (and there are many, many more)

Fine-tuning

Fine-tuning involved augmenting a foundation model with a dataset pertaining to a specific use case. There are a few days to do this, but the two most prominent are:

instruction fine-tuning
classification fine-tuning

Instruction fine-tuning

Instruction fine-tuning essentially involves providing your base model with a data set consisting of question-and-answer pairs. e.g: if you wanted to be able to translate text from English -> Arabic, you could provide it with a set of translations, such as:

{
  "en": "How is the weather today?",
  "ar": "كيف التقس اليوم؟"
}

Classification fine-tuning

Classification fine-tuning involves providing your base model with a dataset consisting of raw data with 'classifications' or 'labels'. For example: if you wanted to build a model that can automatically assess an email as 'spam' or 'not spam', you could teach it what 'spam' is by giving it a dataset consisting of emails with an associated 'spam' or 'not spam' classification. Given enough of these, the model will be able to determine with greater and greater accuracy whether an email outside of its fine-tuning/training data is spam or not.

Tags: books, programming, machine-learning, tech