Tokenization is one of the most crucial steps in building large language models (LLMs) from scratch. It plays a foundational role in data preprocessing, which ultimately impacts the training and performance of the LLM. In simple terms, tokenization is the process of breaking down text into smaller components called tokens. These tokens are later converted into numerical identifiers, called token IDs, which are used as inputs for machine learning models.
In this chapter, we delve into the mechanics of tokenization, exploring concepts such as splitting text
into tokens, mapping tokens to unique token IDs, handling out-of-vocabulary (OOV) words, and using special
tokens like [UNK]
(unknown token) and [END]
(end-of-text token). We will build
a tokenizer from scratch using Python, gaining a hands-on understanding of how modern LLMs, like GPT,
handle tokenization.
Tokenization is a critical preprocessing step in working with textual data for machine learning. It essentially transforms text into smaller, meaningful pieces called tokens. These tokens can vary depending on the approach chosen: they may represent whole words, subwords, characters, or even byte-pair encodings.
The main goal of tokenization is to simplify textual data so that it can be represented numerically for machine learning models. By doing so, raw text, which cannot be directly processed by models, becomes a structured input. This step ensures the model can learn relationships between words effectively.
Why Tokenization Matters
Tokenization affects every subsequent step in the NLP pipeline. Proper tokenization ensures that important
semantic and structural features of the input data are preserved, leading to better model performance. For
example, tokenizing contractions like "can't" into "can" and "'t" can improve downstream processing.
Tokenization also helps in reducing vocabulary size and managing out-of-vocabulary tokens by adopting approaches like subword tokenization or Byte Pair Encoding (BPE), which we'll explore in the next chapter.
with open("verdict.txt", "r") as file:
raw_text = file.read()
import re
tokens = re.split(r'[\s,.;:?!]+', raw_text)
unique_tokens = sorted(set(tokens))
vocab = {token: idx for idx, token in enumerate(unique_tokens)}
token_ids = [vocab[token] for token in tokens if token in vocab]
To encapsulate the tokenization process, we create a Python class with methods to encode and decode text. This modular approach ensures reusability, testability, and easier maintenance. A well-structured tokenizer class simplifies downstream tasks like handling unknown tokens or adding special tokens for LLMs.
class Tokenizer:
def __init__(self, vocab):
self.vocab = vocab
self.reverse_vocab = {v: k for k, v in vocab.items()}
def encode(self, text):
tokens = re.split(r'[\s,.;:?!]+', text)
token_ids = [self.vocab.get(token, self.vocab.get("[UNK]")) for token in tokens]
return token_ids
def decode(self, token_ids):
tokens = [self.reverse_vocab[token_id] for token_id in token_ids]
return " ".join(tokens)
Out-of-vocabulary (OOV) words pose a significant challenge during tokenization. These are words that are
not present in the predefined vocabulary. To address this, we introduce the [UNK]
token, which
acts as a placeholder for unknown words. This ensures that the tokenizer can handle a wide range of inputs,
even if it doesn't understand every token.
class TokenizerWithOOV(Tokenizer):
def __init__(self, vocab):
super().__init__(vocab)
self.vocab["[UNK]"] = max(self.vocab.values()) + 1
def encode(self, text):
tokens = re.split(r'[\s,.;:?!]+', text)
token_ids = [self.vocab.get(token, self.vocab["[UNK]"]) for token in tokens]
return token_ids
To enhance the tokenizer, we introduce special tokens like <|unk|>
and
<|endoftext|>
. These tokens improve the handling of unknown words and document boundaries,
ensuring better compatibility with LLMs.
class EnhancedTokenizer:
def __init__(self, vocab):
self.vocab = vocab
self.reverse_vocab = {v: k for k, v in vocab.items()}
self.vocab["<|unk|>"] = len(vocab)
self.vocab["<|endoftext|>"] = len(vocab) + 1
def encode(self, text):
tokens = re.split(r'[\s,.;:?!]+', text) + ["<|endoftext|>"]
token_ids = [self.vocab.get(token, self.vocab["<|unk|>"]) for token in tokens]
return token_ids
def decode(self, token_ids):
tokens = [self.reverse_vocab.get(token_id, "<|unk|>") for token_id in token_ids]
return " ".join(tokens).replace(" <|endoftext|>", "")
This chapter provided a comprehensive understanding of tokenization and its role in building LLMs. In the next chapter, we will explore Byte Pair Encoding (BPE), the method used by GPT to handle subword tokenization. Stay tuned for more in-depth knowledge on how LLMs process and understand text.
You can visit the related notebook here: Github