The Modern Tokenization Stack for NLP: Byte Pair Encoding
10 min read

The Modern Tokenization Stack for NLP: Byte Pair Encoding

The Modern Tokenization Stack for NLP: Byte Pair Encoding
Photo by Raphael Schaller / Unsplash

In my previous blog post, I've introduced you to the traditional tokenization stack for Natural Language Processing. Now it's high time to introduce you to the modern tokenization stack for NLP (in 2022) with byte pair encoding (BPE). BPE is a tokenization method used by many popular transformer-based models like RoBERTa, GPT-2 and XLM.

Background

The field of Natural Language Processing has seen a tremendous amount of innovation and excitement over the past years. We went from sparse text representations such as bag of words and tf-idf to word embeddings that capture semantic meaning such as word2vec and fasttext.

Even with all of these innovations, limitations were still there. For instance, word2vec doesn’t capture context, so it remained hard to capture polysemy (words with multiple meanings) and other linguistic phenomena. Moreover, rare tokens that were never encountered in these pre-trained models would get a default representation (the <unk> token) which led to loss of information.

Luckily, the field hasn’t stood still in the meantime. New innovations have brought transformer-based architectures for language models, and with these models came a new way of tokenization: subword tokenization. This architecture in combination with this method of tokenization solves the aforementioned limitations of capturing rare words and providing context.

To perform subword tokenization, the field has reinvented an old data compression algorithm: byte pair encoding.

In this tutorial, I’ll explain what it is, what makes it so effective and how to implement it yourself.

What is byte pair encoding?

Byte pair encoding (BPE) was originally invented in 1994 as a technique for data compression. Data was compressed by replacing commonly occurring pairs of consecutive bytes by a byte that wasn’t present in the data yet.

In order to make byte pair encoding suitable for subword tokenization in NLP, some amendmends have been made. Frequently occurring subword pairs are merged instead of replaced by another character as they would in the original compression algorithm.

Using BPE, we make sure that frequent words are represented as single tokens whereas low frequency words are split up into multiple subword tokens.

In order to understand byte pair encoding, we’ll demonstrate how it works with an example:

Suppose you have a corpus with the following set of words and their frequencies after initial word tokenization:

{
 "book": 12, 
 "nook": 8, 
 "noob": 14, 
 "boob": 5, 
 "books": 6
 }

We start off with a character vocabulary of: {"b", "k", "n", "s", "o"}.

If we split our corpus into the character vocabulary tokens, we get:

Word Characters Counts
book b o o k 12
nook n o o k 8
noob n o o b 14
boob b o o b 5
books b o o k s 6

With Byte Pair Encoding, we then count all character combinations and compute their frequency within 'words'. The most frequently occurring character combinations will then be merged.

In the example above, the most frequently occurring character pair is 'oo' because it occurs in all words (12+8+14+5+6 = 45 times). So by combining the two 'o' into 'oo', we get:

Word Characters Counts
book b oo k 12
nook n oo k 8
noob n oo b 14
boob b oo b 5
books b oo k s 6

If we again compute the frequency combinations of pairs of symbols, the pair 'oo' and 'k' is the most frequent (occurring 12+8+6 = 26 times). We merge them and get:

Word Characters Counts
book b ook 12
nook n ook 8
noob n oo b 14
boob b oo b 5
books b ook s 6

If we iterate once more, we see that 'oo' and 'b' are the most frequent pair (14 + 5 = 19 times).

The resulting vocab would then be:

Word Characters Counts
book b ook 12
nook n ook 8
noob n oob 14
boob b oob 5
books b ook s 6

If we would stop iterating at this point, our tokens would be:

{"b", "ook", "n", "oob", "s"}

While the vocabulary is still of the same size as with just the characters (or words for that matter), you can imagine if we scale up this toy example to a bigger vocabulary, this technique greatly reduces the number of tokens needed and generalises patterns such as pluralisation and degrees of comparison.

Why does it work?

As mentioned, previous approaches lacked representations for words that may be important to your vocabulary but didn’t occur in the vocabulary of your pre-trained model.

If your pre-trained model was trained on Wikipedia data and your vocabulary is based on tweets, you might miss representations for a lot of domain-specific data. With BPE, these words can still get representations by making clever use of subword tokens.

Note that tokens that never occur in the vocabulary that we base our subwords on, will still be replaced by <unk>. E.g. Wikipedia is not likely to have many emoji's present, so if your BPE-based tokens are based on that, and you suddenly supply it tweets with emojis, these might still be replaced with <unk>.

Moreover, we've seen that a vocabulary based on word tokens can often blow up with the large variety of words that occur in any given language. With byte pair encoding, we keep the vocabulary smaller because many subwords are found in multiple words in our vocabulary.

We can also learn the meaning of subwords better with less vocabulary. In word-level tokenization we often need separate tokens for ‘eat/eating/eats’ or ‘big-bigger-biggest’: certain subwords indicate degrees of comparison or plurality for instance which can now be captured. With subword tokenization, we can capture the root and suffix separately and the suffix can then be used as a generalization of a specific grammatical feature. Pretty neat, right?

Because of this, we can also mix and match subword representations on words that didn’t occur in the original vocabulary it was trained on. This way, rare words will also often get a representation that can capture at least part of its meaning.

Implementing byte pair encoding in Python

We've seen the pseudo-code implementation of byte pair encoding above. Now let's code up a Python implementation.

First, import some dependencies:

from collections import Counter, defaultdict
import re
import string

Then, we load in a dataset. In this case, we'll use The Raven by Edgar Allan Poe.

def load_data():
    with open('data/raven.txt', 'r') as f:
        text = f.read()
    	return text

raven = load_data()

Preprocess the text

We are now ready to preprocess the text into tokens:

def preprocess_text(text):
    punctuation = string.punctuation + '“’'
    
    text = text.lower()
    text_clean = text.translate(str.maketrans('', '', punctuation))

    tokens = text_clean.split()
    tokens = [" ".join(tok) + ' </w>' for tok in tokens]
    return tokens
    
tokens = preprocess_text(raven)

We lowercase the text, remove the punctuation for simplicity's sake, and split the text into tokens on the whitespace character.

After each 'word token', we add a </w> as an indicator of a word boundary. It's important to include this boundary indicator, because it will help the algorithm distinguish between different uses of the same character combinations.

As an example, consider the suffix ess as in happiness and other words. By using the </w> boundary token we can make a distinction between this suffix and ess found in words such as essential.

Compute word frequencies

word_counts = Counter(tokens)

We use the Counter from the collections module to compute the frequencies of our tokens. The result looks something like this (capped for size):

Compute pairs

Now, it's time to compute pair frequencies between combinations of characters.

def get_pairs(word_freqs):
    pairs = defaultdict(int)
    for word, freq in word_freqs.items():
        chars = word.split()
        char_zip = list(zip(chars, chars[1:]))
        for char in char_zip:
            pairs[char] += freq
    return pairs

pairs = get_pairs(word_counts)

This results in the following defaultdict with pairs of characters and their frequencies (capped for size):

Find the best pair

The next step is to find the best pair in this dictionary of pairs. We can use a nifty one-liner for this:

best_pair = max(pairs, key=pairs.get)

In this case, the best pair is ('e', '</w>') with 187 occurrences.

Merging the best pair

def merge_byte_pairs(best_pair, word_freq_dict):
    best_pair_space = " ".join(best_pair)
    best_pair_merged = "".join(best_pair)
    merged_dict = {}
    for word in word_freq_dict:
        word_merged = word.replace(best_pair_space, best_pair_merged)
        merged_dict[word_merged] = word_freq_dict[word]
    return merged_dict

Here, we look for every occurrence of the best pair within the tokens of our corpus and remove the space in between these characters. So in this case every occurrence of e </w> becomes e</w>.

This is illustrated here in (part of) the resulting dictionary, where o n c e </w> has become o n c e</w> and w h i l e </w> has become w h i l e</w>.

Compute the new subword tokens

def get_subword_tokens(word_counts):
    char_counts = defaultdict(int)
    for word, freq in word_counts.items():
        chars = word.split()
        for char in chars:
            char_counts[char] += freq
            
    return char_counts

Now that we have merged our first pair of characters, we can compute the new subword token dictionary.

Here we see part of that new subword token dictionary, illustrating that indeed we have now merged e</w> together.

Rinse, repeat!

Now, of course one iteration of merging is not enough. We'll have to repeat this process for multiple iterations. Let's combine the steps together, for easy iteration:

raven = load_data()
tokens = preprocess_text(raven)
word_counts = Counter(tokens)
N_ITER = 150
for i in range(N_ITER):
    pairs = get_pairs(word_counts)
    best_pair = max(pairs, key=pairs.get)
    word_counts = merge_byte_pairs(best_pair, word_counts)
    subword_tokens = get_subword_tokens(word_counts)

You can play with the number of iterations and see what tokens get merged at different iterations. :)

With 150 iterations, we get quite a few sensible subword combinations. Here are some excerpts:

{
'on': 23,
'ce</w>': 4,
'upon</w>': 5,
'a</w>': 15,
'm': 32,
'i': 48,
'd': 44,
'n': 14,
'ight</w>': 5,
'rear': 1,
'y</w>': 15,
}

Note, how it made subword tokens of the article a and common suffixes -y and - ight, as well as kept the whole word for commonly occurring word upon. Note that -ight is not a real suffix per se, but because we're only using one text (poem) and in this particular poem this is such a frequently occurring word ending that it has picked up on this as a valuable subword token.

{
'sta': 2,
'ely</w>': 3,
'raven</w>': 11,
'stly</w>': 2,
'he</w>': 7,
'bird</w>': 10,
}

Note, how it has found that raven, bird and he should get single tokens, because they occur often in the text. Furthermore, I see some interesting suffixes like -ely and -stly. Prefix sta also got its own token.

Perhaps it can be seen even more clearly in the subword tokenization of the actual words:

Here, we see the following:

  • Frequent tokens a, i, upon and and get their own token
  • Common suffixes -ight, -y, -e and -ered as well
  • Some co-occurring vowels like ea are merged as well.

By using subword tokenization we went from 459 unique tokens (words) to 263 subword tokens. Not bad for one text, right?

How to encode new sentences?

To encode new sentences using the subword tokens that we've just learned, we take the following steps:

  1. Split the sentence into word tokens and add the </w> like we did before
  2. Iterate over our subword token corpus ordered from the longest to the shortest token and check whether we can replace substrings of each token from our sentence with a token from our subword corpus.
  3. If any substrings of the sentence cannot be replaced by tokens from our corpus, we replace them by the <unk> token.

Note how this process is quite computationally expensive. So in practice, you'd keep track of words and their subtokens in a dictionary, like this:

word_subword_dict = defaultdict(list)
for word, freq in word_counts.items():
    subword_tokens = word.split()
    complete_word = ''.join(subword_tokens)
    word_subword_dict[complete_word] = subword_tokens

That way, whenever you have to encode a word, you can just search for it in the word_subword_dict dictionary, which is a much faster lookup.

If the token hasn't been found yet, like in the case of an unseen word, we do take the approach described in step 2, in order to find subwords that can 'construct' this unknown word as much as possible.

So suppose your vocab contains the word 'follow' but not 'unfollow', we can then 'construct' the word 'unfollow' by using subword tokens 'follow' and 'un'.

In addition, we can now add the tokenization of this new word to our dictionary in order to make the lookup faster in the future.

Limitations of byte pair encoding

Like any approach to tokenization, byte pair encoding also has its flaws.

By nature, byte pair encoding is greedy: it tries to find the best pair at every iteration, which means it is not very efficient.

In addition, the resulting subword tokens are dependent on the number of iterations we run the algorithm, therefore we may have different tokens (and vocab size) and thus different representations based on how long we keep iterating.

The stopping criteria are also not well-defined. Sometimes a fixed number of merges is set, other times a threshold is set on the frequency of tokens, and the algorithm stops when all token frequencies fall below that.

Finally, on a linguistic level, in order to use meaningful subwords to construct 'new words', we assume these subwords carry some linguistic meaning, which may not always be the case. As a toy example consider getting tokens 'h' and 'itting', whilst 'hit', ('t'), and 'ing' would have been more satisfactory linguistically.

There is some research going on to tackle some of these issues though, see this paper by Vilar and Federico if you're interested.

Conclusion

In this article, we dove deep into Byte Pair Encoding. What it is, how it works and a full implementation in Python.

Even though Byte Pair Encoding has its limitations, it is widely used in many modern NLP applications. I'm quite sure that the next few years will see some great contributions from academia and industry on how to improve it even further.

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.