Traditional Tokenization Methods for NLP in Python
Tokenization is a step that is crucial when working with texts and building effective language models. But what is tokenization exactly, and how does it work?
In this article, I'll dive into traditional methods for tokenization. I'll explain what tokenization for Natural Language Processing (NLP) is, why we need it and various approaches to implementing it in practice.
What is tokenization in NLP?
Tokenization can be defined as a way to split up a text into smaller units called tokens. Tokens can be words, characters or subwords. Traditionally, tokenization is primarily done on the word level, but the latest language models are starting to use subword tokenization more and more.
This blog post will focus on traditional word tokenization, I will cover subword tokenization techniques in a different blog post.
Why do we need it?
Tokenization is the first step in getting natural language ready for your model to process. It is important because it determines how your data, your text, will be presented to your model. The decisions you make in tokenization can therefore directly impact your model performance.
Tokenization is applied to all documents in your corpus. This will result in a list of unique tokens for your corpus. These tokens are then used to determine the final vocabulary.
The occurrences of the tokens in the text can be directly used as feature vectors in your machine learning model. This is basically how the CountVectorizer (bag of words tokenizer) of scikit-learn works. Of course, this is not the only way to convert tokens to feature vectors. Other examples of vectorization are tf-idf (term frequency inverse document frequency) and (word) embeddings. I could write a whole new blog post about that topic alone. 😁
Approaches
There are multiple ways to go about (traditional) word tokenization. I’ll cover four approaches in this blog post:
- Whitespace tokenization
- NLTK’s
word_tokenize
- Regex tokenization
- Spacy’s Tokenizer
For the approaches described, we will be using the following two sentences to demonstrate the different tokenization results:
tongue_twister = """The chic Sikh's sixty-sixth sheep is sick."""
tweet = """Elon Musk Sells $1.1 Billion In Tesla Stock, But Docs Reveal He Planned To Regardless Of Twitter Poll."""
Also make sure to download and import the following packages:
from nltk import word_tokenize, RegexpTokenizer
import spacy
python -m spacy download en_core_web_sm
Whitespace tokenization
The simplest approach to tokenization is to simply split your sentences on the whitespaces. While this may be simple, it is far from perfect.
Let's illustrate this with the tongue twister and the tweet. By default the split()
method will split on whitespace and that's exactly what we want:
tokens_tt = tongue_twister.split()
print(tokens_tt)
This outputs:
['The',
'chic',
"Sikh's",
'sixty-sixth',
'sheep',
'is',
'sick.']
tokens_tweet = tweet.split()
print(tokens_tweet)
This outputs:
['Elon',
'Musk',
'Sells',
'$1.1',
'Billion',
'In',
'Tesla',
'Stock,',
'But',
'Docs',
'Reveal',
'He',
'Planned',
'To',
'Regardless',
'Of',
'Twitter',
'Poll.']
Note how this method doesn’t split up the apostrophe s of Sikh’s
and keeps the comma’s and periods attached to the word before it. Not optimal..
Tokenization methods in NLTK
NLTK, or the Natural Language Toolkit, is a Python library for statistical natural language processing in Python. Together with Spacy, it is often the first place to go for simple to more advanced text processing tooling. They provide tooling from tokenization to part-of-speech tagging and from creating collocations and concordances to creating syntactic trees.
For this tutorial, all you need to install is nltk
. Note that for certain features, you’ll have to download specific nltk datasets.
Word_tokenize
This is NLTK’s most popular tokenization option. With word_tokenize()
all you need is a text (sentence) and a language to tokenize in (every language has different punctuation rules / patterns for sentence tokenization). The default language is English.
The word_tokenize
function first splits the text into sentences using the sent_tokenize
function. This is the default, but you can switch off this step by setting the boolean preserve_line
to True.
Next, it uses an improved version of the treebank word tokenizer to tokenize each sentence into tokens.
Under the hood, the treebank word tokenizer uses a lot of regexes. If you want to know exactly which regexes are applied and how it is implemented, have a look at their documentation.
So now you must wonder, how well does it do? Well, let's test it out on our sentences.
tokens_tt = word_tokenize(tongue_twister)
print(tokens_tt)
This outputs:
['The',
'chic',
'Sikh',
"'s",
'sixty-sixth',
'sheep',
'is',
'sick',
'.']
tokens_tweet = word_tokenize(tweet)
print(tokens_tweet)
This outputs:
['Elon',
'Musk',
'Sells',
'$',
'1.1',
'Billion',
'In',
'Tesla',
'Stock',
',',
'But',
'Docs',
'Reveal',
'He',
'Planned',
'To',
'Regardless',
'Of',
'Twitter',
'Poll',
'.']
As you can see, the apostrophe s is now split correctly, as is the punctuation. Depending on your use case you may or may not want the dollar sign to be split off from the quantity.
Note that if you would like to specify a different language to the word_tokenize
function, you may get a LookupError. This is because NLTK doesn't download everything for you out of the box and you'll have to download any 'extras' yourself on an as-needed basis. You can simply run the following in a Jupyter notebook to download punctuation tools needed for non-English languages:
import nltk
nltk.download('punkt')
Regexp_tokenizer
With regex tokenization, we apply a regex to our text to ‘craft’ our tokens. For this we can use the regexp_tokenizer
in NLTK. You need to supply your own regex pattern. It’s mostly useful when you have a very specific pattern that you need to find that is dissimilar from how a regular tokenizer would tokenize it. Examples would be to extract hashtags, emojis or mentions from a tweet.
If we use the following:
hashtag_tokenizer = RegexpTokenizer('#\\w+|@\\w+')
tweet_hashtag = "gotta love tokenization #cool #sweet #amazing @lucytalksdata"
hashtag_tokenizer.tokenize(tweet_hashtag)
We are able to extract:
['#cool', '#sweet', '#amazing', '@lucytalksdata']
For more general tokenization, I wouldn’t use this particular one since other functions (e.g. word_tokenize
) are available that are based on years of research and therefore have already figured out good patterns. No need to reinvent any wheels. 🤓
Tokenization in Spacy
Spacy is another Python library for natural language processing. It prides itself for its industrial-strength text processing pipelines that are ready for use in production in terms of performance and developer experience. Besides tokenization, its pipelines boast a plethora of other features, such as named entity recognition (NER), dependency parsing, part-of-speech tagging and word vectors. It has language models for many different languages, as well as a multi-language model.
Spacy’s tokenizer
Spacy works with language models. These language models can be downloaded and loaded in as nlp
objects. For this tutorial, you've already downloaded en_core_web_sm
, so we can load that in like so:
nlp = spacy.load("en_core_web_sm")
Calling this nlp
object with your text will return a ‘doc’ which consists of the text plus annotations. Tokens can be found using the .text
syntax.
doc_tt = nlp(tongue_twister)
print([token.text for token in doc_tt])
This returns:
['The',
'chic',
'Sikh',
"'s",
'sixty',
'-',
'sixth',
'sheep',
'is',
'sick',
'.']
As you can see, the number sixty-sixth is now split into three tokens. The apostrophe s and period are correctly split.
Let's have a look at the tweet again:
doc_tweet = nlp(tweet)
print([token.text for token in doc_tweet])
This returns:
['Elon',
'Musk',
'Sells',
'$',
'1.1',
'Billion',
'In',
'Tesla',
'Stock',
',',
'But',
'Docs',
'Reveal',
'He',
'Planned',
'To',
'Regardless',
'Of',
'Twitter',
'Poll',
'.']
As you can see, Spacy tokenises this sentence the same way as NLTK does (with word_tokenize
).
The tokenization process is explained clearly in the Spacy docs, but it follows the following rules. First split on whitespace. Then the text is processed from left to right. For each current token it performs two checks:
- Does it match an exception rule? Example:
let's
does not contain any whitespace but the 's should get a separate token. - Can a prefix, suffix or infix be split off? E.g. punctuation like periods and comma's.
The exception rules are often language dependent, and are stored in the language models. So make sure to use a language model that matches your text's language.
If you have a specific use case that is not covered by the out-of-the-box tokenizer, you can add custom rules to your tokenizer.
Moreover, you can also add a third party tokenizer to your Spacy pipeline if you’re using Spacy for other tasks as well. E.g. you can add a WordPiece tokenizer for your Bert model.
These customizations make Spacy very flexible for many different types of applications.
Approaches Wrap-up
So we've seen four types of tokenization: whitespace, word_tokenize
, custom regex and Spacy's tokenizer.
In general, the whitespace tokenizer performs subpar for most intents and purposes because it will not take care of proper tokenization for punctuation. The regex tokenizer is useful for custom use cases but for all-purpose tokenization, I would opt for the NLTK word_tokenize
or Spacy's tokenizer.
I personally have a slight preference for Spacy's tokenizer for the simple reason that Spacy is more robust for production systems. It's flexible and actively maintained, whereas NLTK is more research-focused. This means NLTK might have slightly better accuracy at times, but it is more tedious to use in production and doesn't always have industry users in mind.
Why is tokenization hard?
Tokenization may seem ‘easy’ for languages using the Latin alphabet, but this doesn’t apply to all languages. Languages like Chinese and Arabic are more complicated to tokenize. More on that later..
Furthermore, the definition of a token is not set in stone. 'It depends' per use case. We might ascribe different meanings to what a token constitutes, depending on the context. Out-of-the-box tokenizers have the most general use case in mind. If you need something more custom, you’ll have to build it yourself.
Limitations of word tokenization
With word tokenization every unique word in your corpus becomes a token. If you don’t apply any filtering (removing stopwords, stemming etc.), the vocabulary size might blow up. If you one-hot encode the text, your feature count will be huge and your feature space sparse.
Moreover, your model will only know about the words it saw during training. In practice this means that it will likely get a lot of out-of-vocabulary words during inference. We can use a dummy token like UNK
(unknown) for that, but that does essentially mean that all unknown words get the same token, virtually losing their meaning...
There are remedies though. Many of the more recent NLP models are working with subword tokenization techniques. This way, the vocabulary is smaller and meaningful relationships such as comparisons (big - bigger - biggest) can be learned. Another blog posts will cover these techniques in more detail.
A note on tokenization for non-Latin alphabets
Up until now, I’ve mainly focused on texts in languages using the Latin alphabet. In these languages, texts can indeed be split by white space and punctuation. Any of the packages that I mentioned before will give you decent results and you can pick the one that suits your use case best.
However, there are also languages that don’t separate words easily using whitespace and punctuation. As an example, take Chinese. In Chinese, each word can be represented using one or more characters. This problem is called word segmentation and is an active area of research.
I’m not going into detail here, I might in the future if you’d like to read more about it. But there’s actually a number of good word segmentation libraries out there, for instance pywordseg.
Spacy also has a Chinese language model. I haven’t tried it out, but knowing Spacy, that might be a good place to start as well.
Conclusion
In this blog post, I've explained what tokenization is and how you can implement word tokenization yourself using NLTK and Spacy.
I've also mentioned some of the limitations of using word tokenization in your models as well as challenges for non-Latin languages.
I'm curious, how do you tokenize your texts?
Very soon, I'll also release a blog post on subword tokenization, so stay tuned for that! :)
Let's keep in touch! 📫
If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.
If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.