Tokenization in NLP Explained: Transforming Text into Insights (Part 1 - Basic Techniques)
Subhra
8 min read
In the ever-evolving world of machine learning and AI, NLP or Natural Language Processing stands out because of its unique ability to enable machines to communicate like humans do. It sits at the heart of many of the everyday tools we use, such as chatbots that know what to say to us and search engines that find us what we are looking for. But, how do these machines make sense of human language? The answer begins with a simple but incredibly powerful technique called tokenization that transforms human language into a structured, understandable data for machines. This is the first of a two-part series that will walk you through several types of tokenization methods, each with its pros and cons and usage.
Definition and Importance
Tokenization is the process of breaking down text into smaller and more manageable chunks, called as tokens, which could be words, subwords, characters, numbers, or punctuation marks, so that the algorithms can easily process, analyse and understand human language. It is the first of several steps in a journey that transforms raw texts into meaningful insights, laying the foundation for machines to interact with human language in ways that feel increasingly natural and intuitive.
Like letters, which are the building blocks of words, tokens are the foundational elements that enable algorithms to process and interpret language. Tokens provide a structure that algorithms can work with, paving the way for more sophisticated NLP applications, thereby opening up a world of possibilities for more intuitive and intelligent technological advances. Some prominent use cases where tokenization plays a significant role are: - Machine Translation, Search Engine, Speech Recognition, Sentiment Analysis etc.
Types of Techniques
Tokenization techniques can be broadly divided into basic and advanced methods, each tailored to address specific challenges and requirements of language processing. Basic tokenization methods offer a straightforward approach to break down texts, which include Word (or whitespace) Tokenization, Character Tokenization, Punctuation-Based Tokenization, Dictionary-Based Tokenization, Rule-Based Tokenization and Sentence Tokenization methods. These are instrumental in the initial processing and analysis of texts, providing dimple yet effective ways to dissect text into manageable tokens.
On the other end of the spectrum, advanced tokenization methods delve deeper into the linguistic nuances, pushing the boundaries of text analysis even further. Subword Tokenization, including methods like Byte-Pair Encoding (BPE) and its variant, Byte-Level Byte-Pair Tokenization, Unigram Tokenization, WordPiece Tokenization, SentencePiece Tokenization, Morphological Tokenization, and Multilingual Tokenization are some of the more advanced tokenization methods currently available which offer sophisticated approaches to deal with the intricacies of human language. These methods excel at handling novel words, reducing the size of the vocabulary, and accommodating linguistic diversity.
Let’s explore each of the basic tokenization techniques in more detail:
1. Word (whitespace) Tokenization:
This is one of the simplest yet highly effective methods that breaks down text into its constituent words by taking white spaces as delimiters. It is still very effective in texts where spaces clearly separate words, even though it treats everything in between two whitespaces as one token.
Example: Let’s consider the below sentence:
“AI, in 2024, is revolutionizing data analytics.”
Using Word tokenization, this sentence will be broken down into the following 7 tokens:
Take notice of the punctuation, the comma (,) included in the token along with the word, e.g. 'AI,'. You can pre-process the text by removing punctuation before applying Word tokenization to prevent such instances.
Advantages:
Simplicity and ease of implementation.
Facilitates basic NLP tasks such as word frequency count and preliminary analysis.
Disadvantages:
Inability to recognize and preserve the context of words within the text.
Treats each word independently without taking into account its relationship with adjacent words.
Struggles with compound words and phrases that don't adhere to simple whitespace delineation
Implementation:
The following code snippet is a basic approach to implement Word Tokenization in Python:
2. Character Tokenization:
It breaks down texts into their most granular characters such as letters, punctuations, digits, and even white spaces. It’s particularly useful for analysing language with complex word structures where a little change at the character level can have a big impact on the overall meaning of the text.
Example: Let’s consider the same sentence as before:
“AI, in 2024, is revolutionizing data analytics.”
Using Character tokenization, this piece of text will be broken down into the following 47 tokens:
Advantages:
Simple and straightforward to implement without the need for any pre-processing of text
Can capture the subtleties of a language where a minor change in characters can alter the meaning
Disadvantages:
Processing at character level increases the volume significantly, thereby resulting in high computational cost
Loses higher-level meaning of the text due to the inability to recognize and preserve the word boundaries or syntactic structure
Implementation:
The following code snippet is a simple approach to implementing Character tokenization in Python:
3. Punctuation-based Tokenization:
It breaks down the text by recognizing punctuation marks as the primary delimiters and focusing on the structural elements denoted by them. It’s good at segmenting texts into phrases and compound structures denoted by punctuation.
Example: Let’s consider the same sentence as before:
“AI, in 2024, is revolutionizing data analytics.”
As you can see, there are 2 types of punctuation: comma (,) and period (.) in the above sentence. Using Punctuation-based tokenization, this sentence will be broken down into the following 3 tokens:
Advantages:
Efficiently handles sentences and some compound structures that rely on punctuation
Helps in evaluating readability and the structure of text
Disadvantages:
Punctuation marks can sometimes be misinterpreted as delimiters, causing incorrect tokenization
May not capture the full context or nuances of the language
Implementation:
Since the standard libraries in Python don't provide a direct method for punctuation-based tokenization, we can achieve it by using regular expressions with the `re` module. Here is the code to implement Punctuation-based tokenization:
4. Dictionary-based Tokenization:
It breaks down the text based on a predefined dictionary or set of words. This technique matches text segments against the objects in the dictionary to identify and separate valid tokens. It’s particularly useful in languages like Chinese and Japanese where words are not separated by spaces, as well as in fields with specialised vocabulary and acronyms.
Process:
A comprehensive dictionary or vocabulary of valid tokens (words, phrases, and symbols) is prepared in advance, which serves as the basis for tokenization.
The tokenization algorithm searches for a sequence of characters from the text that matches an existing token in the dictionary.
Once a match is found, the text segment is considered as a valid token and separated from the rest of the text.
The algorithm then goes on to search for the next token from the rest of the text.
The unmatched text segments can either be marked as unknown tokens or tokenized using a different technique.
Example: Let’s consider the same sentence as before:
“AI, in 2024, is revolutionizing data analytics.”
Assuming our predefined dictionary includes the following words: {‘AI’, ‘2024’, ‘revolutionize’, ‘data’, ‘analytics’, …}, the Dictionary-based tokenization will generate the following sets of valid tokens and unknown tokens:
Advantages:
Ensures accurate tokenization for known terms and phrases
Excels in domain-specific text analysis, such as medical or legal documents analysis with tailored dictionaries
Disadvantages:
Limited by the fixed vocabulary, hence may lead to a significant number of unknown tokens
Labor-intensive to maintain the dictionary by updating it with new words and phrases
May miss the true context of words which may lead to incorrect tokenization
Implementation:
The standard libraries in Python don't provide a direct method for this tokenization technique. The following code snippet is a basic approach to Dictionary-based tokenization for the above example:
5. Rule-based Tokenization:
It breaks down the text based on customised rules, often utilizing regular expressions (regex) to segment text into tokens. Because of the great degree of flexibility this method provides, customised rules can be created to accurately handle linguistic intricacies, such as handling compound phrases, abbreviations, and other linguistic patterns. It is particularly effective when you need specific tokenization techniques that go beyond basic punctuation or whitespace delimitation.
Example: Let’s consider a different sentence with more variations than the earlier one:
“On average, an AI tool will cost $100.00 in 2024.”
Let’s assume the rule is to break down the text into common words, punctuation, monetary amounts, and numbers. With this Rule-based tokenization method, the above sentence will be broken down into the following tokens:
Advantages:
Flexibility to create custom rules to fit the specific needs
Ability to accurately handle complex texts with compound phrases, abbreviations, acronyms and other special linguistic features
Disadvantages:
Designing and maintaining rules to cover all cases can be challenging
Potential to overfit the tokenization process to specific characteristics of a corpus of text
Implementation:
Below is a Python code snippet using regex to implement a basic rule-based tokenization that could be applied to our example sentence:
6. Sentence Tokenization:
It breaks down the text into its constituent sentences, while maintaining the context that each sentence carries. This method leverages punctuation such as question marks (?), exclamation marks (!), periods (.), and so on to recognize the boundaries of sentences. This helps in deciphering the overall meaning and flow of the text. This method is very useful for complex tasks such as sentiment analysis, where sentiment can vary from sentence to sentence within a document, and text summarization, where synthesizing and analysing the content on a sentence-by-sentence basis is crucial.
Example: Let’s consider the following text:
“AI, in 2024, is revolutionizing data analytics. It's transforming industries, one algorithm at a time. What's next? On average, an AI tool will cost $100.00 in 2024.”
Sentence tokenization would break this text down into the following individual sentences:
Advantages:
Preserves the full context of each sentence
Maintains the structure and flow longer texts, thereby facilitating further analysis
Disadvantages:
Complex punctuation and dialogue can pose challenges while identifying sentence boundaries
Certain abbreviations, decimal points, and usage of periods (.) for purposes other than sentence demarcation may be misconstrued as sentence boundaries
Implementation:
Here's a Python code snippet using the NLTK package, a popular NLP library, to implement sentence tokenization:
We have explored the fundamentals of basic tokenization techniques, covering straightforward word, character, and punctuation-based tokenization to the more sophisticated approaches of sentence, rule-based, and dictionary-based techniques. Each of these techniques offers unique features, addressing specific needs in the field of text analysis. The second part of this post will delve into the advanced tokenization techniques such as Byte-Pair Encoding among others, which will give you a sneak peek into the methods that are at the forefront of tackling some of the most complex challenges in NLP.