Tokenisation

Tokenisation is a core step in natural language processing (NLP) where raw text is split into smaller units known as tokens. These tokens can be words, sentences, or subwords, depending on the level of analysis. Tokenisation prepares text for downstream tasks like embedding, classification, or parsing.

There are different kinds: spaCy or Hugging Face tokenisers.

Types of Tokenisation

1. Word Tokenisation

Splits text into individual words and retains punctuation.

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text_original)
print(tokens)

2. Sentence Tokenisation

Segments text into full sentences.

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text_original)
print(sentences)

3. Custom Tokenisation with Cleaning

For cleaner preprocessing (e.g. lowercasing, removing non-alphanumerics and stopwords):

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
temp = text_original.lower()
temp = re.sub(r"[^a-zA-Z0-9]", " ", temp)      # keep only letters and numbers
temp = re.sub(r"[[0-9]+\]", "", temp)         # remove bracketed numbers
tokens = word_tokenize(temp)
tokens_clean = [t for t in tokens if t not in stopwords.words("english")]
print(tokens_clean)

Special Tokens in Transformer Models

Modern models like BERT use special tokens during tokenisation to preserve input structure:

[CLS]: Marks the start of the input sequence; used for classification tasks.
[SEP]: Separates different segments (e.g. question from context in QA tasks).

nltk

Data Archive

Explorer

Tokenisation

Types of Tokenisation

1. Word Tokenisation

2. Sentence Tokenisation

3. Custom Tokenisation with Cleaning

Special Tokens in Transformer Models

Backlinks

Explorer

Data Archive

Explorer

Tokenisation

Types of Tokenisation

1. Word Tokenisation

2. Sentence Tokenisation

3. Custom Tokenisation with Cleaning

Special Tokens in Transformer Models

Related:

Backlinks

Explorer