N-grams are used in NLP that allow for the analysis of text data by breaking it down into smaller, manageable sequences.

An N-gram is a contiguous sequence of n items (or tokens) from a given sample of text or speech. In the context of natural language processing (NLP) and text analysis, these items are typically words or characters.

N-grams are used to analyze and model the structure of language, and they can help in various tasks such as text classification.

Types of N-grams

  • Unigram: An N-gram where n = 1. It represents individual words or tokens. For example, in the sentence “I love AI”, the unigrams are [“I”, “love”, “AI”].

  • Bigram: An N-gram where n = 2. It represents pairs of consecutive words. For the same sentence, the bigrams would be [“I love”, “love AI”].

  • Higher-order N-grams: These can go beyond three words, such as 4-grams (quadgrams) or 5-grams, and so on.

Code implementations:

This can be does through kwargs in CountVectorizer.