Preprocessing in NLP tasks is called Normalization involves reducing words to their base or root form, converting them to lowercase, and removing stop words.
What are some steps involved in the pre-processing of a text?. These include making the text lower case, removal of punctuation, tokenize the text (split up the words in a sentence), remove stop words as they convey grammar rather than meaning, word stemming (reduce words to their stems).
Tokenisation: Used to separate words or sentences.
Stemming: returns part of a words that doesnt change ie breaks, breakthrough gives break. Use
from nltk.stem.porter import PorterStemmer
temp=text #decomposed
porter_stemmer = PorterStemmer()
stemmed_tokens = [porter_stemmer.stem(token) for token in temp]
lemmatization: reducing word to it’s base form e.g. words “is”, “was”, “were” will turn into “be”.
from nltk.stem.wordnet import WordNetLemmatizer
temp=text #decomposed
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [wordnet_lemmatizer.lemmatize(token, pos="v") for token in temp]
main normaliser
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
def normalize_document(document, stemmer=porter_stemmer, lemmatizer=wodnet_lemmatizer):
"""Noramlizes data by performing following steps:
1. Changing each word in corpus to lowercase.
2. Removing special characters and interpunction.
3. Dividing text into tokens.
4. Removing english stopwords.
5. Stemming words.
6. Lemmatizing words.
temp = document.lower()
temp = re.sub(r"[^a-zA-Z0-9]", " ", temp)
temp = word_tokenize(temp)
temp = [t for t in temp if t not in stopwords.words("english")]
temp = [porter_stemmer.stem(token) for token in temp]
temp = [lemmatizer.lemmatize(token) for token in temp]
return temp