This article shows how you can do Stemming
and Lemmatisation
on your text using NLTK.
You can read about introduction to NLTK in this article: Introduction to NLP & NLTK
The main goal of stemming and lemmatization is to convert related words to a common base/root word. It’s a special case of text normalization.
STEMMING
Stemming any word means returning stem of the word. A single word can have different versions. But all the different versions of that word has a single stem/base/root word. The stem word is not necessary to be identical to the morphological root of the word.
Example:
The word Work
will be the stem word for working, worked, and works.
working => work
worked => work
works => work
Loading Stemmer Module
There are many stemming algorithms. Porter stemming algorithm
is the most popular one.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work
Stemming text document
We need to first convert the text into word tokens. After that, we can stem each word of the token list.
We can see the below code that the word jumps
has been stemmed to jump
and the word lazy
has been stemmed to lazi
.
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
# tokenize text
words = word_tokenize(text)
print (words)
'''
Output:
['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
stemmer = PorterStemmer()
words_stem = [stemmer.stem(word) for word in words]
# The above line of code is a shorter version of the following code:
'''
words_stem = []
for word in words:
words_stem.append(stemmer.stem(word))
'''
#words_stem_2 = [str(item) for item in words_stem]
#print (words_stem_2)
print (words_stem)
'''
Output:
['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''
Using split() function
You can simply test the stemmer on your text without work tokenizing. For this, you can use the split() method which turns a string into a list based on any delimiter. The default delimiter is a space.
Note: Tokenizing sentences into words is useful as it separates punctuations from the words. In below example, the last word dog
will be taken as dog.
(with full-stop at the end). The punctuation mark is not separated from the word.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
text = "A quick brown fox jumps over the lazy dog."
text_stem = " ".join([stemmer.stem(word) for word in text.split()])
print (text_stem)
Stemming Non-English Words
There are other different stemmers like SnowballStemmer, LancasterStemmer, ISRIStemmer, RSLPStemmer, RegexpStemmer.
SnowballStemmer can stem words of different languages besides English.
from nltk.stem import SnowballStemmer
# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output:
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''
Stemming Spanish Words using SnowballStemmer
Let’s stem some Spanish words.
Here’s the English translation of the Spanish words:
trabajando => working
trabajos => works
trabajó => worked
# -*- coding: utf-8 -*-
# The above line is added to solve the following error:
# SyntaxError: Non-ASCII character '\xc3' in file
from nltk.stem import SnowballStemmer
stemmer_spanish = SnowballStemmer('spanish')
print (stemmer_spanish.stem('trabajando')) # output: trabaj
print (stemmer_spanish.stem('trabajos')) # output: trabaj
print (stemmer_spanish.stem('trabajó'.decode('utf-8'))) # output: trabaj
# UTF-8 decode is done to solve the following error:
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3
Stemming English Words using SnowballStemmer
stemmer_english = SnowballStemmer('english')
print (stemmer_english.stem('working')) # output: work
print (stemmer_english.stem('works')) # output: work
print (stemmer_english.stem('worked')) # output: work
LEMMATIZATION
Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.
Difference between Stemming and Lemmatisation
– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.
– Stemmers are typically easier to implement than Lemmatizers.
– Stemmers run faster than Lemmatizers.
– The accuracy of stemming is less than that of lemmatization.
Lemmatization in NLTK can be done using WordNet’s Lemmatizer. WordNet is a lexical database of English.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatisation depends upon the Part of Speech of the word
# lemmatize(word, pos=NOUN)
# the default part of speech (pos) for lemmatize method is "n", i.e. noun
# we can specify part of speech (pos) value like below:
# noun = n, verb = v, adjective = a, adverb = r
print (lemmatizer.lemmatize('is')) # output: is
print (lemmatizer.lemmatize('are')) # output: are
print (lemmatizer.lemmatize('is', pos='v')) # output: be
print (lemmatizer.lemmatize('are', pos='v')) # output: be
print (lemmatizer.lemmatize('working', pos='n')) # output: working
print (lemmatizer.lemmatize('working', pos='v')) # output: work
Lemmatising text document
We need to first convert the text into word tokens. After that, we can lemmatize each word of the token list.
We can see in the below code that the word jumps
has been converted to its base word jump
.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
# tokenize text
words = word_tokenize(text)
print (words)
'''
Output:
['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
lemmatizer = WordNetLemmatizer()
words_lemma = [lemmatizer.lemmatize(word) for word in words]
# The above line of code is a shorter version of the following code:
'''
words_lemma = []
for word in words:
words_lemma.append(lemmatizer.lemmatize(word))
'''
#words_lemma_2 = [str(item) for item in words_lemma]
#print (words_lemma_2)
print (words_lemma)
'''
Output:
['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
'''
Hope this helps. Thanks.