Python NLTK: Stemming & Lemmatization [Natural Language Processing (NLP)]

This article shows how you can do Stemming and Lemmatisation on your text using NLTK.

You can read about introduction to NLTK in this article: Introduction to NLP & NLTK

The main goal of stemming and lemmatization is to convert related words to a common base/root word. It’s a special case of text normalization.

Table of Contents

STEMMING

Stemming any word means returning stem of the word. A single word can have different versions. But all the different versions of that word has a single stem/base/root word. The stem word is not necessary to be identical to the morphological root of the word.

Example:

The word Work will be the stem word for working, worked, and works.

working => work
worked => work
works => work

Loading Stemmer Module

There are many stemming algorithms. Porter stemming algorithm is the most popular one.


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work

Stemming text document
We need to first convert the text into word tokens. After that, we can stem each word of the token list.

We can see the below code that the word jumps has been stemmed to jump and the word lazy has been stemmed to lazi.


from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

text = "A quick brown fox jumps over the lazy dog."

# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()

# tokenize text 
words = word_tokenize(text)

print (words)
'''
Output:

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''

stemmer = PorterStemmer()

words_stem = [stemmer.stem(word) for word in words]

# The above line of code is a shorter version of the following code:
'''
words_stem = []

for word in words:
    words_stem.append(stemmer.stem(word))
'''

#words_stem_2 = [str(item) for item in words_stem]
#print (words_stem_2)

print (words_stem)
'''
Output:

['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
'''

Using split() function

You can simply test the stemmer on your text without work tokenizing. For this, you can use the split() method which turns a string into a list based on any delimiter. The default delimiter is a space.

Note: Tokenizing sentences into words is useful as it separates punctuations from the words. In below example, the last word dog will be taken as dog. (with full-stop at the end). The punctuation mark is not separated from the word.


from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
text = "A quick brown fox jumps over the lazy dog."
text_stem = " ".join([stemmer.stem(word) for word in text.split()])
print (text_stem)

Stemming Non-English Words

There are other different stemmers like SnowballStemmer, LancasterStemmer, ISRIStemmer, RSLPStemmer, RegexpStemmer.

SnowballStemmer can stem words of different languages besides English.


from nltk.stem import SnowballStemmer

# Languages supported by SnowballStemmer
print (SnowballStemmer.languages)
'''
Output:

['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']
'''

Stemming Spanish Words using SnowballStemmer

Let’s stem some Spanish words.

Here’s the English translation of the Spanish words:

trabajando => working
trabajos => works
trabajó => worked


# -*- coding: utf-8 -*- 
# The above line is added to solve the following error: 
# SyntaxError: Non-ASCII character '\xc3' in file

from nltk.stem import SnowballStemmer

stemmer_spanish = SnowballStemmer('spanish')

print (stemmer_spanish.stem('trabajando')) # output: trabaj
print (stemmer_spanish.stem('trabajos')) # output: trabaj
print (stemmer_spanish.stem('trabajó'.decode('utf-8'))) # output: trabaj

# UTF-8 decode is done to solve the following error: 
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Stemming English Words using SnowballStemmer


stemmer_english = SnowballStemmer('english')

print (stemmer_english.stem('working')) # output: work
print (stemmer_english.stem('works')) # output: work
print (stemmer_english.stem('worked')) # output: work

LEMMATIZATION

Lemmatization is closely related to stemming. Lemmatization returns the lemmas of the word which is the base/root word.

Difference between Stemming and Lemmatisation

– A stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

– While converting any word to the root/base word, stemming can create non-existent work but lemmatization creates actual dictionary words.

– Stemmers are typically easier to implement than Lemmatizers.

– Stemmers run faster than Lemmatizers.

– The accuracy of stemming is less than that of lemmatization.

Lemmatization in NLTK can be done using WordNet’s Lemmatizer. WordNet is a lexical database of English.


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatisation depends upon the Part of Speech of the word
# lemmatize(word, pos=NOUN)
# the default part of speech (pos) for lemmatize method is "n", i.e. noun
# we can specify part of speech (pos) value like below:
# noun = n, verb = v, adjective = a, adverb = r

print (lemmatizer.lemmatize('is')) # output: is
print (lemmatizer.lemmatize('are')) # output: are

print (lemmatizer.lemmatize('is', pos='v')) # output: be
print (lemmatizer.lemmatize('are', pos='v')) # output: be

print (lemmatizer.lemmatize('working', pos='n')) # output: working
print (lemmatizer.lemmatize('working', pos='v')) # output: work

Lemmatising text document
We need to first convert the text into word tokens. After that, we can lemmatize each word of the token list.

We can see in the below code that the word jumps has been converted to its base word jump.


from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

text = "A quick brown fox jumps over the lazy dog."

# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()

# tokenize text 
words = word_tokenize(text)

print (words)
'''
Output:

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''

lemmatizer = WordNetLemmatizer()

words_lemma = [lemmatizer.lemmatize(word) for word in words]

# The above line of code is a shorter version of the following code:
'''
words_lemma = []

for word in words:
    words_lemma.append(lemmatizer.lemmatize(word))
'''

#words_lemma_2 = [str(item) for item in words_lemma]
#print (words_lemma_2)

print (words_lemma)
'''
Output:

['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
'''

Hope this helps. Thanks.