Python NLTK: Stop Words [Natural Language Processing (NLP)]

Facebook Tweet LinkedIn Pin Print EmailShares

This article shows how you can use the default Stopwords corpus present in Natural Language Toolkit (NLTK).

To use stopwords corpus, you have to download it first using the NLTK downloader. In my previous article on Introduction to NLP & NLTK, I have written about downloading and basic usage example of different NLTK corpus data.

Stopwords are the frequently occurring words in a text document. For example, a, the, is, are, etc.

Loading stopwords Corpus


from nltk.corpus import stopwords

#stopwords_files = [str(item) for item in stopwords.fileids()]
#print (stopwords_files)

print (stopwords.fileids())
'''
Output:

['arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'kazakh', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
'''

English Stopwords


#stopwords_english = [str(item) for item in stopwords.words('english')]
#print (stopwords_english)

print (stopwords.words('english'))
#print (stopwords.words(fileids=['english']))
'''
Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
'''

Tokenize Words

We split the text sentence/paragraph into a list of words. Each word in the list is called a token.


from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 

text = "A quick brown fox jumps over the lazy dog."

# Normalize text
# NLTK considers capital letters and small letters differently.
# For example: Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()

# tokenize text 
words = word_tokenize(text)

print (words)
'''
Output:

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''

Removing Punctuation

As you can see in the above output, there’s a token for full-stop (.). It’s a punctuation mark. Generally, we get rid of such punctuation marks while analyzing text.


words = [w for w in words if w.isalpha()]
print words
'''
Output:

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
'''

Removing Stop Words

Here, we will remove stop words from our text data using the default stopwords corpus present in NLTK.

Get list of English Stopwords


stop_words = stopwords.words('english')

print (len(stop_words)) # output: 153

words_filtered = words[:] # creating a copy of the words list

for word in words:
	if word in stop_words:		
		words_filtered.remove(word)

print (words_filtered)
'''
Output:

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
'''

Updating Stop Words Corpus

Suppose, you don’t want to omit some stopwords for your text analysis. In such case, you have to remove those words from the stopwords list.

Let’s suppose, you want the words over and under for your text analysis. The words “over” and “under” are present in the stopwords corpus by default. Let’s remove them from the stopwords corpus.


# set() function removes duplicate entries from the list
stop_words = set(stopwords.words('english')) - set(['over', 'under'])

#stopwords_english = [str(item) for item in stop_words]
#print (stopwords_english)

print (stop_words)
'''
Output:

['all', 'just', 'being', 'through', 'yourselves', 'its', 'before', 'hadn', 'with', 'll', 'had', 'should', 'to', 'only', 'won', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'd', 'did', 'didn', 'these', 't', 'each', 'where', 'because', 'doing', 'theirs', 'some', 'hasn', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'weren', 're', 'does', 'above', 'between', 'mustn', 'she', 'be', 'we', 'after', 'here', 'shouldn', 'hers', 'by', 'on', 'about', 'couldn', 'of', 'against', 's', 'isn', 'or', 'own', 'into', 'yourself', 'down', 'mightn', 'wasn', 'your', 'from', 'her', 'whom', 'aren', 'there', 'been', 'few', 'too', 'then', 'themselves', 'was', 'until', 'more', 'himself', 'both', 'but', 'off', 'herself', 'than', 'those', 'he', 'me', 'myself', 'ma', 'this', 'up', 'will', 'while', 'ain', 'below', 'can', 'were', 'my', 'at', 'and', 've', 'wouldn', 'is', 'in', 'am', 'it', 'doesn', 'an', 'as', 'itself', 'o', 'have', 'further', 'their', 'if', 'again', 'no', 'that', 'when', 'same', 'any', 'how', 'other', 'which', 'you', 'shan', 'needn', 'haven', 'who', 'most', 'such', 'why', 'a', 'don', 'i', 'm', 'having', 'so', 'y', 'the', 'yours', 'once']
'''

print (len(stop_words)) # output: 151

words_filtered = words[:] # creating a copy of the words list

for word in words:
	if word in stop_words:		
		words_filtered.remove(word)

print (words_filtered)
'''
Output:

['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
'''

Hope this helps. Thanks.

Removing Stop Words

Related posts: