This article shows how you can use the default Stopwords
corpus present in Natural Language Toolkit (NLTK).
To use stopwords
corpus, you have to download it first using the NLTK downloader. In my previous article on Introduction to NLP & NLTK, I have written about downloading and basic usage example of different NLTK corpus data.
Stopwords are the frequently occurring words in a text document. For example, a, the, is, are, etc.
Loading stopwords Corpus
from nltk.corpus import stopwords
#stopwords_files = [str(item) for item in stopwords.fileids()]
#print (stopwords_files)
print (stopwords.fileids())
'''
Output:
['arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'kazakh', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
'''
English Stopwords
#stopwords_english = [str(item) for item in stopwords.words('english')]
#print (stopwords_english)
print (stopwords.words('english'))
#print (stopwords.words(fileids=['english']))
'''
Output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
'''
Tokenize Words
We split the text sentence/paragraph into a list of words. Each word in the list is called a token.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "A quick brown fox jumps over the lazy dog."
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example: Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
# tokenize text
words = word_tokenize(text)
print (words)
'''
Output:
['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
'''
Removing Punctuation
As you can see in the above output, there’s a token for full-stop (.). It’s a punctuation mark. Generally, we get rid of such punctuation marks while analyzing text.
words = [w for w in words if w.isalpha()]
print words
'''
Output:
['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
'''
Removing Stop Words
Here, we will remove stop words from our text data using the default stopwords corpus present in NLTK.
Get list of English Stopwords
stop_words = stopwords.words('english')
print (len(stop_words)) # output: 153
words_filtered = words[:] # creating a copy of the words list
for word in words:
if word in stop_words:
words_filtered.remove(word)
print (words_filtered)
'''
Output:
['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
'''
Updating Stop Words Corpus
Suppose, you don’t want to omit some stopwords for your text analysis. In such case, you have to remove those words from the stopwords list.
Let’s suppose, you want the words over
and under
for your text analysis. The words “over” and “under” are present in the stopwords corpus by default. Let’s remove them from the stopwords corpus.
# set() function removes duplicate entries from the list
stop_words = set(stopwords.words('english')) - set(['over', 'under'])
#stopwords_english = [str(item) for item in stop_words]
#print (stopwords_english)
print (stop_words)
'''
Output:
['all', 'just', 'being', 'through', 'yourselves', 'its', 'before', 'hadn', 'with', 'll', 'had', 'should', 'to', 'only', 'won', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'd', 'did', 'didn', 'these', 't', 'each', 'where', 'because', 'doing', 'theirs', 'some', 'hasn', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'weren', 're', 'does', 'above', 'between', 'mustn', 'she', 'be', 'we', 'after', 'here', 'shouldn', 'hers', 'by', 'on', 'about', 'couldn', 'of', 'against', 's', 'isn', 'or', 'own', 'into', 'yourself', 'down', 'mightn', 'wasn', 'your', 'from', 'her', 'whom', 'aren', 'there', 'been', 'few', 'too', 'then', 'themselves', 'was', 'until', 'more', 'himself', 'both', 'but', 'off', 'herself', 'than', 'those', 'he', 'me', 'myself', 'ma', 'this', 'up', 'will', 'while', 'ain', 'below', 'can', 'were', 'my', 'at', 'and', 've', 'wouldn', 'is', 'in', 'am', 'it', 'doesn', 'an', 'as', 'itself', 'o', 'have', 'further', 'their', 'if', 'again', 'no', 'that', 'when', 'same', 'any', 'how', 'other', 'which', 'you', 'shan', 'needn', 'haven', 'who', 'most', 'such', 'why', 'a', 'don', 'i', 'm', 'having', 'so', 'y', 'the', 'yours', 'once']
'''
print (len(stop_words)) # output: 151
words_filtered = words[:] # creating a copy of the words list
for word in words:
if word in stop_words:
words_filtered.remove(word)
print (words_filtered)
'''
Output:
['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
'''
Hope this helps. Thanks.