This article shows how you can perform sentiment analysis on movie reviews using Python and Natural Language Toolkit (NLTK).
Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing the text/document into a specific class or category (like positive and negative). In other words, we can say that sentiment analysis classifies any particular text or document as positive or negative. Basically, the classification is done for two classes: positive and negative. However, we can add more classes like neutral, highly positive, highly negative, etc.
Sentiment Analysis is also referred as Opinion Mining. It’s mostly used in social media and customer reviews data.
In this article, we will learn about labeling data, extracting features, training classifier, and testing the accuracy of the classifier.
Supervised Classification
Here, we will be doing supervised text classification. In supervised classification, the classifier is trained with labeled training data.
In this article, we will use the NLTK’s movie_reviews
corpus as our labeled training data. The movie_reviews
corpus contains 2K movie reviews with sentiment polarity classification. It’s compiled by Pang, Lee.
Here, we have two categories for classification. They are: positive and negative. The movie_reviews corpus already has the reviews categorized as positive and negative.
from nltk.corpus import movie_reviews
# Total reviews
print (len(movie_reviews.fileids())) # Output: 2000
# Review categories
print (movie_reviews.categories()) # Output: [u'neg', u'pos']
# Total positive reviews
print (len(movie_reviews.fileids('pos'))) # Output: 1000
# Total negative reviews
print (len(movie_reviews.fileids('neg'))) # Output: 1000
positive_review_file = movie_reviews.fileids('pos')[0]
print (positive_review_file) # Output: pos/cv000_29590.txt
Create list of movie review document
This list contains array containing tuples of all movie review words and their respective category (pos or neg).
documents = []
for category in movie_reviews.categories():
for fileid in movie_reviews.fileids(category):
#documents.append((list(movie_reviews.words(fileid)), category))
documents.append((movie_reviews.words(fileid), category))
print (len(documents)) # Output: 2000
# x = [str(item) for item in documents[0][0]]
# print (x)
# print first tuple
print (documents[0])
'''
Output:
(['plot', ':', 'two', 'teen', 'couples', 'go', ...], 'neg')
'''
# shuffle the document list
from random import shuffle
shuffle(documents)
Feature Extraction
To classify the text into any category, we need to define some criteria. On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category. This kind of criteria is known as feature
. We can define one or more feature to train our classifier.
In this example, we will use the top-N words feature
.
Fetch all words from the movie reviews corpus
We first fetch all the words from all the movie reviews and create a list.
all_words = [word.lower() for word in movie_reviews.words()]
# print first 10 words
print (all_words[:10])
'''
Output:
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']
'''
Create Frequency Distribution of all words
Frequency Distribution will calculate the number of occurence of each word in the entire list of words.
from nltk import FreqDist
all_words_frequency = FreqDist(all_words)
print (all_words_frequency)
'''
Output:
<FreqDist with 39768 samples and 1583820 outcomes>
'''
# print 10 most frequently occurring words
print (all_words_frequency.most_common(10))
'''
Output:
[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822)]
'''
Removing Punctuation and Stopwords
From the above frequency distribution of words, we can see the most frequently occurring words are either punctuation marks or stopwords.
Stop words are those frequently words which do not carry any significant meaning in text analysis. For example, I, me, my, the, a, and, is, are, he, she, we, etc.
Punctuation marks like comma, fullstop. inverted comma, etc. occur highly in any text data.
We will do data cleaning
by removing stop words and punctuations.
Remove Stop Words
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')
print (stopwords_english)
'''
Output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
'''
# create a new list of words by removing stopwords from all_words
all_words_without_stopwords = [word for word in all_words if word not in stopwords_english]
# print the first 10 words
print (all_words_without_stopwords[:10])
'''
Output:
['plot', ':', 'two', 'teen', 'couples', 'go', 'church', 'party', ',', 'drink']
'''
'''
# Above code is written using the List Comprehension feature of Python
# It's the same thing as writing the following, the output is the same
all_words_without_stopwords = []
for word in all_words:
if word not in stopwords_english:
all_words_without_stopwords.append(word)
print (all_words_without_stopwords[:10])
'''
You can see that after removing stopwords, the words to
and a
has been removed from the first 10 words result.
Remove Punctuation
import string
print (string.punctuation)
'''
Output:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
'''
# create a new list of words by removing punctuation from all_words
all_words_without_punctuation = [word for word in all_words if word not in string.punctuation]
# print the first 10 words
print (all_words_without_punctuation[:10])
'''
Output:
[u'plot', u'two', u'teen', u'couples', u'go', u'to', u'a', u'church', u'party', u'drink']
'''
You can see that on the list that all punctuations like semi-colon :
, comma ,
are removed.
Remove both Stopwords & Punctuation
In the above examples, at first, we only removed stopwords and then in the next code, we only removed punctuation.
Below, we will remove both stopwords and punctuation from the all_words
list.
# Let's name the new list as all_words_clean
# because we clean stopwords and punctuations from the word list
all_words_clean = []
for word in all_words:
if word not in stopwords_english and word not in string.punctuation:
all_words_clean.append(word)
print (all_words_clean[:10])
'''
Output:
['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get']
'''
Frequency Distribution of cleaned words list
Below is the frequency distribution of the new list after removing stopwords and punctuation.
all_words_frequency = FreqDist(all_words_clean)
print (all_words_frequency)
'''
Output:
<FreqDist with 39586 samples and 710578 outcomes>
'''
# print 10 most frequently occurring words
print (all_words_frequency.most_common(10))
'''
Output:
[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('time', 2411), ('good', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]
'''
Previously, before removing stopwords and punctuation, the frequency distribution was:
FreqDist with 39768 samples and 1583820 outcomes
Now, the frequency distribution is:
FreqDist with 39586 samples and 710578 outcomes
This shows that after removing around 200 stop words and punctuation, the outcomes/words number has reduced to around half of the original size.
The most common words
or highly occurring words list has also got meaningful words in the list. Before, the first 10 frequently occurring words were only stop-words and punctuations.
Create Word Feature using 2000 most frequently occurring words
We take 2000 most frequently occurring words as our feature.
print (len(all_words_frequency)) # Output: 39586
# get 2000 frequently occuring words
most_common_words = all_words_frequency.most_common(2000)
print (most_common_words[:10])
'''
Output:
[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('time', 2411), ('good', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]
'''
print (most_common_words[1990:])
'''
Output:
[('genuinely', 64), ('path', 64), ('eve', 64), ('aware', 64), ('bank', 64), ('bound', 64), ('eric', 64), ('regular', 64), ('las', 64), ('niro', 64)]
'''
# the most common words list's elements are in the form of tuple
# get only the first element of each tuple of the word list
word_features = [item[0] for item in most_common_words]
print (word_features[:10])
'''
Output:
['film', 'one', 'movie', 'like', 'even', 'time', 'good', 'story', 'would', 'much']
'''
Create Feature Set
Now, we write a function that will be used to create feature set. The feature set is used to train the classifier.
We define a feature extractor function that checks if the words in a given document are present in the word_features list or not.
def document_features(document):
# "set" function will remove repeated/duplicate tokens in the given list
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
# get the first negative movie review file
movie_review_file = movie_reviews.fileids('neg')[0]
print (movie_review_file)
'''
Output:
neg/cv000_29416.txt
'''
#print (document_features(movie_reviews.words(movie_review_file)))
'''
Output:
{'contains(waste)': False, 'contains(lot)': False, 'contains(rent)': False, 'contains(black)': False, 'contains(rated)': False, 'contains(potential)': False, .............................................................................
.............................................. 'contains(smile)': False, 'contains(cross)': False, 'contains(barry)': False}
'''
In the beginning of this article, we have created the documents
list which contains data of all the movie reviews. Its elements are tuples with word list as first item and review category as the second item of the tuple.
# print first tuple of the documents list
print (documents[0])
'''
Output:
(['plot', ':', 'two', 'teen', 'couples', 'go', ...], 'neg')
'''
We now loop through the documents
list and create a feature set list using the document_features
function defined above.
- Each item of the feature_set list is a tuple.
- The first item of the tuple is the dictionary returned from
document_features
function - The second item of the tuple is the category (pos or neg) of the movie review
feature_set = [(document_features(doc), category) for (doc, category) in documents]
print (feature_set[0])
'''
Output:
({'contains(waste)': False, 'contains(lot)': False, 'contains(rent)': False, 'contains(black)': False, 'contains(rated)': False, 'contains(potential)': False, ...........................................................................
............................................................. 'contains(good)': False, 'contains(live)': False, 'contains(synopsis)': False, 'contains(appropriate)': False, 'contains(towards)': False, 'contains(smile)': False, 'contains(cross)': False, 'contains(barry)': False}, 'neg')
'''
'''
# In the above code, we have used list-comprehension feature of python
# The same code can be written as below:
feature_set = []
for (doc, category) in documents:
feature_set.append((document_features(doc), category))
print (feature_set[0])
'''
Training Classifier
From the feature set we created above, we now create a separate training set and a separate testing/validation set. The train set is used to train the classifier and the test set is used to test the classifier to check how accurately it classifies the given text.
Creating Train and Test Dataset
In this example, we use the first 400 elements of the feature set array as a test set and the rest of the data as a train set. Generally, 80/20 percent is a fair split between training and testing set, i.e. 80 percent training set and 20 percent testing set.
print (len(feature_set)) # Output: 2000
test_set = feature_set[:400]
train_set = feature_set[400:]
print (len(train_set)) # Output: 1600
print (len(test_set)) # Output: 400
Training a Classifier
Now, we train a classifier using the training dataset. There are different kind of classifiers namely Naive Bayes Classifier, Maximum Entropy Classifier, Decision Tree Classifier, Support Vector Machine Classifier, etc.
In this example, we use the Naive Bayes Classifier. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
Testing the trained Classifier
Let’s see the accuracy percentage of the trained classifier. The accuracy value changes each time you run the program because of the names array being shuffled above.
from nltk import classify
accuracy = classify.accuracy(classifier, test_set)
print (accuracy) # Output: 0.77
Let’s see the output of the classifier by providing some custom reviews.
from nltk.tokenize import word_tokenize
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: neg
# Negative review correctly classified as negative
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.999989264571
print (prob_result.prob("pos")) # Output: 1.07354285262e-05
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: neg
# Positive review is classified as negative
# We need to improve our feature set for more accurate prediction
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.999791868552
print (prob_result.prob("pos")) # Output: 0.000208131447797
Let’s see the most informative features among the entire features in the feature set.
# show 5 most informative features
print (classifier.show_most_informative_features(10))
'''
Output:
Most Informative Features
contains(outstanding) = True pos : neg = 14.7 : 1.0
contains(mulan) = True pos : neg = 7.8 : 1.0
contains(poorly) = True neg : pos = 7.7 : 1.0
contains(wonderfully) = True pos : neg = 7.5 : 1.0
contains(seagal) = True neg : pos = 6.5 : 1.0
contains(awful) = True neg : pos = 6.1 : 1.0
contains(wasted) = True neg : pos = 6.1 : 1.0
contains(waste) = True neg : pos = 5.6 : 1.0
contains(damon) = True pos : neg = 5.3 : 1.0
contains(flynt) = True pos : neg = 5.1 : 1.0
'''
The result shows that the word outstanding
is used in positive reviews 14.7 times more often than it is used in negative reviews the word poorly
is used in negative reviews 7.7 times more often than it is used in positive reviews. Similarly, for other letters. These ratios are also called likelihood ratios
.
Therefore, a review has a high chance to be classified as positive if it contains words like outstanding
and wonderfully
. Similarly, a review has a high chance of being classified as negative if it contains words like poorly
, awful
, waste
, etc.
Note: You can modify the
document_features
function to generate the feature set which can improve the accuracy of the trained classifier. Feature extractors are built through a process of trail-and-error & guided by intuitions.
Bag of Words Feature
In the above example, we used top-N words
feature. We used 2000 most frequently occurring words as our top-N words feature. The classifier identified negative review as negative. However, the classifier was not able to classify positive review correctly. It classified a positive review as negative.
Top-N words feature
– The top-N words feature is also a bag-of-words feature.
– But in the top-N feature, we only used the top 2000 words in the feature set.
– We combined the positive and negative reviews into a single list, randomized the list, and then separated the train and test set.
– This approach can result in the un-even distribution of positive and negative reviews across the train and test set.Bag-of-words feature shown below
In the bag-of-words feature as shown below:
– We will use all the useful words of each review while creating the feature set.
– We take a fixed number of positive and negative reviews for train and test set.
– This result in equal distribution of positive and negative reviews across train and test set.
In the approach shown below, we will modify the feature extractor function.
- We form a list of unique words of each review.
- The category (pos or neg) is assigned to each bag of words.
- Then the category of any given text is calculated by matching the different bag-of-words & their respective category.
from nltk.corpus import movie_reviews
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append(words)
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append(words)
# print first positive review item from the pos_reviews list
print (pos_reviews[0])
'''
Output:
['films', 'adapted', 'from', 'comic', 'books', ...]
'''
# print first negative review item from the neg_reviews list
print (neg_reviews[0])
'''
Output:
['plot', ':', 'two', 'teen', 'couples', 'go', ...]
'''
# print first 20 items of the first item of positive review
print (pos_reviews[0][:20])
'''
Output:
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']
'''
# print first 20 items of the first item of negative review
print (neg_reviews[0][:20])
'''
Output:
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']
'''
Feature Extraction
We use the bag-of-words feature. Here, we clean the word list (i.e. remove stop words and punctuation). Then, we create a dictionary of cleaned words.
from nltk.corpus import stopwords
import string
stopwords_english = stopwords.words('english')
# feature extractor function
def bag_of_words(words):
words_clean = []
for word in words:
word = word.lower()
if word not in stopwords_english and word not in string.punctuation:
words_clean.append(word)
words_dictionary = dict([word, True] for word in words_clean)
return words_dictionary
# using dict will remove duplicate words from the words list
# note the output: stopword 'the' is also removed
print (bag_of_words(['the', 'the', 'good', 'bad', 'the', 'good']))
'''
Output:
{'bad': True, 'good': True}
'''
Create Feature Set
We use the bag-of-words feature and tag each review with its respective category as positive or negative.
# positive reviews feature set
pos_reviews_set = []
for words in pos_reviews:
pos_reviews_set.append((bag_of_words(words), 'pos'))
# negative reviews feature set
neg_reviews_set = []
for words in neg_reviews:
neg_reviews_set.append((bag_of_words(words), 'neg'))
# print first positive review item from the pos_reviews list
print (pos_reviews_set[0])
'''
Output:
({'childs': True, 'steve': True, 'surgical': True, 'go': True, 'certainly': True, 'watchmen': True, 'song': True, 'simpsons': True, 'novel': True, ........................................................................
........................................................ 'menace': True, 'starting': True, 'original': True}, 'pos')
'''
# print first negative review item from the neg_reviews list
print (neg_reviews_set[0])
'''
Output:
({'concept': True, 'skip': True, 'insight': True, 'playing': True, 'executed': True, 'go': True, 'still': True, 'find': True, 'seemed': True, .............................................................................................
................................................. 'entertaining': True, 'years': True, 'away': True, 'came': True}, 'neg')
'''
Create Train and Test Set
There are 1000 positive reviews set and 1000 negative reviews set. We take 20% (i.e. 200) of positive reviews and 20% (i.e. 200) of negative reviews as a test set. The remaining negative and positive reviews will be taken as a training set.
Note:
– There is difference between pos_reviews & pos_reviews_set array which are defined above.
– pos_reviews array contains words list only
– pos_reviews_set array contains words feature list
– pos_reviews_set & neg_reviews_set arrays are used to create train and test set as shown below
print (len(pos_reviews_set), len(neg_reviews_set)) # Output: (1000, 1000)
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
print(len(test_set), len(train_set)) # Output: (400, 1600)
Training Classifier and Calculating Accuracy
We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.7325
print (classifier.show_most_informative_features(10))
'''
Output:
Most Informative Features
breathtaking = True pos : neg = 20.3 : 1.0
dazzling = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 12.2 : 1.0
outstanding = True pos : neg = 10.6 : 1.0
insipid = True neg : pos = 10.3 : 1.0
stretched = True neg : pos = 10.3 : 1.0
stupidity = True neg : pos = 10.2 : 1.0
annual = True pos : neg = 9.7 : 1.0
headache = True neg : pos = 9.7 : 1.0
avoids = True pos : neg = 9.7 : 1.0
'''
Testing Classifier with Custom Review
We provide custom review text and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive reviews provided.
from nltk.tokenize import word_tokenize
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: neg
# Negative review correctly classified as negative
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.776128854994
print (prob_result.prob("pos")) # Output: 0.223871145006
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: pos
# Positive review correctly classified as positive
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) # Output: 0.0972171562901
print (prob_result.prob("pos")) # Output: 0.90278284371
Bi-gram Features
N-grams are common terms in text processing and analysis. N-grams are related with words of a text. There are different n-grams like unigram, bigram, trigram, etc.
Unigram = Item having a single word, i.e. the n-gram of size 1. For example, good.
Bigram = Item having two words, i.e. the n-gram of size 2. For example, very good.
Trigram = Item having three words, i.e. the n-gram of size 3. For example, not so good.
In the above bag-of-words model, we only used the unigram feature. In the example below, we will use both unigram and bigram feature, i.e. we will deal with both single words and double words.
Feature Extraction
In this case, both unigrams and bigrams are used as features.
We define two functions:
bag_of_words
: that extracts only unigram features from the movie review wordsbag_of_ngrams
: that extracts only bigram features from the movie review words
We then define another function:
bag_of_all_words
: that combines both unigram and bigram features
from nltk import ngrams
from nltk.corpus import stopwords
import string
stopwords_english = stopwords.words('english')
# clean words, i.e. remove stopwords and punctuation
def clean_words(words, stopwords_english):
words_clean = []
for word in words:
word = word.lower()
if word not in stopwords_english and word not in string.punctuation:
words_clean.append(word)
return words_clean
# feature extractor function for unigram
def bag_of_words(words):
words_dictionary = dict([word, True] for word in words)
return words_dictionary
# feature extractor function for ngrams (bigram)
def bag_of_ngrams(words, n=2):
words_ng = []
for item in iter(ngrams(words, n)):
words_ng.append(item)
words_dictionary = dict([word, True] for word in words_ng)
return words_dictionary
'''
# Alternative Bi-gram feature extractor
# using BigramCollocationFinder module
# Collocations are multiple words which commonly co-occur.
# http://www.nltk.org/howto/collocations.html
# https://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
# feature extractor function for ngrams (bigram)
# get 200 most frequently occurring bigrams from every review
def bag_of_ngrams(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
'''
from nltk.tokenize import word_tokenize
text = "It was a very good movie."
words = word_tokenize(text.lower())
print (words)
'''
Output:
['it', 'was', 'a', 'very', 'good', 'movie', '.']
'''
print (bag_of_ngrams(words))
'''
Output:
{('very', 'good'): True, ('movie', '.'): True, ('it', 'was'): True, ('good', 'movie'): True, ('was', 'a'): True, ('a', 'very'): True}
'''
# working with cleaning words
# i.e. removing stopwords and punctuation
words_clean = clean_words(words, stopwords_english)
print (words_clean)
'''
Output:
['good', 'movie']
'''
# cleaning words is find for unigrams
# but this can omit important words for bigrams
# for example, stopwords like very, over, under, so, etc. are important for bigrams
# we create a new stopwords list specifically for bigrams by omitting such important words
important_words = ['above', 'below', 'off', 'over', 'under', 'more', 'most', 'such', 'no', 'nor', 'not', 'only', 'so', 'than', 'too', 'very', 'just', 'but']
stopwords_english_for_bigrams = set(stopwords_english) - set(important_words)
words_clean_for_bigrams = clean_words(words, stopwords_english_for_bigrams)
print (words_clean_for_bigrams)
'''
Output:
['very', 'good', 'movie']
'''
# We will use general stopwords for unigrams
# And special stopwords list for bigrams
unigram_features = bag_of_words(words_clean)
print (unigram_features)
'''
Output:
{'movie': True, 'good': True}
'''
bigram_features = bag_of_ngrams(words_clean_for_bigrams)
print (bigram_features)
'''
Output:
{('very', 'good'): True, ('good', 'movie'): True}
'''
# combine both unigram and bigram features
all_features = unigram_features.copy()
all_features.update(bigram_features)
print (all_features)
'''
Output:
{'movie': True, ('very', 'good'): True, 'good': True, ('good', 'movie'): True}
'''
# let's define a new function that extracts all features
# i.e. that extracts both unigram and bigrams features
def bag_of_all_words(words, n=2):
words_clean = clean_words(words, stopwords_english)
words_clean_for_bigrams = clean_words(words, stopwords_english_for_bigrams)
unigram_features = bag_of_words(words_clean)
bigram_features = bag_of_ngrams(words_clean_for_bigrams)
all_features = unigram_features.copy()
all_features.update(bigram_features)
return all_features
print (bag_of_all_words(words))
'''
Output:
{'movie': True, ('very', 'good'): True, 'good': True, ('good', 'movie'): True}
'''
Working with NLTK’s movie reviews corpus
from nltk.corpus import movie_reviews
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append(words)
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append(words)
Create Feature Set
# positive reviews feature set
pos_reviews_set = []
for words in pos_reviews:
pos_reviews_set.append((bag_of_all_words(words), 'pos'))
# negative reviews feature set
neg_reviews_set = []
for words in neg_reviews:
neg_reviews_set.append((bag_of_all_words(words), 'neg'))
Create Train and Test Set
There are 1000 positive reviews set and 1000 negative reviews set. We take 20% (i.e. 200) of positive reviews and 20% (i.e. 200) of negative reviews as the test set. The remaining negative and positive reviews will be taken as the training set.
print (len(pos_reviews_set), len(neg_reviews_set)) # Output: (1000, 1000)
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
print(len(test_set), len(train_set)) # Output: (400, 1600)
Training Classifier and Calculating Accuracy
We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.8025
print (classifier.show_most_informative_features(10))
'''
Output:
Most Informative Features
insulting = True neg : pos = 17.0 : 1.0
outstanding = True pos : neg = 14.7 : 1.0
('nice', 'see') = True pos : neg = 11.7 : 1.0
('one', 'worst') = True neg : pos = 11.4 : 1.0
('would', 'think') = True neg : pos = 11.0 : 1.0
('quite', 'well') = True pos : neg = 11.0 : 1.0
('makes', 'no') = True neg : pos = 10.3 : 1.0
('but', 'script') = True neg : pos = 10.3 : 1.0
('quite', 'frankly') = True neg : pos = 10.3 : 1.0
animators = True pos : neg = 10.3 : 1.0
'''
Note:
– The accuracy of the classifier has significantly increased when trained with combined feature set (unigram + bigram).
– Accuracy was 73% while using only Unigram features.
– Accuracy has increased to 80% while using combined (unigram + bigram) features.
Testing Classifier with Custom Review
We provide custom review text and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive reviews provided.
from nltk.tokenize import word_tokenize
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_all_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: neg
# Negative review correctly classified as negative
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.770612685688
print (prob_result.prob("pos")) # Output: 0.229387314312
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_all_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: pos
# Positive review correctly classified as positive
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) # Output: 0.00677736186354
print (prob_result.prob("pos")) # Output: 0.993222638136
References:
1. Learning to Classify Text
2. From Text Classification to Sentiment Analysis
Hope this helps. Thanks.