Machine Learning & Sentiment Analysis: Text Classification using Python & NLTK

This article deals with using different feature sets to train three different classifiers [Naive Bayes Classifier, Maximum Entropy (MaxEnt) Classifier, and Support Vector Machine (SVM) Classifier].

Bag of Words, Stopword Filtering and Bigram Collocations methods are used for feature set generation.

Text Reviews from Yelp Academic Dataset are used to create training dataset.

Cross-validation is also done in the evaluation process.

The code used in this article is based upon this article from StreamHacker.

Python programming language is used along with Python’s NLTK (Natural Language Toolkit) Library.

I selected 500 positive reviews (reviews having 5 star rating) and 500 negative reviews (reviews having 1 star rating) from Yelp dataset. Positive reivews are kept in a CSV file named positive-data.csv and negative reviews are kept in a CSV file named negative-data.csv.

Download: Positive and Negative Training Data

Some words (e.g. no, not, more, most, below, over, too, very, etc.) have been removed from the standard stopwords available in NLTK. It’s done so because those words can have some sentiment impact in our review dataset.


stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))

1/3 (one-third) of the data is used as test feature set and the remaining 2/3 (two-third) is used as training feature set.

testset = 25% of pos_data + 25% of neg data
trainset = 75% of pos_data + 75% of neg data


negfeats = [(featx(f), 'neg') for f in word_split(negdata)]
posfeats = [(featx(f), 'pos') for f in word_split(posdata)]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

Classification is done using three different classifiers. In other words, evaluation is done by training three different classifiers.

Training Naive Bayes Classifier


classifier = NaiveBayesClassifier.train(trainfeats)

Training Maximum Entropy Classifier

I have used Generalized Iterative Scaling (GIS) algorithm. The other algorithms available are Improved Iterative Scaling (IIS) and LM-BFGS algorithm, with training performed by Megam (megam). See more at: http://www.nltk.org/_modules/nltk/classify/maxent.html


classifier = MaxentClassifier.train(trainfeats, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)

Training Support Vector Machine Classifier

I have used Linear Support Vector Classification model. BernoulliNB and LogisticRegression can also be used in place of LinearSVC. See more at: http://www.nltk.org/_modules/nltk/classify/scikitlearn.html


classifier = SklearnClassifier(LinearSVC(), sparse=False)
classifier.train(trainfeats)

Classification accuracy is measured in terms of general Accuracy, Precision, Recall, and F-measure.

The evaluation is also done using cross-validation. In this process, at first the positive and negative features are combined and then it is randomly shuffled. This is necessary because in cross-validation if the shuffling is not done then the test chunk might have only negative or only positive data. n in the below code indicates the folds. n = 5 means 5-fold cross-validation.


trainfeats = negfeats + posfeats	
random.shuffle(trainfeats)	
n = 5

The evaluation can be done using different feature sets like all words feature set, all words feature set with stopword filter, bigram word feature set, and bigram word feature set with stopword filter.


evaluate_classifier(word_feats)
evaluate_classifier(stopword_filtered_word_feats)
evaluate_classifier(bigram_word_feats)	
evaluate_classifier(bigram_word_feats_stopwords)

Here is the full code:


import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier, MaxentClassifier, SklearnClassifier
import csv
from sklearn import cross_validation
from sklearn.svm import LinearSVC, SVC
import random
from nltk.corpus import stopwords
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

posdata = []
with open('positive-data.csv', 'rb') as myfile:	
	reader = csv.reader(myfile, delimiter=',')
	for val in reader:
		posdata.append(val[0])		

negdata = []
with open('negative-data.csv', 'rb') as myfile:	
	reader = csv.reader(myfile, delimiter=',')
	for val in reader:
		negdata.append(val[0])			

def word_split(data):	
	data_new = []
	for word in data:
		word_filter = [i.lower() for i in word.split()]
		data_new.append(word_filter)
	return data_new

def word_split_sentiment(data):
	data_new = []
	for (word, sentiment) in data:
		word_filter = [i.lower() for i in word.split()]
		data_new.append((word_filter, sentiment))
	return data_new
	
def word_feats(words):	
	return dict([(word, True) for word in words])

stopset = set(stopwords.words('english')) - set(('over', 'under', 'below', 'more', 'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))
     
def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    """
    print words
    for ngram in itertools.chain(words, bigrams): 
		if ngram not in stopset: 
			print ngram
    exit()
    """    
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
    
def bigram_word_feats_stopwords(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    """
    print words
    for ngram in itertools.chain(words, bigrams): 
		if ngram not in stopset: 
			print ngram
    exit()
    """    
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams) if ngram not in stopset])

# Calculating Precision, Recall & F-measure
def evaluate_classifier(featx):
	
	negfeats = [(featx(f), 'neg') for f in word_split(negdata)]
	posfeats = [(featx(f), 'pos') for f in word_split(posdata)]
	    
	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
	
	# using 3 classifiers
	classifier_list = ['nb', 'maxent', 'svm'] 	
		
	for cl in classifier_list:
		if cl == 'maxent':
			classifierName = 'Maximum Entropy'
			classifier = MaxentClassifier.train(trainfeats, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)
		elif cl == 'svm':
			classifierName = 'SVM'
			classifier = SklearnClassifier(LinearSVC(), sparse=False)
			classifier.train(trainfeats)
		else:
			classifierName = 'Naive Bayes'
			classifier = NaiveBayesClassifier.train(trainfeats)
			
		refsets = collections.defaultdict(set)
		testsets = collections.defaultdict(set)

		for i, (feats, label) in enumerate(testfeats):
				refsets[label].add(i)
				observed = classifier.classify(feats)
				testsets[observed].add(i)

		accuracy = nltk.classify.util.accuracy(classifier, testfeats)
		pos_precision = nltk.metrics.precision(refsets['pos'], testsets['pos'])
		pos_recall = nltk.metrics.recall(refsets['pos'], testsets['pos'])
		pos_fmeasure = nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
		neg_precision = nltk.metrics.precision(refsets['neg'], testsets['neg'])
		neg_recall = nltk.metrics.recall(refsets['neg'], testsets['neg'])
		neg_fmeasure =  nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
		
		print ''
		print '---------------------------------------'
		print 'SINGLE FOLD RESULT ' + '(' + classifierName + ')'
		print '---------------------------------------'
		print 'accuracy:', accuracy
		print 'precision', (pos_precision + neg_precision) / 2
		print 'recall', (pos_recall + neg_recall) / 2
		print 'f-measure', (pos_fmeasure + neg_fmeasure) / 2	
				
		#classifier.show_most_informative_features()
	
	print ''
	
	## CROSS VALIDATION
	
	trainfeats = negfeats + posfeats	
	
	# SHUFFLE TRAIN SET
	# As in cross validation, the test chunk might have only negative or only positive data	
	random.shuffle(trainfeats)	
	n = 5 # 5-fold cross-validation	
	
	for cl in classifier_list:
		
		subset_size = len(trainfeats) / n
		accuracy = []
		pos_precision = []
		pos_recall = []
		neg_precision = []
		neg_recall = []
		pos_fmeasure = []
		neg_fmeasure = []
		cv_count = 1
		for i in range(n):		
			testing_this_round = trainfeats[i*subset_size:][:subset_size]
			training_this_round = trainfeats[:i*subset_size] + trainfeats[(i+1)*subset_size:]
			
			if cl == 'maxent':
				classifierName = 'Maximum Entropy'
				classifier = MaxentClassifier.train(training_this_round, 'GIS', trace=0, encoding=None, labels=None, sparse=True, gaussian_prior_sigma=0, max_iter = 1)
			elif cl == 'svm':
				classifierName = 'SVM'
				classifier = SklearnClassifier(LinearSVC(), sparse=False)
				classifier.train(training_this_round)
			else:
				classifierName = 'Naive Bayes'
				classifier = NaiveBayesClassifier.train(training_this_round)
					
			refsets = collections.defaultdict(set)
			testsets = collections.defaultdict(set)
			for i, (feats, label) in enumerate(testing_this_round):
				refsets[label].add(i)
				observed = classifier.classify(feats)
				testsets[observed].add(i)
			
			cv_accuracy = nltk.classify.util.accuracy(classifier, testing_this_round)
			cv_pos_precision = nltk.metrics.precision(refsets['pos'], testsets['pos'])
			cv_pos_recall = nltk.metrics.recall(refsets['pos'], testsets['pos'])
			cv_pos_fmeasure = nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
			cv_neg_precision = nltk.metrics.precision(refsets['neg'], testsets['neg'])
			cv_neg_recall = nltk.metrics.recall(refsets['neg'], testsets['neg'])
			cv_neg_fmeasure =  nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
					
			accuracy.append(cv_accuracy)
			pos_precision.append(cv_pos_precision)
			pos_recall.append(cv_pos_recall)
			neg_precision.append(cv_neg_precision)
			neg_recall.append(cv_neg_recall)
			pos_fmeasure.append(cv_pos_fmeasure)
			neg_fmeasure.append(cv_neg_fmeasure)
			
			cv_count += 1
				
		print '---------------------------------------'
		print 'N-FOLD CROSS VALIDATION RESULT ' + '(' + classifierName + ')'
		print '---------------------------------------'
		print 'accuracy:', sum(accuracy) / n
		print 'precision', (sum(pos_precision)/n + sum(neg_precision)/n) / 2
		print 'recall', (sum(pos_recall)/n + sum(neg_recall)/n) / 2
		print 'f-measure', (sum(pos_fmeasure)/n + sum(neg_fmeasure)/n) / 2
		print ''
	
		
evaluate_classifier(word_feats)
#evaluate_classifier(stopword_filtered_word_feats)
#evaluate_classifier(bigram_word_feats)	
#evaluate_classifier(bigram_word_feats_stopwords)

Here is the result obtained using all words feature set evaluate_classifier(word_feats).

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.712
precision 0.808857808858
recall 0.712
f-measure 0.6875

—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.696
precision 0.801753867376
recall 0.696
f-measure 0.666806958474

—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884221311475
recall 0.884
f-measure 0.883983293594

—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.742
precision 0.820669619013
recall 0.742023637087
f-measure 0.724301799825

—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.723
precision 0.808616815505
recall 0.725142220446
f-measure 0.702686890214

—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.855
precision 0.854878928286
recall 0.855295825428
f-measure 0.854608585556

Here is the result obtained using bigram words feature set evaluate_classifier(bigram_word_feats).

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.812
precision 0.863372093023
recall 0.812
f-measure 0.805111874077

—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.784
precision 0.849162011173
recall 0.784
f-measure 0.773429108485

—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884024577573
recall 0.884
f-measure 0.88399814397

—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.827
precision 0.861070429762
recall 0.827596155604
f-measure 0.822565942413

—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.8
precision 0.84954715877
recall 0.802450691272
f-measure 0.792653687642

—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.868
precision 0.86793244094
recall 0.868212258492
f-measure 0.867717178745

The result can be improved by increasing the training dataset size. Currently, the training dataset contains of total 1000 reviews (500 positive and 500 negative). This number can be increased to see if the increment improves accuracy result.

The accuracy result can also be improved by using best words and best bigrams as feature set instead of all words and all bigrams. ‘Best’ means the most frequently occuring words or bigrams. This approach of eliminating low information features (or, removing noisy data) is a kind of dimensionality reduction. Here is a good tutorial on eliminating low information features by creating a feature set of best words and best bigrams.

Hope this helps.
Thanks.