Home » Machine Learning, Natural Language Processing (NLP), Python, Sentiment Analysis25 January 2016

Machine Learning & Sentiment Analysis: Text Classification using Python & NLTK

This article deals with using different feature sets to train three different classifiers [Naive Bayes Classifier, Maximum Entropy (MaxEnt) Classifier, and Support Vector Machine (SVM) Classifier].

Bag of Words, Stopword Filtering and Bigram Collocations methods are used for feature set generation.

Text Reviews from Yelp Academic Dataset are used to create training dataset.

Cross-validation is also done in the evaluation process.

The code used in this article is based upon this article from StreamHacker.

Python programming language is used along with Python’s NLTK (Natural Language Toolkit) Library.

I selected 500 positive reviews (reviews having 5 star rating) and 500 negative reviews (reviews having 1 star rating) from Yelp dataset. Positive reivews are kept in a CSV file named positive-data.csv and negative reviews are kept in a CSV file named negative-data.csv.

Download: Positive and Negative Training Data

Some words (e.g. no, not, more, most, below, over, too, very, etc.) have been removed from the standard stopwords available in NLTK. It’s done so because those words can have some sentiment impact in our review dataset.

1/3 (one-third) of the data is used as test feature set and the remaining 2/3 (two-third) is used as training feature set.

testset = 25% of pos_data + 25% of neg data
trainset = 75% of pos_data + 75% of neg data

Classification is done using three different classifiers. In other words, evaluation is done by training three different classifiers.

Training Naive Bayes Classifier

Training Maximum Entropy Classifier

I have used Generalized Iterative Scaling (GIS) algorithm. The other algorithms available are Improved Iterative Scaling (IIS) and LM-BFGS algorithm, with training performed by Megam (megam). See more at: http://www.nltk.org/_modules/nltk/classify/maxent.html

Training Support Vector Machine Classifier

I have used Linear Support Vector Classification model. BernoulliNB and LogisticRegression can also be used in place of LinearSVC. See more at: http://www.nltk.org/_modules/nltk/classify/scikitlearn.html

Classification accuracy is measured in terms of general Accuracy, Precision, Recall, and F-measure.

The evaluation is also done using cross-validation. In this process, at first the positive and negative features are combined and then it is randomly shuffled. This is necessary because in cross-validation if the shuffling is not done then the test chunk might have only negative or only positive data. n in the below code indicates the folds. n = 5 means 5-fold cross-validation.

The evaluation can be done using different feature sets like all words feature set, all words feature set with stopword filter, bigram word feature set, and bigram word feature set with stopword filter.

Here is the full code:

Here is the result obtained using all words feature set evaluate_classifier(word_feats).

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.712
precision 0.808857808858
recall 0.712
f-measure 0.6875

—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.696
precision 0.801753867376
recall 0.696
f-measure 0.666806958474

—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884221311475
recall 0.884
f-measure 0.883983293594

—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.742
precision 0.820669619013
recall 0.742023637087
f-measure 0.724301799825

—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.723
precision 0.808616815505
recall 0.725142220446
f-measure 0.702686890214

—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.855
precision 0.854878928286
recall 0.855295825428
f-measure 0.854608585556

Here is the result obtained using bigram words feature set evaluate_classifier(bigram_word_feats).

—————————————
SINGLE FOLD RESULT (Naive Bayes)
—————————————
accuracy: 0.812
precision 0.863372093023
recall 0.812
f-measure 0.805111874077

—————————————
SINGLE FOLD RESULT (Maximum Entropy)
—————————————
accuracy: 0.784
precision 0.849162011173
recall 0.784
f-measure 0.773429108485

—————————————
SINGLE FOLD RESULT (SVM)
—————————————
accuracy: 0.884
precision 0.884024577573
recall 0.884
f-measure 0.88399814397

—————————————
N-FOLD CROSS VALIDATION RESULT (Naive Bayes)
—————————————
accuracy: 0.827
precision 0.861070429762
recall 0.827596155604
f-measure 0.822565942413

—————————————
N-FOLD CROSS VALIDATION RESULT (Maximum Entropy)
—————————————
accuracy: 0.8
precision 0.84954715877
recall 0.802450691272
f-measure 0.792653687642

—————————————
N-FOLD CROSS VALIDATION RESULT (SVM)
—————————————
accuracy: 0.868
precision 0.86793244094
recall 0.868212258492
f-measure 0.867717178745

The result can be improved by increasing the training dataset size. Currently, the training dataset contains of total 1000 reviews (500 positive and 500 negative). This number can be increased to see if the increment improves accuracy result.

The accuracy result can also be improved by using best words and best bigrams as feature set instead of all words and all bigrams. ‘Best’ means the most frequently occuring words or bigrams. This approach of eliminating low information features (or, removing noisy data) is a kind of dimensionality reduction. Here is a good tutorial on eliminating low information features by creating a feature set of best words and best bigrams.

Hope this helps.
Thanks.

Sentiment Analysis

Get New Post by Email

Find me on

FacebookTwitterGoogle+LinkedInRSS Feed
  • Sharmili Nag

    Hi your post is really helpful.But I am getting a ZeroDivisionError: float division by zero in the bigram_word_feats function.How to resolve it?