Python NLTK: Twitter Sentiment Analysis [Natural Language Processing (NLP)]

This article shows how you can perform sentiment analysis on Twitter tweets using Python and Natural Language Toolkit (NLTK).

Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing the text/document into a specific class or category (like positive and negative). In other words, we can say that sentiment analysis classifies any particular text or document as positive or negative. Basically, the classification is done for two classes: positive and negative. However, we can add more classes like neutral, highly positive, highly negative, etc.

Sentiment Analysis is also referred as Opinion Mining. It’s mostly used in social media and customer reviews data.

Supervised Classification

Here, we will be doing supervised text classification. In supervised classification, the classifier is trained with labeled training data.

In this article, we will use the NLTK’s twitter_samples corpus as our labeled training data. The twitter_samples corpus contains 2K movie reviews with sentiment polarity classification. It’s compiled by Pang, Lee.

Here, we have two categories for classification. They are: positive and negative. The twitter_samples corpus already has the tweets categorized as positive and negative.

The twitter_samples corpus contains 3 files.

1) negative_tweets.json: contains 5k negative tweets
2) positive_tweets.json: contains 5k positive tweets
3) tweets.20150430-223406.json: contains 20k positive and negative tweets


from nltk.corpus import twitter_samples
print (twitter_samples.fileids())
'''
Output:

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
'''

pos_tweets = twitter_samples.strings('positive_tweets.json')
print (len(pos_tweets)) # Output: 5000

neg_tweets = twitter_samples.strings('negative_tweets.json')
print (len(neg_tweets)) # Output: 5000

all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets)) # Output: 20000

for tweet in pos_tweets[:5]:
    print (tweet)
'''
Output:

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!
@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
@97sides CONGRATS :)
yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days
'''

Tokenize Tweets

NLTK has a TweetTokenizer module that does a good job in tokenizing (splitting text into a list of words) tweets.

Three different parameters can be passed while calling the TweetTokenizer class. They are:

preserve_case: if False then it converts tweet to lowercase and vice-versa.
strip_handles: if True then it removes twitter handles from the tweet and vice-versa.
reduce_len: if True then it reduces the length of words in the tweet like hurrayyyy, yipppiieeee, etc. and vice-versa.


from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

for tweet in pos_tweets[:5]:
    print (tweet_tokenizer.tokenize(tweet))
'''
Output:

['#followfriday', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
['hey', 'james', '!', 'how', 'odd', ':/', 'please', 'call', 'our', 'contact', 'centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':)', 'many', 'thanks', '!']
['we', 'had', 'a', 'listen', 'last', 'night', ':)', 'as', 'you', 'bleed', 'is', 'an', 'amazing', 'track', '.', 'when', 'are', 'you', 'in', 'scotland', '?', '!']
['congrats', ':)']
['yeaaah', 'yipppy', '!', '!', '!', 'my', 'accnt', 'verified', 'rqst', 'has', 'succeed', 'got', 'a', 'blue', 'tick', 'mark', 'on', 'my', 'fb', 'profile', ':)', 'in', '15', 'days']
'''

Cleaning Tweet

In the tweet cleaning process, we will do the following:

– Remove stock market tickers like $GE
– Remove retweet text “RT”
– Remove hyperlinks
– Remove hashtags (only the hashtag # and not the word)
– Remove stop words like a, and, the, is, are, etc.
– Remove emoticons like :), :D, :(, :-), etc.
– Remove punctuation like full-stop, comma, exclamation sign, etc.
– Convert words to Stem/Base words using Porter Stemming Algorithm. E.g. words like ‘working’, ‘works’, and ‘worked’ will be converted to their base/stem word “work”.

We will define a function named clean_tweets which returns a list of cleaned (by removing the above-mentioned things) words for any given tweet.


import string
import re

from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

from nltk.tokenize import TweetTokenizer

# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])

# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])

# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)

def clean_tweets(tweet):
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)

    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)

    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)

    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []   
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation): # remove punctuation
            #tweets_clean.append(word)
            stem_word = stemmer.stem(word) # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print (clean_tweets(custom_tweet))
'''
Output:

['hello', 'great', 'day', 'good', 'morning']
'''

print (pos_tweets[5])
'''
Output:

@BhaktisBanter @PallaviRuhail This one is irresistible :)
#FlipkartFashionFriday http://t.co/EbZ0L2VENM
'''

print (clean_tweets(pos_tweets[5]))
'''
Output:

['one', 'irresistible', 'flipkartfashionfriday']
'''

Feature Extraction

We define a simple bag_of_words function that extracts unigram features from the tweets.


# feature extractor function
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_dictionary = dict([word, True] for word in words) 
    return words_dictionary

custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print (bag_of_words(custom_tweet))
'''
Output:

{'great': True, 'good': True, 'morning': True, 'hello': True, 'day': True}
'''

# positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
    pos_tweets_set.append((bag_of_words(tweet), 'pos')) 

# negative tweets feature set
neg_tweets_set = []
for tweet in neg_tweets:
    neg_tweets_set.append((bag_of_words(tweet), 'neg'))

print (len(pos_tweets_set), len(neg_tweets_set)) # Output: (5000, 5000)

Create Train and Test Set

There are 5000 positive tweets set and 5000 negative tweets set. We take 20% (i.e. 1000) of positive tweets and 20% (i.e. 1000) of negative tweets as the test set. The remaining negative and positive tweets will be taken as the training set.


# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle 
shuffle(pos_tweets_set)
shuffle(neg_tweets_set)

test_set = pos_tweets_set[:1000] + neg_tweets_set[:1000]
train_set = pos_tweets_set[1000:] + neg_tweets_set[1000:]

print(len(test_set),  len(train_set)) # Output: (2000, 8000)

Training Classifier and Calculating Accuracy

We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.


from nltk import classify
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.765

print (classifier.show_most_informative_features(10))   
'''
Output:

Most Informative Features
                     via = True              pos : neg    =     37.0 : 1.0
                    glad = True              pos : neg    =     25.0 : 1.0
                     sad = True              neg : pos    =     22.6 : 1.0
                      aw = True              neg : pos    =     21.7 : 1.0
                     bam = True              pos : neg    =     21.0 : 1.0
                     x15 = True              neg : pos    =     19.7 : 1.0
                 appreci = True              pos : neg    =     17.7 : 1.0
                   arriv = True              pos : neg    =     15.0 : 1.0
                     ugh = True              neg : pos    =     14.3 : 1.0
                  justin = True              neg : pos    =     13.0 : 1.0
'''

Testing Classifier with Custom Tweet

We provide custom tweet and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive tweets provided.


custom_tweet = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_tweet_set = bag_of_words(custom_tweet)
print (classifier.classify(custom_tweet_set)) # Output: neg
# Negative tweet correctly classified as negative

# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.941844352481
print (prob_result.prob("pos")) # Output: 0.0581556475194


custom_tweet = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_tweet_set = bag_of_words(custom_tweet)

print (classifier.classify(custom_tweet_set)) # Output: pos
# Positive tweet correctly classified as positive

# probability result
prob_result = classifier.prob_classify(custom_tweet_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) # Output: 0.00131055449755
print (prob_result.prob("pos")) # Output: 0.998689445502

Precision, Recall & F1-Score

Accuracy is (correctly predicted observation) / (total observation).

Precision is about being precise.
– It shows how many correct predictions were given.
– For example, out of 100 questions, if you answered only 1 question and answered it correctly then you will have 100% precision.
– It’s about checking how often the classifier predicts the result correctly.

Recall (as opposed to precision)
– is about answering all questions that have the answer “true” with the answer “true”.
– It’s about checking how often does the classifier predicts “yes” when the result is actually “yes”.

F1 Score or F-measure: Harmonic mean of recall and precision.

We should have “true” answers and “false” answers for the calculation of precision and recall.

For mathematical representation of precision and recall, we need to understand the following:

True Positive (TP): e.g. the number of patients who did have cancer whom we correctly diagnosed as having cancer
True Negative (TN): e.g. the number of patients who did not have cancer whom we correctly diagnosed as not having cancer

False Positive (FP): e.g. the number of patients who did not have cancer whom we incorrectly diagnosed as having cancer (Also known as Type I error)
False Negative (FN): e.g. the number of patients who did have cancer whom we incorrectly diagnosed as not having cancer (Also known as Type II error)

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = (TP) / (TP + FP)

Recall = (TP) / (TP + FN)

F1 Score = 2 * (precision * recall) / (precision + recall)


from collections import defaultdict

actual_set = defaultdict(set)
predicted_set = defaultdict(set)

actual_set_cm = []
predicted_set_cm = []

for index, (feature, actual_label) in enumerate(test_set):
    actual_set[actual_label].add(index)
    actual_set_cm.append(actual_label)

    predicted_label = classifier.classify(feature)

    predicted_set[predicted_label].add(index)
    predicted_set_cm.append(predicted_label)
    
from nltk.metrics import precision, recall, f_measure, ConfusionMatrix

print 'pos precision:', precision(actual_set['pos'], predicted_set['pos']) # Output: pos precision: 0.762896825397
print 'pos recall:', recall(actual_set['pos'], predicted_set['pos']) # Output: pos recall: 0.769
print 'pos F-measure:', f_measure(actual_set['pos'], predicted_set['pos']) # Output: pos F-measure: 0.76593625498

print 'neg precision:', precision(actual_set['neg'], predicted_set['neg']) # Output: neg precision: 0.767137096774
print 'neg recall:', recall(actual_set['neg'], predicted_set['neg']) # Output: neg recall: 0.761
print 'neg F-measure:', f_measure(actual_set['neg'], predicted_set['neg']) # Output: neg F-measure: 0.7640562249

Confusion Matrix

Confusion Matrix is a table that is used to describe the performance of the classifier.

Confusion Matrix is represented in the following format:


'''
           |   Predicted NO      |   Predicted YES     |
-----------+---------------------+---------------------+
Actual NO  | True Negative (TN)  | False Positive (FP) |
Actual YES | False Negative (FN) | True Positive (TP)  |
-----------+---------------------+---------------------+
'''

The following output of the confusion matrix shows the following performance of our trained classifier:

– 761 negative tweets were correctly classified as negative (TN)
– 239 negative tweets were incorrectly classified as positive (FP)
– 231 positive tweets were incorrectly classified as negative (FN)
– 769 positive tweets were correctly classified as positive (TP)


# Confusion Matrix for the test set
# 
# Output: 
# row = actual_set_cm 
# column = predicted_set_cm
cm = ConfusionMatrix(actual_set_cm, predicted_set_cm)
print (cm)
'''
Output:

    |   n   p |
    |   e   o |
    |   g   s |
----+---------+
neg |<761>239 |
pos | 231<769>|
----+---------+
(row = reference; col = test)
'''

print (cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
'''
Output:

    |      n      p |
    |      e      o |
    |      g      s |
----+---------------+
neg | <38.0%> 11.9% |
pos |  11.6% <38.5%>|
----+---------------+
(row = reference; col = test)
'''

Hope this helps. Thanks.