Python NLTK: Text Classification [Natural Language Processing (NLP)]

This article shows how you can classify text into different categories using Python and Natural Language Toolkit (NLTK).

Examples of text classification include spam filtering, sentiment analysis (analyzing text as positive or negative), genre classification, categorizing news articles, etc. There can be some categories and we need to do text analysis and classify the text/document into one of the available categories.

In this article, we will learn about labeling data, extracting features, training classifier, and testing the accuracy of the classifier.

Supervised Classification

Here, we will be doing supervised text classification. In supervised classification, the classifier is trained with labeled training data.

In this article, we will use the NLTK’s names corpus as our labeled training data. The names corpus contains a total of around 8K male and female names. It’s compiled by Kantrowitz, Ross.

So, we have two categories for classification. They are male and female. Our training data (the “names” corpus) has names that are already labeled as male and names that are already labeled as female.


from nltk.corpus import names 

print (names.fileids()) # Output: [female.txt', male.txt']

male_names = names.words('male.txt')
female_names = names.words('female.txt')

print (len(male_names)) # Output: 2943
print (len(female_names)) # Output: 5001

# male_names_2 = [str(item) for item in male_names]
# print (male_names_2[0:10])

# print first 10 male names
print (male_names[0:10])
'''
Output: 

['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim']
'''

# female_names_2 = [str(item) for item in female_names]
# print (female_names_2[0:10])

# print first 10 female names
print (female_names[0:10]) 
'''
Output:

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abigael', 'Abigail', 'Abigale']
'''

Feature Extraction

To classify the text into any category, we need to define some criteria. On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category. This kind of criteria is known as feature. We can define one or more feature to train our classifier.

In this example, we will use the last letter of the names as the feature.

We will define a function that extracts the last letter of any provided word. The function will return a dictionary containing the last letter information of the given word.


def gender_features(word):
    return {'last_letter' : word[-1]}

print (gender_features('Mukesh')) # Output: {'last_letter': 'h'}

The dictionary returned by the above function is called a feature set. This feature set is used to train the classifier.

We will now create a feature set using all the male and female names.

For this, we first combine the male and female names and shuffle the combined array.

Combining and Labeling Names Array


from nltk.corpus import names 
import random 

male_names = names.words('male.txt')
female_names = names.words('female.txt')

labeled_male_names = [(str(name), 'male') for name in male_names]

# printing first 10 labeled male names
print (labeled_male_names[:10])
'''
Output:

[('Aamir', 'male'), ('Aaron', 'male'), ('Abbey', 'male'), ('Abbie', 'male'), ('Abbot', 'male'), ('Abbott', 'male'), ('Abby', 'male'), ('Abdel', 'male'), ('Abdul', 'male'), ('Abdulkarim', 'male')]
'''

labeled_female_names = [(str(name), 'female') for name in female_names]

# printing first 10 labeled female names
print (labeled_female_names[:10])
'''
Output:

[('Abagael', 'female'), ('Abagail', 'female'), ('Abbe', 'female'), ('Abbey', 'female'), ('Abbi', 'female'), ('Abbie', 'female'), ('Abby', 'female'), ('Abigael', 'female'), ('Abigail', 'female'), ('Abigale', 'female')]
'''

# combine labeled male and labeled female names
labeled_all_names = labeled_male_names + labeled_female_names

# shuffle the labeled names array
random.shuffle(labeled_all_names)

# printing first 10 labeled all/combined names
print (labeled_all_names[:10])
'''
Output:

[('Aggie', 'female'), ('Eugenie', 'female'), ('Lottie', 'female'), ('Ansell', 'male'), ('Dexter', 'male'), ('Regina', 'female'), ('Tre', 'male'), ('Adelice', 'female'), ('Joelly', 'female'), ('Fran', 'male')]
'''

Extracting Feature & Creating Feature Set

We use the gender_features function that we defined above to extract the feature from the labeled names data. As mentioned above, the feature for this example will be the last letter of the names. So, we extract the last letter of all the labeled names and create a new array with the last letter of each name and the associated label for that particular name. This new array is called the feature set.


feature_set = [(gender_features(name), gender) for (name, gender) in labeled_all_names]

print (labeled_all_names[:10])
'''
Output:

[('Rebe', 'female'), ('Flory', 'female'), ('Jaquenette', 'female'), ('Inna', 'female'), ('Andra', 'female'), ('Collie', 'female'), ('Almira', 'female'), ('Rodger', 'male'), ('Beau', 'male'), ('Bruno', 'male')]
'''

print (feature_set[:10])
'''
Output:

[({'last_letter': 'e'}, 'female'), ({'last_letter': 'y'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'r'}, 'male'), ({'last_letter': 'u'}, 'male'), ({'last_letter': 'o'}, 'male')]
'''

Training Classifier

From the feature set we created above, we now create a separate training set and a separate testing/validation set. The train set is used to train the classifier and the test set is used to test the classifier to check how accurately it classifies the given text.

Creating Train and Test Dataset

In this example, we use the first 1500 elements of the feature set array as the test set and the rest of the data as the train set. Generally, 80/20 percent is a fair split between training and testing set, i.e. 80 percent training set and 20 percent testing set.


print (len(feature_set)) # Output: 7944

test_set = feature_set[:1500]
train_set = feature_set[1500:]

print (len(train_set)) # Output: 6944
print (len(test_set)) # Output: 1500

Training a Classifier

Now, we train a classifier using the training dataset. There are different kind of classifiers namely Naive Bayes Classifier, Maximum Entropy Classifier, Decision Tree Classifier, Support Vector Machine Classifier, etc.

In this example, we use the Naive Bayes Classifier. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.


from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

Testing the trained Classifier

Let’s see the output of the classifier by providing some names to it.


print (classifier.classify(gender_features('John'))) # Output: male

print (classifier.classify(gender_features('Mary'))) # Output: female

Let’s see the accuracy percentage of the trained classifier. The accuracy value changes each time you run the program because of the names array being shuffled above.


from nltk import classify 

accuracy = classify.accuracy(classifier, test_set)

print (accuracy) # Output: 0.76

Let’s see the most informative features among the entire features in the feature set.

The result shows that the names ending with letter “k” are male 36.9 times more often than they are female but the names ending with the letter “a” are female 34.1 times more often than they are male. Similarly, for other letters. These ratios are also called likelihood ratios.

Therefore, if you provide a name ending with letter “k” to the above trained classifier then it will predict it as “male” and if you provide a name ending with the letter “a” to the classifier then it will predict it as “female”.


# show 5 most informative features
print (classifier.show_most_informative_features(5))
'''
Output:

Most Informative Features
             last_letter = 'k'              male : female =     36.9 : 1.0
             last_letter = 'a'            female : male   =     34.1 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'd'              male : female =      9.4 : 1.0
             last_letter = 'm'              male : female =      8.8 : 1.0
'''

print (classifier.classify(gender_features('Jack'))) # Output: male

print (classifier.classify(gender_features('Eliza'))) # Output: female

Note: You can modify the gender_features function to generate the feature set which can improve the accuracy of the trained classifier. For example, you can use both first and last letter of the names as the feature. Feature extractors are built through a process of trial-and-error & guided by intuitions.

Reference: Learning to Classify Text

Hope this helps. Thanks.