Home » Natural Language Processing (NLP), Python12 February 2018

Natural Language Processing (NLP): Basic Introduction to NLTK [Python]

Natural Language Processing (NLP) is about the processing of natural language by computer. It’s about making computer/machine understand about natural language. Natural language means the language that humans speak and understand.

Natural Language Toolkit (NLTK) is a suite of Python libraries for Natural Language Processing (NLP). NLTK contains different text processing libraries for classification, tokenization, stemming, tagging, parsing, etc.

I suppose you have already installed Python.

Install NLTK

On Linux/Mac, run the following command terminal:

For Python 2.x

For Python 3.x

After it is installed, you can verify its installation by running python on terminal:

If the import works without any error then nltk has been properly installed on your system.

For Windows users, you can follow the instructions provided here: http://www.nltk.org/install.html (This link also contains installation instructions for Linux & Mac users.)

Install NLTK Packages

Run python on terminal:

Then, import NLTK and run nltk.download()

This will open the NLTK downloader from where you can choose the corpora and models to download. You can also download all packages at once.

NLTK Downloader

Simple Text Processing with NLTK

Convert raw text to nltk tokens

Convert tokens to nltk text format

Search for any word in the text description

Concordance view will show the searched word along with some context related to that word

Output similar word as the searched word

Similar words mean what other words appear in a similar range of contexts.

Count the number of tokens in the text

Tokens include words and punctuation symbols.

Get unique tokens only by removing repeated tokens

Count the number of unique tokens

Sort tokens alphabetically

Lexical Diversity

Lexical Diversity = Ratio of unique tokens to the total number of tokens

len(set(text)) / len(text)

Total number of occurrence of any particular word

Percentage of any particular word in the whole text

100 * (Total count of the particular word) / (Total number of tokens in the text)

Frequency Distribution

Frequency (Number of occurence) of each vocabulary item in the text.

Freqency Distribution Plot

Frequency Distribution

Cumulative Frequency Distribution Plot

Cumulative Frequency = Running total of absolute frequency

Running total means the sum of all the frequencies up to the current point.

Example:

Suppose, there are three words X, Y, and Z.
And their respective frequency is 1, 2, and 3.
This freqency is their absolute frequency.
Cumulative freqency of X, Y, and Z will be as follows:
X -> 1
Y -> 2+1 = 3
Z -> 3+3 = 6

Frequency Distribution

Collocations

Collocations = multiple-words that occur commonly

BIGRAMS

Finding the 10 best bigrams in the text

Here, scoring of ngrams is done by PMI (pointwise mutual information) method.

Here, scoring of ngrams is done by likelihood ratios method.

Frequency Distribution of Bigrams

Ignore all bigrams that occur less than 2 times in the text

TRIGRAMS

Finding the 10 best trigrams in the text

Here, scoring of ngrams is done by PMI (pointwise mutual information) method.

Here, scoring of ngrams is done by likelihood ratios method.

Accessing Text Corpora

Text Corpus = Large collection of text

Text Corporas can be downloaded from nltk with nltk.download() command. It’s mentioned at the beginning of this article.

To access any text corpora, it should be downloaded first.

Here are the basic functions that can be used with the nltk text corpus:

fileids() = the files of the corpus
fileids([categories]) = the files of the corpus corresponding to these categories
categories() = the categories of the corpus
categories([fileids]) = the categories of the corpus corresponding to these files
raw() = the raw content of the corpus
raw(fileids=[f1,f2,f3]) = the raw content of the specified files
raw(categories=[c1,c2]) = the raw content of the specified categories
words() = the words of the whole corpus
words(fileids=[f1,f2,f3]) = the words of the specified fileids
words(categories=[c1,c2]) = the words of the specified categories
sents() = the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) = the sentences of the specified fileids
sents(categories=[c1,c2]) = the sentences of the specified categories
abspath(fileid) = the location of the given file on disk
encoding(fileid) = the encoding of the file (if known)
open(fileid) = open a stream for reading the given corpus file
root = if the path to the root of locally installed corpus
readme() = the contents of the README file of the corpus

Movie Reviews Corpus

movie_reviews corpus contains 2K movie reviews with sentiment polarity classification. It’s compiled by Pang, Lee.

Stopwords Corpus

stopwords corpus contains the high-frequency words (words occurring frequently in any text document). In text processing, the document/text is filtered by removing the stop words.

Names Corpus

names corpus contains 8K male and female names. It’s compiled by Kantrowitz, Ross.

References:

1. http://www.nltk.org/book/ch01.html
2. http://www.nltk.org/howto/collocations.html
3. http://www.nltk.org/book/ch02.html

Hope this helps. Thanks.

Python

Get New Post by Email

Find me on

FacebookTwitterGoogle+LinkedInRSS Feed