The logic behind this captivating technology seems complex but isn’t. And even now, with a solid grasp of basic Python programming, you can create a novel DIY word processor with the natural language toolkit (NLTK).
Here’s how to get started with Python’s NLTK.
What Is NLTK and How Does It Work?
Written with Python, NLTK features a variety of string manipulating functionalities. It’s a versatile natural language library with a vast model repository for various natural language applications.
With NLTK, you can process raw texts and extract meaningful features from them. It also offers text analyzing models, feature-based grammars, and rich lexical resources for building a complete language model.
How to Set Up NLTK
First, create a project root folder anywhere on your PC. To start using the NLTK library, open your terminal to the root folder you created earlier and create a virtual environment.
Then, install the natural language toolkit into this environment using pip:
NLTK, however, features a variety of datasets that serve as a basis for novel natural language models. To access them, you need to spin up the NLTK built-in data downloader.
So, once you’ve successfully installed NLTK, open your Python file using any code editor.
Then import the nltk module and instantiate the data downloader using the following code:
Running the above code via the terminal brings up a graphic-user interface for selecting and downloading data packages. Here, you’ll need to choose a package and click the Download button to get it.
Any data package you download goes to the specified directory written in the Download Directory field. You can change this if you like. But try to maintain the default location at this level.
Note: The data packages appends to the system variables by default. So, you can keep using them for subsequent projects regardless of the Python environment you’re using.
How to Use NLTK Tokenizers
Ultimately, NLTK offers trained tokenizing models for words and sentences. Using these tools, you can generate a list of words from a sentence. Or transform a paragraph into a sensible sentence array.
Here’s an example of how to use the NLTK word_tokenizer:
NLTK also uses a pre-trained sentence tokenizer called PunktSentenceTokenizer. It works by chunking a paragraph into a list of sentences.
Let’s see how this works with a two-sentence paragraph:
You can further tokenize each sentence in the array generated from the above code using word_tokenizer and Python for loop.
Examples of How to Use NLTK
So while we can’t demonstrate all possible use-cases of NLTK, here are a few examples of how you can start using it to solve real-life problems.
Get Word Definitions and Their Parts of Speech
NLTK features models for determining parts of speech, getting detailed semantics, and possible contextual use of various words.
You can use the wordnet model to generate variables for a text. Then determine its meaning and part of speech.
For instance, let’s check the possible variables for “Monkey:”
The above code outputs possible word alternatives or syntaxes and parts of speech for “Monkey.”
Now check the meaning of “Monkey” using the definition method:
You can replace the string in the parenthesis with other generated alternatives to see what NLTK outputs.
The pos_tag model, however, determines the parts of speech of a word. You can use this with the word_tokenizer or PunktSentenceTokenizer() if you’re dealing with longer paragraphs.
Here’s how that works:
The above code pairs each tokenized word with its speech tag in a tuple. You can check the meaning of these tags on Penn Treebank.
For a cleaner result, you can remove the periods in the output using the replace() method:
Visualizing Feature Trends Using NLTK Plot
Extracting features from raw texts is often tedious and time-consuming. But you can view the strongest feature determiners in a text using the NLTK frequency distribution trend plot.
NLTK, however, syncs with matplotlib. You can leverage this to view a specific trend in your data.
The code below, for instance, compares a set of positive and negative words on a distribution plot using their last two alphabets:
The alphabet distribution plot looks like this:
Looking closely at the graph, words ending with ce, ds, le, nd, and nt have a higher likelihood of being positive texts. But those ending with al, ly, on, and te are more likely negative words.
Note: Although we’ve used self-generated data here, you can access some of the NLTK’s built-in datasets using its Corpus reader by calling them from the corpus class of nltk. You might want to look at the corpus package documentation to see how you can use it.
Keep Exploring the Natural Language Processing Toolkit
With the emergence of technologies like Alexa, spam detection, chatbots, sentiment analysis, and more, natural language processing seems to be evolving into its sub-human phase. Although we’ve only considered a few examples of what NLTK offers in this article, the tool has more advanced applications higher than the scope of this tutorial.
Having read this article, you should have a good idea of how to use NLTK at a base level. All that’s left for you to do now is put this knowledge into action yourself!