Word Predictor App

Gomez

5/27/2020

Summary

This application was developed for Coursera, Data Science Capstone Project. The application predicts the next word given a sentence by using n-grams. Three sample files are used to create training dataset to predict the next word:

  1. en_US.blogs.txt
  2. en_US.news.txt
  3. en_US.twitter.txt

A sample file was created by taking random lines from each of the three files. The final sample file includes a total of 600,000 lines.

Using packages for Natural Language Processing (NLP), we tokenize and create n-grams(2, 3, and 4) on which we based our prediction model.

The application predicts the next word in a sentence. The application provides new possible outcomes after each word is input by the user. Additionally the user can elect to display the probability of the possible next words.

How it works

Our fist step is to create tokens from our sample file. A token is a unit of text, such as a word, that we use for analysis. Using the tokens, we can find the most commonly used works in our sample file.

Next, we create bigrams, which tokenizes by pairs of adjacent words rather than by individual ones. Using bigrams, and by calculating the frequency of the most common bigrams, we can calculate probabilities of a second word appearing given that a first word has already appeared. We extend this concept of n-grams, to n=3 and n=4. By including bigrams (n=2), trigrams (n-3), and fourgrams (n-4) into our model, we can predict next word given a series of words already in place.

The following graphs represent the most common words and bigrams present in our sample file.

Word & Bigram frequencies

Word & Bigram frequencies

WordClouds

The following wordclouds are just another way to visualize the most common words and bigrams found in our sample file.

Example

In our example, we try to find the next word to the expression “from the bottom of my”. The applicattion presents a series of words that will following the expression, with the probabilities calculate next to them.

## [1] "from the bottom of my "
Next Word Probability
heart 0.72
favorite 0.08
life 0.07
blog 0.06
right 0.06
card 0.06
cup 0.06
dresser 0.06