Summary

The goal of this task: Build a smart application to presents options for what the next word might be

Source data: large, unstructured database of the English language in txt format from the SwiftKey

Raw data proccessing: RMarkdown

Source code in Git: Git Repository

Shiny application: ShinyApp

Work Stages

Downloading data from Coursera and loading raw data into R, discover the structure in the data
Cleaning and analyzing text data
Building and sampling from a predictive text model
Build a predictive text product in Shiny App

Cleaning and analyzing text data

The raw data contains corpora (collected from publicly available sources by a web crawler) in 4 different languages, for this project only en_US locale files were used.

Since the data is too big and computer doesn’t have enough capacity to perform exploratory data analysis I have to randomly sampling 1% of total data(vector of size 5.55Gb). The amount of remaining data is still sufficient for statistical analysis.

To perform data analysis data was normalized:

URLs, special characters, punctuation marks, numbers, excess white space, stopwords – deleted
Text changed to lower case.

Building and sampling from a predictive text model

For the analysis Uni/Bi/Tri grams were created to discover the structure in the data and how words are put together
Following algorithm for next word prediction realized

If user enter several words:

a.App grab last 2 words and find Trigram started with this words. Prediction – last word of correspondent Trigram. If correspondent Trigram doesn’t exist – App grab 1 last word;

b.App find Bigram which starts with this word. Prediction – last word of correspondent Bigram. If doesn’t exists – app find Trigram which starts with this word. Prediction – second word of such Trigram. If such case doesn’t exist - find Trigram with second word. Prediction – last word of correspondent Trigram. If such a case also does not exist – App find the most popular words from Unigram.
If user enter 1 word – see p.1b
If user doesn’t enter word – App find the most popular word from Unigram

Shiny App

Link on ShinyApp

The application has following options:

User can enter word or sentence for next word prediction
User can choose number of most probable words for prediction (I limited possibility to choose number of predicted words to 10 since there is no business logic for bigger number as part of this assignment)
User can choose model for prediction - Unigram, Bigram, Trigram or Cumulative (optimal from N-grams). The logic of it works you can find on slide above.

On the mainPanel after entering all necessary parameters you can see:

What model you have choose
Table with most probable predicted word (with frequency and probability of this combination in source data). Number of predicted words was specified by user
Visualization of most probable next word in wordcloud.

Thank you for yout attention!

Capstone Project for the Johns Hopkins Data Science Specialization - Next Word Prediction

Summary

Work Stages

Cleaning and analyzing text data

Building and sampling from a predictive text model

Shiny App