Data Science Capstone Final Project

12/17/2022

Summary

The presentation is about the Word Prediction Shiny App built using R. This project is created for the Final Project Submission of Data Science Capstone Course on Coursera.

The goal of this project is to build a predictive model for text processing and present it using Shiny app. The shiny app has the User Interface that takes a phrase or a partial sentence as an input and outputs a prediction of the next word.

Datasets Used

The data for word prediction model project is given by SwiftKey. The training data includes three types of datasets Blogs, News and Twitter. The data is from a corpus called HC Corpora. The corpora are collected from publicly available sources by a web crawler.

Datasets Used and Data Pre-Processing

First step is data sampling, the appropriate sample (one percent) of data from all three files is selected. This sampling is done to handle the performance issues that could result due to the big size of the input files.
The corpus is then created using the combined sample of all these datasets
The regular text pre-processing techniques are applied such as lower case conversion, removing numbers, removing punctuation marks, removing extra white spaces.
The general text modeling practice includes stopwards removal. In this use case of word prediction, I am keeping the stopwords as this app can be used for predicting words used in common English language sentences. I have built the algorithm using N-grams created by removing stopwords, but then the alogrithm fails to predict the commonly used associated words.
Next step is for the additional cleanup for removing the bad or profane words from the sample data or the training data. I have downloaded the list of bad words from google for performing this cleanup step on the data.
The bigram (2-gram), trigram (3-gram), fourgram (4-gram) and fivegram (5-gram) sets are generated and sorted by frequency.
These N-gram datasets are saved so that they can be later used for creating a predictive model.

Word Prediction Algorithm

Processing of the partially entered sentence by user

The partial sentence accepted from the user interface of Shiny App goes through the text pre-processing steps.

The user input text goes through regular text-preprocessing stages such as lower case conversion, removal of punctuation marks and numbers, removal of extra white spaces
Next remove twitter hashtags, rt, foreign characters
Convert commonly used English language short forms to full words and commonly used single letters with full words

Determine the appropriate dataset to use for predicting next word

The word prediction algorithm is using the Supid Back-Off apporach along with N-Grams created in the data pre-processing stage.

The code checks the number of words in the cleaned input text
Low frequency n-grams are filtered
If the number of words entered is 1, the bigram created during data pre-processing step is used to predict the most likely word with an entered input word. If the match is not found in the bigram dataset then the algorithm will return the most likely used unigram word using the unigram dataset.

Word Prediction Algorithm - Continued

If the number of words entered is 1, the bigram created during data pre-processing step is used to predict the most likely word with an entered input word. If the match is not found in the bigram dataset then the algorithm will return the most likely used unigram word using the unigram dataset.
If the number of words entered are 2, the trigram is used to predict the next likely word depending on the combination of given two words. If no match is found then the same prediction function is called using the secone word of the entered sentence which ultimately uses the trigram to predict the next best match.
If the number of words entered are 3, the fourgram is used to predict the most likely word depending on the combination of first three words. If no match is found then the same prediction function is called using last two words of the entered sentence which ultimately uses the bigram to predict the next best match.
If the number of words entered are 4, use the fivegram created during data pre-processing step to predict the next likely word. If no match is found then the same prediction function is called using last three words of the entered sentence which ultimately uses the trigram to predict the next best match.
If the number of words entered are more than 5, the fivgrams dataset will be used and the word prediction will happen based on the first four words. This is done to avoid the performance issues and delays with higher N-gram datasets.
If no matches are found then word with the highest frequency from the unigram dataset is returned to the user as this is the most likely used single word.

The Shiny App For Word Prediction

The shiny app contains two sections, the User Interface and the Server.

User Interface

The User Interface of this application consists of two sections. First section on the left hand side gives the brief idea about this app and also set of basic instructions to interact with the app. The second section or the main section includes the text box on the top. This text box is used to enter the partial sentense by the users. In the main panel right below the text box, the predicted word gets displayed.

Server

The server side code includes the reference to the word prediction script and the call to the word prediction function. The new word is predicted using N-Grams Back-off model and then passed to the user interface.

Links

Shiny Web App Link

https://siddhimk.shinyapps.io/word_prediction/

Github Code Link

https://github.com/siddhimk/Data_Science_Capstone_Word_Prediction