Capstone Project Final Report

5/26/2020

Introduction

I have developed a web application using RStudio Shiny named WordPredictR that predicts the next probable words as soon as the user starts typing.

The app gives five word options in the order of their likelihood depending upon the words that the user have typed.

WordPredictR

The Data Set

The corpus that was used to build this app comes from three source files namely blogs, news and the twitter tweets, which were provided by the swiftkey officials for this project.
For better accuracy I have also incorporated the data of reviews from amazon taken from their site.
Basic cleaning and preprocessing steps were performed before feeding it to the algorithm.

NOTE: All of the data were used to build the model and algorithm for the web App.

         feature  freq pred       base
1     one_of_the 20730  the     one_of
2       a_lot_of 20062   of      a_lot
3 thanks_for_the 14545  the thanks_for
4        to_be_a 13744    a      to_be

The Algorithm

To keep the app efficient a simple model of “Stupid Backoff” was implemented to predict the next word which just calculates the probability of the next word given a set of words.

I have implemented a 5-gram model to predict the next word and if the entered word length is less than four then lower-grams models were implemented.

Efficiency of the App

The counts of the n-grams and the scores were already calculated and stored so that when the user enters a text it just needs to compare and give the best 5 scores, hence improving the efficiency.

Data Tables were used to store the data set which is much faster than data frames and also data can be accessed easily using indexing.
Complete data from the provided corpus as well as extra data from amazon were used that enhances the performance of the algorithm.
The WordPredictR has an out of sample accuracy of around 40 percent.