TextPredictor

Padma Panchapakesan
28th February 2020

Capstone Project for Coursera Data Science Course

Introduction

In this project we develop a word predicting app
The app takes a string as user input and predicts the next word
The prediction model is incorporated in a shiny app which provides a front end for user inputs

Corpus Data

The data that was used for the prediction model was obtained from blogs, tweets and news.
This data set was provide as part of the coursera capstone project

Preprocessing

Following reprocessing was performed on the data before using

– Remove extra whitespace

– Convert all letters to lower case

– Remove Punctuation

– Remove numbers
Trigrams, bigrams and unigrams were generated from the preprocessed corpus
The ngrams are sorted according to their frequency of occurance(highest to lowest) and stored in RData files

Algorithm

User input is first searched in the sorted trigrams
If the string is found, then the last word of the most frequent trigram is predicted as the next word
Else the user input is searched in bigram
If it is neither found in trigram or bigram, the most frequent unigram is predicted as the next word

Shiny App and Code

Shiny App: https://padmapanchapakesan.shinyapps.io/TextPredictor/
GitHub: https://github.com/PadmaPanchapakesan/DataScienceCapstone