TextPredictor
Padma Panchapakesan
28th February 2020
Capstone Project for Coursera Data Science Course
Introduction
- In this project we develop a word predicting app
- The app takes a string as user input and predicts the next word
- The prediction model is incorporated in a shiny app which provides a front end for user inputs
Corpus Data
- The data that was used for the prediction model was obtained from blogs, tweets and news.
- This data set was provide as part of the coursera capstone project
Preprocessing
Following reprocessing was performed on the data before using
– Remove extra whitespace
– Convert all letters to lower case
– Remove Punctuation
– Remove numbers
Trigrams, bigrams and unigrams were generated from the preprocessed corpus
The ngrams are sorted according to their frequency of occurance(highest to lowest) and
stored in RData files
Algorithm
- User input is first searched in the sorted trigrams
- If the string is found, then the last word of the most frequent trigram is predicted as the next word
- Else the user input is searched in bigram
- If it is neither found in trigram or bigram, the most frequent unigram is predicted as the next word