TextPredictor

Padma Panchapakesan
28th February 2020

Capstone Project for Coursera Data Science Course

Introduction

  • TextPredictor is a word predicting app
  • The prediction model for this app is incorporated in a shiny app which provides a front end for user inputs
  • The user should enter a text in the text box and press the Submit button
  • The next predicted word will appear in the main panel
  • The links for the Shiny App and the source code are given below

Data Set

  • The data that was used for the prediction model was obtained from blogs, tweets and news.
  • This data set was provide as part of the coursera capstone project

  • The data obtained was preprocessed to remove extra whitespace, convert all letters to lower case, remove punctuation and remove numbers

  • Trigrams, bigrams and unigrams were generated from the preprocessed data

  • The ngrams were sorted according to their frequency of occurance(highest to lowest) and stored in RData files

Algorithm

  • User input is first searched in the sorted trigrams
  • If the string is found, then the last word of the most frequent trigram is predicted as the next word
  • Else the user input is searched in bigram
  • If it is neither found in trigram or bigram, the most frequent unigram is predicted as the next word

Links