TextPredictor

Padma Panchapakesan
28th February 2020

Capstone Project for Coursera Data Science Course

Introduction

  • In this project we develop a word predicting app
  • The app takes a string as user input and predicts the next word
  • The prediction model is incorporated in a shiny app which provides a front end for user inputs

Corpus Data

  • The data that was used for the prediction model was obtained from blogs, tweets and news.
  • This data set was provide as part of the coursera capstone project

Preprocessing

  • Following reprocessing was performed on the data before using

    Remove extra whitespace

    Convert all letters to lower case

    Remove Punctuation

    Remove numbers

  • Trigrams, bigrams and unigrams were generated from the preprocessed corpus

  • The ngrams are sorted according to their frequency of occurance(highest to lowest) and stored in RData files

Algorithm

  • User input is first searched in the sorted trigrams
  • If the string is found, then the last word of the most frequent trigram is predicted as the next word
  • Else the user input is searched in bigram
  • If it is neither found in trigram or bigram, the most frequent unigram is predicted as the next word

Shiny App and Code