Coursera Data Science Specialication Capstone Project

Michiko
Jan. 21, 2016

Introduction

The objective of the assignment:

  • Develop a shiny app in which the next word is predicted based on a text input.

My approach:

  • By exploring the given data sets, I found that the frequency distributions of words used in Twitter are different from that in blogs and news articles. Therefore, I built a prediction model specialized for Twitter.
  • The prediction model takes into account not only the frequency of each word used in Twitter, but also sequences of words consist of multiple words (n-gram words). In this model, bigram, trigram, and quadrigram words are considered.

Shiny app: Word prediction (Twitter ver.)

The application is available here: https://cat-fish.shinyapps.io/shiny2/

How to use?

  • Input words in “Input text here” box
  • The predicted next word appears in “Predicted word” box!

This app has some unique features…

  • Input one of seven dirty words. What you do get in the prediction box?
  • Input a long sentence. Do you see the algorithm behind it?

The special features

Treatment of swear words

  • I believe that not swear words themselves but the usage of words matter. Therefore, instead of prohibiting using swear words, I arranged the model to show warnings whenever swear words are used.

Methodology to improve the prediction accuracy

  • When more than 3 words are given, the last 3 words are used to predict the next word (because this model considers maximum 4-gram words). If the sequence of the 3 words does not match with any 4-gram words in the database, the last 2 words are now used to match with 3-gram words in the database. The same process is done for 2 words.
  • If no match is found, one of the most frequently used words in the database is randomly given.

For further improvement

  • The accuracy of the model is expected to be improved by using larger data sets to build the word frequency database.
  • Removing affiexes from words to make the word frequency database could be done in a better way.
  • It would be great if emoji and emoticon could be taken into account.