Data Science Specialization Capstone Presentation

HM
July 26, 2015

Background

  • In the last decade, with the explosion of “smart” phones and other “smart” wearable electronics coupled with the breathtaking rate of development of computer chips, data input devices are becoming smaller and smaller everyday. To mitigate the extraordinary obstacles presented by the small sizes of input devices the manufacturers and programmers rely on software that facilitate typing.
  • The purpose of this capstone project is to create an application that will suggest words to the user as they are typing.
  • This presentation is quick description of the training set of the data used to develop the application.

Methods

  • The data used was graciously made available at: http://www.corpora.heliohost.org/

  • After loading the data sets, I will create corpora with the data sets.

  • I will clean each corpus using the tm package in R removing punctuations, white space, dirty words and stop words.

  • I will then tokenize it using the RWeka package

  • This will result in 2-gram, 3-gram, 4-gram, and 5-gram

Methods (Continued)

  • After tokenizing the texts, I will remove the duplicates
tokens <- unique(tokens)
  • I will finally summarize it creating the frequency for the terms.

Creating the Model

  • In the next step, I will separate the data into training and test set
  • The training set will be approximately 70 percent and
  • The test set will be about 30 percent.

Final Steps

  • I will use cross-validation using the RWeka package and the n-grams created in previous step
  • The purpose is to train the system so it can pick the n-gram with the highest score
  • The user interface will be created using Shiny. It will have a text box for the user input and another text box where the predicted next word will appear after the user presses on the “Go” button.

Summary

  • In conclusion, the application will be simple to use and will be trained to do its job by learning from the 3 data sets available.
  • The following R packages will be used primarily to do the job: tm, RWeka, and data.table.
  • Speed and accuracy will be a challenge but the right coding and packages will be used to minimize errors and to maximize user's experience.
date()
[1] "Sun Jul 26 16:49:38 2015"