September 21, 2019

Creating the Data Set and N-Grams

  • For my model, I made a sample data set of the three data sets (news, blogs, and twitter). Then, I started by editing the sample set to get rid of bad words, non-English words, etc using Vcorpus and tm_map.
  • Next, I made n-grams using NGramTokenizer and TermDocMatrix.
  • I then made a data.frame to put the words in an easy to read format. By using data frame, I was able to get a data frame of each set of words and their frequency in my data set.

Tables for Predicting Words

  • From here, I thought about how I would do a task like this in excel (I use excel daily and am very familiar with its functions). My thought was that if this project was in excel, I would use a vlookup in excel and have the words you want predicted in one column, the outcome in another column, and the frequency next to it sorted from largest to smallest.
  • For this data set, I made those three columns described and used a filter and used “which” to make a function similar to a vlookup
  • I then made a backup prediction word in case the word entered was not in my data set. Since the most frequent word in my data set was the word “the”, if nothing comes up when you run the function, “the” will be the back up predicted word

Drawbacks

  • One major limitation of my model is the size of the sample data set. Due to the computation limitations on my computer, I could only pull 30,000 lines of text from each source for a total of 90,000 lines.
  • This limitation causes my prediction to most of the time be the word “the” unless very simple words are used such as “I”, “he”, “she” ,etc.
  • To add more capacity to the model, in the future I would use data.table instead of data.frame to allow more lines of text to be used to add more accuracy to the model

How the App Words

  • To run the app, the first thing you need to do is put your words you would like predicted under the instructions. You can type as many or as few words as you would like.
  • When you click “Submit”, your predicted word will come up under “Word Prediction” to the right.
  • Please note, unless you use very basic words such as “I”, “He”, “she”, etc, the result will be the word “the” so even if you hit submit, it will look like it hasn’t refreshed so please test the model by using very simple words.
  • Thank you again and I hope you have a great day!