Milestone Report

Overview

The purpose of this report is to give an analysis of the dataset being used for the creation text prediction app and to give a brief explanation of the algorithm that is to be used to make fast and accurate predictions. The application will take a word or series of words and will predict the next word - much like the text prediction on mobile phone messaging or when typing into the Google search bar.

The Data

The data for this project Is available here: Capstone Dataset. It is part of the HC Corpora.

An analysis has been done on the 3 text files in the en_US folder:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

A portion of the data in these files will be used to train a model for text prediction. For the purpose of this analysis, I have taken a 1% random sample of each file to alleviate memory and speed issues. These samples should be representative of the full data set.

Below I have summarised some of the initial findings from my exploratory data analysis of these files.

Using the GnuWin32 ‘file’ command, we see that each file is UTF-8 Unicode, English (mostly) text with CRLF line terminators.

The number of lines in each file is as follows:

## [1] "Number of lines in en_US.blogs.txt:  899288"

## [1] "Number of lines in en_US.news.txt:  1010242"

## [1] "Number of lines in en_US.twitter.txt:  2360148"

The number of words in each file is as follows:

## [1] "Number of words in en_US.blogs.txt:  37296225"

## [1] "Number of words in en_US.news.txt:  34258969"

## [1] "Number of words in en_US.twitter.txt:  29959336"

Common Single Words

As a general overview, this graph shows the most common single words that appear in the 3 text documents.

By looking at the breakdown between the 3 files (blog, news and twitter), we can see that there is a relative similarity between the most common single words in each of the 3 files.

##        blog_sample.txt news_sample.txt twitter_sample.txt
## one               1257            2450               1541
## like               998             859               1245
## just               966             707               1128
## can                946             625               1064
## time               881             579                957
## get                722             576                944
## know               653             554                881
## now                601             521                879
## new                571             503                855
## also               558             498                820
## us                 544             490                796
## people             541             490                788
## day                539             455                756
## even               532             454                746
## much               525             442                744
## good               519             431                740
## make               502             426                726
## first              498             348                714
## well               498             344                675
## think              489             340                636

As these words are common to just about any English language text, I’m going to use Term Frequency Inverse Document Frequency (tf-idf) to find the important words in each document. It works by decreasing the weight for commonly used words and increasing the weight for less common words. It ends up finding the common (but not too common) words in a document which should give us a better flavour of each document in the corpus.

## $blog_sample.txt
##           tsp      coloured     cardstock        muffin          knit 
##      7.633940      6.679698      6.679698      6.202576      5.725455 
##          stir         stamp        ideals      realised         flour 
##      5.458829      5.282738      5.248334      5.248334      4.578373 
##         layer    christians       realise         satan    intentions 
##      4.402281      4.402281      4.294091      4.294091      4.294091 
##    attachment           lol         allah consciousness      passages 
##      4.294091      3.874008      3.816970      3.816970      3.816970 
## 
## $news_sample.txt
##      team's   spokesman    portland  commission     trenton      voters 
##   12.405153   11.093749   10.565476   10.037202    9.542425    9.156745 
##   sheriff's    township      dimora    averaged prosecutors minneapolis 
##    9.065304    9.065304    8.588183    8.111061    7.748015    7.633940 
##      kasich coordinator    declined    christie enforcement  attorney's 
##    7.633940    7.633940    7.571924    7.571924    7.219742    7.156819 
##      winery      toyota 
##    6.679698    6.679698 
## 
## $twitter_sample.txt
##        rt       lol      haha      lmao     tweet      shit        dm 
## 144.39483 118.86160  47.36855  35.78409  32.04861  31.34424  29.10440 
##       tho       thx     wanna        ff        ur     nigga       smh 
##  26.24167  25.76455  24.47669  23.37894  23.24405  21.47046  20.03909 
##       ass       omg        ya      fuck  congrats        aw 
##  19.54613  19.54613  19.37004  19.01786  18.66567  17.65349

The good people of Twitter are expressing themselves enthusiastically. My first thought was to filter out the profanities but that has complications as it likely would make some sentences nonsensical, would strip out the sentiment of what is being said, and would complicate text prediction. The better way might be to check profanity in the prediction phase i.e., don’t suggest a profanity but rather the next most likely response.

The common single words of each document show some variation in the texts but don’t really give us much information. Now we’ll look at common groups of words (ngrams). As an example here, we’ll look at groups of 4 consecutive words.

Common Word Strings (ngrams)

Splitting the 4-grams up in this way and using a tfidf weight the ngrams to find the important words shows up some differences in the 3 documents. Whereas the blog has all about ‘I, me, my’ eg. ‘I thought I would’, ‘I have to say’, ‘this is my first’, twitter is more about ‘you’ as in ‘thank you’, ‘what are you doing’, ‘how have you been’ etc. Interesting. Not surprisingly, the news text does not generally make use of the first or second person pronouns.

It must be noted that there seems to be fairly common occurances of repetitive words in both twitter and blog texts (e.g. happy_happy_happy_happy, u_u_u_u, harry_harry_harry_harry). This most certainly would skew text prediction probabilities. Will have to look at replacing those with single words.

Algorithm and App

My plan is to implement in R the Katz BackOff Model which deals with conditional probability of a word given the previous word or sequence of words (ngrams). The idea is to create a Shiny app web page that will take a text input and dynamically produce the 3 most likely next words - similar to how mobile phone predictive text works. The more data that is used to create the model will result in more accurate predictions. There is a trade off however with speed so I will be testing various sizes of training and test data to find the optimum balance between speed and accuracy.

There is much still to be done to tidy up the file data. The Twitter data especially has quite freeform spelling as well as text abbreviations and non-conventional sentence structure. Ideally the app will not make text-speak suggestions (e.g. 2day, wtf, etc.). These will either have to be filtered out of the training data or translated into standard English words and phrases.