The purpose of this report is to give an analysis of the dataset being used for the creation text prediction app and to give a brief explanation of the algorithm that is to be used to make fast and accurate predictions. The application will take a word or series of words and will predict the next word - much like the text prediction on mobile phone messaging or when typing into the Google search bar.
The data for this project Is available here: Capstone Dataset. It is part of the HC Corpora.
An analysis has been done on the 3 text files in the en_US folder:
A portion of the data in these files will be used to train a model for text prediction. For the purpose of this analysis, I have taken a 1% random sample of each file to alleviate memory and speed issues. These samples should be representative of the full data set.
Below I have summarised some of the initial findings from my exploratory data analysis of these files.
Using the GnuWin32 ‘file’ command, we see that each file is UTF-8 Unicode, English (mostly) text with CRLF line terminators.
The number of lines in each file is as follows:
## [1] "Number of lines in en_US.blogs.txt: 899288"
## [1] "Number of lines in en_US.news.txt: 1010242"
## [1] "Number of lines in en_US.twitter.txt: 2360148"
The number of words in each file is as follows:
## [1] "Number of words in en_US.blogs.txt: 37296225"
## [1] "Number of words in en_US.news.txt: 34258969"
## [1] "Number of words in en_US.twitter.txt: 29959336"
As a general overview, this graph shows the most common single words that appear in the 3 text documents.
By looking at the breakdown between the 3 files (blog, news and twitter), we can see that there is a relative similarity between the most common single words in each of the 3 files.
## blog_sample.txt news_sample.txt twitter_sample.txt
## one 1257 2450 1541
## like 998 859 1245
## just 966 707 1128
## can 946 625 1064
## time 881 579 957
## get 722 576 944
## know 653 554 881
## now 601 521 879
## new 571 503 855
## also 558 498 820
## us 544 490 796
## people 541 490 788
## day 539 455 756
## even 532 454 746
## much 525 442 744
## good 519 431 740
## make 502 426 726
## first 498 348 714
## well 498 344 675
## think 489 340 636
As these words are common to just about any English language text, I’m going to use Term Frequency Inverse Document Frequency (tf-idf) to find the important words in each document. It works by decreasing the weight for commonly used words and increasing the weight for less common words. It ends up finding the common (but not too common) words in a document which should give us a better flavour of each document in the corpus.
## $blog_sample.txt
## tsp coloured cardstock muffin knit
## 7.633940 6.679698 6.679698 6.202576 5.725455
## stir stamp ideals realised flour
## 5.458829 5.282738 5.248334 5.248334 4.578373
## layer christians realise satan intentions
## 4.402281 4.402281 4.294091 4.294091 4.294091
## attachment lol allah consciousness passages
## 4.294091 3.874008 3.816970 3.816970 3.816970
##
## $news_sample.txt
## team's spokesman portland commission trenton voters
## 12.405153 11.093749 10.565476 10.037202 9.542425 9.156745
## sheriff's township dimora averaged prosecutors minneapolis
## 9.065304 9.065304 8.588183 8.111061 7.748015 7.633940
## kasich coordinator declined christie enforcement attorney's
## 7.633940 7.633940 7.571924 7.571924 7.219742 7.156819
## winery toyota
## 6.679698 6.679698
##
## $twitter_sample.txt
## rt lol haha lmao tweet shit dm
## 144.39483 118.86160 47.36855 35.78409 32.04861 31.34424 29.10440
## tho thx wanna ff ur nigga smh
## 26.24167 25.76455 24.47669 23.37894 23.24405 21.47046 20.03909
## ass omg ya fuck congrats aw
## 19.54613 19.54613 19.37004 19.01786 18.66567 17.65349
The good people of Twitter are expressing themselves enthusiastically. My first thought was to filter out the profanities but that has complications as it likely would make some sentences nonsensical, would strip out the sentiment of what is being said, and would complicate text prediction. The better way might be to check profanity in the prediction phase i.e., don’t suggest a profanity but rather the next most likely response.
The common single words of each document show some variation in the texts but don’t really give us much information. Now we’ll look at common groups of words (ngrams). As an example here, we’ll look at groups of 4 consecutive words.
Splitting the 4-grams up in this way and using a tfidf weight the ngrams to find the important words shows up some differences in the 3 documents. Whereas the blog has all about ‘I, me, my’ eg. ‘I thought I would’, ‘I have to say’, ‘this is my first’, twitter is more about ‘you’ as in ‘thank you’, ‘what are you doing’, ‘how have you been’ etc. Interesting. Not surprisingly, the news text does not generally make use of the first or second person pronouns.
It must be noted that there seems to be fairly common occurances of repetitive words in both twitter and blog texts (e.g. happy_happy_happy_happy, u_u_u_u, harry_harry_harry_harry). This most certainly would skew text prediction probabilities. Will have to look at replacing those with single words.
My plan is to implement in R the Katz BackOff Model which deals with conditional probability of a word given the previous word or sequence of words (ngrams). The idea is to create a Shiny app web page that will take a text input and dynamically produce the 3 most likely next words - similar to how mobile phone predictive text works. The more data that is used to create the model will result in more accurate predictions. There is a trade off however with speed so I will be testing various sizes of training and test data to find the optimum balance between speed and accuracy.
There is much still to be done to tidy up the file data. The Twitter data especially has quite freeform spelling as well as text abbreviations and non-conventional sentence structure. Ideally the app will not make text-speak suggestions (e.g. 2day, wtf, etc.). These will either have to be filtered out of the training data or translated into standard English words and phrases.