DSS Capstone Milestone Report

Executive Summary

The aim of the Data Science Specialisation Capstone project is to produce a Shiny App web application that trys to predict the next word in a user generated sentence. The data has been provided by Swiftkey and includes content from blogs, news and twitter.

The aim of this milestone report is to briefly summarise my plan for the prediction algorithm and Shiny App in a way that would be understandable to a non-data scientist (note. for this reason I have hidden the r-code form the r-pubs doc).

I have focussed initially on the English datasets for blogs, news and twitter data. I have taken a very small (c.1%) sample of the data to reduce the size and allow it to run on my local machine (note. I have taken larger samples of the data outside the scope of this milestone report and achieved very similar frequency results).

I have then cleaned the data by converting text to lower case, removing white space and punctuation and profanity filtering. I have then looked at some word frequency analysis and plotted some graphs and word clouds.

I have also looked at pairs of words (note. using n-grams) and my intention is to extend this N-gram use up to 5 words to create my prediction algorithm.

Data Processing. Data Downloaded and Uploaded to R

The following code is used to upload the data into R. We have loaded the twitter, blog and news feeds from the English datasets.

# Load the packages 
# library(bitops); library(class); library(ggplot2); library(plyr); 
# library(RCurl); library(rjson); library(stringr);
# library(twitteR); library(XML); library(tm); library(RWeka); library(dplyr);
# library(wordcloud); library(stringi)
# 
# Set directory
# setwd("~/Desktop/Capstone/final/en_US")
# 
# Load the data 
# en_news <- as.matrix(readLines("en_US.news.txt",-1,skipNul = TRUE))
# en_twit <- as.matrix(readLines("en_US.twitter.txt",-1,skipNul = TRUE))
# en_blog <- as.matrix(readLines("en_US.blogs.txt",-1,skipNul = TRUE))

Exploratory Data Analysis. Basic summary statistics

The following table contains the character and word counts for the three full datasets:

##                Blog
## Character 162464653
## Word       37570839

##                News
## Character 162227130
## Word       34494539

##             Twitter
## Character 125570778
## Word       30451170

I have taken small samples of this data to allow my computer to cope with the processing for this milestone report (note. my intention is to use as large samples as possible for the n-grams that go towards generating the final algorithm). The data was then cleaned to convert all to lower case, remove punctuation, strip white space and remove numbers. A profanity filter was used to remove specific blacklisted words.

Plots and some interesting findings

The following graphs were created for each of the three cleaned and transformed datasets showing the 25 highest frequency words for each:

plot of chunk unnamed-chunk-4

Word Clouds

The following word clouds were created for each of the three cleaned and transformed datasets (note. the first cloud (furthest to the left) is based on blog data, the second (middle cloud) is based on news data and the last (furthest to the right) is based on twitter data):

plot of chunk unnamed-chunk-5

These word clouds show the overwhealming dominance of the word ‘the’ with ‘and’ also showing significant useage in the datasets.

I therefore intend to use these top frequency single words when the algorithm fails to find a longer string of words to predict on. If the user generated word sequence cannot be identified anywhere in my n-gram groups the algorithm will default to the word ‘the’ or ‘and’ and the most likely next word based on frequency of word use alone.

N-grams

Having looked at the seperated datasets I am confident they are similar enough to be combined for the purposes of this project. In this next graph I have combined the three samples taken from the three datasets and generated word pairs. The following graph shows the 25 highest frequency pairs of words:

plot of chunk unnamed-chunk-6 This again shows the clear dominance of the word ‘the’ in the datset which is something I will have to consider when creating the algorithm.

Conclusions

Initial analysis indicates the three english datasets are similar enough for me to treat them as one corpus for the purpose of this project. This initial look has been based on word frequency to allow me to understand the dataset, I will use this data to apply a default prediction if the final algorithm is unable to find an approriate prediction (note. our guess will be the most frequent word so we will offer the word ‘the’ first if our algorithm fails to find a possible candidate). This will allow me to offer something in the case of typos and unfamiliar input words.

This initial analysis has also allowed me to identify the number of n-grams I will need to run (note. n-grams will give us groups of word from the source text). I will group words in pairs, threes, fours etc. If a four word length string is identified I will take the fifth word from the n-gram as the prediction for the user)

Next steps

I will need to find a sample size that allows me to run the required n-grams on my computer (note. my initial idea is to run a 5 word group to allow me to predict on a 4 word input). I will then write the code to allow me to offer the user a predcition based on the highest n-gram possible (if a 5 long word sequence is not found I will look for a 4 long word sequence matched in the input sentence)

I would also like to do some sentence recognition to allow some analysis to be done on important words in the sentence (note. if food is mentioned prior to the 4 word sequence ‘and a case of’ we should be able to predict the next word as ‘beer’).

Shiny App

I have developed a shell Shiny App that allows for a user generated text input with a submit button that initiates the algorithm to find the next words. The app will then present its three best predictions. I will need to then work of speed.