Coursera Capstone Project - Milestone Report

Milestone Overview

The following is from the Coursera Milestone Report Peer Graded Assignment instructions and it outlines the overall purpose of this report:

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

The following report will show to the reader the progress I have made on this challenging assignment.

This project will attempt to meet the following criteria:

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Obtaining the data

The data was downloaded from the following site: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The zip file was extracted and the data files were loaded into character vectors using the Readlines command. Although there were blog, Twitter and news files for the following languages:

English
Russian
Dutch
Finnish

I have chosen to focus my project on the English data sets only. The reason for this is that English is my native language and deciphering the results of the text analysis and model results in a language I don’t speak would make this fairly daunting project even more difficult.

#define the file path
fpath.us <- "dat/final/en_US/"
#load the files
en.blogs <- readLines(paste0(fpath.us,"en_US.blogs.txt"))
en.news <- readLines(paste0(fpath.us,"en_US.news.txt"))
en.twitter <- readLines(paste0(fpath.us,"en_US.twitter.txt"))

The lenghts of the character vectors are as follows:

en.blogs = 899288
en.news = 77259
en.twitter = 2360148

Despte the fact that my laptop has 16GB of RAM and a 4 core, Intel Core i7 processor installed, the en.blogs and en.twitter files proved to be to large to perform any form of timely analysis on. To lessen the size of the files, I will use the sample function to sample 10% of all files, storing the new files in en.news.red, en.blogs.red and en.twitter.red.

set.seed(1)

en.blogs.red <- sample(en.blogs, length(en.blogs)*.1)
en.twitter.red <- sample(en.twitter, length(en.twitter)*.1)
en.news.red <- sample(en.news, length(en.news)*.1)

length(en.blogs.red)

## [1] 89928

length(en.twitter.red)

## [1] 236014

length(en.news.red)

## [1] 7725

Data Exploration and Cleaning

I will now seek to learn more about the files.

Word Counts

Word count for en.blogs: 36893516
Word count for en.twitter: 29430648
Word count for en.news: 2579113

Line Counts

Line Count for en.blogs: 899288
Line Count for en.twitter: 2360148
Line Count for en.news: 77259

Longest Text String

The longest text string in en.blogs is: 40835.
The longest text string in en.news is: 5760.
The longest text string in en.twitter is: 213.

My observations from this analysis is that the files are to large for my computer to handle in their full size. en.news is the smallest of the files but all three will be sampled to ensure consistency across the files.

To further explore these files I will now create a corpus file for the purpose of beginning my text mining analysis of the files, the most frequent terms and how those terms relate to one another via bigrams and trigrams. For the sake of size I will also immediately trim out any features that occur less than 5 times in the matrix. This number may get adjusted as I further look at my model.

For this section of the exploration, I will use Quanteda as my primary text mining tool. I explored several packages most notably tm, quanteda, RWeka and text2vec to determine which would fit my needs. Quanteda offered the best performance with ease of use, so it will be used to create the document feature matrix (dfm) and extract the ngrams from the dfm. Please note, at this time I am only removing stopwords for the unigrams. The thought being that stopword removal of the ngrams that are greater than 1 would result in incomplete phrasing. I may adjust this approach later, if needed.

library(quanteda)
#create the corpus
q.en.corpus <- corpus(c(en.twitter.red,en.blogs.red,en.news.red))

#create the top word list
q.en.corp.1 <- dfm(q.en.corpus, ignoredFeatures = stopwords(kind = "english"))
q.en.corp.1 <- trim(q.en.corp.1, minCount = 5)
#creat the top bigrams
q.en.corp.2 <- dfm(q.en.corpus, ngrams = 2)
q.en.corp.2 <- trim(q.en.corp.2, minCount = 5)
#create the top trigrams
q.en.corp.3 <- dfm(q.en.corpus, ngrams = 3)
q.en.corp.3 <- trim(q.en.corp.3, minCount = 5)
#create the top quadgrams
q.en.corp.4 <- dfm(q.en.corpus, ngrams = 4)
q.en.corp.4 <- trim(q.en.corp.4, minCount = 5)

I will now create a list of the top features for each ngram and perform some rudimentary clean up on the data frames before graphing them.

#find the top features for each of the various ngrams (1:4)
tp.1 <- as.data.frame(topfeatures(q.en.corp.1,25))
tp.2 <- as.data.frame(topfeatures(q.en.corp.2,25))
tp.3 <- as.data.frame(topfeatures(q.en.corp.3,25))
tp.4 <- as.data.frame(topfeatures(q.en.corp.4,25))

#create a function to fix the column names
fix.col <- function(x) {
        names(x)[1] <- "count"
        x$top.words <- rownames(x)
        x <- x[,c(2,1)]
        return(x)
}

#execute the function
tp.1 <- fix.col(tp.1)
tp.2 <- fix.col(tp.2)
tp.3 <- fix.col(tp.3)
tp.4 <- fix.col(tp.4)

Graphing the Ngrams

The following graphs display the top features for all 4 of the ngrams that were created (1-gram,2-gram,3-gram,4-gram).

library(ggplot2)
#graph the unigram
ggplot(data = tp.1, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the unigram analysis")

#graph the bigram
ggplot(data = tp.2, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the bigram analysis")

#graph the trigram
ggplot(data = tp.3, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the trigram analysis")

#graph the 4-gram
ggplot(data = tp.4, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the 4-gram analysis")

Conclusion and next steps

This now conludes my analysis of the data. This has been quite a nerve racking experience as this is a subject matter that I (like most of us in this class) had no prior experience wtih. From this point, my plan is as follows:

Sort the ngrams in descending order
Split the bigrams, trigrams and 4 grams upinto smaller pieces that can be searched and ordered depending on the test data.
Create a function by which a statement (like the ones that will be provided) can be broken down, matched up with an ngram so that the next word can be predicted.
This function will also include a backoff model designed to move down the ngram list if a match isn’t made. There is suggestion of “stupid backoff” for this assignment. I am still researching the validity of that model and how it would be implemented in R.
The prediction model may just be a simple list that I can sort and search or I may try to utilize some form of machine learning algorithm (possibly rpart) to create a tree by which the probabilities could be created
This model will be used to create my shiny application which will most likely be simple in nature. It will merely prompt for a statement and in the main text screen, it will display the top 3-4 for next words that will most likely appear. This will be a very similar model to what happens today on an iPhone when typing a text.