Milestone Overview

The following is from the Coursera Milestone Report Peer Graded Assignment instructions and it outlines the overall purpose of this report:

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

The following report will show to the reader the progress I have made on this challenging assignment.

This project will attempt to meet the following criteria:

Obtaining the data

The data was downloaded from the following site: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The zip file was extracted and the data files were loaded into character vectors using the Readlines command. Although there were blog, Twitter and news files for the following languages:

I have chosen to focus my project on the English data sets only. The reason for this is that English is my native language and deciphering the results of the text analysis and model results in a language I don’t speak would make this fairly daunting project even more difficult.

#define the file path
fpath.us <- "dat/final/en_US/"
#load the files
en.blogs <- readLines(paste0(fpath.us,"en_US.blogs.txt"))
en.news <- readLines(paste0(fpath.us,"en_US.news.txt"))
en.twitter <- readLines(paste0(fpath.us,"en_US.twitter.txt"))

The lenghts of the character vectors are as follows:

Despte the fact that my laptop has 16GB of RAM and a 4 core, Intel Core i7 processor installed, the en.blogs and en.twitter files proved to be to large to perform any form of timely analysis on. To lessen the size of the files, I will use the sample function to sample 10% of all files, storing the new files in en.news.red, en.blogs.red and en.twitter.red.

set.seed(1)

en.blogs.red <- sample(en.blogs, length(en.blogs)*.1)
en.twitter.red <- sample(en.twitter, length(en.twitter)*.1)
en.news.red <- sample(en.news, length(en.news)*.1)

length(en.blogs.red)
## [1] 89928
length(en.twitter.red)
## [1] 236014
length(en.news.red)
## [1] 7725

Data Exploration and Cleaning

I will now seek to learn more about the files.

Word Counts

  • Word count for en.blogs: 36893516
  • Word count for en.twitter: 29430648
  • Word count for en.news: 2579113

Line Counts

  • Line Count for en.blogs: 899288
  • Line Count for en.twitter: 2360148
  • Line Count for en.news: 77259

Longest Text String

  • The longest text string in en.blogs is: 40835.
  • The longest text string in en.news is: 5760.
  • The longest text string in en.twitter is: 213.

My observations from this analysis is that the files are to large for my computer to handle in their full size. en.news is the smallest of the files but all three will be sampled to ensure consistency across the files.

To further explore these files I will now create a corpus file for the purpose of beginning my text mining analysis of the files, the most frequent terms and how those terms relate to one another via bigrams and trigrams. For the sake of size I will also immediately trim out any features that occur less than 5 times in the matrix. This number may get adjusted as I further look at my model.

For this section of the exploration, I will use Quanteda as my primary text mining tool. I explored several packages most notably tm, quanteda, RWeka and text2vec to determine which would fit my needs. Quanteda offered the best performance with ease of use, so it will be used to create the document feature matrix (dfm) and extract the ngrams from the dfm. Please note, at this time I am only removing stopwords for the unigrams. The thought being that stopword removal of the ngrams that are greater than 1 would result in incomplete phrasing. I may adjust this approach later, if needed.

library(quanteda)
#create the corpus
q.en.corpus <- corpus(c(en.twitter.red,en.blogs.red,en.news.red))

#create the top word list
q.en.corp.1 <- dfm(q.en.corpus, ignoredFeatures = stopwords(kind = "english"))
q.en.corp.1 <- trim(q.en.corp.1, minCount = 5)
#creat the top bigrams
q.en.corp.2 <- dfm(q.en.corpus, ngrams = 2)
q.en.corp.2 <- trim(q.en.corp.2, minCount = 5)
#create the top trigrams
q.en.corp.3 <- dfm(q.en.corpus, ngrams = 3)
q.en.corp.3 <- trim(q.en.corp.3, minCount = 5)
#create the top quadgrams
q.en.corp.4 <- dfm(q.en.corpus, ngrams = 4)
q.en.corp.4 <- trim(q.en.corp.4, minCount = 5)

I will now create a list of the top features for each ngram and perform some rudimentary clean up on the data frames before graphing them.

#find the top features for each of the various ngrams (1:4)
tp.1 <- as.data.frame(topfeatures(q.en.corp.1,25))
tp.2 <- as.data.frame(topfeatures(q.en.corp.2,25))
tp.3 <- as.data.frame(topfeatures(q.en.corp.3,25))
tp.4 <- as.data.frame(topfeatures(q.en.corp.4,25))

#create a function to fix the column names
fix.col <- function(x) {
        names(x)[1] <- "count"
        x$top.words <- rownames(x)
        x <- x[,c(2,1)]
        return(x)
}

#execute the function
tp.1 <- fix.col(tp.1)
tp.2 <- fix.col(tp.2)
tp.3 <- fix.col(tp.3)
tp.4 <- fix.col(tp.4)

Graphing the Ngrams

The following graphs display the top features for all 4 of the ngrams that were created (1-gram,2-gram,3-gram,4-gram).

library(ggplot2)
#graph the unigram
ggplot(data = tp.1, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the unigram analysis")

#graph the bigram
ggplot(data = tp.2, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the bigram analysis")

#graph the trigram
ggplot(data = tp.3, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the trigram analysis")

#graph the 4-gram
ggplot(data = tp.4, aes(x = reorder(top.words,-count), y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90)) + xlab("Top Words") + ylab("Count of words in the corpus") + ggtitle("Top Features in the 4-gram analysis")

Conclusion and next steps

This now conludes my analysis of the data. This has been quite a nerve racking experience as this is a subject matter that I (like most of us in this class) had no prior experience wtih. From this point, my plan is as follows: