Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Executive Summary

In this report, preliminary analyses of the Twitter, News, and Blogs datasets are presented. Word counts, line counts, and basic data tables are presented, as well as monograms of the 20 most popular words per dataset. A plan for the future of the project is also laid out.

Data Acquisition and Preprocessing

Libraries

First, we will load the ggplot2 library, as well as tm, RWeka, and stringi, which all have a lot of useful NLP tools.

suppressMessages(library(tm))
## Warning: package 'tm' was built under R version 3.4.2
suppressMessages(library(RWeka))
## Warning: package 'RWeka' was built under R version 3.4.2
suppressMessages(library(ggplot2))
suppressMessages(library(stringi))

Data Acquisition

I will be loading a subset of 10000 observations from each dataset, as doing more than that takes a very long time, especially during tokenization.

download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                destfile = "Coursera-SwiftKey.zip")
unzip('Coursera-SwiftKey.zip')  
tweets <- readLines("final/en_US/en_US.twitter.txt", n = 10000)
tweetsdf <- data.frame(tweets)
news <- readLines("final/en_US/en_US.news.txt", n = 10000)
newsdf <- data.frame(news)
blogs <- readLines("final/en_US/en_US.blogs.txt", n = 10000)
blogsdf <- data.frame(blogs)

Let’s look at the sizes of the datasets, in Mb:

file.info("final/en_US/en_US.twitter.txt")$size/1024^2
## [1] 159.3641
file.info("final/en_US/en_US.news.txt")$size/1024^2
## [1] 196.2775
file.info("final/en_US/en_US.blogs.txt")$size/1024^2
## [1] 200.4242

Hence my reluctance in load the whole dataset. Let’s take a peak inside:

head(tweets)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
head(news)
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
head(blogs)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

Preprocessing

The following is some preprocessing for the corpuses, including transforming the data into a Corpus, and altering various data, like changing capitalization, making all letters lower-case, stripping white space, and removing stop words – extremely common words like “a” and “the”, which would otherwise dominate our monograms.

space  <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

tweetCorp <- Corpus(VectorSource(tweets))
tweetCorp <- tm_map(tweetCorp, space,"\"|/|@|\\|")
tweetCorp <- tm_map(tweetCorp, content_transformer(tolower))
tweetCorp <- tm_map(tweetCorp, removeNumbers)
tweetCorp <- tm_map(tweetCorp, stripWhitespace)
tweetCorp <- tm_map(tweetCorp, removeWords, stopwords('english'))

newsCorp <- Corpus(VectorSource(news))
newsCorp <- tm_map(newsCorp, space,"\"|/|@|\\|")
newsCorp <- tm_map(newsCorp, content_transformer(tolower))
newsCorp <- tm_map(newsCorp, removeNumbers)
newsCorp <- tm_map(newsCorp, stripWhitespace)
newsCorp <- tm_map(newsCorp, removeWords, stopwords('english'))

blogsCorp <- Corpus(VectorSource(blogs))
blogsCorp <- tm_map(blogsCorp, space,"\"|/|@|\\|")
blogsCorp <- tm_map(blogsCorp, content_transformer(tolower))
blogsCorp <- tm_map(blogsCorp, removeNumbers)
blogsCorp <- tm_map(blogsCorp, stripWhitespace)
blogsCorp <- tm_map(blogsCorp, removeWords, stopwords('english'))

Here, I shift the Corpus back into a dataframe, tokenize the data (i.e. pick out all the individual words), and create my subset of monograms, in this case the most popular words that aren’t stop words.

tweetCorpDF <- data.frame(text = unlist(tweetCorp), stringsAsFactors = FALSE)
tokensTweets <- NGramTokenizer(tweetCorpDF, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
tokensTweets <- data.frame(table(tokensTweets))
tweetMonogram <- tokensTweets[order(tokensTweets$Freq, decreasing = TRUE),]
colnames(tweetMonogram) <- c("words","count")

newsCorpDF <- data.frame(text = unlist(newsCorp), stringsAsFactors = FALSE)
tokensNews <- NGramTokenizer(newsCorpDF, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
tokensNews <- data.frame(table(tokensNews))
newsMonogram <- tokensNews[order(tokensNews$Freq, decreasing = TRUE),]
colnames(newsMonogram) <- c("words","count")

blogsCorpDF <- data.frame(text = unlist(blogsCorp), stringsAsFactors = FALSE)
tokensBlogs <- NGramTokenizer(blogsCorpDF, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
tokensBlogs <- data.frame(table(tokensBlogs))
blogsMonogram <- tokensBlogs[order(tokensBlogs$Freq, decreasing = TRUE),]
colnames(blogsMonogram) <- c("words","count")

Data Tables

Line Count

Stringi makes word and character counts very easy, although since this is a 10000-sample subset, the number of lines is very predictable:

stri_stats_general(tweets)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       10000       10000      682791      565117
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       10000       10000     2041079     1707148
stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       10000       10000     2294142     1893513

As one would expect, we have 10000 lines each.

Word Count

Now, let’s look at word count, again with stringi’s help:

sum(stri_count_words(tweets))
## [1] 127024
sum(stri_count_words(news))
## [1] 350478
sum(stri_count_words(blogs))
## [1] 419478

Of course, there would be more if we were using the full data set.

NGrams

We can also look at the most popular words in each set:

head(tweetMonogram)
##      words count
## 8250  just   660
## 8773  like   510
## 67       -   494
## 6646   get   474
## 8974  love   440
## 6789  good   424
head(newsMonogram)
##       words count
## 24613  said  2484
## 31147  will  1084
## 347       $  1055
## 90        -   911
## 19978   one   835
## 19278   new   675
head(blogsMonogram)
##       words count
## 20962   one  1324
## 32447  will  1235
## 16424  just  1153
## 5412    can  1134
## 17519  like  1091
## 29966  time  1023

This isn’t terribly illuminating. Let’s look at the monogram plots instead.

Monogram Plots

Finally, we create plot of the first twenty most popular words in each subset. This gives us a good idea of the sorts of things people write. Of course, the most common things people actually write are stop words, hence why we removed them; we don’t gain any new information otherwise.

ggplot(tweetMonogram[1:20,], aes(words, count)) + geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(newsMonogram[1:20,], aes(words, count)) + geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(blogsMonogram[1:20,], aes(words, count)) + geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Conclusions and Next Steps

Before anything else, these datasets need to be cleaned further. There is an unfortunate presence of dashes and similar symbology that needs to be expunged. The step after that is to use machine learning to make the word predictor. My idea is to break up each dataset into three parts – training, cross-validation, and testing – on which to apply the machine-learning software. I will use a variety of techniques, applying them to each training set in turn, then apply them to the cross-validation to see which is best, then retrain that best technique, then apply it to the test set. The information I will feed the algorithms will be bigrams, trigrams, and quadgrams, weighted according to frequency, which should suffice to get the program to learn the most likely next word. Finally, the program will be developed into a Shiny App for the final presentation.