Milestone Report - Data Science Capstone

Introduction

Welcome to my Milestone Report as part of the Data Science captsone. This report uses data from a corpus called HC Corpora and features blog, news and twitte data.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Summary Statistics

Below is some r code to load needed librarys and read in the data:

library(tm)
library(knitr)
library(RWeka)
library(wordcloud)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
setwd("~/Coursera/Capstone")
# Reads in the Blogs data
blog <- readLines("Data/en_US.blogs.txt")
# Reads in the News data
news <- readLines("Data/en_US.news.txt")
# Reads in the Twitter data
twitter <- readLines("Data/en_US.twitter.txt")

Now that the dats is read into R, below is a table with some summary statistics:

File	File Size	Total Lines	Total Words	Avg Words/Line
en_us.news.txt	19.2 MB	1,010,242	34,372,530	34.1
en_us.blogs.txt	248.5 MB	899,288	37,334,147	41.5
en_us.twitter.txt	301.4 MB	2,360,148	30,373,563	12.9

As can be seen, these files are rather large and will require a great deal of computing power to anaylize. Additionally, while the blogs data has less than half the number of lines in the twitter data, each line is much larger. This makes sense twitter caps the number characters each tweet can contine (140).

Data Processing

Since these files are rather large, only a random sample of the data will be retained to analysis. Using the rbinom function, random samples from the binomial distribution, only 1% of the data will be kept. With a more powerful computer, more data could be kept. Finally, the three files are concatinated into one main file

#This code creates a random sample of each data set then removes the large data
set.seed(1005)
blog.s <- blog[ifelse(rbinom(length(blog),1,.01),T,F)]
rm(blog);gc()
news.s <- news[ifelse(rbinom(length(news),1,.01),T,F)]
rm(news);gc()
twitter.s <- twitter[ifelse(rbinom(length(twitter),1,.01),T,F)]
rm(twitter);gc()
#This creates one main file and removes all other files
text <- paste(blog.S,news.S,twitter.S); rm(blog.s,news.s,twitter.s);gc()

The next step is to turn our data into a corpus and remove some unneeded parts of the text. All of the following functions are from the tm package.

#Turn data into a corpus, Remove Unwanted text, numbers, characters and punctuation
corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
rm(text);gc()

Creating N-Grams

Now that the data is in a good place, it is time to start tokenizing the corpus and create some n-grams. I created four tokens: (1) unigrams, (2) bigrams, (3) trigrams and (4) four-gram. Next, the term-document matrix function is used to create a matrix that shows the most frequent word or words. A n-gram allows us to start predicting the next word. We will use these datasets in our future algorithm.

# This tells the function where to break
TDelimiters <- " \\t\\r\\n.!?,;\"()"
#Create Functions
token1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = TDelimiters))
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = TDelimiters))
token4 <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4, delimiters = TDelimiters))

# Create a dtm for all 4 grams
dtm <- DocumentTermMatrix(corpus,control=list(tokenize = token1))
dtm <- removeSparseTerms(dtm, .8)
freq <- colSums(as.matrix(dtm))
dtm2 <- DocumentTermMatrix(corpus,control=list(tokenize = token2))
freq2 <- colSums(as.matrix(dtm2))
dtm3 <- DocumentTermMatrix(corpus,control=list(tokenize = token3))
freq3 <- colSums(as.matrix(dtm3))
dtm4 <- DocumentTermMatrix(corpus,control=list(tokenize = token4))
freq4 <- colSums(as.matrix(dtm4))

rm(dtm,dtm2,dtm3,dtm4);gc()

Now that all four n-grams are created, lets examine each to see what the most frequent words are.

It’s no suprise that the word “the” is the most popular. As we move forward, we should take into account very common words that may not be good to predict.

Next Steps

Now that the exploratory analysis is completed and we have begun to build N-Grams, the next step is to begin creating a prediction algorithm. The model will then be constructed as a Shiny Application which will allow users to type in a word, and the model will generate the next predicted word(s).