Data Science Capstone Project: Milestone Report

J. Halitsky

### March 28, 2015

#### Synopsis:

The goal of this report is to introduce the data provided through the Coursera Capstone Data Science Project and to address the following:

    1.  Demonstrate that you've downloaded the data and have successfully loaded it in.  
    2.  Create a basic report of summary statistics about the data sets.  
    3.  Report any interesting findings that you amassed so far.  
    4.  Get feedback on your plans for creating a prediction algorithm and Shiny app.  

#### Understanding the Problem:

The training data used from this project is from a corpus called HC Corpora (www.corpora.hellohost.org). HC Corpora provides a collection of corpora for various languages for free. These corpora have been collected from numerous webpages with the aim to get a varied and comprehensive corpus for each language.

The training data was downloaded from the Coursera site (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) on March 22, 2015. The files named LOCALE.blogs.txt where LOCALE indicates each of the 4 locales en_US, de_DE, ru_RU and fi_FI. Additionally, each LOCALE has 3 types of sources: blogs, news and Twitter updates. For this exercise, I will only be focusing on the en_US data.

Obtaining the data:

        library(knitr)
## Warning: package 'knitr' was built under R version 3.1.2
        opts_knit$set(progress=FALSE, verbose = TRUE)
        opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA,
                       fig.path="figure/", fig.keep="high", fig.width=10, fig.height=6,
                       fig.align="center")
library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.1.3
library(dplyr)
Warning: package 'dplyr' was built under R version 3.1.2
library(wordcloud)
Warning: package 'wordcloud' was built under R version 3.1.2
Warning: package 'RColorBrewer' was built under R version 3.1.2
library(NLP)
Warning: package 'NLP' was built under R version 3.1.2
library(tm)
Warning: package 'tm' was built under R version 3.1.2
library(qdap)
Warning: package 'qdap' was built under R version 3.1.2
Warning: package 'qdapDictionaries' was built under R version 3.1.2
Warning: package 'qdapRegex' was built under R version 3.1.2
Warning: package 'qdapTools' was built under R version 3.1.2
setwd("~/Desktop/CourseraClass/CAPSTONE/")
corpus <- VCorpus(DirSource("./final/en_US", "UTF-8"), readerControl = list(language = "en"))
paste(meta(corpus[[1]], "id"), " ", meta(corpus[[2]], "id"), " ", meta(corpus[[3]], 
    "id"))
[1] "en_US.blogs.txt   en_US.news.txt   en_US.twitter.txt"

Basic Summaries:

# Blogs
nsize <- format(object.size(corpus[[1]]), "MB")
nlines <- length(corpus[[1]]$content)
nwords <- length(unlist(strsplit(corpus[[1]]$content, " ")))  #count words
blogs <- paste("en_US.blogs.txt file size:", nsize, " number of lines:", nlines, 
    " number of words:", nwords)
# News
nsize <- format(object.size(corpus[[2]]), "MB")
nlines <- length(corpus[[2]]$content)
nwords <- length(unlist(strsplit(corpus[[2]]$content, " ")))  #count words
news <- paste("en_US.news.txt file size:", nsize, " number of lines:", nlines, 
    " number of words:", nwords)
# Twitter
nsize <- format(object.size(corpus[[3]]), "MB")
nlines <- length(corpus[[3]]$content)
nwords <- length(unlist(strsplit(corpus[[3]]$content, " ")))  #count words
twitter <- paste("en_US.twitter.txt file size:", nsize, " number of lines:", 
    nlines, " number of words:", nwords)

cat(blogs, "\n", news, "\n", twitter)
en_US.blogs.txt file size: 248.5 Mb  number of lines: 899288  number of words: 37334131 
 en_US.news.txt file size: 249.6 Mb  number of lines: 1010242  number of words: 34372530 
 en_US.twitter.txt file size: 301.4 Mb  number of lines: 2360148  number of words: 30373543

#### Data Sample and Cleaning:

As you can see above, the file size on each of the training files are extremely large. By tidying the data we will see a dramatic decrease in the speed of running the code. To be more efficient, a sample function in r will be used to create a random sample with 2% of each of the documents provided.

Sample Set:

set.seed(23678)
content <- corpus[[1]]$content
sz <- length(content) * 0.02
content <- sample(content, sz)
corpus[[1]]$content <- content

content <- corpus[[2]]$content
sz <- length(content) * 0.02
content <- sample(content, sz)
corpus[[2]]$content <- content

content <- corpus[[3]]$content
sz <- length(content) * 0.02
content <- sample(content, sz)
corpus[[3]]$content <- content
content <- NULL

Data Cleaning
Part of tidying the data, is to clean the dataset by removing all special characters, whitespace, punctuation, numbers, stopwords, profanity and changing all the data to lowercase. This will help with analyzing the words to see how many are repetitive.

Github has a list of dirty naughty obscene and otherwise bad words (https://raw.githubusercontent.com/dannygj/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en)

rmvSpclChar <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, rmvSpclChar, "[^[:graph:]]")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, readLines("final/en_US/profanity.txt"))
Warning in mclapply(content(x), FUN, ...): all scheduled cores encountered
errors in user code

#### Conclusion:

In analyzing the data, the training files ranged around 200 MB per file. By taking a sample of this data, we are able to view the frequency of the word distributions. We only observed the English dataset but there were many foreign words identified when reviewing the data. With other languages present, we would definitely have to identify all the foreign word roots, stems and characters. Also, misspelled, abbreviated, and hashtag type words will require additional cleaning.

#### Final Project Planning:

My next step will be to obtain additional studies of the Statistical Natural Language Processing and N-grams which will assist in creating an approach for the final project of predicting the next word in a sentence. From what was observed in this project, size will be a limitation to the model accuracy.