Coursera Data Science Capstone Milestone 1

Introduction

The goal of the Capstone milestone project is to show my ability to work with data in R, and explains exploratory analysis for building a prediction algorithm and Shiny app. The report will be submitted on R Pubs (http://rpubs.com/).

The main focus points of this report You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Downlad Data and Load in R - Demonstrate that you’ve downloaded the data and have successfully loaded it into the R workspage. 2. Summary Statistics - Create a basic report of summary statistics about the data sets, including tables and plots. 3. Interesting Findings - Report any interesting findings that you amassed so far. 4. Strategy for prediction algorithm and Shiny app.

Getting the Data

Download the Data

The dataset is available for download as a zip file (see References for website). Check to see if the Corpora file already exists; if not, download the file from and unzip the folder to extract the raw data into the selected working directory.

Load the Raw Data Files into R.

Use the readlines() function to load the data into the workspace in R as follows.

suppressWarnings(twitter.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.twitter.txt"))
suppressWarnings(blogs.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.blogs.txt"))
suppressWarnings(news.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.news.txt"))

Summary Statistics

Statistical Summary: Basic Text Data Information

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. To find out what the data look like, statistical summaries are performed.

Note that the file sizes are very large and thus computing memory needs to be considered. Find the lengths of the files and view a basic summary of the file sizes.

US Blogs

matrix(c("Statistic", "File Size", "Length", "Max Char per Line", 
"Blogs", object.size(blogs.raw), length(blogs.raw), max(nchar(blogs.raw))), nrow=2, ncol=4, byrow=TRUE)

##      [,1]        [,2]        [,3]     [,4]               
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Blogs"     "260564320" "899288" "40835"

US News

matrix(c("Statistic", "File Size", "Length", "Max Char per Line", 
"News", object.size(news.raw), length(news.raw), max(nchar(news.raw))), nrow=2, ncol=4, byrow=TRUE)

##      [,1]        [,2]        [,3]     [,4]               
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "News"      "20111392"  "77259"  "5760"

US Twitter

matrix(c("Statistic", "File Size", "Length", "Max Char per Line", "Twitter", object.size(twitter.raw), length(twitter.raw), max(nchar(twitter.raw))), nrow=2, ncol=4, byrow=TRUE)

##      [,1]        [,2]        [,3]      [,4]               
## [1,] "Statistic" "File Size" "Length"  "Max Char per Line"
## [2,] "Twitter"   "316037344" "2360148" "213"

Create Subsets Using Random Sampling

Since building models doesn’t require use of all the given data to be accurate, we can use randomly generated samples of the text files as the training set.

blogs.sample <- sample(blogs.raw, size = length(blogs.raw)/10)
news.sample <- sample(news.raw, size = length(news.raw)/10)
twitter.sample <- sample(twitter.raw, size = length(twitter.raw)/10)

Data Summary

After sampling the data to create the training set, a summary of the new data is helpful.

#Create a matrix of summary statistics
summatrix <-matrix(c(object.size(blogs.sample), length(blogs.sample), max(nchar(blogs.sample)),
object.size(news.sample), length(news.sample), max(nchar(news.sample)),
object.size(twitter.sample), length(twitter.sample), max(nchar(twitter.sample))), 
nrow=3, ncol=3, byrow=TRUE); 
colnames(summatrix) <- c("FileSize", "Length", "MaxChar");  
rownames(summatrix) <- c("Blogs", "News", "Twitter"); summatrix;

##         FileSize Length MaxChar
## Blogs   25913768  89928    7082
## News     1993016   7725    1170
## Twitter 31895192 236014     211

#Create a Multi-panel Barplot
plot <- par(mfrow=c(1,3), mar=c(3,3,1,1), oma=c(0,0,3,1))
barplot(summatrix[,1], main = "File Size (Mb in Millions)", col=c("blue","red", "green"))
barplot(summatrix[,2], main = "Number of Lines", col=c("blue","red", "green"))
barplot(summatrix[,1], main = "Max Char Per Line", col=c("blue","red", "green"))
mtext("BarPlot of Summary Statistics for Corpora Data", side=3, line=1, outer=TRUE, cex=1, font=2)

par(plot)

It appears from the data summaries that it would be advantageous to join the data sets together, while maintaining the charateristics of each data file.

Interesting Finds

From the bar charts above, the Twitter data has the largest file size, length, and maximum number of lines, followed closely by Blogs. The News data has a much smaller file size, length, and maximum number of lines.
The file sizes are very large and use a lot of memory, so it is necessary to take samples for analysis.
From the summary statistics, I feel it would be advantageous to combine all three data sets into one corpus for analysis.

Summary of Strategy to Build Predictive Model

The corpus samples will be tokenized to build an n-gram model. An N-gram model estimates the probability of a word occuring in a phrase based on the previous words in the phrase. N-grams calculate this by probability by looking at the number of times the last word occurs in a phrase followed by the number of times the phrase minus the last word that occurs.

The above considerations will be address based on initial testing of a simple n-gram model. The strategy is to start out with a simple n-gram model, and build in complexity as determined by how the model performs against test sets of phrases.

References Used and Research

(1). Dataset https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Readme file for corpora available http://www.corpora.heliohost.org/aboutcorpus.html.

(3). Natural Language Processing Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing.

(4). CRAN Task View: Natural Language Processing https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

(5). Text Mining https://en.wikipedia.org/wiki/Text_mining https://insidebigdata.com/2017/07/10/five-steps-tackling-big-data-natural-language-processing/.

(6). Nelson, Paul. “Five Steps to Tackling Big Data with Natural Language Processing.” Inside Big Data, 10 July 2017, <insidebigdata.com/2017/07/10/five-steps-tackling-big-data-natural-language-processing/>. Accessed 17 Jan. 2018.

(7). Kirk, Chris. “The Most Popular Swear Words on Facebook.” Lexicon Valley, 11 Sept. 2013, <www.slate.com/blogs/lexicon_valley/2013/09/11/top_swear_words_most_popular_curse_words_on_facebook.html>. Accessed 24 Jan.2018.

(8). R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

(9). Ingo Feinerer and Kurt Hornik (2014). tm: Text Mining Package. R package version 0.6. http://CRAN.R-project.org/package=tm.

(10). Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.