The goal of the Capstone milestone project is to show my ability to work with data in R, and explains exploratory analysis for building a prediction algorithm and Shiny app. The report will be submitted on R Pubs (http://rpubs.com/).
The main focus points of this report You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Downlad Data and Load in R - Demonstrate that you’ve downloaded the data and have successfully loaded it into the R workspage. 2. Summary Statistics - Create a basic report of summary statistics about the data sets, including tables and plots. 3. Interesting Findings - Report any interesting findings that you amassed so far. 4. Strategy for prediction algorithm and Shiny app.
The dataset is available for download as a zip file (see References for website). Check to see if the Corpora file already exists; if not, download the file from and unzip the folder to extract the raw data into the selected working directory.
Use the readlines() function to load the data into the workspace in R as follows.
suppressWarnings(twitter.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.twitter.txt"))
suppressWarnings(blogs.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.blogs.txt"))
suppressWarnings(news.raw <- readLines("C:/Users/leigh/Desktop/Coursera/Data Science Certification/Course 10 - Data Science Capstone/data/en_US.news.txt"))
Large databases comprising of text in a target language are commonly used when generating language models for various purposes. To find out what the data look like, statistical summaries are performed.
Note that the file sizes are very large and thus computing memory needs to be considered. Find the lengths of the files and view a basic summary of the file sizes.
matrix(c("Statistic", "File Size", "Length", "Max Char per Line",
"Blogs", object.size(blogs.raw), length(blogs.raw), max(nchar(blogs.raw))), nrow=2, ncol=4, byrow=TRUE)
## [,1] [,2] [,3] [,4]
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Blogs" "260564320" "899288" "40835"
matrix(c("Statistic", "File Size", "Length", "Max Char per Line",
"News", object.size(news.raw), length(news.raw), max(nchar(news.raw))), nrow=2, ncol=4, byrow=TRUE)
## [,1] [,2] [,3] [,4]
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "News" "20111392" "77259" "5760"
matrix(c("Statistic", "File Size", "Length", "Max Char per Line", "Twitter", object.size(twitter.raw), length(twitter.raw), max(nchar(twitter.raw))), nrow=2, ncol=4, byrow=TRUE)
## [,1] [,2] [,3] [,4]
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Twitter" "316037344" "2360148" "213"
Since building models doesn’t require use of all the given data to be accurate, we can use randomly generated samples of the text files as the training set.
blogs.sample <- sample(blogs.raw, size = length(blogs.raw)/10)
news.sample <- sample(news.raw, size = length(news.raw)/10)
twitter.sample <- sample(twitter.raw, size = length(twitter.raw)/10)
After sampling the data to create the training set, a summary of the new data is helpful.
#Create a matrix of summary statistics
summatrix <-matrix(c(object.size(blogs.sample), length(blogs.sample), max(nchar(blogs.sample)),
object.size(news.sample), length(news.sample), max(nchar(news.sample)),
object.size(twitter.sample), length(twitter.sample), max(nchar(twitter.sample))),
nrow=3, ncol=3, byrow=TRUE);
colnames(summatrix) <- c("FileSize", "Length", "MaxChar");
rownames(summatrix) <- c("Blogs", "News", "Twitter"); summatrix;
## FileSize Length MaxChar
## Blogs 25913768 89928 7082
## News 1993016 7725 1170
## Twitter 31895192 236014 211
#Create a Multi-panel Barplot
plot <- par(mfrow=c(1,3), mar=c(3,3,1,1), oma=c(0,0,3,1))
barplot(summatrix[,1], main = "File Size (Mb in Millions)", col=c("blue","red", "green"))
barplot(summatrix[,2], main = "Number of Lines", col=c("blue","red", "green"))
barplot(summatrix[,1], main = "Max Char Per Line", col=c("blue","red", "green"))
mtext("BarPlot of Summary Statistics for Corpora Data", side=3, line=1, outer=TRUE, cex=1, font=2)
par(plot)
It appears from the data summaries that it would be advantageous to join the data sets together, while maintaining the charateristics of each data file.
The corpus samples will be tokenized to build an n-gram model. An N-gram model estimates the probability of a word occuring in a phrase based on the previous words in the phrase. N-grams calculate this by probability by looking at the number of times the last word occurs in a phrase followed by the number of times the phrase minus the last word that occurs.
The above considerations will be address based on initial testing of a simple n-gram model. The strategy is to start out with a simple n-gram model, and build in complexity as determined by how the model performs against test sets of phrases.
(1). Dataset https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
(3). Natural Language Processing Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing.
(4). CRAN Task View: Natural Language Processing https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.
(5). Text Mining https://en.wikipedia.org/wiki/Text_mining https://insidebigdata.com/2017/07/10/five-steps-tackling-big-data-natural-language-processing/.
(6). Nelson, Paul. “Five Steps to Tackling Big Data with Natural Language Processing.” Inside Big Data, 10 July 2017, <insidebigdata.com/2017/07/10/five-steps-tackling-big-data-natural-language-processing/>. Accessed 17 Jan. 2018.
(7). Kirk, Chris. “The Most Popular Swear Words on Facebook.” Lexicon Valley, 11 Sept. 2013, <www.slate.com/blogs/lexicon_valley/2013/09/11/top_swear_words_most_popular_curse_words_on_facebook.html>. Accessed 24 Jan.2018.
(8). R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
(9). Ingo Feinerer and Kurt Hornik (2014). tm: Text Mining Package. R package version 0.6. http://CRAN.R-project.org/package=tm.
(10). Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.