Introduction

This presentation covers Week Two of the Coursera Data Science Capstone and will review the following criteria for this week’s assignment:

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

The motivation for this project is to:

Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Download the data

Download the data from the location provided:

url  <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("coursera-swiftkey.zip")){
 download.file(url, destfile="coursera-swiftkey.zip")
}

This downloads a zip file that contains the following data sets in text format:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Organize and analyze the data sets

After loading the required libraries I’ll read in the data and perform some basic analysis of the resulting data sets.

First extract the data and conduct additional analysis and exploration of the data sets.

Exract the data from the zip file.

blogdata <- "en_US.blogs.txt"; con <- file(blogdata,open = "rb")
blogdata <- readLines(con, skipNul = TRUE); close(con)

newsdata <- "en_US.news.txt"; con <- file(newsdata,open = "rb")
newsdata <- readLines(con,skipNul = TRUE); close(con)

## Warning in readLines(con, skipNul = TRUE): incomplete final line found on
## 'en_US.news.txt'

twitterdata <- "en_US.twitter.txt"; con <- file(twitterdata,open = "rb")
twitterdata <- readLines(con,skipNul = TRUE); close(con)

Calculate data set statistics: Num_lines = total number of lines in the data set. Num_words = total number of words in the data set. Max_words = maximum number of words in a single line within the data set.

num_lines <- sapply(list(blogdata,newsdata,twitterdata),length)
num_char <- sapply(list(blogdata,newsdata,twitterdata),nchar)
stat_calc <- cbind(c("blog_data","news_data","twitter_data"),num_lines,sapply(num_char,sum),sapply(num_char,max))
stat_table <- as.data.frame.array(stat_calc)
colnames(stat_table) <- c("dataset","Num_lines","Num_words","Max_words")

Display a Basic Summary Table of Three Data Sets

knitr::kable(stat_table)

dataset	Num_lines	Num_words	Max_words
blog_data	899288	208361438	40835
news_data	1010242	203791405	11384
twitter_data	2360148	162385035	213

Additional data analysis and plots

Due to the large size of the data files, I will take a random sample of each data set and then perform additional analysis of the representative data.

First I will combine all of the files into one and take a sample of each:

all.data <- c(blogdata, newsdata, twitterdata)
all.data.sample <- c(sample(blogdata,length(blogdata)/100),sample(newsdata,length(newsdata)/100),sample(twitterdata,length(twitterdata)/100))

Then, I will create a corpus (https://en.wikipedia.org/wiki/Text_corpus) based on the resulting sample data set and cleanup the data (e.g., transform to ASCII format, remove extra whitespace, convert to all lowercase, remove punctuation and numbers). These transformations are available using the “tm” package and the R gsub function.

set.seed(1803)
library(tm)

## Loading required package: NLP

corp <- VCorpus(VectorSource(all.data.sample))
corp <- tm_map(corp, function(x)  iconv(x, 'UTF-8', 'ASCII'))
corp <- tolower(corp)
corp <- gsub("[[:digit:]]","",corp)
corp <- gsub("[[:punct:]]","",corp)
corp <- gsub("[[:cntrl:]]","",corp)
corp <- gsub("[^[:print:]]","",corp)

Next, I’ll parse out 1-4 N-Grams using the NGramTokenizer function that is part of the R RWeka package. This function also allows the developer to specify what word delimiters to use. I’ve used the standard recommended delimiters.

I then place the resulting N-Grams into data frames and prepare them for plotting.

library(RWeka)

delim <- " \\r\\n\\t.!?,;:\"()"
onegram <- NGramTokenizer(corp, Weka_control(min=1,max=1, delimiters = delim))
bigram <- NGramTokenizer(corp, Weka_control(min=2,max=2, delimiters = delim))
trigram <- NGramTokenizer(corp, Weka_control(min=3,max=3, delimiters = delim))
quadgram <- NGramTokenizer(corp, Weka_control(min=4,max=4, delimiters = delim))

onegram.df <- data.frame(table(onegram))
bigram.df <- data.frame(table(bigram))
trigram.df <- data.frame(table(trigram))
quadgram.df <- data.frame(table(quadgram))

sort.onegram<-onegram.df[order(onegram.df$Freq,decreasing = TRUE),]
sort.bigram<-bigram.df[order(bigram.df$Freq,decreasing = TRUE),]
sort.trigram<-trigram.df[order(trigram.df$Freq,decreasing = TRUE),]
sort.quadgram<-quadgram.df[order(quadgram.df$Freq,decreasing = TRUE),]

onegram30 <- sort.onegram[1:30,]
colnames(onegram30) <- c("words", "frequency")
rownames(onegram30) <- c(1:30)

bigram30 <- sort.bigram[1:30,]
colnames(bigram30) <- c("words", "frequency")
rownames(bigram30) <- c(1:30)

trigram30 <- sort.trigram[1:30,]
colnames(trigram30) <- c("words", "frequency")
rownames(trigram30) <- c(1:30)

quadgram30 <- sort.quadgram[1:30,]
colnames(quadgram30) <- c("words", "frequency")
rownames(quadgram30) <- c(1:30)

Then plot the N-Grams.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

plot<-function (data1,tittle1)
  {
    ggplot(data=data1,aes(reorder(words, -frequency), frequency)) +
      labs( x="words", y="frequency", title=tittle1) +
      geom_bar(stat="identity", fill="blue") +
      geom_text(aes(label=frequency),vjust=-0.1, size = 2, angle=45, hjust=-.05) +
      theme(axis.text = element_text(size = 7.5, angle=45, hjust=1),)
  }

plot(onegram30,"30 Most frequent 1-grams")

plot(bigram30,"30 Most frequent 2-grams")

plot(trigram30,"30 Most frequent 3-grams")

plot(quadgram30,"30 Most frequent 4-grams")

The most interesting finding isn’t really that interesting: The word “the” the most commonly-used words in the English language and it shows up as the top 1-Gram in my data so I’m happy that my approach has confirmed that the data sets represent true English language usage. This is practically true for all of my 1-Grams (https://en.wikipedia.org/wiki/Most_common_words_in_English).

Plans for creating a prediction algorith and Shiny app

My next challenge will be to develop a prediction algorithm and Shiny app that will predict the next set of words to suggest after a user types in a word. I’ve already bumped up against memory issues with my computer so any advice in the comments are welcome. However, I’m pretty confident that my corpus is ready-to-go based on the findings above.

I’m considering a backoff approach which means I’ll go back to a n-1 gram level to calculate a word with a probability of zero.

Data Science - Week Two Peer-Graded Assignment (PGA)

D. Bradford

10/29/2019