This presentation covers Week Two of the Coursera Data Science Capstone and will review the following criteria for this week’s assignment:
Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
The motivation for this project is to:
Download the data from the location provided:
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("coursera-swiftkey.zip")){
download.file(url, destfile="coursera-swiftkey.zip")
}
This downloads a zip file that contains the following data sets in text format:
After loading the required libraries I’ll read in the data and perform some basic analysis of the resulting data sets.
First extract the data and conduct additional analysis and exploration of the data sets.
Exract the data from the zip file.
blogdata <- "en_US.blogs.txt"; con <- file(blogdata,open = "rb")
blogdata <- readLines(con, skipNul = TRUE); close(con)
newsdata <- "en_US.news.txt"; con <- file(newsdata,open = "rb")
newsdata <- readLines(con,skipNul = TRUE); close(con)
## Warning in readLines(con, skipNul = TRUE): incomplete final line found on
## 'en_US.news.txt'
twitterdata <- "en_US.twitter.txt"; con <- file(twitterdata,open = "rb")
twitterdata <- readLines(con,skipNul = TRUE); close(con)
Calculate data set statistics: Num_lines = total number of lines in the data set. Num_words = total number of words in the data set. Max_words = maximum number of words in a single line within the data set.
num_lines <- sapply(list(blogdata,newsdata,twitterdata),length)
num_char <- sapply(list(blogdata,newsdata,twitterdata),nchar)
stat_calc <- cbind(c("blog_data","news_data","twitter_data"),num_lines,sapply(num_char,sum),sapply(num_char,max))
stat_table <- as.data.frame.array(stat_calc)
colnames(stat_table) <- c("dataset","Num_lines","Num_words","Max_words")
knitr::kable(stat_table)
| dataset | Num_lines | Num_words | Max_words |
|---|---|---|---|
| blog_data | 899288 | 208361438 | 40835 |
| news_data | 1010242 | 203791405 | 11384 |
| twitter_data | 2360148 | 162385035 | 213 |
Due to the large size of the data files, I will take a random sample of each data set and then perform additional analysis of the representative data.
First I will combine all of the files into one and take a sample of each:
all.data <- c(blogdata, newsdata, twitterdata)
all.data.sample <- c(sample(blogdata,length(blogdata)/100),sample(newsdata,length(newsdata)/100),sample(twitterdata,length(twitterdata)/100))
Then, I will create a corpus (https://en.wikipedia.org/wiki/Text_corpus) based on the resulting sample data set and cleanup the data (e.g., transform to ASCII format, remove extra whitespace, convert to all lowercase, remove punctuation and numbers). These transformations are available using the “tm” package and the R gsub function.
set.seed(1803)
library(tm)
## Loading required package: NLP
corp <- VCorpus(VectorSource(all.data.sample))
corp <- tm_map(corp, function(x) iconv(x, 'UTF-8', 'ASCII'))
corp <- tolower(corp)
corp <- gsub("[[:digit:]]","",corp)
corp <- gsub("[[:punct:]]","",corp)
corp <- gsub("[[:cntrl:]]","",corp)
corp <- gsub("[^[:print:]]","",corp)
Next, I’ll parse out 1-4 N-Grams using the NGramTokenizer function that is part of the R RWeka package. This function also allows the developer to specify what word delimiters to use. I’ve used the standard recommended delimiters.
I then place the resulting N-Grams into data frames and prepare them for plotting.
library(RWeka)
delim <- " \\r\\n\\t.!?,;:\"()"
onegram <- NGramTokenizer(corp, Weka_control(min=1,max=1, delimiters = delim))
bigram <- NGramTokenizer(corp, Weka_control(min=2,max=2, delimiters = delim))
trigram <- NGramTokenizer(corp, Weka_control(min=3,max=3, delimiters = delim))
quadgram <- NGramTokenizer(corp, Weka_control(min=4,max=4, delimiters = delim))
onegram.df <- data.frame(table(onegram))
bigram.df <- data.frame(table(bigram))
trigram.df <- data.frame(table(trigram))
quadgram.df <- data.frame(table(quadgram))
sort.onegram<-onegram.df[order(onegram.df$Freq,decreasing = TRUE),]
sort.bigram<-bigram.df[order(bigram.df$Freq,decreasing = TRUE),]
sort.trigram<-trigram.df[order(trigram.df$Freq,decreasing = TRUE),]
sort.quadgram<-quadgram.df[order(quadgram.df$Freq,decreasing = TRUE),]
onegram30 <- sort.onegram[1:30,]
colnames(onegram30) <- c("words", "frequency")
rownames(onegram30) <- c(1:30)
bigram30 <- sort.bigram[1:30,]
colnames(bigram30) <- c("words", "frequency")
rownames(bigram30) <- c(1:30)
trigram30 <- sort.trigram[1:30,]
colnames(trigram30) <- c("words", "frequency")
rownames(trigram30) <- c(1:30)
quadgram30 <- sort.quadgram[1:30,]
colnames(quadgram30) <- c("words", "frequency")
rownames(quadgram30) <- c(1:30)
Then plot the N-Grams.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
plot<-function (data1,tittle1)
{
ggplot(data=data1,aes(reorder(words, -frequency), frequency)) +
labs( x="words", y="frequency", title=tittle1) +
geom_bar(stat="identity", fill="blue") +
geom_text(aes(label=frequency),vjust=-0.1, size = 2, angle=45, hjust=-.05) +
theme(axis.text = element_text(size = 7.5, angle=45, hjust=1),)
}
plot(onegram30,"30 Most frequent 1-grams")
plot(bigram30,"30 Most frequent 2-grams")
plot(trigram30,"30 Most frequent 3-grams")
plot(quadgram30,"30 Most frequent 4-grams")
The most interesting finding isn’t really that interesting: The word “the” the most commonly-used words in the English language and it shows up as the top 1-Gram in my data so I’m happy that my approach has confirmed that the data sets represent true English language usage. This is practically true for all of my 1-Grams (https://en.wikipedia.org/wiki/Most_common_words_in_English).
My next challenge will be to develop a prediction algorithm and Shiny app that will predict the next set of words to suggest after a user types in a word. I’ve already bumped up against memory issues with my computer so any advice in the comments are welcome. However, I’m pretty confident that my corpus is ready-to-go based on the findings above.
I’m considering a backoff approach which means I’ll go back to a n-1 gram level to calculate a word with a probability of zero.