This is a progress brief for my Coursera Data Science Capstone Project. The objective is to explain the Explortory Data Analysis which will lead to the eventual prediction app and algorithm.
The raw corpus data are located at my desktop hard desk:
Blog “./data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt” News “./data/Coursera-SwiftKey/final/en_US/en_US.news.txt” Twitter “./data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt”
Download corpus dataset from Coursera data source (Corpus Data Source) and then unzip it to local disk.
As the original data files (Blogs, News and Twitter) are extremely large, a small sample will be generated to study the data. A 10% of the contents of each of the data (Blogs, News and Twitter) will be sampled to create the sample corpus.
The corpus will then be generated by using the sample created.
blogs <- readLines(con <- file("C:/Users/553168/Desktop/Data Science Coursera/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt"),
encoding = "UTF-8", skipNul = TRUE)
close(con)
head(blogs)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
summary(blogs)
## Length Class Mode
## 899288 character character
max(nchar(blogs))
## [1] 40833
news <- readLines(con <- file("C:/Users/553168/Desktop/Data Science Coursera/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt"),
encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con <- file("C:/Users/553168/Desktop/Data Science
## Coursera/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt"), :
## incomplete final line found on 'C:/Users/553168/Desktop/Data Science
## Coursera/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt'
close(con)
head(news)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
summary(news)
## Length Class Mode
## 77259 character character
max(nchar(news))
## [1] 5760
twitter <- readLines(con <- file("C:/Users/553168/Desktop/Data Science Coursera/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"),
encoding = "UTF-8", skipNul = TRUE)
close(con)
head(twitter)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
summary(twitter)
## Length Class Mode
## 2360148 character character
max(nchar(twitter))
## [1] 140
The basic statistics and summary information are tabulated for each data type as follows:
library(stringi)
## Warning: package 'stringi' was built under R version 3.4.4
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.3
statsRaw <- data.frame(
Dataset = c("Blogs","News","Twitter"),
FileSizeinMB = c(file.info("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2,
file.info("./Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024^2,
file.info("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024^2
),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
WordCount=sapply(list(blogs,news,twitter),stri_stats_latex)[4,])
)
)
kable(statsRaw)
| Dataset | FileSizeinMB | Lines | LinesNEmpty | Chars | CharsNWhite | WordCount |
|---|---|---|---|---|---|---|
| Blogs | 200.4242 | 899288 | 899288 | 206824382 | 170389539 | 37570839 |
| News | 196.2775 | 77259 | 77259 | 15639408 | 13072698 | 2651432 |
| 159.3641 | 2360148 | 2360148 | 162096241 | 134082806 | 30451170 |
A sample dataset that is only consisted of 1% from each of the 3 original datasets are selected for further analysis due to the huge file size of the original data and the limitation of computational power.
Then, the sampled data are used to create a corpus; and the following clean up steps are performed, such as converting all words to lowercase, removing punctuation, excluding numbers, striping whitespace, eliminating English stop words and stemming. Afterall, the Document Term Matrix (DTM) is created for each corpus.
library(tm)
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
library(SnowballC)
# Set random seed for reproducibility
set.seed(11)
sampleBlogs <- blogs[sample(length(blogs)*0.01, replace=FALSE)]
sampleNews <- news[sample(length(news)*0.01, replace=FALSE)]
sampleTwitter <- twitter[sample(length(twitter)*0.01, replace=FALSE)]
sampleBlogs <- iconv(sampleBlogs, "UTF-8", "ASCII", sub="")
sampleNews <- iconv(sampleNews, "UTF-8", "ASCII", sub="")
sampleTwitter <- iconv(sampleTwitter, "UTF-8", "ASCII", sub="")
# Save merged corpus sample to local disk
sampleData <- c(sampleBlogs, sampleNews, sampleTwitter)
filecon <- file("./sampleData.txt")
writeLines(sampleData, filecon)
close(filecon)
sampleMerged <- list(sampleBlogs, sampleNews, sampleTwitter)
# Create corpus and Document Term Matrix (DTM) vector
sampleCorpus <- list()
dtm <- list()
for (i in 1:length(sampleMerged)) {
sampleCorpus[[i]] <- Corpus(VectorSource(sampleMerged[[i]])) #Create corpus
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], tolower) #lowercase
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], removePunctuation) #Remove punctuation
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], removeNumbers) #Remove numbers
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], stripWhitespace) #Strip Whitespace
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], removeWords, stopwords("english")) #Remove English stopwords
sampleCorpus[[i]] <- tm_map(sampleCorpus[[i]], stemDocument) #Perform stemming
# Create document term frequency for corpus
dtm[[i]] <- DocumentTermMatrix(sampleCorpus[[i]], control=list(wordLengths=c(0,Inf)))
}
The first step is understanding the distribution and relationship between the words, tokens, and phrases in the text. Thus, the wordcloud is used to plot the word frequency map for each US English Blogs, News and Twitter corpora respectively. Maximum number of words to be plotted is set to be 100 and they will be plotted in decreasing frequency.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
library(slam)
## Warning: package 'slam' was built under R version 3.4.4
# Set random seed for reproducibility
set.seed(11)
Headings <- c("US English Blogs",
"US English News",
"US English Twitter")
# Plot word cloud (Max = 100) using DTM for each corpus
for (i in 1:length(sampleCorpus)) {
wordcloud(words = colnames(dtm[[i]]), freq = slam::col_sums(dtm[[i]]),
scale = c(2, 1), max.words = 100, random.order = FALSE, rot.per = 0.5,
use.r.layout = FALSE, colors = brewer.pal(8, "Accent")
)
title(Headings[i])
}
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.4
library(rJava)
## Warning: package 'rJava' was built under R version 3.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.3
plotNGrams <- function (x, subTitle, num) {
# Develop Unigram token
uniToken <- RWeka::NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram <- as.data.frame(table(uniToken))
unigram <- unigram[order(unigram$Freq, decreasing = TRUE), ]
colnames(unigram) <- c("Word", "Freq")
unigram <- head(unigram, num)
uniplot <- ggplot(unigram, aes(x=reorder(Word, Freq), y=Freq)) +
geom_bar(stat="identity", fill="red") +
ggtitle(paste("Unigrams of", subTitle)) +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
# Develop Bigram token
biToken <- RWeka::NGramTokenizer(x, Weka_control(min = 2, max = 2,
delimiters = " \\r\\n\\t.,;:\"()?!"))
bigram <- as.data.frame(table(biToken))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE), ]
colnames(bigram) <- c("Word", "Freq")
bigram <- head(bigram, num)
biplot <- ggplot(bigram, aes(x=reorder(Word, Freq), y=Freq)) +
geom_bar(stat="identity", fill="blue") +
ggtitle(paste("Bigrams of", subTitle)) +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
# Develop Trigram token
triToken <- RWeka::NGramTokenizer(x, Weka_control(min = 3, max = 3,
delimiters = " \\r\\n\\t.,;:\"()?!"))
trigram <- as.data.frame(table(triToken))
trigram <- trigram[order(trigram$Freq, decreasing = TRUE), ]
colnames(trigram) <- c("Word", "Freq")
trigram <- head(trigram, num)
triplot <- ggplot(trigram, aes(x=reorder(Word, Freq),y=Freq)) +
geom_bar(stat="identity", fill="green") +
ggtitle(paste("Trigrams of", subTitle)) +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
# Arrange three plots into one row
grid.arrange(uniplot, biplot, triplot, ncol = 3)
}
Weka is a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization. In this report, the US English Blogs, News and Twitter corpora are the character vector with strings to be tokenized and NGramTokenizer splits strings into n-grams with given minimal and maximal numbers of grams.
Hence, the RWeka package is used to develope functions that tokenize the sample corpora in order to create unigram, bigram and trigram by adopting the ggplot2 package. A thorough exploratory analysis of the data is performed in order to understand the distribution and frequencies of words and word pairs in the corpora. The complete code is provided in appendix.
plotNGrams(x = sampleBlogs, subTitle = "Blogs", num = 10)
plotNGrams(x = sampleTwitter, subTitle = "Twitter", num = 10)
## Future Goals and Plans This concludes the initial exploratory analysis on the major features of the data. The next steps are as follows:
to build simple n-gram model for the relationship between words to predict the next word based on the previous 1, 2, or 3 words. to build a predictive text mining application to be deployed as new data product in a Shiny app and suggest the most likely next word based on user input.