This milestone report is part of the data science capstone project from “Data Science Specialization” Learning path on Coursera by Johns Hopkins University. The capstone project class allows students to create a usable/public data product that can be used to show your skills to potential employers. Projects is drawn from real-world problems and will be conducted with industry, government, and academic partners. This project is supported by SwiftKey, corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
“I went to the”"
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
In this project we will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product you can show off to your family, friends, and potential employers.
This is the training data to get you started that will be the basis for most of the capstone. In this project I will use the data from the Coursera site as instructed in the coursera course from below link.
We have text documents in 4 different languages such as English, US (en_US, ~550MB), German (de_DE, ~240MB), Russian (ru_RU, ~325MB) and Finish (fi_FI, ~220MB) in the forms of blogs, news andtwitter`. I am planning to use English, US data for this analysis where we have
I have downloaded the data manually but it can be downloaded diagrammatically using below scripts
setwd("D:\\dscapstone") # Project Workspace
wbsource<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip" # Data Scource
projFileName<-"Coursera-SwiftKey.zip" # Destination.
download.file(wbsource, projWorkspace) # Download and save
unzip(projFileName) # Uncompress file
In this capstone project we would be performing following tasks in order to achieve the overall objective of the project:
The goal of this exercise is to build and evaluate your first predictive model. You will use the n-gram and backoff models you built in previous tasks to build and evaluate your predictive model. The goal is to make the model efficient and accurate.
I am planning to use following R packages. We would need to install these packages if they are not installed already using install.packages() and load them using library().
## Load Packages
library("tm")
library("stringi")
library("wordcloud")
library("clue")
library("ggplot2")
library("RColorBrewer")
library("SnowballC")
library("RWeka")
library("qdap")
Before going ahead, it is important to to download the data keep in the right path. So I wish to check whether they files exist, if not then download from the web-source.
setwd("D:\\dscapstone")
# Check if the file has been extracted
if (!file.exists("./final")) {
if(!file.exists("Swiftkey.zip")){
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "Swiftkey.zip", method = "curl")
dateDownload <- date()
}
SwiftKey.zip<- "Coursera-SwiftKey.zip"
outDir<-"."
unzip(SwiftKey.zip,exdir=outDir)
}
For this project I am using documents with English-US (en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt) , though I have documents with German, Russian and Finish.
I have used readLines() function to read the text files, but for the large size of data we can also use a connection string.
# Read files in the R using
us_blogs<-readLines("final\\en_US\\en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
us_news<-readLines("final\\en_US\\en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
us_twitter<-readLines("final\\en_US\\en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Summary US blogs
summary(us_blogs)
## Length Class Mode
## 899288 character character
# Structure of the data
str(us_blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods." ...
us_blog:us_blogs[3:5]
## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [2] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [3] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
# Summary US News
summary(us_news)
## Length Class Mode
## 77259 character character
str(us_news)
## chr [1:77259] "He wasn't home alone, apparently." ...
us_news:us_news[3:5]
## [1] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [2] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [3] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
# Summary US Twitter
summary(us_twitter)
## Length Class Mode
## 2360148 character character
str(us_twitter)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
us_twitter:us_twitter[3:5]
## [1] "they've decided its more fun if I don't."
## [2] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [3] "Words from a complete stranger! Made my birthday even better :)"
longest_lines<- (c(max(nchar(us_blogs)), max(nchar(us_news)), max(nchar(us_twitter))))
type <- c("US-Blogs", "US-News", "US-Twitter")
longest_lines <- as.data.frame(cbind(type,longest_lines))
colnames(longest_lines) <- c("file.name", "line.length")
longest_lines$line.length <- as.numeric(longest_lines$line.length)
print(longest_lines)
## file.name line.length
## 1 US-Blogs 2
## 2 US-News 3
## 3 US-Twitter 1
shortest_lines<-c(min(nchar(us_blogs)), min(nchar(us_news)), min(nchar(us_twitter)))
type <- c("US-Blogs", "US-News", "US-Twitter")
shortest_lines<-as.data.frame(cbind(type, shortest_lines))
colnames(shortest_lines) <- c("file.name", "line.length")
print(shortest_lines)
## file.name line.length
## 1 US-Blogs 1
## 2 US-News 2
## 3 US-Twitter 2
As we have large size of datasets to work with, we will use a representative sample for this project to improve performance of Shiny app and predictive models. I have randomly selected representative samples of the data for analysis and modeling using sample() function. This will help to reduce usage of memory and increase performance of the Shiny application.
# Set random seed so that samples do not change
set.seed(12345)
# Sample from US-Blogs
# Help ?sample
us_blogs_sample <- sample(us_blogs, length(us_blogs)*0.01)
# Sample from US-News
us_news_sample <- sample(us_news, length(us_news)*0.10)
# Sample from US Twitter
us_twitter_sample <- sample(us_twitter, length(us_twitter)*0.01)
# Combine all the sample
us_data_sample <- c(us_blogs_sample,us_news_sample,us_twitter_sample)
# remove data to make some space
rm(list=c("type", "us_blogs", "us_news", "us_twitter", "us_blogs_sample", "us_news_sample", "us_twitter_sample"))
# Quick check on data structure
us_data_sample[1:3]
## [1] "And now Im home. Older, wiser, a little slimmer and hopefully secure in the knowledge of what Im good at, and what I want to do with my time."
## [2] "I turned the Today show on to catch up on the morning's news and immediately I knew something terrible had happened in New York City."
## [3] "Ill take this opportunity to diverge from the usual Take Three path, and, instead of focusing on one last role, offer up an Arkin Remix - a concisely-potted overview. Arkin has long been seen as one of the exemplary supporting actors. So many of his roles before his resurgence in popularity during the 90s and 00s, to present day, were memorable; its hard to single one last role out. He added charm and a studious commitment to characterising a range of films from his debut (Thats Me, in 1963) onwards."
summary(us_data_sample)
## Length Class Mode
## 40318 character character
Before performing analysis on Text data, we will preprocess the data parse, clean and transform as necessary for better results and predictions.
iconv() and option latin1gsub() and regular expression [^0-9a-z]gsub() and regular expressiongsub() and regular expressiongsub() and regular expressionTo know more on incon() click here, for gsub() click here and click here to know more on regular expression, also view regex.
# Remove non-English characters, letters etc.
# Help ?inconv
us_data_sample<-iconv(us_data_sample, "latin1", "ASCII", sub="")
# Remove special characters with spaces
# Help ?gsub
us_data_sample_1 <- gsub("[^0-9a-z]", " ", us_data_sample, ignore.case = TRUE)
rm(us_data_sample)
# Remove duplicate characters
us_data_sample_1 <- gsub('([[:alpha:]])\\1+', '\\1\\1', us_data_sample_1)
# Remove special numbers with spaces
us_data_sample_1 <- gsub("[^a-z]", " ", us_data_sample_1, ignore.case = TRUE)
# Remove multiple spaces to one
us_data_sample_1 <- gsub("\\s+", " ", us_data_sample_1)
us_data_sample_1 <- gsub("^\\s", "", us_data_sample_1)
us_data_sample_1 <- gsub("\\s$", "", us_data_sample_1)
# Summary
summary(us_data_sample_1)
## Length Class Mode
## 40318 character character
str(us_data_sample_1)
## chr [1:40318] "And now Im home Older wiser a little slimmer and hopefully secure in the knowledge of what Im good at and what "| __truncated__ ...
Create a virtual corpus using Vcorpus() function.
# create Corpus
# Help ??VCorpus
myCorpus <- VCorpus(VectorSource(us_data_sample_1))
rm(us_data_sample_1)
Perform necessary transformation/preprocessing activities using tm_map() from tm package. The objective is to have clean texts by removing stop words, punctuation, multiple white spaces etc. We will perform the following transformations
My name Is Ariful will be converted to small case my name is ariful.# Help ??tm_map'
# Normalize to small cases
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# Remove Stop Words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
# Remove Punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# Remove Numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
# Create plain text documents
myCorpus <- tm_map(myCorpus, PlainTextDocument)
# Stem words in a text document using Porter's stemming algorithm.
myCorpus <- tm_map(myCorpus, stemDocument, "english")
# Strip White Spaces
myCorpus <- tm_map(myCorpus, stripWhitespace)
# Remove text within brackets using qdap library
#myCorpus_1<-bracketX(myCorpus)
# Replace numbers with words
#myCorpus_1<-replace_number(myCorpus_1)
# Replace abbreviations
#myCorpus_1<-replace_abbreviation(myCorpus_1)
General differences between basic n-gram models are explained here http://recognize-speech.com/language-model/n-gram-model/comparison
Exemplary split of a phrase into uni-, bi- and trigrams
Now we will use TermDocumentMatrix() to create a document-term matrix or term-document matrix which is a mathematical matrix that describes the frequency of terms/words/strings that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. Read more on n-gram on wiki.
Now we will find out most frequently occurred words (uni-gram, bi-gram and tri-gram) and visualize.
# Unigram
uni_token <- function(x) {NGramTokenizer(x, Weka_control(min = 1, max = 1))}
uni_tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = uni_token))
#uni_tdm <- removeSparseTerms(uni_tdm, 0.95)
uni_corpus <- findFreqTerms(uni_tdm,lowfreq = 50)
uni_corpus_freq <- rowSums(as.matrix(uni_tdm[uni_corpus,]))
uni_corpus_freq <- data.frame(word=names(uni_corpus_freq), frequency=uni_corpus_freq)
df1<- uni_corpus_freq[order(-uni_corpus_freq$frequency),][1:20,]
df1<-df1[complete.cases(df1),]
Top 20 frequently occurred words are..
df1[1:20,]
## word frequency
## just just 2974
## get get 2968
## will will 2961
## one one 2959
## like like 2947
## can can 2940
## said said 2516
## time time 2492
## year year 2215
## day day 2196
## love love 2025
## make make 1961
## good good 1906
## know know 1800
## now now 1785
## new new 1775
## work work 1682
## thank thank 1567
## see see 1543
## want want 1524
#We have built a word cloud using top 50 frequent words from sample data.
wordcloud(words = df1$word, freq = df1$frequency, min.freq = 50,
max.words=50, random.order=TRUE, rot.per=0.75,
colors=brewer.pal(8, "Dark2"), c(5,.5), vfont=c("script","plain"))
barplot(df1[1:20,]$freq, las = 2, names.arg = df1[1:20,]$word,
col =df1[1:20,]$freq, main ="",
ylab = "Word frequencies", cex.axis=.5, cex = .5, cex.lab=0.75, cex.main=.75)
#Bigram
bi_token <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
bi_tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = bi_token))
bi_corpus <- findFreqTerms(bi_tdm,lowfreq = 10)
bi_corpus_freq <- rowSums(as.matrix(bi_tdm[bi_corpus,]))
bi_corpus_freq <- data.frame(word=names(bi_corpus_freq), frequency=bi_corpus_freq)
df2 <- bi_corpus_freq[order(-bi_corpus_freq$frequency),][1:20,]
wordcloud(words = df2$word, freq = df2$frequency, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.75,
colors=brewer.pal(8, "Dark2"), c(5,.5), vfont=c("script","plain"))
ggplot(df2, aes(x = df2$word, y = frequency)) +
geom_bar(stat = "identity", fill = "orange") +
labs(title = " ") +
xlab("Words") +
ylab("Count")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Trigram
tri_token <- function(x) {NGramTokenizer(x, Weka_control(min = 3, max = 3))}
tri_tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = tri_token))
tri_corpus <- findFreqTerms(tri_tdm,lowfreq = 5)
tri_corpus_freq <- rowSums(as.matrix(tri_tdm[tri_corpus,]))
tri_corpus_freq <- data.frame(word=names(tri_corpus_freq), frequency=tri_corpus_freq)
df3<-tri_corpus_freq[order(-tri_corpus_freq$frequency),][1:20,]
wordcloud(words = df3$word, freq = df3$frequency, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.60,
colors=brewer.pal(8, "Dark2"), c(5,.5), vfont=c("script","plain"))
ggplot(df3, aes(x = df3$word, y = frequency)) +
geom_bar(stat = "identity", fill = "#FF6666") +
labs(title = " ") +
xlab("Words") +
ylab("Count")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
output:
html_document:
toc: true
toc_depth: 5
toc_float: true
number_sections: true
code_folding: hide
To know more on formatting click here http://rmarkdown.rstudio.com/html_document_format.html#table_of_contents.