This is the Milestone Report for the Coursera Data Science Specialization offered by The John Hopkins University. The Capstone project is done in collaboration with Swiftkey and the goal of this project is to design a shiny application using R codes, with a text prediction capabilities. This report will outline the exploratory analysis of the dataset and the plans for implementing the text prediction algorithm.
The dataset is comprised of the output of crawls of news sites, blogs and twitter downloaded at Datasets. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:
The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report.
if(!file.exists("Coursera-SwiftKey.zip")){
#Download the dataset
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip")
Download_Date <- Sys.time()
Download_Date
#"2016-03-15 13:30:48 GMT"
#Unzip the dataset
unzip("Coursera-SwiftKey.zip")
}else{
print("Dataset is already downloaded!")
}
There is a minor problem with the News dataset. It contains an unusual character on line 77,259. In order to address this issue a small piece of code was written to edit out the character before processing the dataset.
if(!file.exists("./en_US/en_US.news_edit.txt")){
con <- file("./en_US/en_US.news.txt", "rb")
News_Data <- readLines(con)
close(con)
#Remove the odd symbol on line 77259
News_Data <- gsub("\032", "", News_Data, ignore.case=F, perl=T)
writeLines(News_Data, con=file("./en_US/en_US.news_edit.txt"))
close(con=file("./en_US/en_US.news_edit.txt"))
file.rename("./en_US/en_US.news.txt", "./en_US.news.txt")
}else{
con <- file("./en_US/en_US.news_edit.txt", "rb")
News_Data <- readLines(con)
close(con)
}
#Load libraries
library(NLP)
## Warning: package 'NLP' was built under R version 3.2.3
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.4
library(data.table)
## Warning: package 'data.table' was built under R version 3.2.4
#Generate Corpus for text analysis
cname <- file.path(".","en_US")
docs <- Corpus(DirSource(cname))
The first part of this exploratory analysis is to determine the basic characteristics for each dataset. These characteristics are shown in the table below.
| Dataset | File Size (bytes) | Number of Lines | Smallest entry | Largest entry |
|---|---|---|---|---|
| Blogs | 210160014 | 899288 | 1 | 40835 |
| 167105338 | 2360148 | 2 | 213 | |
| News | NA | 77259 | 2 | 5760 |
This process is to reduce the sample size by cleaning and processing each of the datasets (Blogs, Twitter and News) which were large enough that processing time is a factor. In order to address this concern, a representative sampling of each of the datasets was made for the remainder of this analysis. The subset of each file is outlined in the table below.
#Limit Dataset to a random subset of 20% of the data
set.seed(1337)
Subset <- docs
Subset[[1]]$content <- Subset[[1]]$content[as.logical(rbinom(length(Subset[[1]]$content),
1, prob=0.2))]
Subset[[2]]$content <- Subset[[2]]$content[as.logical(rbinom(length(Subset[[2]]$content),
1, prob=0.2))]
Subset[[3]]$content <- Subset[[3]]$content[as.logical(rbinom(length(Subset[[3]]$content),
1, prob=0.2))]
| Dataset | File Size (bytes) | Number of Lines | Smallest entry | Largest entry |
|---|---|---|---|---|
| Subset Blogs | 52658576 | 179180 | 1 | 12421 |
| Subset Twitter | 63762584 | 471986 | 2 | 213 |
| Subset News | 4035936 | 15495 | 3 | 1346 |
Before the subsetted data can be fully analyzed the data needs to be pre-processed to standardize the words and characters from each dataset. An example entry from the Blogs dataset is shown below:
## [1] "Point 2: If itâ<U+0080><U+0099>s a show that your kid wants, a show about a book is always better than CRACCC."
#Plot Word Frequencies
ggplot(wf[wf$freq>60000, ], aes(x=word, y=freq)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("") +
ylab("Frequency") +
ggtitle("Words that appear over 60,000\ntimes in the three Datasets")
The high frequency for “connecting” words, such as “the”, “and”, “that” suggests that using a pattern based on word frequency alone will not be sufficient for text prediction. The next analysis looks at common word combinations.
For brivity the N-gram analysis of this report was limited to 2-grams.
#Plot Word Frequencies
ggplot(WF_Ngram[WF_Ngram$freq>16000, ], aes(x=word, y=freq)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("") +
ylab("Frequency") +
ggtitle("Two Word Combinations (2-grams)that appear\n over 16,000 times in the three Datasets")
The distribution of 2-grams gives an idea of the prevalence of prepositions in natural language. The text prediction model will have to take this into account.
The plan is to develop a text prediction application (a Shiny App using R code) that applies the frequency of 4-grams, 3-grams and 2-grams to estimate the most likely word to follow the entered text. The trick will be to offer valid predictions of N-grams that are not observed within the dataset. In these cases the algorithm will likely default to a list of “non-common” words (i.e. factor out words like the, and, that) and estimate the best possible choice.
This analysis was performed on a machine with the following characteristics:
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_Malaysia.1252 LC_CTYPE=English_Malaysia.1252
## [3] LC_MONETARY=English_Malaysia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Malaysia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.6 RWeka_0.4-24 ggplot2_2.1.0 stringi_1.0-1
## [5] tm_0.6-2 NLP_0.1-9
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 knitr_1.11 magrittr_1.5
## [4] RWekajars_3.7.12-1 munsell_0.4.2 colorspace_1.2-6
## [7] stringr_1.0.0 plyr_1.8.3 tools_3.2.0
## [10] parallel_3.2.0 grid_3.2.0 gtable_0.1.2
## [13] htmltools_0.3 yaml_2.1.13 digest_0.6.9
## [16] rJava_0.9-8 formatR_1.2 evaluate_0.8
## [19] slam_0.1-32 rmarkdown_0.6.1 labeling_0.3
## [22] scales_0.3.0 chron_2.3-47
1: Swiftkey - http://swiftkey.com/en/
2: Datasets - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip