The Milestone report is a part of the Data Science Specialization Capstone by Johns Hopkins University on Coursera. To make this project available, the creators of the course have partnered with Swiftkey, a company that builds a smart keyboard that makes it easier for people to type on their mobile devices. The final goal of the project is to develop a similar smart keyboard with the help of various predictive models. Thus, to realise this task, we will extensively explore the area of natural language processing. The aim of this report is to demonstrate the first and essential skills required to tackle this objective: to load and process the large text data sets and to carry out exploratory analysis of its main features. Finally, we will outline next steps required for creation of the prediction algorithms deployed within a Shiny user-friendly web-application.
The document consists of the following parts:
Loading the data
Exploratory data analysis
Processing the data and creating corpora
Tokenization
Visualisation
Key findings
Next steps
The data set for this project is available by the link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. To proceed with all the tasks in scope of this project, let’s first download the data and unzip it to a separate directory.
fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Coursera-SwiftKey.zip", method = "curl")
unlink(fileURL)
unzip("Coursera-SwiftKey.zip")
This project requires using several R packages. Having installed them, let’s load them all preliminarily.
library(ggplot2) # data sets visualisation
library(stringi) # string manipulation
library(tm) # main package used in text mining
library(wordcloud) # for creation of word clouds
The source data consists of four folders (relative to the language of a data set). For the purpose of our analysis we will use en_US data sets. In order to load full data sets into R, we’ll use readLines function:
twitter_lines <- readLines("final/EN_US/en_US.twitter.txt")
blogs_lines<-readLines("final/EN_US/en_US.blogs.txt")
news_lines <- readLines("final/EN_US/en_US.news.txt")
all_lines <- c(twitter_lines,blogs_lines,news_lines)
In this section, we’ll perform basic exploratory analysis of all data sets, including calculations of the length of lines in number of charaters, number of lines and, finally, number of words in all our data sets.
shortest <- 1000; longest <- 0; total <- 0; cnt <- 0
for (i in 1:length(twitter_lines)) {
length <- nchar(twitter_lines[i])
if (length > longest) longest <- length
if (length != 0 && length < shortest ) shortest <- length
if (length > 0) cnt <- cnt + 1
total <- total + length
}
shortest; round(total / cnt); longest
## [1] 2
## [1] 69
## [1] 213
Having run the code for all other data sets we get the following resulting stats regarding length of lines in number of characters. Here we can notice that the shortest line has 1 characher only, and the longest is almost about of 41K.
| Length of line | Blogs | News | |
|---|---|---|---|
| Shortest line | 2 | 1 | 2 |
| Average line | 69 | 232 | 203 |
| Longest line | 213 | 40835 | 5760 |
total_raw<-lapply(list(twitter_lines,blogs_lines,news_lines),function(x) stri_count_words(x))
stats<-data.frame(
data_set =c("Twitter","Blogs","News"),
t(rbind(sapply(list(twitter_lines,blogs_lines,news_lines),stri_stats_general),
total_words =sapply(list(twitter_lines,blogs_lines,news_lines),stri_stats_latex)[4,])),
all_data =rbind(summary(total_raw[[1]]),summary(total_raw[[2]]),summary(total_raw[[3]]))
)
print(stats)
## data_set Lines LinesNEmpty Chars CharsNWhite total_words
## 1 Twitter 2360148 2360148 162384825 134370864 30556524
## 2 Blogs 899288 899288 208361438 171926076 37746231
## 3 News 77259 77259 15683765 13117038 2661443
## all_data.Min. all_data.1st.Qu. all_data.Median all_data.Mean
## 1 1 7 12 12.79289
## 2 0 9 29 42.29050
## 3 1 19 32 34.81225
## all_data.3rd.Qu. all_data.Max.
## 1 18 47
## 2 60 6725
## 3 46 1123
To start with, as we deal with pretty large amount of data (approx.556 MB for the three data sets in English), the tasks of the project can be resource-intensive. Let’s assume that it would be enough to restrict the data with few randomly selected rows: this might be an accurate approximation to results that could have been obtained using all the data. We’ll start data processing by loading only 1000 lines of each of three data sets.
con <- file("final/EN_US/en_US.twitter.txt", 'r')
twitter_lines <- readLines(con, n = 1000)
close(con)
con <- file("final/EN_US/en_US.blogs.txt", 'r')
blogs_lines <- readLines(con, n = 1000)
close(con)
con <- file("final/EN_US/en_US.news.txt", 'r')
news_lines<- readLines(con, n = 1000)
close(con)
Next, we’ll create data corpora.
en_twitter <- VCorpus(VectorSource(twitter_lines))
en_blogs <- VCorpus(VectorSource(blogs_lines))
en_news <- VCorpus(VectorSource(news_lines))
Having created the corpora, we can proceed with data cleansing and processing. It will be carried out in the following order:
Removal of punctuation
Removal of numbers
Convert words to lower-case
Removal of whitespaces
en_twitter<- tm_map(en_twitter, removePunctuation)
en_blogs <- tm_map(en_blogs, removePunctuation)
en_news <- tm_map(en_news, removePunctuation)
en_twitter <- tm_map(en_twitter, removeNumbers)
en_blogs <- tm_map(en_blogs, removeNumbers)
en_news <- tm_map(en_news, removeNumbers)
en_twitter <- tm_map(en_twitter, tolower)
en_blogs <- tm_map(en_blogs, tolower)
en_news <- tm_map(en_news, tolower)
en_twitter <- tm_map(en_twitter, stripWhitespace)
en_blogs <- tm_map(en_blogs, stripWhitespace)
en_news <- tm_map(en_news, stripWhitespace)
en_twitter <- tm_map(en_twitter, PlainTextDocument)
en_blogs <- tm_map(en_blogs, PlainTextDocument)
en_news <- tm_map(en_news, PlainTextDocument)
As a first step, we’ll perform basic preparation for tokenization - generation of the document-term matrices. A DTM, by a widely accepted definition, is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights. Additionally, we do removal of sparse terms. These operations will be carried out with each data set.
twitter_dtm <- DocumentTermMatrix(en_twitter)
twitter_dtm <- removeSparseTerms(twitter_dtm, 0.999)
blogs_dtm <- DocumentTermMatrix(en_blogs)
blogs_dtm <- removeSparseTerms(blogs_dtm, 0.999)
news_dtm <- DocumentTermMatrix(en_news)
news_dtm <- removeSparseTerms(news_dtm, 0.999)
In the next chunk of code, we’ll define the most frequent terms.
twitter_freq <- colSums(as.matrix(twitter_dtm))
twitter_ordered <- order(twitter_freq)
blogs_freq <- colSums(as.matrix(blogs_dtm))
blogs_ordered <- order(blogs_freq)
news_freq <- colSums(as.matrix(news_dtm))
news_ordered <- order(news_freq)
twitter_freq[tail(twitter_ordered,20)]
## but can one with was what like not all have its are just this your
## 50 50 50 52 55 56 57 57 58 58 58 62 62 69 75
## that for and you the
## 103 144 180 215 417
blogs_freq[tail(blogs_ordered,20)]
## just what like they about one from all are not but have
## 128 128 137 137 138 141 149 156 165 173 222 244
## this you was with for that and the
## 266 290 317 361 396 540 1229 2046
news_freq[tail(news_ordered,20)]
## more you who will has they its not are from have his but was with
## 101 104 105 105 108 112 129 133 139 139 139 145 173 204 251
## said that for and the
## 259 332 386 795 1874
In Natural Language Processing (NLP), tokenization refers to breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. Next, we’ll define a function that will be used in extraction of 2-gram and 3-gram word structures from the cleaned text corpus and use it in the further analysis.
two_gram_tokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
three_gram_tokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
In the following task, we again generate DTMs, remove sparse terms and organize the data by frequency but for 2-grams and 3-grams respectively.
#Document term matrices for 2-grams for each data set
dtm_twitter_two <- DocumentTermMatrix(en_twitter, control = list(tokenize = two_gram_tokenizer))
dtm_twitter_two_removed <- removeSparseTerms(dtm_twitter_two, 0.999)
freq_twitter_two <- colSums(as.matrix(dtm_twitter_two_removed)); order_two <- order(freq_twitter_two)
dtm_blogs_two <- DocumentTermMatrix(en_blogs, control = list(tokenize = two_gram_tokenizer))
dtm_blogs_two_removed <- removeSparseTerms(dtm_blogs_two, 0.999)
freq_blogs_two <- colSums(as.matrix(dtm_blogs_two_removed)); order_two <- order(freq_blogs_two)
dtm_news_two <- DocumentTermMatrix(en_news, control = list(tokenize = two_gram_tokenizer))
dtm_news_two_removed <- removeSparseTerms(dtm_news_two, 0.999)
freq_news_two <- colSums(as.matrix(dtm_news_two_removed)); order_two <- order(freq_news_two)
#Document term matrices for 3-grams for each data set
dtm_twitter_three <- DocumentTermMatrix(en_twitter, control = list(tokenize = three_gram_tokenizer))
dtm_twitter_three_removed <- removeSparseTerms(dtm_twitter_three, 0.999)
freq_twitter_three <- colSums(as.matrix(dtm_twitter_three_removed)); order_twitter_three <- order(freq_twitter_three)
dtm_blogs_three <- DocumentTermMatrix(en_blogs, control = list(tokenize = three_gram_tokenizer))
dtm_blogs_three_removed <- removeSparseTerms(dtm_blogs_three, 0.999)
freq_blogs_three <- colSums(as.matrix(dtm_blogs_three_removed)); order_blogs_three <- order(freq_blogs_three)
dtm_news_three <- DocumentTermMatrix(en_news, control = list(tokenize = three_gram_tokenizer))
dtm_news_three_removed <- removeSparseTerms(dtm_news_three, 0.999)
freq_news_three <- colSums(as.matrix(dtm_news_three_removed)); order_news_three <- order(freq_news_three)
In the following plots we’ll demonstrate the 2-grams that appear more frequently.
ggplot(subset(data.frame(word=names(freq_twitter_two), freq=freq_twitter_two), freq_twitter_two>12), aes(word, freq)) +
geom_bar(stat="identity", color = "blue", fill = "darkblue") +
ggtitle("Twitter data set")
ggplot(subset(data.frame(word=names(freq_blogs_two), freq=freq_blogs_two) , freq_blogs_two>50), aes(word, freq)) +
geom_bar(stat="identity", color = "magenta", fill = "red") +
ggtitle("Blogs data set")
ggplot(subset(data.frame(word=names(freq_news_two), freq=freq_news_two), freq_news_two>30), aes(word, freq)) +
geom_bar(stat="identity", color = "black", fill = "green") +
ggtitle("News data set")
A word cloud can be also a very useful tool when you need to highlight the most commonly cited words in a text using a quick visualization. In the following section, we’ll create wordclouds of the top 40 words ib each of the three data sets.
wordcloud(names(freq_twitter_three), freq_twitter_three, max.words=40, scale=c(3, .5), colors=brewer.pal(6, "Accent"))
wordcloud(names(freq_blogs_three), freq_blogs_three, max.words=40, scale=c(3,.3), colors=brewer.pal(7, "Set2"))
wordcloud(names(freq_news_three), freq_news_three, max.words=40, scale=c(2,.1), colors=brewer.pal(5, "Paired"))
It took me less than a minute to load and process all the scripts in the exercise (taking into account that the task was performed on a resouceful laptop). We significantly restricted the size of our data in the exercise by loading chunks to tackle the resources problem. In the future, to make dealing with large data sets more efficient without loosing the quality of prediction models, it might be reasonable to apply parallel computing.
Additionally, I would suggest to use some local data base to store the large amounts of data, such as Access, SQLite, MS SQL Server etc.
As was noted before, we deal with very mixed data: starting from one character per line and up to 41K, the size of every data set is also very different.
Stopwords and swear words were not removed. In some cases, might be also suggested to have profanity removed (depending on the type, purpose or who the finel user is).
Among the most frequent words we can find articles, pronouns, particles, auxiliary verb, prepositions.
In the next steps of the capstone project, we’ll dig more into the predictive modelling. We might base it on n-gram tokenization, explore such text mining techniques as “Bag of Words”, apply sentiment analysis to improve predictions etc. Having defined the prediction model of our interest, we will continue the project with designing and developing a Shiny Application.
The session information attached:
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
## [3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C
## [5] LC_TIME=Russian_Russia.1251
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wordcloud_2.6 RColorBrewer_1.1-2 tm_0.7-5
## [4] NLP_0.2-0 stringi_1.1.7 ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 pillar_1.2.3 compiler_3.5.0 plyr_1.8.4
## [5] bindr_0.1.1 tools_3.5.0 digest_0.6.15 evaluate_0.10.1
## [9] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1
## [13] yaml_2.1.19 parallel_3.5.0 bindrcpp_0.2.2 withr_2.1.2
## [17] dplyr_0.7.5 stringr_1.3.1 knitr_1.20 xml2_1.2.0
## [21] rprojroot_1.3-2 grid_3.5.0 tidyselect_0.2.4 glue_1.2.0
## [25] R6_2.2.2 rmarkdown_1.10 purrr_0.2.5 magrittr_1.5
## [29] backports_1.1.2 scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [33] colorspace_1.3-2 labeling_0.3 lazyeval_0.2.1 munsell_0.5.0
## [37] slam_0.1-43