This milestone report is part of the Capstone Course of the Data Science Specialization on Coursera. The goal of the project is to create a predictive text model using data from SwiftKey. The data was obtained via datascraping the internet for blogposts, news articles and tweets. Written code can be found in an appendix after the report.
The data comes from three separate files. One for each of blogs, news and twitter data sources. I start the report by first analysing each dataset on it’s own. Later, I combine them and analyse them all together. Before exploring the data I will remove common words known as stop words. These are words like: the, is, and, etc.
First lets look at the size of the datasets and the mean length of a blog, a tweet and a news article.
| Blogs | News | ||
|---|---|---|---|
| Corpus Size | 899288 | 1010242 | 2360148 |
| Mean Length | 290 | 201 | 72 |
We see that the document sizes are not equal between corpuses. When we combine them into one big set later on we should give the twitter set a higher weight than the others to ensure an even distribution.
First I make a plot of the most common words in each of the three datasets.
We see a lot of similarities between the different sources but also some differences. The twitter data contains more abbreviations like lol and rt, news contains words like police and bounty, but the blogs dataset seems to have the most ordinary corpus.
Now that we’ve had a look at the most used singular words we can look at the top n-grams, which are sequences of words of size n. Since the dataset is so big, we will only explore a random sample of size 20% pulled from the total set. When analyzing the ngrams I will not remove stopwords, since the endgame of this project is text prediction, not merely analyzing.
Let’s see which ngrams are used the most. I have also added a simple word-counting plot from the combined dataset.
One simple idea for text prediction is to split the data into quadgrams and simply predict the fourth word based on the first three words. If there are only two words before the predicted we could use trigrams etc. Let’s plot a connected graph of our quadgrams. This graph will connect words with an arrow if one word often leads into the other in texts.
We see that if we were to simply move between words on the graph in the direction of the arrow we would get sentences that sort of make sense. This gives some good ideas for how the model should look. My plan is to look into Markov Models. They model the probability of going from one state (rainy weather, word number 1, etc.) to another state (sunny weather, word number 2, etc.) and would suit well for text prediction.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] purrr_0.2.4 Hmisc_4.1-1 Formula_1.2-2
## [4] survival_2.41-3 lattice_0.20-35 scales_0.5.0
## [7] knitr_1.18 reshape2_1.4.3 ggraph_1.0.0
## [10] igraph_1.1.2 stringr_1.2.0 readr_1.1.1
## [13] gridExtra_2.3 ggthemes_3.4.0 cluster_2.0.6
## [16] wordcloud_2.5 RColorBrewer_1.1-2 ggplot2_2.2.1
## [19] tm_0.7-3 NLP_0.1-11 tidytext_0.1.6
## [22] tibble_1.4.1 tidyr_0.7.2 dplyr_0.7.4
## [25] plyr_1.8.4
##
## loaded via a namespace (and not attached):
## [1] viridis_0.4.0 viridisLite_0.2.0 splines_3.4.3
## [4] assertthat_0.2.0 highr_0.6 latticeExtra_0.6-28
## [7] yaml_2.1.16 slam_0.1-42 ggrepel_0.7.0
## [10] pillar_1.0.1 backports_1.1.2 glue_1.2.0
## [13] digest_0.6.13 checkmate_1.8.5 colorspace_1.3-2
## [16] htmltools_0.3.6 Matrix_1.2-12 psych_1.7.8
## [19] pkgconfig_2.0.1 broom_0.4.3 tweenr_0.1.5
## [22] ggforce_0.1.1 htmlTable_1.11.1 nnet_7.3-12
## [25] lazyeval_0.2.1 mnormt_1.5-5 magrittr_1.5
## [28] evaluate_0.10.1 tokenizers_0.1.4 janeaustenr_0.1.5
## [31] nlme_3.1-131 SnowballC_0.5.1 MASS_7.3-48
## [34] xml2_1.1.1 foreign_0.8-69 tools_3.4.3
## [37] data.table_1.10.4-3 hms_0.4.0 munsell_0.4.3
## [40] bindrcpp_0.2 compiler_3.4.3 rlang_0.1.6
## [43] units_0.5-1 grid_3.4.3 rstudioapi_0.7
## [46] htmlwidgets_0.9 labeling_0.3 base64enc_0.1-3
## [49] rmarkdown_1.8 gtable_0.2.0 R6_2.2.2
## [52] udunits2_0.13 bindr_0.1 rprojroot_1.3-2
## [55] stringi_1.1.6 parallel_3.4.3 Rcpp_0.12.14
## [58] rpart_4.1-11 acepack_1.4.1
# Directories
topdir <- getwd()
readdir <- paste(topdir, '/final/en_US', sep = '')
setwd(readdir)
# Reading the Data
blogs <- readLines(con = 'en_US.blogs.txt', skipNul = TRUE)
news <- readLines(con = 'en_US.news.txt', skipNul = TRUE)
twitter <- readLines(con = 'en_US.twitter.txt', skipNul = TRUE)
setwd(topdir)
# Create tables for each subcategory
data("stop_words")
## Blogs
blogdf <- data_frame(text = blogs)
blogwords <- blogdf %>%
unnest_tokens(word, text) %>%
filter(!word %in% 0:100) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## News
newsdf <- data_frame(text = news)
newswords <- newsdf %>%
unnest_tokens(word, text) %>%
filter(!word %in% 0:100) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Twitter
twitterdf <- data_frame(text = twitter)
twitterwords <- twitterdf %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
filter(!word %in% 0:100) %>%
count(word, sort = TRUE)
## Quadgram
quadgrams <- df %>%
unnest_tokens(quadgrams, text, token = 'ngrams', n = 4) %>%
separate(quadgrams, c('word1', 'word2', 'word3', 'word4'), sep = ' ') %>%
count(word1, word2, word3, word4, sort = TRUE) %>%
filter(n > 3)
g_words <- ggplot(words[1:20,], aes(x=reorder(word, n), y=n)) + geom_col() + theme_tufte() +
coord_flip() + ylab('Word') + xlab('Frequency') + ggtitle('Words')
di_words <- ggplot(digrams[1:20,], aes(x=reorder(paste(word1, word2), n), y=n)) +
geom_col() + theme_tufte() + coord_flip() + xlab('Word Pair') + ylab('Frequency') +
ggtitle('Digrams')
tri_words <- ggplot(trigrams[1:20,], aes(x=reorder(paste(word1, word2, word3), n), y=n)) +
geom_col() + theme_tufte() + coord_flip() + xlab('Word Trio') + ylab('Frequency') +
ggtitle('Trigrams')
quad_words <- ggplot(quadgrams[1:20,], aes(x=reorder(paste(word1, word2, word3, word4), n), y=n)) + geom_col() + theme_tufte() + coord_flip() + ylab('Frequency') + xlab('Word Quartet') +
ggtitle('Quadgrams')
grid.arrange(g_words, di_words, tri_words, quad_words, ncol = 2)
I only showed the quadgram graph but kept the others in, just in case.
a <- arrow(type = 'closed', length = unit(.15, 'inches'))
digram_graph <- digrams %>%
filter(n > 6500) %>%
graph_from_data_frame()
d <- ggraph(digram_graph, 'fr') +
geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
arrow = a, end_cap = circle(.12, 'inches')) +
geom_node_point(color = 'lightpink', size = 5) +
geom_node_text(aes(label = name),repel = TRUE) +
theme_void() + ggtitle('Connection Graph of Digrams')
# Make Digram from Trigram Counts
tgc1 <- trigrams[,c(1,2,4)] %>%
filter(n>100)
tgc2 <- trigrams[,c(2,3,4)] %>%
filter(n>100)
colnames(tgc2) <- c('word1', 'word2', 'n')
tgc <- rbind(tgc1, tgc2) %>%
arrange(desc(n))
trigram_graph <- tgc %>%
filter(n > 1200) %>%
graph_from_data_frame()
t <- ggraph(trigram_graph, 'fr') +
geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
arrow = a, end_cap = circle(.12, 'inches')) +
geom_node_point(color = 'lightblue', size = 5) +
geom_node_text(aes(label = name),repel = TRUE) +
theme_void() + ggtitle('Connection Graph of Trigrams')
# Make Digram from Quadgram Counts
qgcfilt <- quadgrams %>%
filter(n > 100)
qgc1 <- qgcfilt[,c(1,2,5)]
qgc2 <- qgcfilt[,c(2,3,5)]
colnames(qgc2) <- c('word1', 'word2', 'n')
qgc3 <- qgcfilt[,c(3,4,5)]
colnames(qgc3) <- c('word1', 'word2', 'n')
qgc <- rbind(qgc1, qgc2, qgc3) %>%
ddply(.variables = c('word1', 'word2'), numcolwise(sum)) %>%
arrange(desc(n)) %>%
dplyr::as_data_frame()
qgcgraph <- qgc %>%
filter(n > 700) %>%
graph_from_data_frame()
q <- ggraph(qgcgraph, 'fr') +
geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
arrow = a, end_cap = circle(.12, 'inches')) +
geom_node_point(color = 'goldenrod', size = 5) +
geom_node_text(aes(label = name),repel = TRUE) +
theme_void() + ggtitle('Connection Graph of Quadgrams')