Synopsis

This milestone report is part of the Capstone Course of the Data Science Specialization on Coursera. The goal of the project is to create a predictive text model using data from SwiftKey. The data was obtained via datascraping the internet for blogposts, news articles and tweets. Written code can be found in an appendix after the report.

Data Reading and Cleaning

The data comes from three separate files. One for each of blogs, news and twitter data sources. I start the report by first analysing each dataset on it’s own. Later, I combine them and analyse them all together. Before exploring the data I will remove common words known as stop words. These are words like: the, is, and, etc.

First lets look at the size of the datasets and the mean length of a blog, a tweet and a news article.

Size of datasets and mean doc length
Blogs News Twitter
Corpus Size 899288 1010242 2360148
Mean Length 290 201 72

We see that the document sizes are not equal between corpuses. When we combine them into one big set later on we should give the twitter set a higher weight than the others to ensure an even distribution.

Exploratory Analysis

First I make a plot of the most common words in each of the three datasets.

Wordcount Graphs

We see a lot of similarities between the different sources but also some differences. The twitter data contains more abbreviations like lol and rt, news contains words like police and bounty, but the blogs dataset seems to have the most ordinary corpus.

Blending the Datasets

Now that we’ve had a look at the most used singular words we can look at the top n-grams, which are sequences of words of size n. Since the dataset is so big, we will only explore a random sample of size 20% pulled from the total set. When analyzing the ngrams I will not remove stopwords, since the endgame of this project is text prediction, not merely analyzing.

Exploratory plots

Word Counts

Let’s see which ngrams are used the most. I have also added a simple word-counting plot from the combined dataset.

Graphs

One simple idea for text prediction is to split the data into quadgrams and simply predict the fourth word based on the first three words. If there are only two words before the predicted we could use trigrams etc. Let’s plot a connected graph of our quadgrams. This graph will connect words with an arrow if one word often leads into the other in texts.

We see that if we were to simply move between words on the graph in the direction of the arrow we would get sentences that sort of make sense. This gives some good ideas for how the model should look. My plan is to look into Markov Models. They model the probability of going from one state (rainy weather, word number 1, etc.) to another state (sunny weather, word number 2, etc.) and would suit well for text prediction.

Appendix

Code from the report

Session Info

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] purrr_0.2.4        Hmisc_4.1-1        Formula_1.2-2     
##  [4] survival_2.41-3    lattice_0.20-35    scales_0.5.0      
##  [7] knitr_1.18         reshape2_1.4.3     ggraph_1.0.0      
## [10] igraph_1.1.2       stringr_1.2.0      readr_1.1.1       
## [13] gridExtra_2.3      ggthemes_3.4.0     cluster_2.0.6     
## [16] wordcloud_2.5      RColorBrewer_1.1-2 ggplot2_2.2.1     
## [19] tm_0.7-3           NLP_0.1-11         tidytext_0.1.6    
## [22] tibble_1.4.1       tidyr_0.7.2        dplyr_0.7.4       
## [25] plyr_1.8.4        
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.4.0       viridisLite_0.2.0   splines_3.4.3      
##  [4] assertthat_0.2.0    highr_0.6           latticeExtra_0.6-28
##  [7] yaml_2.1.16         slam_0.1-42         ggrepel_0.7.0      
## [10] pillar_1.0.1        backports_1.1.2     glue_1.2.0         
## [13] digest_0.6.13       checkmate_1.8.5     colorspace_1.3-2   
## [16] htmltools_0.3.6     Matrix_1.2-12       psych_1.7.8        
## [19] pkgconfig_2.0.1     broom_0.4.3         tweenr_0.1.5       
## [22] ggforce_0.1.1       htmlTable_1.11.1    nnet_7.3-12        
## [25] lazyeval_0.2.1      mnormt_1.5-5        magrittr_1.5       
## [28] evaluate_0.10.1     tokenizers_0.1.4    janeaustenr_0.1.5  
## [31] nlme_3.1-131        SnowballC_0.5.1     MASS_7.3-48        
## [34] xml2_1.1.1          foreign_0.8-69      tools_3.4.3        
## [37] data.table_1.10.4-3 hms_0.4.0           munsell_0.4.3      
## [40] bindrcpp_0.2        compiler_3.4.3      rlang_0.1.6        
## [43] units_0.5-1         grid_3.4.3          rstudioapi_0.7     
## [46] htmlwidgets_0.9     labeling_0.3        base64enc_0.1-3    
## [49] rmarkdown_1.8       gtable_0.2.0        R6_2.2.2           
## [52] udunits2_0.13       bindr_0.1           rprojroot_1.3-2    
## [55] stringi_1.1.6       parallel_3.4.3      Rcpp_0.12.14       
## [58] rpart_4.1-11        acepack_1.4.1

Reading the data

# Directories
topdir <- getwd()
readdir <- paste(topdir, '/final/en_US', sep = '')
setwd(readdir)
# Reading the Data
blogs <- readLines(con = 'en_US.blogs.txt', skipNul = TRUE)
news <- readLines(con = 'en_US.news.txt', skipNul = TRUE)
twitter <- readLines(con = 'en_US.twitter.txt', skipNul = TRUE)
setwd(topdir)

Unnesting tokens

# Create tables for each subcategory
data("stop_words")
## Blogs
blogdf <- data_frame(text = blogs)
blogwords <- blogdf %>%
    unnest_tokens(word, text) %>%
    filter(!word %in% 0:100) %>%
    anti_join(stop_words) %>%
    count(word, sort = TRUE)

## News
newsdf <- data_frame(text = news)
newswords <- newsdf %>%
    unnest_tokens(word, text) %>%
    filter(!word %in% 0:100) %>%
    anti_join(stop_words) %>%
    count(word, sort = TRUE)

## Twitter
twitterdf <- data_frame(text = twitter)
twitterwords <- twitterdf %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    filter(!word %in% 0:100) %>%
    count(word, sort = TRUE)

Quadgrams

## Quadgram
quadgrams <- df %>%
    unnest_tokens(quadgrams, text, token = 'ngrams', n = 4) %>% 
    separate(quadgrams, c('word1', 'word2', 'word3', 'word4'), sep = ' ') %>%
    count(word1, word2, word3, word4, sort = TRUE) %>%
    filter(n > 3) 

Count plots

g_words <- ggplot(words[1:20,], aes(x=reorder(word, n), y=n)) + geom_col() + theme_tufte() +
    coord_flip() + ylab('Word') + xlab('Frequency') + ggtitle('Words')

di_words <- ggplot(digrams[1:20,], aes(x=reorder(paste(word1, word2), n), y=n)) + 
    geom_col() + theme_tufte() + coord_flip() + xlab('Word Pair') + ylab('Frequency') +
    ggtitle('Digrams')

tri_words <- ggplot(trigrams[1:20,], aes(x=reorder(paste(word1, word2, word3), n), y=n)) +
    geom_col() + theme_tufte() + coord_flip() + xlab('Word Trio') + ylab('Frequency') + 
    ggtitle('Trigrams')


quad_words <- ggplot(quadgrams[1:20,], aes(x=reorder(paste(word1, word2, word3, word4), n), y=n)) + geom_col() + theme_tufte() + coord_flip() + ylab('Frequency') + xlab('Word Quartet') +
    ggtitle('Quadgrams')

grid.arrange(g_words, di_words, tri_words,  quad_words, ncol = 2) 

Word graph

I only showed the quadgram graph but kept the others in, just in case.

a  <- arrow(type = 'closed', length = unit(.15, 'inches'))

digram_graph <- digrams %>%
    filter(n > 6500) %>%
    graph_from_data_frame()

d <- ggraph(digram_graph, 'fr') +
    geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
                   arrow = a, end_cap = circle(.12, 'inches')) + 
    geom_node_point(color = 'lightpink', size = 5) +
    geom_node_text(aes(label = name),repel = TRUE) +
    theme_void() + ggtitle('Connection Graph of Digrams')


# Make Digram from Trigram Counts
tgc1 <- trigrams[,c(1,2,4)] %>%
    filter(n>100)
tgc2 <- trigrams[,c(2,3,4)] %>%
    filter(n>100)
colnames(tgc2) <- c('word1', 'word2', 'n')
tgc <- rbind(tgc1, tgc2) %>%
    arrange(desc(n))
trigram_graph <- tgc %>%
    filter(n > 1200) %>%
    graph_from_data_frame()
t <- ggraph(trigram_graph, 'fr') +
    geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
                   arrow = a, end_cap = circle(.12, 'inches')) + 
    geom_node_point(color = 'lightblue', size = 5) +
    geom_node_text(aes(label = name),repel = TRUE) +
    theme_void() + ggtitle('Connection Graph of Trigrams') 
# Make Digram from Quadgram Counts
qgcfilt <- quadgrams %>%
    filter(n > 100)
qgc1 <- qgcfilt[,c(1,2,5)]
qgc2 <- qgcfilt[,c(2,3,5)]
colnames(qgc2) <- c('word1', 'word2', 'n')
qgc3 <- qgcfilt[,c(3,4,5)]
colnames(qgc3) <- c('word1', 'word2', 'n')
qgc <- rbind(qgc1, qgc2, qgc3) %>%
    ddply(.variables = c('word1', 'word2'), numcolwise(sum)) %>%
    arrange(desc(n)) %>%
    dplyr::as_data_frame()
qgcgraph <- qgc %>%
    filter(n > 700) %>%
    graph_from_data_frame()
q <- ggraph(qgcgraph, 'fr') +
    geom_edge_link(aes(edge_alpha = n), edge_colour = 'grey30', show.legend = FALSE,
                   arrow = a, end_cap = circle(.12, 'inches')) + 
    geom_node_point(color = 'goldenrod', size = 5) +
    geom_node_text(aes(label = name),repel = TRUE) +
    theme_void() + ggtitle('Connection Graph of Quadgrams')