Milestone Report

Synopsis
Getting and Processing the Data
Visualizing Word Frequency
Planning the Prediction Model

library(tidytext)
library(tibble)
library(dplyr)
library(stringr)
library(kableExtra)
library(tidyr)
library(wordcloud)
library(ggplot2)
library(scales)
library(gridExtra)
library(reshape2)
library(igraph)
library(ggraph)
library(data.table)

Synopsis

The data for this project was provided by SwiftKey and can be found https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset comprises of three character vectors: news, blogs, and twitter with each index representing a single sample (i.e. a full article or tweet).

The goal of this project is to produce an application that accepts a string of words and predicts the next word using a corpus sample extracted from the SwiftKey dataset.

The large corpus of text documents was processed, cleaned, analyzed and visualized using a number of language processing packages in R including tidytext, stringr and wordcloud to name a few. tidytext is especially useful as it provides the capability to interact with textual data using ‘dplyr’ like functions for quick and easy exploration.

A Markov Chain model is produced using quadgrams, trigrams, and bigrams structured into frequency tables. The tables allow for easy determination of the highest probability ngram given a series of words as an input.

In a future project I plan to explore the use of LSTM deep learning to compare the effectiveness of these different methods at predicting the next word in a sentence.

Getting and Processing the Data

Importing Raw Data

newsRaw <- readLines('../data/final/en_US/en_US.news.txt', encoding = 'UTF-8')
blogsRaw <- readLines('../data/final/en_US/en_US.blogs.txt', encoding = 'UTF-8')
twitterRaw <- readLines('../data/final/en_US/en_US.twitter.txt', encoding = 'UTF-8')

Comparison of Unprocessed Data

tibble(Source = c('news', 'blogs', 'twitter'),
       'Number of Samples' = c(length(newsRaw),
                               length(blogsRaw),
                               length(twitterRaw)),
       'Total Characters' = c(sum(nchar(newsRaw)), 
                              sum(nchar(blogsRaw)), 
                              sum(nchar(twitterRaw))),
       'Average Characters' = c(round(mean(nchar(newsRaw))),
                                round(mean(nchar(blogsRaw))),
                                round(mean(nchar(twitterRaw))))
       ) %>%
        kbl() %>%
        kable_styling(bootstrap_options = c("striped", 
                                            "hover", 
                                            "condensed"),
                      full_width = F)

Source	Number of Samples	Total Characters	Average Characters
news	77259	15639408	202
blogs	899288	206824505	230
twitter	2360148	162096031	69

Processing the Raw Data

After taking a look at the least common ‘words’ in the files, many were found to contain large amounts of white space, special characters, numbers, and non-english words. I kept only ‘alpha’ characters and spaces for this project. Any web urls also were removed.

news <- gsub('(https?|www.)\\S*', '', newsRaw, perl = T)
news <- gsub('[^a-zA-Z[:space:]]', '', news, perl = T)
news <- str_squish(news)
blogs <- gsub('(https?|www.)\\S*', '', blogsRaw, perl = T)
blogs <- gsub('[^a-zA-Z[:space:]]', '', blogs, perl = T)
blogs <- str_squish(blogs)
twitter <- gsub('(https?|www.)\\S*', '', twitterRaw, perl = T)
twitter<- gsub('[^a-zA-Z[:space:]]', '', twitter, perl = T)
twitter <- str_squish(twitter)

Creation of a data table containing samples

tidyData <- rbind(tibble(source = 'news', 
                         text = news,
                         sample = 1:length(news)),
                  tibble(source = 'blogs', 
                         text = blogs,
                         sample = 1:length(blogs)),
                  tibble(source = 'twitter', 
                         text = twitter,
                         sample = 1:length(twitter)))
tidyData$source <- factor(tidyData$source, 
                          levels = c('news', 'blogs', 'twitter'))

The starting data consists of 3 character vectors containing a sample at each index. Each vector was first converted to a data frame (here, a ‘tibble’) with the a line number and text column for easier handling and viewing.

Tokenization

My goal of this section is to create a tidy text format: A table with one-token-per-row. A token is a meaningful unit of text, such as a word or punctuation. Tokenization is the process of turning text into tokens.

unnest_tokens() is a part of the tidytext package in R and splits a column of character strings into tokens (one-token-per-row).

tidyWords <- tidyData %>%
        unnest_tokens(word, text)

tidyWords %>%
        group_by(source) %>%
        count(word, name = 'Times Word Occurs') %>%
        count(`Times Word Occurs`, name = 'Number of Words') %>%
        arrange(`Times Word Occurs`)

## # A tibble: 7,135 x 3
## # Groups:   source [3]
##    source  `Times Word Occurs` `Number of Words`
##    <fct>                 <int>             <int>
##  1 news                      1             42022
##  2 blogs                     1            230477
##  3 twitter                   1            316891
##  4 news                      2             11912
##  5 blogs                     2             44980
##  6 twitter                   2             49061
##  7 news                      3              6128
##  8 blogs                     3             21405
##  9 twitter                   3             21107
## 10 news                      4              4009
## # ... with 7,125 more rows

Blogs and twitter have significantly more words that appear only 1 time, likely due to incorrect spelling or non-words used in these lax sources.

The data frame stop_words is also provided by the tidytext package which provides 1149 stop words from three lexicons (‘SMART’, ‘onix’, or ‘snowball’). Stop words are the most common words that provide little to no insight into our data, which is why they’re removed before analysis. The FCC ‘7 banned words’ were used to remove profanity.

profanity <- tibble(word = readLines('../data/curses'))
tidyWords <- tidyWords %>% anti_join(profanity)
cleanWords <- tidyWords %>% anti_join(stop_words)

Visualizing Word Frequency

Data that is useful in this exploration phase is frequency count of words and the proportion of words to compare sources to each other.

commonWords <- cleanWords %>%
        group_by(source) %>%
        count(word, sort = T, name = 'freq') %>%
        mutate(prop = freq / sum(freq)) %>%
        ungroup()

Frequency Table

First I made a table with the top 10 most common words, showing their frequency.

commonWordsTable <- cbind(commonWords %>%
                                  filter(source == 'news') %>%
                                  select(word, freq) %>%
                                  slice_head(n = 10),
                           commonWords %>%
                                   filter(source == 'blogs') %>%
                                   select(word, freq) %>%
                                   slice_head(n = 10), 
                           commonWords %>%
                                   filter(source == 'twitter') %>%
                                   select(word, freq) %>%
                                   slice_head(n = 10))

commonWordsTable %>%
        kbl() %>%
        kable_styling(bootstrap_options = c("striped", 
                                            "hover", 
                                            "condensed"),
                      full_width = F) %>%
        add_header_above(c('News' = 2, 'Blogs' = 2, 'Twitter' = 2))

News		Blogs		Twitter
word	freq	word	freq	word	freq
time	3992	time	88143	im	158535
people	3667	im	66951	love	105583
city	2828	people	59220	dont	90104
percent	2629	dont	56261	day	90060
school	2611	day	50485	rt	88697
game	2368	love	44736	time	74698
home	2343	life	39721	lol	66704
million	2343	ive	33617	people	51496
dont	2272	world	28524	happy	48536
county	2237	didnt	27666	follow	47326

Frequency Plots

Using ggplot2 the most frequent words can be visualized very easily:

commonWords %>% 
        group_by(source) %>% 
        arrange(-prop) %>% 
        slice_head(n = 10) %>%
        ungroup() %>%
        (function(df) merge(df, count(df, word, name = 'overlap'))) %>%
# propGraph <- merge(propGraph, count(propGraph, word, name = 'overlap'))
# propGraph %>%
        ggplot(aes(x = prop, y = word, fill = factor(overlap))) +
        geom_col() +
        facet_grid(~source) +
        # theme(legend.title = element_text('# of Overlaps'))
        labs(y = NULL, x = '%') +
        scale_fill_discrete(name = 'Overlaps Between Sources:') +
        theme(legend.position = 'top', legend.justification = 0)

The words ‘time’, ‘people’, and ‘dont’ are found in all 3 top 10 words, while ‘love’, ‘im’, and ‘day’ are common only between blogs and twitter.

Coverage

Sorting words by frequency gives a metric to visualize how much of the data is coverved with a specific number of top words.

coverageData <- tidyWords %>% 
        count(word, sort = T, name = 'freq') %>% 
        mutate(prop = freq / sum(freq)) %>%
        mutate(coverage = cumsum(prop)) %>%
        mutate(numwords = row_number())

coverageData %>% 
        ggplot(aes(numwords, coverage)) +
        geom_line(size = 1) +
        scale_x_log10() +
        scale_y_continuous(labels = percent_format())

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

rbind(
        coverageData %>% 
                filter(coverage >= 0.5) %>% 
                slice_head(n = 1) %>%
                select(coverage, numwords),
        coverageData %>% 
                filter(coverage >= 0.9) %>% 
                slice_head(n = 1) %>%
                select(coverage, numwords)
)

## # A tibble: 2 x 2
##   coverage numwords
##      <dbl>    <int>
## 1    0.500      120
## 2    0.900     6863

Word Plots

plotWords <- commonWords %>% 
        select(-freq) %>% 
        filter(prop >= 0.00001) %>% 
        spread(source, prop)
nbplot <- plotWords %>%
        ggplot(aes(x = blogs, y = news, color = abs(news - blogs))) +
        geom_abline(color = 'gray40', lty = 2) +
        geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
        geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) + 
        scale_y_log10(labels = percent_format()) +
        scale_color_gradient(limits = c(0, 0.0001),
                             low = 'darkslategray4', high = 'gray75') +
        theme(legend.position = 'none')
ntplot <- plotWords %>%
        ggplot(aes(x = twitter, y = news, color = abs(news - twitter))) +
        geom_abline(color = 'gray40', lty = 2) +
        geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
        geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) + 
        scale_y_log10(labels = percent_format()) +
        scale_color_gradient(limits = c(0, 0.0001),
                             low = 'darkslategray4', high = 'gray75') +
        theme(legend.position = 'none')
tbplot <- plotWords %>%
        ggplot(aes(x = blogs, y = twitter, color = abs(twitter - blogs))) +
        geom_abline(color = 'gray40', lty = 2) +
        geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
        geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) + 
        scale_y_log10(labels = percent_format()) +
        scale_color_gradient(limits = c(0, 0.0001),
                             low = 'darkslategray4', high = 'gray75') +
        theme(legend.position = 'none')
grid.arrange(nbplot, ntplot, tbplot, nrow = 1)

Wordclouds

The package tidytext is very handy for manipulating textual data with dplyr and other common R operations. However, some functions require a ‘term frequency matrix’ as their input rather than a data frame. The reshape2 package allows easy transformation to a matrix and vice versa.

In addition to transforming a data frame into a matrix, the acast function can also ‘spread’ the data to a ‘wide’ format at the same time (similar to the spread function in tidyr).

cleanWords %>%
        filter('news' == 'news') %>%
        inner_join(get_sentiments('bing')) %>%
        count(word, sentiment, sort = T) %>%
        acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
        comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))

cleanWords %>%
        filter('blogs' == 'blogs') %>%
        inner_join(get_sentiments('bing')) %>%
        count(word, sentiment, sort = T) %>%
        acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
        comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))

cleanWords %>%
        filter('twitter' == 'twitter') %>%
        inner_join(get_sentiments('bing')) %>%
        count(word, sentiment, sort = T) %>%
        acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
        comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))

Planning the Prediction Model

Up to this point the exploration has mainly been exploring single words, with some sentiment analysis. In the final model, bigrams, trigrams and quadgrams will be used to predict the next word in a string of words.

markov chain with transition matrix (stochastic matrix).

ngram Exploration

set.seed(13579)
modelText <- tibble(text = tidyData$text, sample = 1:length(tidyData$text)) %>%
        sample_frac(0.1)
bigram <- modelText %>%
        unnest_tokens(ngram, text, token = 'ngrams', n = 2)
trigram <- modelText %>%
        unnest_tokens(ngram, text, token = 'ngrams', n = 3) %>%
        filter(!is.na(ngram))
quadgram <- modelText %>%
        unnest_tokens(ngram, text, token = 'ngrams', n = 4) %>%
        filter(!is.na(ngram))

tibble('ngram Length' = c(2, 3, 4),
       'Number of Grams Occuring More Than Once' =
               c(dim(bigram %>% count(ngram) %>% filter(n > 1))[1],
                 dim(trigram %>% count(ngram) %>% filter(n > 1))[1],
                 dim(quadgram %>% count(ngram) %>% filter(n > 1))[1])
) %>% kbl() %>%
        kable_styling(bootstrap_options = c("striped",
                                            "hover",
                                            "condensed"),
                      full_width = F)

ngram Length	Number of Grams Occuring More Than Once
2	457553
3	485459
4	224925

ngram Frequencies

bigram_counts <- bigram %>% count(ngram, sort = T)
trigram_counts <- trigram %>% count(ngram, sort = T)
quadgram_counts <- quadgram %>% count(ngram, sort = T)

bigram_plot <- bigram_counts %>%
        slice_head(n = 20) %>%
        mutate(ngram = reorder(ngram, n)) %>%
        ggplot(aes(n, ngram)) +
        geom_col() +
        ylab(NULL)
trigram_plot <- trigram_counts %>%
        slice_head(n = 20) %>%
        mutate(ngram = reorder(ngram, n)) %>%
        ggplot(aes(n, ngram)) +
        geom_col() +
        ylab(NULL)
quadgram_plot <- quadgram_counts %>%
        slice_head(n = 20) %>%
        mutate(ngram = reorder(ngram, n)) %>%
        ggplot(aes(n, ngram)) +
        geom_col() +
        ylab(NULL)
grid.arrange(bigram_plot, trigram_plot, quadgram_plot, nrow = 1)

bigram_counts_sep <- bigram_counts %>%
        separate(ngram, c('word1', 'word2'), sep = ' ')
trigram_counts_sep <- trigram_counts %>%
        separate(ngram, c('word1', 'word2', 'word3'), sep = ' ')
quadgram_counts_sep <- quadgram_counts %>%
        separate(ngram, c('word1', 'word2', 'word3', 'word4'), sep = ' ')

Arrow plot

bigram_graph <- bigram_counts_sep %>% 
        filter(!word1 %in% stop_words$word) %>%
        filter(!word2 %in% stop_words$word) %>%
        filter(n > 60) %>% 
        graph_from_data_frame()

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
        geom_edge_link(aes(edge_alpha = n), 
                       show.legend = FALSE,
                       arrow = a, 
                       end_cap = circle(.07, 'inches')) +
        geom_edge_link() +
        geom_node_point(color = "lightblue", size = 5) +
        geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Description of Prediction Model

The model for this project makes use of a Markov model chain to decide which word is the most likely to occur after the input words based on the sample corpus provided. Once the tables are made, using the model is as simple as looking a string of words up in a table and determining the highest frequency word that follows.

Here is an example of what a portion of the quadgram frequency table looks like:

quadgram_freq_matrix <- modelText %>%
    unnest_tokens(ngram, text, token = 'ngrams', n = 4) %>%
    filter(!is.na(ngram)) %>%
    count(ngram, sort = T) %>%
    mutate(n = n/sum(n), 
           coverage = cumsum(n)) %>%
    separate(ngram, c('word1', 'word2', 'word3', 'word4'), sep = ' ') %>%
    unite(query, c(word1, word2, word3), sep = ' ') %>%
    cast_dfm(query, word4, n)

quadgram_freq_matrix[1:6, 1:5] %>% 
        kbl() %>%
        kable_styling(bootstrap_options = c("striped", 
                                            "hover"),
                      full_width = F)

doc_id	follow	the	of	time	be
thanks for the	9.94e-05	0.00e+00	0.00e+00	2.0e-07	0.0e+00
the end of	0.00e+00	8.26e-05	0.00e+00	1.4e-06	0.0e+00
the rest of	0.00e+00	8.00e-05	0.00e+00	2.0e-07	0.0e+00
at the end	0.00e+00	8.00e-07	7.46e-05	0.0e+00	0.0e+00
for the first	0.00e+00	0.00e+00	3.00e-07	6.3e-05	0.0e+00
is going to	0.00e+00	1.70e-06	0.00e+00	0.0e+00	6.2e-05