library(tidytext)
library(tibble)
library(dplyr)
library(stringr)
library(kableExtra)
library(tidyr)
library(wordcloud)
library(ggplot2)
library(scales)
library(gridExtra)
library(reshape2)
library(igraph)
library(ggraph)
library(data.table)
The data for this project was provided by SwiftKey and can be found https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset comprises of three character vectors: news, blogs, and twitter with each index representing a single sample (i.e. a full article or tweet).
The goal of this project is to produce an application that accepts a string of words and predicts the next word using a corpus sample extracted from the SwiftKey dataset.
The large corpus of text documents was processed, cleaned, analyzed and visualized using a number of language processing packages in R including tidytext, stringr and wordcloud to name a few. tidytext is especially useful as it provides the capability to interact with textual data using ‘dplyr’ like functions for quick and easy exploration.
A Markov Chain model is produced using quadgrams, trigrams, and bigrams structured into frequency tables. The tables allow for easy determination of the highest probability ngram given a series of words as an input.
In a future project I plan to explore the use of LSTM deep learning to compare the effectiveness of these different methods at predicting the next word in a sentence.
newsRaw <- readLines('../data/final/en_US/en_US.news.txt', encoding = 'UTF-8')
blogsRaw <- readLines('../data/final/en_US/en_US.blogs.txt', encoding = 'UTF-8')
twitterRaw <- readLines('../data/final/en_US/en_US.twitter.txt', encoding = 'UTF-8')
tibble(Source = c('news', 'blogs', 'twitter'),
'Number of Samples' = c(length(newsRaw),
length(blogsRaw),
length(twitterRaw)),
'Total Characters' = c(sum(nchar(newsRaw)),
sum(nchar(blogsRaw)),
sum(nchar(twitterRaw))),
'Average Characters' = c(round(mean(nchar(newsRaw))),
round(mean(nchar(blogsRaw))),
round(mean(nchar(twitterRaw))))
) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"),
full_width = F)
| Source | Number of Samples | Total Characters | Average Characters |
|---|---|---|---|
| news | 77259 | 15639408 | 202 |
| blogs | 899288 | 206824505 | 230 |
| 2360148 | 162096031 | 69 |
After taking a look at the least common ‘words’ in the files, many were found to contain large amounts of white space, special characters, numbers, and non-english words. I kept only ‘alpha’ characters and spaces for this project. Any web urls also were removed.
news <- gsub('(https?|www.)\\S*', '', newsRaw, perl = T)
news <- gsub('[^a-zA-Z[:space:]]', '', news, perl = T)
news <- str_squish(news)
blogs <- gsub('(https?|www.)\\S*', '', blogsRaw, perl = T)
blogs <- gsub('[^a-zA-Z[:space:]]', '', blogs, perl = T)
blogs <- str_squish(blogs)
twitter <- gsub('(https?|www.)\\S*', '', twitterRaw, perl = T)
twitter<- gsub('[^a-zA-Z[:space:]]', '', twitter, perl = T)
twitter <- str_squish(twitter)
tidyData <- rbind(tibble(source = 'news',
text = news,
sample = 1:length(news)),
tibble(source = 'blogs',
text = blogs,
sample = 1:length(blogs)),
tibble(source = 'twitter',
text = twitter,
sample = 1:length(twitter)))
tidyData$source <- factor(tidyData$source,
levels = c('news', 'blogs', 'twitter'))
The starting data consists of 3 character vectors containing a sample at each index. Each vector was first converted to a data frame (here, a ‘tibble’) with the a line number and text column for easier handling and viewing.
My goal of this section is to create a tidy text format: A table with one-token-per-row. A token is a meaningful unit of text, such as a word or punctuation. Tokenization is the process of turning text into tokens.
unnest_tokens() is a part of the tidytext package in R and splits a column of character strings into tokens (one-token-per-row).
tidyWords <- tidyData %>%
unnest_tokens(word, text)
tidyWords %>%
group_by(source) %>%
count(word, name = 'Times Word Occurs') %>%
count(`Times Word Occurs`, name = 'Number of Words') %>%
arrange(`Times Word Occurs`)
## # A tibble: 7,135 x 3
## # Groups: source [3]
## source `Times Word Occurs` `Number of Words`
## <fct> <int> <int>
## 1 news 1 42022
## 2 blogs 1 230477
## 3 twitter 1 316891
## 4 news 2 11912
## 5 blogs 2 44980
## 6 twitter 2 49061
## 7 news 3 6128
## 8 blogs 3 21405
## 9 twitter 3 21107
## 10 news 4 4009
## # ... with 7,125 more rows
Blogs and twitter have significantly more words that appear only 1 time, likely due to incorrect spelling or non-words used in these lax sources.
The data frame stop_words is also provided by the tidytext package which provides 1149 stop words from three lexicons (‘SMART’, ‘onix’, or ‘snowball’). Stop words are the most common words that provide little to no insight into our data, which is why they’re removed before analysis. The FCC ‘7 banned words’ were used to remove profanity.
profanity <- tibble(word = readLines('../data/curses'))
tidyWords <- tidyWords %>% anti_join(profanity)
cleanWords <- tidyWords %>% anti_join(stop_words)
Data that is useful in this exploration phase is frequency count of words and the proportion of words to compare sources to each other.
commonWords <- cleanWords %>%
group_by(source) %>%
count(word, sort = T, name = 'freq') %>%
mutate(prop = freq / sum(freq)) %>%
ungroup()
First I made a table with the top 10 most common words, showing their frequency.
commonWordsTable <- cbind(commonWords %>%
filter(source == 'news') %>%
select(word, freq) %>%
slice_head(n = 10),
commonWords %>%
filter(source == 'blogs') %>%
select(word, freq) %>%
slice_head(n = 10),
commonWords %>%
filter(source == 'twitter') %>%
select(word, freq) %>%
slice_head(n = 10))
commonWordsTable %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"),
full_width = F) %>%
add_header_above(c('News' = 2, 'Blogs' = 2, 'Twitter' = 2))
| word | freq | word | freq | word | freq |
|---|---|---|---|---|---|
| time | 3992 | time | 88143 | im | 158535 |
| people | 3667 | im | 66951 | love | 105583 |
| city | 2828 | people | 59220 | dont | 90104 |
| percent | 2629 | dont | 56261 | day | 90060 |
| school | 2611 | day | 50485 | rt | 88697 |
| game | 2368 | love | 44736 | time | 74698 |
| home | 2343 | life | 39721 | lol | 66704 |
| million | 2343 | ive | 33617 | people | 51496 |
| dont | 2272 | world | 28524 | happy | 48536 |
| county | 2237 | didnt | 27666 | follow | 47326 |
Using ggplot2 the most frequent words can be visualized very easily:
commonWords %>%
group_by(source) %>%
arrange(-prop) %>%
slice_head(n = 10) %>%
ungroup() %>%
(function(df) merge(df, count(df, word, name = 'overlap'))) %>%
# propGraph <- merge(propGraph, count(propGraph, word, name = 'overlap'))
# propGraph %>%
ggplot(aes(x = prop, y = word, fill = factor(overlap))) +
geom_col() +
facet_grid(~source) +
# theme(legend.title = element_text('# of Overlaps'))
labs(y = NULL, x = '%') +
scale_fill_discrete(name = 'Overlaps Between Sources:') +
theme(legend.position = 'top', legend.justification = 0)
The words ‘time’, ‘people’, and ‘dont’ are found in all 3 top 10 words, while ‘love’, ‘im’, and ‘day’ are common only between blogs and twitter.
Sorting words by frequency gives a metric to visualize how much of the data is coverved with a specific number of top words.
coverageData <- tidyWords %>%
count(word, sort = T, name = 'freq') %>%
mutate(prop = freq / sum(freq)) %>%
mutate(coverage = cumsum(prop)) %>%
mutate(numwords = row_number())
coverageData %>%
ggplot(aes(numwords, coverage)) +
geom_line(size = 1) +
scale_x_log10() +
scale_y_continuous(labels = percent_format())
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
rbind(
coverageData %>%
filter(coverage >= 0.5) %>%
slice_head(n = 1) %>%
select(coverage, numwords),
coverageData %>%
filter(coverage >= 0.9) %>%
slice_head(n = 1) %>%
select(coverage, numwords)
)
## # A tibble: 2 x 2
## coverage numwords
## <dbl> <int>
## 1 0.500 120
## 2 0.900 6863
plotWords <- commonWords %>%
select(-freq) %>%
filter(prop >= 0.00001) %>%
spread(source, prop)
nbplot <- plotWords %>%
ggplot(aes(x = blogs, y = news, color = abs(news - blogs))) +
geom_abline(color = 'gray40', lty = 2) +
geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.0001),
low = 'darkslategray4', high = 'gray75') +
theme(legend.position = 'none')
ntplot <- plotWords %>%
ggplot(aes(x = twitter, y = news, color = abs(news - twitter))) +
geom_abline(color = 'gray40', lty = 2) +
geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.0001),
low = 'darkslategray4', high = 'gray75') +
theme(legend.position = 'none')
tbplot <- plotWords %>%
ggplot(aes(x = blogs, y = twitter, color = abs(twitter - blogs))) +
geom_abline(color = 'gray40', lty = 2) +
geom_jitter(alpha = 0.1, size = 2, width = 0.2, height = 0.2) +
geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.0001),
low = 'darkslategray4', high = 'gray75') +
theme(legend.position = 'none')
grid.arrange(nbplot, ntplot, tbplot, nrow = 1)
The package tidytext is very handy for manipulating textual data with dplyr and other common R operations. However, some functions require a ‘term frequency matrix’ as their input rather than a data frame. The reshape2 package allows easy transformation to a matrix and vice versa.
In addition to transforming a data frame into a matrix, the acast function can also ‘spread’ the data to a ‘wide’ format at the same time (similar to the spread function in tidyr).
cleanWords %>%
filter('news' == 'news') %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort = T) %>%
acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))
cleanWords %>%
filter('blogs' == 'blogs') %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort = T) %>%
acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))
cleanWords %>%
filter('twitter' == 'twitter') %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort = T) %>%
acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
comparison.cloud(colors = c('firebrick3', 'dodgerblue3'))
Up to this point the exploration has mainly been exploring single words, with some sentiment analysis. In the final model, bigrams, trigrams and quadgrams will be used to predict the next word in a string of words.
markov chain with transition matrix (stochastic matrix).
set.seed(13579)
modelText <- tibble(text = tidyData$text, sample = 1:length(tidyData$text)) %>%
sample_frac(0.1)
bigram <- modelText %>%
unnest_tokens(ngram, text, token = 'ngrams', n = 2)
trigram <- modelText %>%
unnest_tokens(ngram, text, token = 'ngrams', n = 3) %>%
filter(!is.na(ngram))
quadgram <- modelText %>%
unnest_tokens(ngram, text, token = 'ngrams', n = 4) %>%
filter(!is.na(ngram))
tibble('ngram Length' = c(2, 3, 4),
'Number of Grams Occuring More Than Once' =
c(dim(bigram %>% count(ngram) %>% filter(n > 1))[1],
dim(trigram %>% count(ngram) %>% filter(n > 1))[1],
dim(quadgram %>% count(ngram) %>% filter(n > 1))[1])
) %>% kbl() %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"),
full_width = F)
| ngram Length | Number of Grams Occuring More Than Once |
|---|---|
| 2 | 457553 |
| 3 | 485459 |
| 4 | 224925 |
bigram_counts <- bigram %>% count(ngram, sort = T)
trigram_counts <- trigram %>% count(ngram, sort = T)
quadgram_counts <- quadgram %>% count(ngram, sort = T)
bigram_plot <- bigram_counts %>%
slice_head(n = 20) %>%
mutate(ngram = reorder(ngram, n)) %>%
ggplot(aes(n, ngram)) +
geom_col() +
ylab(NULL)
trigram_plot <- trigram_counts %>%
slice_head(n = 20) %>%
mutate(ngram = reorder(ngram, n)) %>%
ggplot(aes(n, ngram)) +
geom_col() +
ylab(NULL)
quadgram_plot <- quadgram_counts %>%
slice_head(n = 20) %>%
mutate(ngram = reorder(ngram, n)) %>%
ggplot(aes(n, ngram)) +
geom_col() +
ylab(NULL)
grid.arrange(bigram_plot, trigram_plot, quadgram_plot, nrow = 1)
bigram_counts_sep <- bigram_counts %>%
separate(ngram, c('word1', 'word2'), sep = ' ')
trigram_counts_sep <- trigram_counts %>%
separate(ngram, c('word1', 'word2', 'word3'), sep = ' ')
quadgram_counts_sep <- quadgram_counts %>%
separate(ngram, c('word1', 'word2', 'word3', 'word4'), sep = ' ')
bigram_graph <- bigram_counts_sep %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(n > 60) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n),
show.legend = FALSE,
arrow = a,
end_cap = circle(.07, 'inches')) +
geom_edge_link() +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The model for this project makes use of a Markov model chain to decide which word is the most likely to occur after the input words based on the sample corpus provided. Once the tables are made, using the model is as simple as looking a string of words up in a table and determining the highest frequency word that follows.
Here is an example of what a portion of the quadgram frequency table looks like:
quadgram_freq_matrix <- modelText %>%
unnest_tokens(ngram, text, token = 'ngrams', n = 4) %>%
filter(!is.na(ngram)) %>%
count(ngram, sort = T) %>%
mutate(n = n/sum(n),
coverage = cumsum(n)) %>%
separate(ngram, c('word1', 'word2', 'word3', 'word4'), sep = ' ') %>%
unite(query, c(word1, word2, word3), sep = ' ') %>%
cast_dfm(query, word4, n)
quadgram_freq_matrix[1:6, 1:5] %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped",
"hover"),
full_width = F)
| doc_id | follow | the | of | time | be |
|---|---|---|---|---|---|
| thanks for the | 9.94e-05 | 0.00e+00 | 0.00e+00 | 2.0e-07 | 0.0e+00 |
| the end of | 0.00e+00 | 8.26e-05 | 0.00e+00 | 1.4e-06 | 0.0e+00 |
| the rest of | 0.00e+00 | 8.00e-05 | 0.00e+00 | 2.0e-07 | 0.0e+00 |
| at the end | 0.00e+00 | 8.00e-07 | 7.46e-05 | 0.0e+00 | 0.0e+00 |
| for the first | 0.00e+00 | 0.00e+00 | 3.00e-07 | 6.3e-05 | 0.0e+00 |
| is going to | 0.00e+00 | 1.70e-06 | 0.00e+00 | 0.0e+00 | 6.2e-05 |