Summary

The following is a milestone report summarizing the exploratory analysis and steps taken thus far to prepare for building an NLP text prediction model/app. As part of the capstone project for the Data Science Specialization offered by John Hopkins University, students are instructed to use web-scraped data provided by SwiftKey in order to build a Shiny app which predicts the next word of a sentence given a few words of input (similar to modern day predictive keyboard features found on iPhones, for example). As per the official instructions for this milestone report/assignment:

“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.”

Loading/reading in Data

The first step of our analysis begins with retrieving the given datasets and loading them into our R environment. As stated previously, the datasets are provided by courtesy of SwiftKey, who have scraped web data from three sources: blog sites, news sites, and Twitter. The data from each of these sources are contained in their separate, corresponding .txt files. Furthermore, we will only be working with the “en_US” locale datasets for this project. The data is provided to students from a download link; in order to preserve reproducibility in our analysis, we will download the dataset using this link directly within our R script.

#Loading in general libraries
library(tidyverse)
library(magrittr)
library(R.utils)

#Downloading & unzipping data into a directory
if(!file.exists("projData")){dir.create("projData")}
trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(trainURL, destfile = "projData/data.zip", method = "curl")
unzip("projData/data.zip")
unlink("projData/data.zip")

From there, we will assign each dataset to its own corresponding object/character vector using readLines

#Reading in data
con <- file("projData/final/en_US/en_US.twitter.txt", "r")
twit <- readLines(con)
con <- file("projData/final/en_US/en_US.news.txt", "r")
news <- readLines(con)
con <- file("projData/final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con)
close(con)

Lets now run some quick summary statistics on each dataset.

Blogs dataset:
summary(blogs)
##    Length     Class      Mode 
##    899288 character character
sum(stringi::stri_count_words(blogs))
## [1] 37546250
mean(stringi::stri_count_words(blogs))
## [1] 41.75109

As you can see, the blogs dataset contains 899,288 total lines, 37,546,250 total words, and an average of ~41.8 words per line.

News dataset:
summary(news)
##    Length     Class      Mode 
##   1010242 character character
sum(stringi::stri_count_words(news))
## [1] 34762395
mean(stringi::stri_count_words(news))
## [1] 34.40997

The news dataset contains 1,010,242 total lines, 34,762,395 total words, and an average of ~34.4 words per line.

Twitter dataset:
summary(twit)
##    Length     Class      Mode 
##   2360148 character character
sum(stringi::stri_count_words(twit))
## [1] 30093372
mean(stringi::stri_count_words(twit))
## [1] 12.75063

Lastly, the Twitter dataset contains 2,360,148 total lines, 30,093,372 total words, and an average of ~12.8 words per line. Notice the discrepancy in the average words per line in the Twitter dataset compared to the blogs and news datasets; this is largely due to the 140 character limit that tweets were limited to (at the time the data was collected).

Data Processing

Since each dataset contains up to millions of lines, we will choose to randomly sample 100,000 lines from each dataset when performing our exploratory analysis. The main reason for this is to simply reduce the loading/processing times of some of the tokenizing functions that we will be using later (especially when creating bigram and trigram tokens that increase the dataset sizes substantially). By randomly sampling thousands of lines from each dataset, we can still get an accurate representation of the larger population that we are sampling from.

#Taking a 100,000 line sample of each dataset
twitSample <- sample(twit, size = 100000)
newsSample <- sample(news, size = 100000)
blogsSample <- sample(blogs, size = 100000)

Now that we have a sample of each dataset, we will analyze the relationship and frequency of word pairs by their n-grams. More specifically, we will analyze the top 10 most frequent unigrams (single words), bigrams (two word pairs), and trigrams (three word pairs). We will be using the tm and tokenizers packages to create the tokens and load them into dataframes.

First, we will create sub-tables (that we will later feed to gt tables for visualization purposes) of frequent n-gram counts of each dataset. These sub-tables will then be column-binded/merged together into one aggregate table for each n-gram.

Lets now create an aggregate table named topUnigrams that lists the top 10 most frequent unigrams (words) in each dataset, with their respective share.

library(tm)
library(tokenizers)

#Creating Top Unigrams Dataframe
blogsTopUnigrams <- tokenize_words(blogsSample) %>% 
                        unlist %>% 
                        data.frame() %>% 
                        rename(blogs_unigrams = 1) %>% 
                        group_by(blogs_unigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

newsTopUnigrams <- tokenize_words(newsSample) %>% 
                        unlist %>% 
                        data.frame() %>% 
                        rename(news_unigrams = 1) %>% 
                        group_by(news_unigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

twitTopUnigrams <- tokenize_words(twitSample) %>% 
                        unlist %>% 
                        data.frame() %>% 
                        rename(twit_unigrams = 1) %>% 
                        group_by(twit_unigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

topUnigrams <- cbind(blogsTopUnigrams, newsTopUnigrams, twitTopUnigrams)
topUnigrams %<>% rename(count1 = 2, percent_share1 = 3, 
                       count2 = 5, percent_share2 = 6, 
                       count3 = 8, percent_share3 = 9)
rm(blogsTopUnigrams, newsTopUnigrams, twitTopUnigrams)

We can now create the same exact aggregate table for bigrams named topBigrams

#Creating Top Bigrams Dataframe
blogsTopBigrams <- tokenize_ngrams(blogsSample, n = 2) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(blogs_bigrams = 1) %>% 
                        group_by(blogs_bigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10) 

newsTopBigrams <- tokenize_ngrams(newsSample, n = 2) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(news_bigrams = 1) %>% 
                        group_by(news_bigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

twitTopBigrams <- tokenize_ngrams(twitSample, n = 2) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(twit_bigrams = 1) %>% 
                        group_by(twit_bigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

topBigrams <- cbind(blogsTopBigrams, newsTopBigrams, twitTopBigrams)
topBigrams %<>% rename(count1 = 2, percent_share1 = 3, 
                       count2 = 5, percent_share2 = 6, 
                       count3 = 8, percent_share3 = 9)
rm(blogsTopBigrams, newsTopBigrams, twitTopBigrams)

Lastly, we will create the same table for trigrams named topTrigrams

#Creating Top Trigrams Dataframe
blogsTopTrigrams <- tokenize_ngrams(blogsSample, n = 3) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(blogs_trigrams = 1) %>% 
                        drop_na() %>%
                        group_by(blogs_trigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10) 

newsTopTrigrams <- tokenize_ngrams(newsSample, n = 3) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(news_trigrams = 1) %>%
                        drop_na() %>%
                        group_by(news_trigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
                        head(10)

twitTopTrigrams <- tokenize_ngrams(twitSample, n = 3) %>%
                        unlist %>% 
                        data.frame() %>% 
                        rename(twit_trigrams = 1) %>%
                        drop_na() %>%
                        group_by(twit_trigrams) %>% 
                        summarise(count = n()) %>% 
                        arrange(desc(count)) %>%
                        mutate(percent_share = round(count/sum(count)*100, digits = 2)) %>%
                        head(10)

topTrigrams <- cbind(blogsTopTrigrams, newsTopTrigrams, twitTopTrigrams)
topTrigrams %<>% rename(count1 = 2, percent_share1 = 3, 
                       count2 = 5, percent_share2 = 6, 
                       count3 = 8, percent_share3 = 9)
rm(blogsTopTrigrams, newsTopTrigrams, twitTopTrigrams)

Exploratory Analysis

Now that we have our topUnigrams, topBigrams, and topTrigrams tables, we can pipe them into gt functions from the gt package, used for creating aesthetically pleasing tables.

library(gt)
library(gtExtras)

#Unigrams Chart
topUnigrams %>% gt() %>%
                tab_header(title = md("**Top Unigrams by Count: Blogs, News, & Twitter Datasets**"),
                           subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
                tab_spanner(label = md("**Blogs**"),
                            columns = 1:3) %>%
                tab_spanner(label = md("**News**"),
                            columns = 4:6) %>%
                tab_spanner(label = md("**Twitter**"),
                            columns = 7:9) %>%
                cols_label(blogs_unigrams = md("**Unigram**"),
                           news_unigrams = md("**Unigram**"),
                           twit_unigrams = md("**Unigram**"),
                           count1 = md("**Count**"),
                           count2 = md("**Count**"),
                           count3 = md("**Count**"),
                           percent_share1 = md("**% Share**"),
                           percent_share2 = md("**% Share**"),
                           percent_share3 = md("**% Share**")) %>%
                fmt_percent(columns = c(3,6,9), 
                            scale_values = F, 
                            decimals = 1) %>%
                fmt_integer(columns = c(2,5,8)) %>%
                cols_align(align = "center",
                           columns = c(2,5,8)) %>%
                tab_style(style = cell_borders(sides = "right",
                                               weight = px(1.5)),
                          locations = cells_body(columns = c(3,6))) %>%
                gt_color_rows(columns = c(3,6,9),
                              palette = "ggsci::blue_material",
                              pal_type = "continuous",
                              domain = c(0,8))
Top Unigrams by Count: Blogs, News, & Twitter Datasets
Using a random sample of n = 100,000 lines from each dataset
Blogs News Twitter
Unigram Count % Share Unigram Count % Share Unigram Count % Share
the 207,393 5.0% the 196,027 5.7% the 39,422 3.1%
and 121,489 2.9% to 89,822 2.6% to 33,297 2.6%
to 119,006 2.8% and 87,679 2.5% i 30,681 2.4%
a 100,273 2.4% a 86,743 2.5% a 25,603 2.0%
of 97,772 2.3% of 76,612 2.2% you 23,461 1.8%
i 87,137 2.1% in 67,127 2.0% and 18,424 1.4%
in 66,684 1.6% for 35,184 1.0% for 16,254 1.3%
that 51,523 1.2% that 34,458 1.0% in 16,167 1.3%
is 47,922 1.1% is 28,413 0.8% of 15,179 1.2%
it 44,620 1.1% on 26,657 0.8% is 14,856 1.2%

As you can see, the most frequent word in every dataset is by far “the”, with a share ranging from ~3-6% of all words across the datasets. Similarly, the next most common words are mostly all articles. This is to be expected, as articles indeed make up the vast majority of words in most texts of the English language.

#Bigrams Chart
topBigrams %>% gt() %>%
                tab_header(title = md("**Top Bigrams by Count: Blogs, News, & Twitter Datasets**"),
                           subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
                tab_spanner(label = md("**Blogs**"),
                            columns = 1:3) %>%
                tab_spanner(label = md("**News**"),
                            columns = 4:6) %>%
                tab_spanner(label = md("**Twitter**"),
                            columns = 7:9) %>%
                cols_label(blogs_bigrams = md("**Bigram**"),
                           news_bigrams = md("**Bigram**"),
                           twit_bigrams = md("**Bigram**"),
                           count1 = md("**Count**"),
                           count2 = md("**Count**"),
                           count3 = md("**Count**"),
                           percent_share1 = md("**% Share**"),
                           percent_share2 = md("**% Share**"),
                           percent_share3 = md("**% Share**")) %>%
                fmt_percent(columns = c(3,6,9), 
                            scale_values = F, 
                            decimals = 1) %>%
                fmt_integer(columns = c(2,5,8)) %>%
                cols_align(align = "center",
                           columns = c(2,5,8)) %>%
                tab_style(style = cell_borders(sides = "right",
                                               weight = px(1.5)),
                          locations = cells_body(columns = c(3,6))) %>%
                gt_color_rows(columns = c(3,6,9),
                              palette = "ggsci::green_material",
                              pal_type = "continuous",
                              domain = c(0,1.5))
Top Bigrams by Count: Blogs, News, & Twitter Datasets
Using a random sample of n = 100,000 lines from each dataset
Blogs News Twitter
Bigram Count % Share Bigram Count % Share Bigram Count % Share
of the 20,963 0.5% of the 18,571 0.6% in the 3,309 0.3%
in the 17,081 0.4% in the 17,551 0.5% for the 3,114 0.3%
to the 9,563 0.2% to the 8,238 0.2% of the 2,443 0.2%
on the 8,319 0.2% on the 7,283 0.2% on the 2,168 0.2%
to be 7,746 0.2% for the 6,749 0.2% to be 2,040 0.2%
and the 6,553 0.2% at the 5,836 0.2% thanks for 1,785 0.2%
for the 6,515 0.2% and the 5,201 0.2% to the 1,778 0.2%
i was 5,572 0.1% in a 5,023 0.2% at the 1,558 0.1%
and i 5,538 0.1% to be 4,684 0.1% i love 1,496 0.1%
at the 5,375 0.1% with the 4,308 0.1% if you 1,443 0.1%

Again, we see that many of the top bigrams involve combinations of common articles, as observed when looking at the top unigrams. Interestingly, we can see a slight difference in some of the bigrams in the Twitter dataset such as “thanks for” and “i love”, which to me appears to show the more “personal” component of social media platforms (i.e. the more frequent expression of opinions and emotions, more so than the impartial and formal tone of many news and blog writings).

Additionally, when observing the frequency of bigrams, one can instantly notice that the percent share of the top ten bigrams are significantly smaller than the percent share of the top ten unigrams. This makes sense and is to be expected if you understand the way tokenizers work when creating bigrams (and other higher n-grams) in this context. Two (or more) word pairs are much more distinct and less common relative to the entire dataset because you are only evaluating the occurrence of that exact word pair, as compared to examining each word on a case by case basis (which will inherently “cover” more of the dataset).

#Trigrams Chart
topTrigrams %>% gt() %>%
                tab_header(title = md("**Top Trigrams by Count: Blogs, News, & Twitter Datasets**"),
                           subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
                tab_spanner(label = md("**Blogs**"),
                            columns = 1:3) %>%
                tab_spanner(label = md("**News**"),
                            columns = 4:6) %>%
                tab_spanner(label = md("**Twitter**"),
                            columns = 7:9) %>%
                cols_label(blogs_trigrams = md("**Bigram**"),
                           news_trigrams = md("**Bigram**"),
                           twit_trigrams = md("**Bigram**"),
                           count1 = md("**Count**"),
                           count2 = md("**Count**"),
                           count3 = md("**Count**"),
                           percent_share1 = md("**% Share**"),
                           percent_share2 = md("**% Share**"),
                           percent_share3 = md("**% Share**")) %>%
                fmt_percent(columns = c(3,6,9), 
                            scale_values = F, 
                            decimals = 1) %>%
                fmt_integer(columns = c(2,5,8)) %>%
                cols_align(align = "center",
                           columns = c(2,5,8)) %>%
                tab_style(style = cell_borders(sides = "right",
                                               weight = px(1.5)),
                          locations = cells_body(columns = c(3,6))) %>%
                gt_color_rows(columns = c(3,6,9),
                              palette = "ggsci::indigo_material",
                              pal_type = "continuous",
                              domain = c(0,1))
Top Trigrams by Count: Blogs, News, & Twitter Datasets
Using a random sample of n = 100,000 lines from each dataset
Blogs News Twitter
Bigram Count % Share Bigram Count % Share Bigram Count % Share
one of the 1,616 0.0% one of the 1,356 0.0% thanks for the 979 0.1%
a lot of 1,348 0.0% a lot of 1,168 0.0% looking forward to 399 0.0%
out of the 788 0.0% as well as 585 0.0% thank you for 391 0.0%
to be a 788 0.0% the end of 578 0.0% i love you 362 0.0%
as well as 758 0.0% to be a 558 0.0% for the follow 335 0.0%
some of the 743 0.0% according to the 555 0.0% can't wait to 307 0.0%
it was a 741 0.0% part of the 544 0.0% i want to 290 0.0%
the end of 709 0.0% in the first 536 0.0% going to be 281 0.0%
be able to 688 0.0% out of the 535 0.0% i have a 272 0.0%
a couple of 672 0.0% going to be 519 0.0% i need to 271 0.0%

Once again, many of the top trigrams contain a mix of many common articles. This time, we can see even more of the difference between the Twitter dataset as compared to the blogs and news datasets. Here, we are able to the see that the phrase “thanks for the follow” appears to be among the most common for Twitter (with the trigrams “thanks for the”, “thank you for” and “for the follow” occupying three of the top ten most frequent trigrams).

Again, we also see that this time the percent share of each top word is essentially 0 for almost every trigram – with the exception of “thanks for the” making up 0.1% of all trigrams in the Twitter dataset. This again relates back to the point I mentioned earlier when speaking about bigrams.

Whats Next: Model & App Creation

The next part of this project involves the creation of the actual prediction model itself, which will then be used in the final Shiny app. To do this, I think I will pivot away from the tm and tokenizers packages and instead opt to use the Quanteda package. Quanteda provides many useful text processing and tokenizing functions that should help streamline the entire analytic and modeling process. First, we will most likely sample a small training set worth of data from the main datasets, at least just for the initial model creation process (we don’t want to bog down our algorithms too much, so it is imperative that we maintain a fine balance between dataset size and loading/response times for our model). From there, we will convert each training set into a corpus, create tokens and n-grams (while removing punctuation, numbers, URLs, symbols, profanity, and unnecessary words), and put that all into a document-feature matrix. Once we have our data in that form, it will enable us to create our prediction model easier. Additionally, I am planning on using the Katz backoff smoothing method to assign non-zero probabilities for less likely n-grams.