The following is a milestone report summarizing the exploratory analysis and steps taken thus far to prepare for building an NLP text prediction model/app. As part of the capstone project for the Data Science Specialization offered by John Hopkins University, students are instructed to use web-scraped data provided by SwiftKey in order to build a Shiny app which predicts the next word of a sentence given a few words of input (similar to modern day predictive keyboard features found on iPhones, for example). As per the official instructions for this milestone report/assignment:
“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.”
The first step of our analysis begins with retrieving the given datasets and loading them into our R environment. As stated previously, the datasets are provided by courtesy of SwiftKey, who have scraped web data from three sources: blog sites, news sites, and Twitter. The data from each of these sources are contained in their separate, corresponding .txt files. Furthermore, we will only be working with the “en_US” locale datasets for this project. The data is provided to students from a download link; in order to preserve reproducibility in our analysis, we will download the dataset using this link directly within our R script.
#Loading in general libraries
library(tidyverse)
library(magrittr)
library(R.utils)
#Downloading & unzipping data into a directory
if(!file.exists("projData")){dir.create("projData")}
trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(trainURL, destfile = "projData/data.zip", method = "curl")
unzip("projData/data.zip")
unlink("projData/data.zip")
From there, we will assign each dataset to its own corresponding
object/character vector using readLines
#Reading in data
con <- file("projData/final/en_US/en_US.twitter.txt", "r")
twit <- readLines(con)
con <- file("projData/final/en_US/en_US.news.txt", "r")
news <- readLines(con)
con <- file("projData/final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con)
close(con)
Lets now run some quick summary statistics on each dataset.
summary(blogs)
## Length Class Mode
## 899288 character character
sum(stringi::stri_count_words(blogs))
## [1] 37546250
mean(stringi::stri_count_words(blogs))
## [1] 41.75109
As you can see, the blogs dataset contains 899,288 total lines, 37,546,250 total words, and an average of ~41.8 words per line.
summary(news)
## Length Class Mode
## 1010242 character character
sum(stringi::stri_count_words(news))
## [1] 34762395
mean(stringi::stri_count_words(news))
## [1] 34.40997
The news dataset contains 1,010,242 total lines, 34,762,395 total words, and an average of ~34.4 words per line.
summary(twit)
## Length Class Mode
## 2360148 character character
sum(stringi::stri_count_words(twit))
## [1] 30093372
mean(stringi::stri_count_words(twit))
## [1] 12.75063
Lastly, the Twitter dataset contains 2,360,148 total lines, 30,093,372 total words, and an average of ~12.8 words per line. Notice the discrepancy in the average words per line in the Twitter dataset compared to the blogs and news datasets; this is largely due to the 140 character limit that tweets were limited to (at the time the data was collected).
Since each dataset contains up to millions of lines, we will choose to randomly sample 100,000 lines from each dataset when performing our exploratory analysis. The main reason for this is to simply reduce the loading/processing times of some of the tokenizing functions that we will be using later (especially when creating bigram and trigram tokens that increase the dataset sizes substantially). By randomly sampling thousands of lines from each dataset, we can still get an accurate representation of the larger population that we are sampling from.
#Taking a 100,000 line sample of each dataset
twitSample <- sample(twit, size = 100000)
newsSample <- sample(news, size = 100000)
blogsSample <- sample(blogs, size = 100000)
Now that we have a sample of each dataset, we will analyze the
relationship and frequency of word pairs by their n-grams. More
specifically, we will analyze the top 10 most frequent unigrams (single
words), bigrams (two word pairs), and trigrams (three word pairs). We
will be using the tm and tokenizers packages
to create the tokens and load them into dataframes.
First, we will create sub-tables (that we will later feed to
gt tables for visualization purposes) of frequent n-gram
counts of each dataset. These sub-tables will then be
column-binded/merged together into one aggregate table for each
n-gram.
Lets now create an aggregate table named topUnigrams
that lists the top 10 most frequent unigrams (words) in each dataset,
with their respective share.
library(tm)
library(tokenizers)
#Creating Top Unigrams Dataframe
blogsTopUnigrams <- tokenize_words(blogsSample) %>%
unlist %>%
data.frame() %>%
rename(blogs_unigrams = 1) %>%
group_by(blogs_unigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
newsTopUnigrams <- tokenize_words(newsSample) %>%
unlist %>%
data.frame() %>%
rename(news_unigrams = 1) %>%
group_by(news_unigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
twitTopUnigrams <- tokenize_words(twitSample) %>%
unlist %>%
data.frame() %>%
rename(twit_unigrams = 1) %>%
group_by(twit_unigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
topUnigrams <- cbind(blogsTopUnigrams, newsTopUnigrams, twitTopUnigrams)
topUnigrams %<>% rename(count1 = 2, percent_share1 = 3,
count2 = 5, percent_share2 = 6,
count3 = 8, percent_share3 = 9)
rm(blogsTopUnigrams, newsTopUnigrams, twitTopUnigrams)
We can now create the same exact aggregate table for bigrams
named topBigrams
#Creating Top Bigrams Dataframe
blogsTopBigrams <- tokenize_ngrams(blogsSample, n = 2) %>%
unlist %>%
data.frame() %>%
rename(blogs_bigrams = 1) %>%
group_by(blogs_bigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
newsTopBigrams <- tokenize_ngrams(newsSample, n = 2) %>%
unlist %>%
data.frame() %>%
rename(news_bigrams = 1) %>%
group_by(news_bigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
twitTopBigrams <- tokenize_ngrams(twitSample, n = 2) %>%
unlist %>%
data.frame() %>%
rename(twit_bigrams = 1) %>%
group_by(twit_bigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
topBigrams <- cbind(blogsTopBigrams, newsTopBigrams, twitTopBigrams)
topBigrams %<>% rename(count1 = 2, percent_share1 = 3,
count2 = 5, percent_share2 = 6,
count3 = 8, percent_share3 = 9)
rm(blogsTopBigrams, newsTopBigrams, twitTopBigrams)
Lastly, we will create the same table for trigrams named
topTrigrams
#Creating Top Trigrams Dataframe
blogsTopTrigrams <- tokenize_ngrams(blogsSample, n = 3) %>%
unlist %>%
data.frame() %>%
rename(blogs_trigrams = 1) %>%
drop_na() %>%
group_by(blogs_trigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
newsTopTrigrams <- tokenize_ngrams(newsSample, n = 3) %>%
unlist %>%
data.frame() %>%
rename(news_trigrams = 1) %>%
drop_na() %>%
group_by(news_trigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 1)) %>%
head(10)
twitTopTrigrams <- tokenize_ngrams(twitSample, n = 3) %>%
unlist %>%
data.frame() %>%
rename(twit_trigrams = 1) %>%
drop_na() %>%
group_by(twit_trigrams) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(percent_share = round(count/sum(count)*100, digits = 2)) %>%
head(10)
topTrigrams <- cbind(blogsTopTrigrams, newsTopTrigrams, twitTopTrigrams)
topTrigrams %<>% rename(count1 = 2, percent_share1 = 3,
count2 = 5, percent_share2 = 6,
count3 = 8, percent_share3 = 9)
rm(blogsTopTrigrams, newsTopTrigrams, twitTopTrigrams)
Now that we have our topUnigrams,
topBigrams, and topTrigrams tables, we can
pipe them into gt functions from the gt
package, used for creating aesthetically pleasing tables.
library(gt)
library(gtExtras)
#Unigrams Chart
topUnigrams %>% gt() %>%
tab_header(title = md("**Top Unigrams by Count: Blogs, News, & Twitter Datasets**"),
subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
tab_spanner(label = md("**Blogs**"),
columns = 1:3) %>%
tab_spanner(label = md("**News**"),
columns = 4:6) %>%
tab_spanner(label = md("**Twitter**"),
columns = 7:9) %>%
cols_label(blogs_unigrams = md("**Unigram**"),
news_unigrams = md("**Unigram**"),
twit_unigrams = md("**Unigram**"),
count1 = md("**Count**"),
count2 = md("**Count**"),
count3 = md("**Count**"),
percent_share1 = md("**% Share**"),
percent_share2 = md("**% Share**"),
percent_share3 = md("**% Share**")) %>%
fmt_percent(columns = c(3,6,9),
scale_values = F,
decimals = 1) %>%
fmt_integer(columns = c(2,5,8)) %>%
cols_align(align = "center",
columns = c(2,5,8)) %>%
tab_style(style = cell_borders(sides = "right",
weight = px(1.5)),
locations = cells_body(columns = c(3,6))) %>%
gt_color_rows(columns = c(3,6,9),
palette = "ggsci::blue_material",
pal_type = "continuous",
domain = c(0,8))
| Top Unigrams by Count: Blogs, News, & Twitter Datasets | ||||||||
| Using a random sample of n = 100,000 lines from each dataset | ||||||||
| Blogs | News | |||||||
|---|---|---|---|---|---|---|---|---|
| Unigram | Count | % Share | Unigram | Count | % Share | Unigram | Count | % Share |
| the | 207,393 | 5.0% | the | 196,027 | 5.7% | the | 39,422 | 3.1% |
| and | 121,489 | 2.9% | to | 89,822 | 2.6% | to | 33,297 | 2.6% |
| to | 119,006 | 2.8% | and | 87,679 | 2.5% | i | 30,681 | 2.4% |
| a | 100,273 | 2.4% | a | 86,743 | 2.5% | a | 25,603 | 2.0% |
| of | 97,772 | 2.3% | of | 76,612 | 2.2% | you | 23,461 | 1.8% |
| i | 87,137 | 2.1% | in | 67,127 | 2.0% | and | 18,424 | 1.4% |
| in | 66,684 | 1.6% | for | 35,184 | 1.0% | for | 16,254 | 1.3% |
| that | 51,523 | 1.2% | that | 34,458 | 1.0% | in | 16,167 | 1.3% |
| is | 47,922 | 1.1% | is | 28,413 | 0.8% | of | 15,179 | 1.2% |
| it | 44,620 | 1.1% | on | 26,657 | 0.8% | is | 14,856 | 1.2% |
As you can see, the most frequent word in every dataset is by far “the”, with a share ranging from ~3-6% of all words across the datasets. Similarly, the next most common words are mostly all articles. This is to be expected, as articles indeed make up the vast majority of words in most texts of the English language.
#Bigrams Chart
topBigrams %>% gt() %>%
tab_header(title = md("**Top Bigrams by Count: Blogs, News, & Twitter Datasets**"),
subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
tab_spanner(label = md("**Blogs**"),
columns = 1:3) %>%
tab_spanner(label = md("**News**"),
columns = 4:6) %>%
tab_spanner(label = md("**Twitter**"),
columns = 7:9) %>%
cols_label(blogs_bigrams = md("**Bigram**"),
news_bigrams = md("**Bigram**"),
twit_bigrams = md("**Bigram**"),
count1 = md("**Count**"),
count2 = md("**Count**"),
count3 = md("**Count**"),
percent_share1 = md("**% Share**"),
percent_share2 = md("**% Share**"),
percent_share3 = md("**% Share**")) %>%
fmt_percent(columns = c(3,6,9),
scale_values = F,
decimals = 1) %>%
fmt_integer(columns = c(2,5,8)) %>%
cols_align(align = "center",
columns = c(2,5,8)) %>%
tab_style(style = cell_borders(sides = "right",
weight = px(1.5)),
locations = cells_body(columns = c(3,6))) %>%
gt_color_rows(columns = c(3,6,9),
palette = "ggsci::green_material",
pal_type = "continuous",
domain = c(0,1.5))
| Top Bigrams by Count: Blogs, News, & Twitter Datasets | ||||||||
| Using a random sample of n = 100,000 lines from each dataset | ||||||||
| Blogs | News | |||||||
|---|---|---|---|---|---|---|---|---|
| Bigram | Count | % Share | Bigram | Count | % Share | Bigram | Count | % Share |
| of the | 20,963 | 0.5% | of the | 18,571 | 0.6% | in the | 3,309 | 0.3% |
| in the | 17,081 | 0.4% | in the | 17,551 | 0.5% | for the | 3,114 | 0.3% |
| to the | 9,563 | 0.2% | to the | 8,238 | 0.2% | of the | 2,443 | 0.2% |
| on the | 8,319 | 0.2% | on the | 7,283 | 0.2% | on the | 2,168 | 0.2% |
| to be | 7,746 | 0.2% | for the | 6,749 | 0.2% | to be | 2,040 | 0.2% |
| and the | 6,553 | 0.2% | at the | 5,836 | 0.2% | thanks for | 1,785 | 0.2% |
| for the | 6,515 | 0.2% | and the | 5,201 | 0.2% | to the | 1,778 | 0.2% |
| i was | 5,572 | 0.1% | in a | 5,023 | 0.2% | at the | 1,558 | 0.1% |
| and i | 5,538 | 0.1% | to be | 4,684 | 0.1% | i love | 1,496 | 0.1% |
| at the | 5,375 | 0.1% | with the | 4,308 | 0.1% | if you | 1,443 | 0.1% |
Again, we see that many of the top bigrams involve combinations of common articles, as observed when looking at the top unigrams. Interestingly, we can see a slight difference in some of the bigrams in the Twitter dataset such as “thanks for” and “i love”, which to me appears to show the more “personal” component of social media platforms (i.e. the more frequent expression of opinions and emotions, more so than the impartial and formal tone of many news and blog writings).
Additionally, when observing the frequency of bigrams, one can instantly notice that the percent share of the top ten bigrams are significantly smaller than the percent share of the top ten unigrams. This makes sense and is to be expected if you understand the way tokenizers work when creating bigrams (and other higher n-grams) in this context. Two (or more) word pairs are much more distinct and less common relative to the entire dataset because you are only evaluating the occurrence of that exact word pair, as compared to examining each word on a case by case basis (which will inherently “cover” more of the dataset).
#Trigrams Chart
topTrigrams %>% gt() %>%
tab_header(title = md("**Top Trigrams by Count: Blogs, News, & Twitter Datasets**"),
subtitle = md("**Using a random sample of n = 100,000 lines from each dataset**")) %>%
tab_spanner(label = md("**Blogs**"),
columns = 1:3) %>%
tab_spanner(label = md("**News**"),
columns = 4:6) %>%
tab_spanner(label = md("**Twitter**"),
columns = 7:9) %>%
cols_label(blogs_trigrams = md("**Bigram**"),
news_trigrams = md("**Bigram**"),
twit_trigrams = md("**Bigram**"),
count1 = md("**Count**"),
count2 = md("**Count**"),
count3 = md("**Count**"),
percent_share1 = md("**% Share**"),
percent_share2 = md("**% Share**"),
percent_share3 = md("**% Share**")) %>%
fmt_percent(columns = c(3,6,9),
scale_values = F,
decimals = 1) %>%
fmt_integer(columns = c(2,5,8)) %>%
cols_align(align = "center",
columns = c(2,5,8)) %>%
tab_style(style = cell_borders(sides = "right",
weight = px(1.5)),
locations = cells_body(columns = c(3,6))) %>%
gt_color_rows(columns = c(3,6,9),
palette = "ggsci::indigo_material",
pal_type = "continuous",
domain = c(0,1))
| Top Trigrams by Count: Blogs, News, & Twitter Datasets | ||||||||
| Using a random sample of n = 100,000 lines from each dataset | ||||||||
| Blogs | News | |||||||
|---|---|---|---|---|---|---|---|---|
| Bigram | Count | % Share | Bigram | Count | % Share | Bigram | Count | % Share |
| one of the | 1,616 | 0.0% | one of the | 1,356 | 0.0% | thanks for the | 979 | 0.1% |
| a lot of | 1,348 | 0.0% | a lot of | 1,168 | 0.0% | looking forward to | 399 | 0.0% |
| out of the | 788 | 0.0% | as well as | 585 | 0.0% | thank you for | 391 | 0.0% |
| to be a | 788 | 0.0% | the end of | 578 | 0.0% | i love you | 362 | 0.0% |
| as well as | 758 | 0.0% | to be a | 558 | 0.0% | for the follow | 335 | 0.0% |
| some of the | 743 | 0.0% | according to the | 555 | 0.0% | can't wait to | 307 | 0.0% |
| it was a | 741 | 0.0% | part of the | 544 | 0.0% | i want to | 290 | 0.0% |
| the end of | 709 | 0.0% | in the first | 536 | 0.0% | going to be | 281 | 0.0% |
| be able to | 688 | 0.0% | out of the | 535 | 0.0% | i have a | 272 | 0.0% |
| a couple of | 672 | 0.0% | going to be | 519 | 0.0% | i need to | 271 | 0.0% |
Once again, many of the top trigrams contain a mix of many common articles. This time, we can see even more of the difference between the Twitter dataset as compared to the blogs and news datasets. Here, we are able to the see that the phrase “thanks for the follow” appears to be among the most common for Twitter (with the trigrams “thanks for the”, “thank you for” and “for the follow” occupying three of the top ten most frequent trigrams).
Again, we also see that this time the percent share of each top word is essentially 0 for almost every trigram – with the exception of “thanks for the” making up 0.1% of all trigrams in the Twitter dataset. This again relates back to the point I mentioned earlier when speaking about bigrams.
The next part of this project involves the creation of the actual
prediction model itself, which will then be used in the final Shiny app.
To do this, I think I will pivot away from the tm and
tokenizers packages and instead opt to use the
Quanteda package. Quanteda provides many useful text
processing and tokenizing functions that should help streamline the
entire analytic and modeling process. First, we will most likely sample
a small training set worth of data from the main datasets, at least just
for the initial model creation process (we don’t want to bog down our
algorithms too much, so it is imperative that we maintain a fine balance
between dataset size and loading/response times for our model). From
there, we will convert each training set into a corpus, create tokens
and n-grams (while removing punctuation, numbers, URLs, symbols,
profanity, and unnecessary words), and put that all into a
document-feature matrix. Once we have our data in that form, it will
enable us to create our prediction model easier. Additionally, I am
planning on using the Katz backoff smoothing method to assign non-zero
probabilities for less likely n-grams.