Objective

The objective of this capstone project is to first understand how text prediction models work and then create an app based on that model. This is the exploratory analysis of the data sets provided. The Data Science Specialization is offered by John Hopkins University on Coursera. It follows 9 courses covering the basics of R programming through machine learning.

Overview of the data

There are 3 datasets for each of 4 languages- blogs, twitter posts, and news articles. The languages are English, German, Russian, and French. I assume my audience is entirely English speaking so I will limit my model and app to just that language. Here are a few summary statistics about the 3 datasets. No cleaning or transformations have been performed yet.

Dataset Longest_Char Num_Items Object_size Median_char
US Blogs 40833 0.9 Mn 248.5 Mb 156
US Twitter 140 2.36 Mn 301.4 Mb 64
US News 11384 1.01 Mn 249.6 Mb 185

We can make a few observations from this preliminary view. We should have expected that the max characters for a tweet is 140 since that’s the limit the platform imposes. Another observation is that these datasets are fairly large for working with local RAM. I’m not intending to run my scripts on a cloud server so I’l take random samples of the data later in the process. Finally, its surprising to see such a small value for the median character count for a blog post. This is little more than the length of a tweet! There was some discussion about removing line of text to anonomyze the data which could be the cause.

Sampling

I’ll take a sample of 10,000 items from each of the datasets at random. The sample size is approximately 1% of the Blogs and News datasets, but less than 0.5% of the Twitter dataset. I chose 10,000 because I felt it would give sufficient diversity of words without over burdening my processor. Before running the sampling functions I will also remove profanity.

Word Counts

I’ll use the tidytext package to do some analysis on the corpus. This package makes working with the corpus easy and can be streamlined into the tidy approach. The package also comes with a set of stop words which can be excluded.

Observations

First off, I’m drawn to the numbers in the pairs. The top pair in the combined dataset was 1-2 and there was similar single digit pairs in the individual dataset top 20 lists. My first reaction was to remove these since they aren’t ‘words’ but I came to the final conclusion that I want these in the dataset. I want them available to the final product so that the show up the suggested next words.

The next observation is that none of the words in the network graph above are particularly surprising. They are all very common words so aren’t surprising to see in the chart.

Next Steps

The next step in the capstone project will be to find the best way to store the data. In their current sampled form the datasets may be too large to store in a Shiny App and will need to be further pared down. Next, I’ll build a predictive model from these n-grams; I’ll seperate the data into a training and test sets to use in evaluating the results. Finally, I’ll build a Shiny App that I can publish publicly. Most likely, I will try to replicate the keyboard typing scenario that Swift Key produces.

Code

library(dplyr)
library(tidytext)
library(ggplot2)
library(tidyr)
USBlogs  <- readLines("final/en_US/en_US.blogs.txt", encoding= "UTF-16LE", skipNul = T)
USTwitter<- readLines("final/en_US/en_US.twitter.txt", encoding= "UTF-16LE", skipNul = T)
USNews   <- readLines("final/en_US/en_US.news.txt", encoding= "UTF-16LE", skipNul = T)
sumstats<- data.frame(Dataset= c("US Blogs", 
                                "US Twitter", 
                                "US News"),
                     Longest_Char= c(max(nchar(USBlogs)),
                                     max(nchar(USTwitter)),
                                     max(nchar(USNews))),
                     Num_Items= paste0(round(c(length(USBlogs),
                                  length(USTwitter),
                                  length(USNews))/1000000,2)," Mn"),
                     Object_size= c(format(object.size(USBlogs),units="Mb"),
                                    format(object.size(USTwitter),units="Mb"),
                                    format(object.size(USNews),units="Mb")),
                     Median_char= c(median(nchar(USBlogs)),
                                    median(nchar(USTwitter)),
                                    median(nchar(USNews))))
knitr::kable(sumstats, align=c("l","r","r","r","r"))
set.seed(1234)
size= 10000

USBlogs_samp   <- USBlogs[sample(length(USBlogs), size)]
USTwitter_samp <- USTwitter[sample(length(USTwitter), size)]
USNews_samp    <- USNews[sample(length(USNews), size)]

rm(USBlogs, USTwitter, USNews)
profanity<- read.table("http://www.bannedwordlist.com/lists/swearWords.txt",
                       col.names = "word",
                       stringsAsFactors = F)
data_frame(line=1:size, text=USBlogs_samp) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  anti_join(profanity) %>%
  count(word, sort=TRUE) %>%
  mutate(word=reorder(word, n)) %>%
  head(20) %>%
  ggplot(aes(word, n)) + 
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(title= "US Blogs: Count of Top 20 Words")
data_frame(line=1:size, text=USTwitter_samp) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  anti_join(profanity) %>%
  count(word, sort=TRUE) %>%
  mutate(word=reorder(word, n)) %>%
  head(20) %>%
  ggplot(aes(word, n)) + 
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(title= "US Twitter: Count of Top 20 Words")
data_frame(line=1:size, text=USBlogs_samp) %>%
  unnest_tokens(word, text, token="ngrams", n=2) %>%
  
  # remove stopwords
  separate(word, c("word1","word2")) %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word1 %in% profanity$word,
         !word2 %in% profanity$word) %>% 
  mutate(word=paste(word1, word2, "")) %>%
  select(-word1, -word2) %>%
  # rank top 20 words by frequency
  count(word, sort=TRUE) %>%
  mutate(word=reorder(word, n)) %>%
  head(20) %>%
  
  # plot the results
  ggplot(aes(word, n)) + 
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(title= "US Blogs: Top 20 bigrams")
data_frame(line=1:size, text=USTwitter_samp) %>%
  unnest_tokens(word, text, token="ngrams", n=2) %>%
  
  # remove stopwords
  separate(word, c("word1","word2")) %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word1 %in% profanity$word,
         !word2 %in% profanity$word) %>% 
  mutate(word=paste(word1, word2, "")) %>%
  select(-word1, -word2) %>%
  # rank top 20 words by frequency
  count(word, sort=TRUE) %>%
  mutate(word=reorder(word, n)) %>%
  head(20) %>%
  
  # plot the results
  ggplot(aes(word, n)) + 
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(title= "US Twitter: Top 20 bigrams")
library(ggraph)
library(igraph)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

graphdat<- data_frame(line=1:(size*3), text=c(USTwitter_samp, USBlogs_samp, USNews_samp)) %>%
  unnest_tokens(word, text, token="ngrams", n=2) %>%
  
  # remove stopwords
  separate(word, c("word1","word2")) %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word1 %in% profanity$word,
         !word2 %in% profanity$word) %>% 
  # rank top 20 words by frequency
  count(word1, word2, sort=TRUE) %>%
  filter(n>20) %>% 
  graph_from_data_frame()


ggraph(graphdat, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()