Introduction

This is a milestone report for the JHU data science capstone course on Coursera. The end goal of the capstone is to produce a Shiny app that accepts user text input and predicts the next word, as is common with SwiftKey and other mobile phone keyboards. Data from blog posts, news articles, and tweets were provided for use in training the model.

In this milestone report, I load the data, describe it, explore it, and create some ngram data sets for later use. I do not filter for profanity, as has been discussed in the course content, but that may be a later step.

Please note: a lot of the code for this report is hidden to protect it from reuse for this course. There is enough information here for someone to figure it out, for example by reading about the tidytext package.

Loading the Data

We’ll start by loading the packages we’ll use.

library(dplyr)
library(ggraph)
library(here)
library(igraph)
library(readr)
library(tidyr)
library(tidytext)

Next, we will acquire the data for our analysis. We will check to see if the files already exist first, to avoid repeating this process.

if (!file.exists(here("data", "Coursera-SwiftKey.zip"))) {
  fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  fileName <- here("data", "Coursera-SwiftKey.zip")
  download.file(url = fileURL, destfile = fileName, method = "curl")
  unzip(zipfile = here("data", "Coursera-SwiftKey.zip"), exdir = here("data",
                                                                      "final"))
  rm(fileURL)
  rm(fileName)
}

Summary Data and Statistics

Let’s load in our three data sets and look at them. Here are the first few lines of each data set and some summary statistics.

## # A tibble: 6 x 2
##   value                                                                   corpus
##   <chr>                                                                   <chr> 
## 1 In the years thereafter, most of the Oil fields and platforms were nam~ blogs 
## 2 We love you Mr. Brown.                                                  blogs 
## 3 Chad has been awesome with the kids and holding down the fort while I ~ blogs 
## 4 so anyways, i am going to share some home decor inspiration that i hav~ blogs 
## 5 With graduation season right around the corner, Nancy has whipped up a~ blogs 
## 6 If you have an alternative argument, let's hear it! :)                  blogs
## # A tibble: 6 x 2
##   value                                                                   corpus
##   <chr>                                                                   <chr> 
## 1 He wasn't home alone, apparently.                                       news  
## 2 The St. Louis plant had to close. It would die of old age. Workers had~ news  
## 3 WSU's plans quickly became a hot topic on local online sites. Though m~ news  
## 4 The Alaimo Group of Mount Holly was up for a contract last fall to eva~ news  
## 5 And when it's often difficult to predict a law's impact, legislators s~ news  
## 6 There was a certain amount of scoffing going around a few years ago wh~ news
## # A tibble: 6 x 2
##   value                                                                   corpus
##   <chr>                                                                   <chr> 
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? L~ twitt~
## 2 When you meet someone special... you'll know. Your heart will beat mor~ twitt~
## 3 they've decided its more fun if I don't.                                twitt~
## 4 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like I~ twitt~
## 5 Words from a complete stranger! Made my birthday even better :)         twitt~
## 6 First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go C~ twitt~
Summary Statistics
Source WordCount LineCount
Blogs 37546253 899288
News 34762395 1010242
Twitter 30218166 2360148

Sampling and Cleaning the Data

Now let’s create a sample and clean it up for our analysis. We’ll combine the three data sets and create a sample with 10 percent of the entries. In the initial work with this data, we found there were a lot of non-US characters, so we’ll remove those. We’ll also save the sample data set for our records.

dataFull <- bind_rows(dataBlogs, dataNews, dataTwitter)
rm(dataBlogs)
rm(dataNews)
rm(dataTwitter)

seedValue <- 113021
set.seed(seedValue)
dataSample <- dataFull %>% sample_n(nrow(dataFull) * 0.1)
rm(dataFull)

dataSample$value <- gsub("[^\x20-\x7E]", "", dataSample$value)

fileName <- gsub(" ", "", paste("dataSample_", seedValue, ".csv"))
write_csv(dataSample, here("data", "proc", fileName))
rm(seedValue)
rm(fileName)

Exploring the Base Data

Let’s start by looking at the word frequencies from our sample. We will tokenize the data to pull out each word as its own observation. Note that the tokenization process removes punctuation and changes all letters to lowercase.

data_n1 %>%
  count(corpus, word, sort = TRUE)
## # A tibble: 290,080 x 3
##    corpus  word       n
##    <chr>   <chr>  <int>
##  1 news    the   198103
##  2 blogs   the   185770
##  3 blogs   and   108932
##  4 blogs   to    107068
##  5 twitter the    93833
##  6 news    to     91617
##  7 blogs   a      90063
##  8 news    and    88490
##  9 news    a      88076
## 10 blogs   of     87932
## # ... with 290,070 more rows

As we can see, the most frequent words are “the,” “and,” “to,” and so on. We can remove those using tidytext’s stop_words.

data("stop_words")

data_n1 %>%
  anti_join(stop_words) %>%
  count(corpus, word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 287,940 x 3
##    corpus  word       n
##    <chr>   <chr>  <int>
##  1 twitter love   10792
##  2 blogs   time    9179
##  3 twitter day     9024
##  4 twitter rt      8994
##  5 twitter time    7598
##  6 twitter lol     7027
##  7 blogs   people  5880
##  8 news    time    5832
##  9 twitter 3       5647
## 10 twitter people  5198
## # ... with 287,930 more rows

Now we see that “love,” “time,” “day,” and other words are most frequent. Next, let’s omit the corpus to get an overall look for our sample.

data_n1 %>%
  select(word) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 192,123 x 2
##    word       n
##    <chr>  <int>
##  1 time   22609
##  2 day    17376
##  3 love   16214
##  4 people 16001
##  5 3      10511
##  6 2      10490
##  7 1       9404
##  8 rt      9003
##  9 life    8884
## 10 home    8269
## # ... with 192,113 more rows

In graphical form, here are our top words.

data_n1 %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "Count") +
  coord_flip()
## Joining, by = "word"

We’ll retain data_n1 with counts for later, since it can be useful as a fallback prediction.

data_n1 <- data_n1 %>%
  count(corpus, word, sort = TRUE)
write_csv(data_n1, file=here("data", "proc", "data_n1.csv"))
rm(data_n1)

Creating N-grams and Exploring the Data

Next, we will create N-grams from the data. First, we will create pairs of two words, called bigrams. Again, we’ll save this processed data for later use since it takes processing time.

data_n2
## # A tibble: 2,732,626 x 3
##    word1 word2     n
##    <chr> <chr> <dbl>
##  1 of    the   43066
##  2 in    the   40862
##  3 to    the   21350
##  4 for   the   20070
##  5 on    the   19818
##  6 to    be    16364
##  7 at    the   14353
##  8 and   the   12633
##  9 in    a     11768
## 10 with  the   10657
## # ... with 2,732,616 more rows

As we can see, these are mostly common words. “of the,” “in the,” and “to the” are the most comment sets of words! This isn’t particularly interesting, but it’s indicative of actual writing. We’ll save these pairs for later use.

Despite all of the observations of bigrams of common words, the data set still contains a lot of other useful information for prediction. For example, what are the most common words to come after “blue” or before “asparagus?”

data_n2 %>% filter(word1=="blue")
## # A tibble: 671 x 3
##    word1 word2      n
##    <chr> <chr>  <dbl>
##  1 blue  and       68
##  2 blue  sky       34
##  3 blue  eyes      32
##  4 blue  cheese    29
##  5 blue  jays      29
##  6 blue  ribbon    28
##  7 blue  collar    27
##  8 blue  skies     21
##  9 blue  jeans     19
## 10 blue  line      16
## # ... with 661 more rows
data_n2 %>% filter(word2=="asparagus")
## # A tibble: 49 x 3
##    word1   word2         n
##    <chr>   <chr>     <dbl>
##  1 the     asparagus    17
##  2 and     asparagus     6
##  3 grilled asparagus     5
##  4 like    asparagus     3
##  5 of      asparagus     3
##  6 for     asparagus     2
##  7 roasted asparagus     2
##  8 shaved  asparagus     2
##  9 400     asparagus     1
## 10 add     asparagus     1
## # ... with 39 more rows

Next, to explore unique bigrams, let’s remove the common words and see what we find. We’ll present the results graphically. Note that we are only showing the top 25 bigrams.

bigram_graph <- data_n2 %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !is.na(word1)) %>%
  top_n(25, n) %>%
  graph_from_data_frame()

a <- grid::arrow(type = "closed", length = unit(0.07, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a,
                 end_cap = circle(0.03, 'inches')) +
  geom_node_point(size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Finally, we’ll finish this milestore report with some cleanup and then creation of trigrams and quadgrams.

data_n3
## # A tibble: 6,424,492 x 4
##    word1  word2 word3     n
##    <chr>  <chr> <chr> <dbl>
##  1 one    of    the    3429
##  2 a      lot   of     2984
##  3 thanks for   the    2346
##  4 to     be    a      1823
##  5 going  to    be     1782
##  6 i      want  to     1574
##  7 the    end   of     1508
##  8 it     was   a      1476
##  9 out    of    the    1445
## 10 as     well  as     1390
## # ... with 6,424,482 more rows
data_n4
## # A tibble: 8,158,596 x 5
##    word1  word2 word3 word4      n
##    <chr>  <chr> <chr> <chr>  <dbl>
##  1 the    end   of    the      790
##  2 at     the   end   of       688
##  3 the    rest  of    the      645
##  4 thanks for   the   follow   612
##  5 for    the   first time     609
##  6 at     the   same  time     519
##  7 is     going to    be       476
##  8 one    of    the   most     413
##  9 is     one   of    the      410
## 10 when   it    comes to       394
## # ... with 8,158,586 more rows

Initial Predictive Approach

My approach at the time of this report is to make text predictions using ngrams with 4, 3, 2, and 1 words, in respective order of preference. The app will tokenize the user’s entry, read in the last 3 words, and compare them to the ngrams. If the last three words the user enters show up in the quadgram (n=4) data, the model will predict the fourth word. If not, but the last two words typed appear in the trigram data, the model will predict the third word. If the bigram fails, the model will suggest the most common word from the sample. The model may introduce some degree of randomness, but that is to be determined at a later time. For efficiency, I will use pre-processed ngram data sets. Depending on whether I build in randomness or just go with the top choice, I will either include the top entry from each data set or just a sample of them. I am undecided at this point if it will be more efficient to include the highest order ngram data set and just pull from the last n-1 words or multiple data sets to step through in sequence.