This is a milestone report for the JHU data science capstone course on Coursera. The end goal of the capstone is to produce a Shiny app that accepts user text input and predicts the next word, as is common with SwiftKey and other mobile phone keyboards. Data from blog posts, news articles, and tweets were provided for use in training the model.
In this milestone report, I load the data, describe it, explore it, and create some ngram data sets for later use. I do not filter for profanity, as has been discussed in the course content, but that may be a later step.
Please note: a lot of the code for this report is hidden to protect it from reuse for this course. There is enough information here for someone to figure it out, for example by reading about the tidytext package.
We’ll start by loading the packages we’ll use.
library(dplyr)
library(ggraph)
library(here)
library(igraph)
library(readr)
library(tidyr)
library(tidytext)
Next, we will acquire the data for our analysis. We will check to see if the files already exist first, to avoid repeating this process.
if (!file.exists(here("data", "Coursera-SwiftKey.zip"))) {
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileName <- here("data", "Coursera-SwiftKey.zip")
download.file(url = fileURL, destfile = fileName, method = "curl")
unzip(zipfile = here("data", "Coursera-SwiftKey.zip"), exdir = here("data",
"final"))
rm(fileURL)
rm(fileName)
}
Let’s load in our three data sets and look at them. Here are the first few lines of each data set and some summary statistics.
## # A tibble: 6 x 2
## value corpus
## <chr> <chr>
## 1 In the years thereafter, most of the Oil fields and platforms were nam~ blogs
## 2 We love you Mr. Brown. blogs
## 3 Chad has been awesome with the kids and holding down the fort while I ~ blogs
## 4 so anyways, i am going to share some home decor inspiration that i hav~ blogs
## 5 With graduation season right around the corner, Nancy has whipped up a~ blogs
## 6 If you have an alternative argument, let's hear it! :) blogs
## # A tibble: 6 x 2
## value corpus
## <chr> <chr>
## 1 He wasn't home alone, apparently. news
## 2 The St. Louis plant had to close. It would die of old age. Workers had~ news
## 3 WSU's plans quickly became a hot topic on local online sites. Though m~ news
## 4 The Alaimo Group of Mount Holly was up for a contract last fall to eva~ news
## 5 And when it's often difficult to predict a law's impact, legislators s~ news
## 6 There was a certain amount of scoffing going around a few years ago wh~ news
## # A tibble: 6 x 2
## value corpus
## <chr> <chr>
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? L~ twitt~
## 2 When you meet someone special... you'll know. Your heart will beat mor~ twitt~
## 3 they've decided its more fun if I don't. twitt~
## 4 So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like I~ twitt~
## 5 Words from a complete stranger! Made my birthday even better :) twitt~
## 6 First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go C~ twitt~
| Source | WordCount | LineCount |
|---|---|---|
| Blogs | 37546253 | 899288 |
| News | 34762395 | 1010242 |
| 30218166 | 2360148 |
Now let’s create a sample and clean it up for our analysis. We’ll combine the three data sets and create a sample with 10 percent of the entries. In the initial work with this data, we found there were a lot of non-US characters, so we’ll remove those. We’ll also save the sample data set for our records.
dataFull <- bind_rows(dataBlogs, dataNews, dataTwitter)
rm(dataBlogs)
rm(dataNews)
rm(dataTwitter)
seedValue <- 113021
set.seed(seedValue)
dataSample <- dataFull %>% sample_n(nrow(dataFull) * 0.1)
rm(dataFull)
dataSample$value <- gsub("[^\x20-\x7E]", "", dataSample$value)
fileName <- gsub(" ", "", paste("dataSample_", seedValue, ".csv"))
write_csv(dataSample, here("data", "proc", fileName))
rm(seedValue)
rm(fileName)
Let’s start by looking at the word frequencies from our sample. We will tokenize the data to pull out each word as its own observation. Note that the tokenization process removes punctuation and changes all letters to lowercase.
data_n1 %>%
count(corpus, word, sort = TRUE)
## # A tibble: 290,080 x 3
## corpus word n
## <chr> <chr> <int>
## 1 news the 198103
## 2 blogs the 185770
## 3 blogs and 108932
## 4 blogs to 107068
## 5 twitter the 93833
## 6 news to 91617
## 7 blogs a 90063
## 8 news and 88490
## 9 news a 88076
## 10 blogs of 87932
## # ... with 290,070 more rows
As we can see, the most frequent words are “the,” “and,” “to,” and so on. We can remove those using tidytext’s stop_words.
data("stop_words")
data_n1 %>%
anti_join(stop_words) %>%
count(corpus, word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 287,940 x 3
## corpus word n
## <chr> <chr> <int>
## 1 twitter love 10792
## 2 blogs time 9179
## 3 twitter day 9024
## 4 twitter rt 8994
## 5 twitter time 7598
## 6 twitter lol 7027
## 7 blogs people 5880
## 8 news time 5832
## 9 twitter 3 5647
## 10 twitter people 5198
## # ... with 287,930 more rows
Now we see that “love,” “time,” “day,” and other words are most frequent. Next, let’s omit the corpus to get an overall look for our sample.
data_n1 %>%
select(word) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 192,123 x 2
## word n
## <chr> <int>
## 1 time 22609
## 2 day 17376
## 3 love 16214
## 4 people 16001
## 5 3 10511
## 6 2 10490
## 7 1 9404
## 8 rt 9003
## 9 life 8884
## 10 home 8269
## # ... with 192,113 more rows
In graphical form, here are our top words.
data_n1 %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Count") +
coord_flip()
## Joining, by = "word"
We’ll retain data_n1 with counts for later, since it can be useful as a fallback prediction.
data_n1 <- data_n1 %>%
count(corpus, word, sort = TRUE)
write_csv(data_n1, file=here("data", "proc", "data_n1.csv"))
rm(data_n1)
Next, we will create N-grams from the data. First, we will create pairs of two words, called bigrams. Again, we’ll save this processed data for later use since it takes processing time.
data_n2
## # A tibble: 2,732,626 x 3
## word1 word2 n
## <chr> <chr> <dbl>
## 1 of the 43066
## 2 in the 40862
## 3 to the 21350
## 4 for the 20070
## 5 on the 19818
## 6 to be 16364
## 7 at the 14353
## 8 and the 12633
## 9 in a 11768
## 10 with the 10657
## # ... with 2,732,616 more rows
As we can see, these are mostly common words. “of the,” “in the,” and “to the” are the most comment sets of words! This isn’t particularly interesting, but it’s indicative of actual writing. We’ll save these pairs for later use.
Despite all of the observations of bigrams of common words, the data set still contains a lot of other useful information for prediction. For example, what are the most common words to come after “blue” or before “asparagus?”
data_n2 %>% filter(word1=="blue")
## # A tibble: 671 x 3
## word1 word2 n
## <chr> <chr> <dbl>
## 1 blue and 68
## 2 blue sky 34
## 3 blue eyes 32
## 4 blue cheese 29
## 5 blue jays 29
## 6 blue ribbon 28
## 7 blue collar 27
## 8 blue skies 21
## 9 blue jeans 19
## 10 blue line 16
## # ... with 661 more rows
data_n2 %>% filter(word2=="asparagus")
## # A tibble: 49 x 3
## word1 word2 n
## <chr> <chr> <dbl>
## 1 the asparagus 17
## 2 and asparagus 6
## 3 grilled asparagus 5
## 4 like asparagus 3
## 5 of asparagus 3
## 6 for asparagus 2
## 7 roasted asparagus 2
## 8 shaved asparagus 2
## 9 400 asparagus 1
## 10 add asparagus 1
## # ... with 39 more rows
Next, to explore unique bigrams, let’s remove the common words and see what we find. We’ll present the results graphically. Note that we are only showing the top 25 bigrams.
bigram_graph <- data_n2 %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!is.na(word1)) %>%
top_n(25, n) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(0.07, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a,
end_cap = circle(0.03, 'inches')) +
geom_node_point(size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Finally, we’ll finish this milestore report with some cleanup and then creation of trigrams and quadgrams.
data_n3
## # A tibble: 6,424,492 x 4
## word1 word2 word3 n
## <chr> <chr> <chr> <dbl>
## 1 one of the 3429
## 2 a lot of 2984
## 3 thanks for the 2346
## 4 to be a 1823
## 5 going to be 1782
## 6 i want to 1574
## 7 the end of 1508
## 8 it was a 1476
## 9 out of the 1445
## 10 as well as 1390
## # ... with 6,424,482 more rows
data_n4
## # A tibble: 8,158,596 x 5
## word1 word2 word3 word4 n
## <chr> <chr> <chr> <chr> <dbl>
## 1 the end of the 790
## 2 at the end of 688
## 3 the rest of the 645
## 4 thanks for the follow 612
## 5 for the first time 609
## 6 at the same time 519
## 7 is going to be 476
## 8 one of the most 413
## 9 is one of the 410
## 10 when it comes to 394
## # ... with 8,158,586 more rows
My approach at the time of this report is to make text predictions using ngrams with 4, 3, 2, and 1 words, in respective order of preference. The app will tokenize the user’s entry, read in the last 3 words, and compare them to the ngrams. If the last three words the user enters show up in the quadgram (n=4) data, the model will predict the fourth word. If not, but the last two words typed appear in the trigram data, the model will predict the third word. If the bigram fails, the model will suggest the most common word from the sample. The model may introduce some degree of randomness, but that is to be determined at a later time. For efficiency, I will use pre-processed ngram data sets. Depending on whether I build in randomness or just go with the top choice, I will either include the top entry from each data set or just a sample of them. I am undecided at this point if it will be more efficient to include the highest order ngram data set and just pull from the last n-1 words or multiple data sets to step through in sequence.