This is the Milestone Report for the Capstone Course of the Johns Hopkins Data Science Coursera specialization. The goal of the capstone project is to use a corpus of blogs, news articles, and tweets, provided by SwiftKey, to build a next-word prediction app.
In this Milestone Report, I give a brief summary of the raw data, demonstrate how I process the data to create a clean corpus, tokenize the corpus into n-grams, and explore how many words from the corpus are needed to get various levels of coverage.
The corpus that will be the basis for the model built for this project consists of three .txt files:
Before reading the data into R, I used the command wc to get information about each of these files, including the number of lines, number of words, and number of characters. I also used object.size to find out the size of each file.
my_files <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
my_paths <- lapply(my_files, function(x) paste0("data/", x))
MBsize <- sapply(my_paths, function(x) round(file.size(x) / 1024^2, 2))
MBsize <- as.data.frame(cbind(file = my_files, "size (MB)" = MBsize))
system("wc data/*.txt > wc_out.txt")
wc <- read.csv("wc_out.txt", sep = "", header = FALSE)
names(wc) <- c("lines", "words", "characters", "file")
wc <- filter(wc, file %in% my_paths)
wc$file <- gsub("data/", "", wc$file)
wc <- select(wc, file, lines, words, characters)
wc <- left_join(wc, MBsize)
kable(wc)
| file | lines | words | characters | size (MB) |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334690 | 210160014 | 200.42 |
| en_US.news.txt | 1010242 | 34372720 | 205811889 | 196.28 |
| en_US.twitter.txt | 2360148 | 30374206 | 167105338 | 159.36 |
Because of the large amount of text contained in the files, it would be impractical to use the entire content of the files to train a model. In fact, as noted in the course material, it is possible to build a servicable model using a relatively small fraction of the total available data.
To perform the sampling, I wrote a function read_prob (see Appendix) which uses rbinom to randomly choose a specified percentage of lines from each file to read into R. I used this function to sample 0.5% of the twitter file, and 1% each of the blogs and news files. I will use these samples throughout this exploratory analysis.
set.seed(722)
tweets05 <- read_prob("en_US.twitter.txt", 0.005)
head(tweets05)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
blogs1 <- read_prob("en_US.blogs.txt", 0.01)
head(blogs1)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
news1 <- read_prob("en_US.news.txt", 0.01)
head(news1)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
These glimpses of the raw data show that it contains several features that will need to be addressed before building a next-word prediction model. For example, the Twitter data contains emoticons, which should not be included in the final dictionary for the model. There is inconsistent capitalization, so we will want all characters to be transformed to lowercase before we proceed. Punctuation will also need to be removed so that our algorithm does not predict punctuation.
The first step in processing the sampled data is to transform the character vectors into a corpus using the tm package. In creating the corpus, I am removing numbers, punctuation, and extra white space, in addition to converting all characters to lowercase.
library(tm)
clean_corpus<-function(char_vec) {
temp <-VCorpus(VectorSource(char_vec))
temp <-tm_map(temp, removeNumbers)
temp <-tm_map(temp, stripWhitespace)
temp <-tm_map(temp, removePunctuation)
temp <-tm_map(temp, content_transformer(tolower))
return(temp)
}
I am creating separate corpuses for the tweets, blogs, and news sources because I would like to explore them separately for now.
tweets05_c <- clean_corpus(tweets05)
blogs1_c <- clean_corpus(blogs1)
news1_c <- clean_corpus(news1)
Now that the corpuses have been generated, the next step is to form tokenized versions of them. My function ngram_freq (see Appendix) takes as input a corpus and an integer (1 for unigrams, 2 for bigrams, etc) and returns a data frame with two variables - the ngrams and their frequencies in the corpus.
tweets05_unigrams <- ngram_freq(tweets05_c, 1)
tweets05_bigrams <- ngram_freq(tweets05_c, 2)
tweets05_trigrams <- ngram_freq(tweets05_c, 3)
blogs1_unigrams <- ngram_freq(blogs1_c, 1)
blogs1_bigrams <- ngram_freq(blogs1_c, 2)
blogs1_trigrams <- ngram_freq(blogs1_c, 3)
news1_unigrams <- ngram_freq(news1_c, 1)
news1_bigrams <- ngram_freq(news1_c, 2)
news1_trigrams <- ngram_freq(news1_c, 3)
I’m using the function freq_plot (see Appendix) to visualize the top 20 most frequent n-grams from each source.
p1_t <- freq_plot(tweets05_unigrams, "green3", "Tweets")
p1_b <- freq_plot(blogs1_unigrams, "violetred", "Blogs")
p1_n <- freq_plot(news1_unigrams, "dodgerblue", "News")
grid.arrange(p1_t, p1_b, p1_n, nrow = 1, left = "20 Most Frequent Unigrams")
p2_t <- freq_plot(tweets05_bigrams, "green3", "Tweets")
p2_b <- freq_plot(blogs1_bigrams, "violetred", "Blogs")
p2_n <- freq_plot(news1_bigrams, "dodgerblue", "News")
grid.arrange(p2_t, p2_b, p2_n, nrow = 1, left = "20 Most Frequent Bigrams")
p3_t <- freq_plot(tweets05_trigrams, "green3", "Tweets")
p3_b <- freq_plot(blogs1_trigrams, "violetred", "Blogs")
p3_n <- freq_plot(news1_trigrams, "dodgerblue", "News")
grid.arrange(p3_t, p3_b, p3_n, nrow = 1, left = "20 Most Frequent Trigrams")
Note that the Twitter data contains a high frequency of “for the rt” (retweet) which I will go back and remove later because I do not want my model to predict that abbreviation.
From the distributions of n-grams shown above, it is apparent that a small number of unigrams with very high frequency dominate the unigram distribution. For bigrams, this effect is less pronounced, and for trigrams, it takes an even larger number of high frequency trigrams to cover a considerable percentage of total occurrences in the corpus.
A more direct way to look at n-gram coverage is to plot cumulative frequency vs. fraction of n-grams. This way we can get an idea of how many of the low-frequency n-grams we can drop from our dictionary and still retain the desired level of coverage.
The function calc_coverage (see Appendix) takes the data frame generated from ngram_freq, which is sorted by decreasing frequency, and adds a column containing the cumulative percentage of the total n-grams in the sampled corpus that are represented.
tweets05_unigrams <- calc_coverage(tweets05_unigrams)
tweets05_bigrams <- calc_coverage(tweets05_bigrams)
tweets05_trigrams <- calc_coverage(tweets05_trigrams)
blogs1_unigrams <- calc_coverage(blogs1_unigrams)
blogs1_bigrams <- calc_coverage(blogs1_bigrams)
blogs1_trigrams <- calc_coverage(blogs1_trigrams)
news1_unigrams <- calc_coverage(news1_unigrams)
news1_bigrams <- calc_coverage(news1_bigrams)
news1_trigrams <- calc_coverage(news1_trigrams)
p_t <- coverage_plot(tweets05_unigrams, tweets05_bigrams, tweets05_trigrams,
title = "Tweets (0.5% sample)")
p_b <- coverage_plot(blogs1_unigrams, blogs1_bigrams, blogs1_trigrams,
title = "Blogs (1% sample)")
p_n <- coverage_plot(news1_unigrams, news1_bigrams, news1_trigrams,
title = "News (1% sample)")
p_t
p_b
p_n
It is clear from these plots that while most of the corpus can be covered by a relatively small number of unigrams (the 25% highest frequency unigrams in the blogs sample constitute 90% of the total words in that sample), we would have to retain 80% of the unique bigrams, and nearly 90% of the unique trigrams, to get 90% coverage.
Moving forward with building my model, these are some of the issues I will be confronted with:
Determining what portion of the unigrams, bigrams, and trigrams from each of the data sources to include in the dictionary. This decision will be informed by the exploratory analysis of coverage illustrated above. There will be a tradeoff between coverage and efficiency.
Exploring options for efficiently storing n-grams in R. Consider using the data.table package for fast lookup.
Model building and testing, beginning with a simple backoff model, then exploring different smoothing algorithms and methods for handling unseen data.
read_prob <- function(file, p) {
path <- paste0("data/", file)
con <- file(path, "r")
nLines <- countLines(path)
temp <- readLines(con, 1)
line <- 2
while(line <= nLines) {
if(rbinom(1, 1, p)) {
temp <- c(temp, readLines(con, 1))
}
line <- line + 1
}
close(con)
return(temp)
}
ngram_freq <- function(corpus, n) {
ngram_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), n), paste, collapse = " "), use.names = FALSE)
tdm <- as.matrix(TermDocumentMatrix(corpus, control = list(tokenize = ngram_tokenizer)))
freq <- rowSums(tdm)
fm <- data.frame(ngram = names(freq), freq = freq)
fm <- fm %>% arrange(desc(freq))
return(fm)
}
freq_plot <- function(fm, fill = "gray", title = "") {
fm %>%
head(20) %>%
ggplot(aes(x = reorder(ngram, freq), y = freq)) +
geom_col(fill = fill) +
theme_bw() +
coord_flip() +
labs(x = "", y = "Frequency", title = title)
}
calc_coverage <- function(fm) {
fm$percentage <- fm$freq / sum(fm$freq)
fm$cumperc <- cumsum(fm$percentage)
fm$row <- 1:nrow(fm)
fm$x <- fm$row / nrow(fm)
return(fm)
}
coverage_plot <- function(fm_u, fm_b, fm_t, title = "") {
fm_u$ngram <- "unigrams"
fm_b$ngram <- "bigrams"
fm_t$ngram <- "trigrams"
fm <- rbind(fm_u, fm_b, fm_t)
fm %>%
ggplot(aes(x = x, y = cumperc)) +
geom_hline(yintercept = 0.5, col = "gray", lty = 2) +
geom_hline(yintercept = 0.9, col = "gray", lty = 2) +
geom_path(aes(col = ngram)) +
theme_bw() +
labs(x = "Fraction of n-grams", y = "Coverage", title = title)
}