The goal of this milestone report is to explore the major features of the text data given to us for the Coursera Data Science Capstone through Johns Hopkins University. The project is sponsored by SwiftKey. The end-goal is to create text-prediction application with R’s Shiny package that predicts words using a natural language processing model.
The first step here is to get an idea of what kinds of pre-processing will be necessary to prepare the data for creating the model. Specifically, certain kinds of characters and words need to be removed and/or modified to aid in prediction accuracy. Finally, I need to create list of single words, and two/three word phrases to see which occur most frequently.
The following R packages are necessary for this analysis.
library(tm)
library(quanteda)
library(dplyr)
library(ggplot2)
library(stringr)
library(pander)
The following downloads the dataset and unzips it into the current working directory. Next, the files of interest are read into R with the readLines function. The data is available here.
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
download.file(fileURL, basename(fileURL))
unzip(basename(fileURL))
}
blog <- readLines(con = "./final/en_US/en_US.blogs.txt", encoding= "UTF-8", skipNul = T)
news <- readLines(con = "./final/en_US/en_US.news.txt", encoding= "UTF-8", skipNul = T)
twit <- readLines(con = "./final/en_US/en_US.twitter.txt", encoding= "UTF-8", skipNul = T)
Now that the dataset has been obtained and the Corpus created, it needs to be split into a training set, development set, and test set. Since the files contain a large number of documents, a small sample of the training set will be created for exploratory analysis. The ratio will be 60% training, 20% development, and 20% test. The small training dataset for exploratory analysis will be 20% of the training set, or 12% of the total dataset.
#Randomly permute the order of lines in the files for splitting
set.seed(310)
blog <- blog[sample(seq(length(blog)))]
news <- news[sample(seq(length(news)))]
twit <- twit[sample(seq(length(twit)))]
#Split the blog text
n <- length(blog)
train.blog <- blog[1:floor(n*0.6)]
dev.blog <- blog[(floor(n*0.6)+1):floor(n*0.8)]
test.blog <- blog[(floor(n*0.8)+1):n]
#Split the news text
n <- length(news)
train.news <- news[1:floor(n*0.6)]
dev.news <- news[(floor(n*0.6)+1):floor(n*0.8)]
test.news <- news[(floor(n*0.8)+1):n]
#Split the twitter text
n <- length(twit)
train.twit <- twit[1:floor(n*0.6)]
dev.twit <- twit[(floor(n*0.6)+1):floor(n*0.8)]
test.twit <- twit[(floor(n*0.8)+1):n]
#Take a smaller sample of the training sets for exploratory analysis
set.seed(424)
n <- sample(seq(length(train.blog)))
small.blog <- train.blog[n[1:floor(length(n)*0.2)]]
n <- sample(seq(length(train.news)))
small.news <- train.news[n[1:floor(length(n)*0.2)]]
n <- sample(seq(length(train.twit)))
small.twit <- train.twit[n[1:floor(length(n)*0.2)]]
Note: At this point, all the datasets are written to local files so that they can be loaded later. Code is not shown.
Here we can see the list of files created, their file size, and a rough estimate of the number of words contained in each file. To get the word estimate, the str_count function from the stringer package was used to count the number of character sequences separated by a space. Specifically, with the call: str_count(x, "\\S+").
As you can see, the total number of words in the three “small” files totals approximately 12 million, which should provide enough data for a representative exploratory analysis.
| fileName | fileSize | nLines | numWords | avgNumWords |
|---|---|---|---|---|
| en_US.blogs | 200.4 Mb | 899,288 | 37,334,131 | 41.5 |
| en_US.news | 196.3 Mb | 1,010,243 | 34,372,529 | 34 |
| en_US.twitter | 159.4 Mb | 2,360,148 | 30,373,583 | 12.9 |
| train.blog | 119.4 Mb | 539,572 | 22,387,983 | 41.5 |
| train.news | 117.6 Mb | 606,145 | 20,635,720 | 34 |
| train.twit | 96 Mb | 1,416,088 | 18,230,256 | 12.9 |
| dev.blog | 39.9 Mb | 179,858 | 7,484,601 | 41.6 |
| dev.news | 39.2 Mb | 202,049 | 6,881,963 | 34.1 |
| dev.twit | 32 Mb | 472,030 | 6,072,570 | 12.9 |
| test.blog | 39.8 Mb | 179,858 | 7,461,547 | 41.5 |
| test.news | 39.1 Mb | 202,049 | 6,854,846 | 33.9 |
| test.twit | 32 Mb | 472,030 | 6,070,757 | 12.9 |
| small.blog | 24 Mb | 107,914 | 4,504,186 | 41.7 |
| small.news | 23.5 Mb | 121,229 | 4,126,578 | 34 |
| small.twit | 19.2 Mb | 283,217 | 3,647,260 | 12.9 |
I will briefly describe the transformations I did to the Corpus and my reasons or doing so.
NOTE: Other common pre-processing steps such as stopword removal and stemming will not be done. Stopwords are the most commonly used words in a language and removing them will negatively affect prediction accuracy. Similarly, stemming, or removal of suffixes could have detrimental effects on prediction.
corp <- VCorpus(DirSource("data/small"))
myChars <- function(x, n=seq(x)) {
# x: a Corpus
# n: the elements of x for which characters will be returned
require(dplyr)
t <- character()
for(i in n){
t <- c(t, x[[i]][[1]])
}
t %>%
str_split("") %>%
sapply(function(x) x[-1]) %>%
unlist %>%
unique %>%
sort(dec=F)
}
chars <- myChars(corp)
print(chars, quote = F)
## [1] \003 \b \020 \035 ' -
## [11] ! " # $ % & ( ) * ,
## [21] . / : ; ? @ [ \\ ] ^
## [31] _ ` { | } ~ ¡ ¦ ¨ ¯
## [41] ´ ¸ ¿
## [51] ¢ £ ¤ ¥ + < =
## [61] > ± « » × ÷ § © ¬ ®
## [71] ° µ ¶ ·
## [81] 0 ¼ ½ ¾ 1 ¹ 2 ² 3
## [91] ³ 4 5 6 7 8 9 a A á
## [101] Á à À â Â ä Ä ã Ã å
## [111] Å æ Æ b B c C ç Ç d
## [121] D ð Ð e E é É è È ê
## [131] Ê ë f F g G h H i I
## [141] í Í ì î Î ï j J k K
## [151] l L m M n N ñ o O º
## [161] ó Ó ò ô ö Ö õ ø Ø
## [171] p P q Q r R s S ß
## [181] t T þ Þ u U ú Ú ù
## [191] û ü Ü v V w W x X y
## [201] Y ý ÿ z Z
There are lots of foreign characters and symbols that are unnecessary or harmful to prediction, so they need to be converted or deleted.
dat <- sapply(corp, function(row) iconv(row, "latin1", "ASCII", sub=""))
corp <- VCorpus(VectorSource(dat)); rm(dat)
chars <- myChars(corp)
print(chars, quote = F)
## [1] \003 \b \020 \035 ' - ! " # $ % & (
## [15] ) * , . / : ; ? @ [ \\ ] ^ _
## [29] ` { | } ~ + < = > 0 1 2 3 4
## [43] 5 6 7 8 9 a A b B c C d D e
## [57] E f F g G h H i I j J k K l
## [71] L m M n N o O p P q Q r R s
## [85] S t T u U v V w W x X y Y z
## [99] Z
Now, the number of characters has been drastically reduced and can be more easily dealt with in the following transformations.
swap <- content_transformer(function(x, from, to) gsub(from, to, x))
corp <- tm_map(corp, content_transformer(tolower))
# Remove profanity words
profanityWords <- readLines(con="data/profanityWords.txt", skipNul = T)
corp <- tm_map(corp, removeWords, profanityWords)
# Replace all foreign unicode character codes with a space
corp <- tm_map(corp, swap, "<.*>", " ")
# Delete all twitter-style hashtag references
corp <- tm_map(corp, swap, "#.*", " ")
# Delete website names
corp <- tm_map(corp, swap, "www\\..*", " ")
corp <- tm_map(corp, swap, ".*\\.com", " ")
# Replace all punctuation except EOS punctuation and apostrophe with a space
corp <- tm_map(corp, swap, "[^[:alnum:][:space:]\'\\.\\?!]", " ")
# Delete numbers with decimal places
corp <- tm_map(corp, swap, "[0-9]+\\.[0-9]+", "")
# Replace all instances of multiple EOS punctuation with one instance
corp <- tm_map(corp, swap, "([\\.\\?!]){2,}", ". ")
# Replace . ? ! with <EOS> tag
corp <- tm_map(corp, swap, "\\. |\\.$", " <EOS> ")
corp <- tm_map(corp, swap, "\\? |\\?$", " <EOS> ")
corp <- tm_map(corp, swap, "! |!$", " <EOS> ")
# Fix instances of probable accidental typo with EOS punctuation
corp <- tm_map(corp, swap, "[[:alnum:]]+\\?[[:alnum:]]+", " <EOS> ")
corp <- tm_map(corp, swap, "[[:alnum:]]+![[:alnum:]]+", " <EOS> ")
# Remove any extra ? !
corp <- tm_map(corp, swap, "!", " ")
corp <- tm_map(corp, swap, "\\?", " ")
# Convert very common occurence of u.s to US
corp <- tm_map(corp, swap, "u\\.s", "US")
corp <- tm_map(corp, swap, "\\.", "")
# Clean up leftover punctuation artifacts
corp <- tm_map(corp, swap, " 's", " ")
corp <- tm_map(corp, swap, " ' ", " ")
corp <- tm_map(corp, swap, "\\\\", " ")
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, stripWhitespace)
if(!dir.exists("./data/corp")) {dir.create("./data/corp")}
if(!dir.exists("./data/corp/small")) {dir.create("./data/corp/small")}
writeCorpus(corp, "./data/corp/small",
filenames = c("cleanSmallBlog", "cleanSmallNews", "cleanSmallTwit"))
After the transformations, I am ready to do some exploratory analysis to determine the most frequent unigrams, bigrams, and trigrams (sets of 1, 2, and 3 words that occur together).
Let’s first look and the list of characters in the corpus and a small sample of text from one of the documents to get a feel for how it looks.
rm(corp) # Reload the corpus from the new file to ensure changes are set
corp <- VCorpus(DirSource("data/corp/small"))
print(myChars(corp), quote = F); print(strwrap(corp[[2]]$content[c(4,6)]), quote=F)
## [1] ' < > a b c d e E f g h i j k l m n o O p q r s S t u U v w x y z
## [1] the storm prediction center has placed chicago in a severe weather
## [2] risk area wednesday and wednesday night and again thursday and
## [3] thursday night <EOS> this indicates thunderstorms may at times
## [4] grow severe <EOS>
## [5] vascular technologist lauren sherrill presses a hand held
## [6] ultrasound device which looks like a sturdy silver pen on the top
## [7] of his foot near his big toe <EOS> the ultrasound called a doppler
## [8] produces a static sound <EOS> then theres the thumping of
## [9] granthams pulse <EOS>
As you can see, the text has far fewer characters than the original document and only apostrophes for punctuation. This will make it so that all instances of the same word are matched regardless of accent marks and capitalization.
At this point, I realized the limitations of my computer’s ability to create document-term matrices using the tm package’s DocumentTermMatrix() function. Luckily I found the quanteda package, which performs many of the same functions as tm, but much faster, as much of its backend code is done in C++. (More on this in the Next Steps section below)
corp <- quanteda::corpus(corp)
freq_df <- function(x){
# This helper function takes a token output and outputs a sorted N-gram frequency table
fr <- sort(colSums(as.matrix(x)),decreasing = TRUE)
df <- data.frame(n_gram = names(fr), freq=fr, row.names = NULL)
return(df)
}
# Create N-grams and dataframes
uni <- dfm(tokens(corp, removeSymbols=TRUE), tolower=FALSE)
uni_freq <- freq_df(uni)
rm(uni)
uni_freq <- uni_freq[-1,]
biToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=2L)
bi <- dfm(biToks, tolower=FALSE); rm(biToks)
bi_freq <- freq_df(bi)
rm(bi)
bi_freq <- bi_freq[-grep("EOS", bi_freq$n_gram),]
triToks <- tokens_ngrams(tokens(corp, removeSymbols=TRUE), n=3L)
tri <- dfm(triToks, tolower=FALSE); rm(triToks)
tri_freq <- freq_df(tri)
rm(tri)
tri_freq <- tri_freq[-grep("EOS", tri_freq$n_gram),]
Let’s take a quick look at some plots of the most frequent unigrams, bigrams, and trigrams. I removed all instances of phrases that included the
top40 <- function(df, title) {
df <- df[1:40,]
df$n_gram <- factor(df$n_gram, levels = df$n_gram[order(-df$freq)])
ggplot(df, aes(x = n_gram, y = freq)) +
geom_bar(stat = "identity", fill = "dodgerblue3", colour = "gray40") +
labs(title = title, x="N-Gram", y="Count") +
theme(axis.text.x = element_text(angle=60, size=12, hjust = 1),
axis.title = element_text(size=14, face="bold"),
plot.title = element_text(size=16, face="bold"))
}
top40(uni_freq, "40 Most Common Unigrams")
top40(bi_freq, "40 Most Common Bigrams")
top40(tri_freq, "40 Most Common Trigrams")
As I mentioned earlier, the quanteda package’s functions ran much faster on my machine. I will definitely use it for the remainder of the project.
I need to do the following:
- Carefully reconsider all the pre-processing steps I used to see if there is anything important I missed.
- Create dataframes of 1-,2-,3-, and possibly 4-grams based on the larger training dataset including word-relation frequencies.
- Look into using the filehash package to load parts of the data at a time to get around my machine’s RAM limitations.
- Create and test several prediction algorithms, with and without
- Find a way to use either Katz backoff or Kneser-Ney smoothing to deal with unknown words.
- The final goal is to create a Shiny app, with a simple user interface that provides reactive predictions as quickly and accurately as possible.