Capstone Milestone Week 2 Report

Synopsis

I have downloaded the data provided for the Capstone project by Swiftkey and cleaned null values from the texts so that each file can be read into R Studio. I then converted each of the lists of texts (for tweets, blogs, and news) into its own corpus and then document-feature matrix format (as used in natural language processing, NLP) so that they can be tokenized. I have analyzed the numbers of n-grams in the three corpora in order to inspect what the most popular combinations are. These efforts are toward gaining a better understanding of what is going on with the data that we have been given.

Getting and Cleaning the Data

After unzipping the Swiftkey files (en_US.twitter.txt, en_US.blogs.txt, and en_US.news.txt) onto my computer, I tried reading them in using the standard read() and read_lines() functions provided in the basic R package, only to discover that there were null values. So I had to do an initial clean by reading in the raw values and converting any raw value of 0 (raw(0) is the nul byte) to a space (raw(0x20)).

Once I saved clean versions of each of these files, then I could read the data into R, convert all characters to lowercase to simplify text comparisons, and then strip out a couple dozen of the most common English-language profanities via a series of gsub() calls. Originally, I had tried one of the libraries of profanities that I found via search, but it was so large that it caused RStudio to crash, and many of the words were not actually profanities, even if they referred to concepts that one would not necessarily discuss in polite company.

I created three corpora: tweetsCorpus, blogsCorpus, and newsCorpus, using the corpus() function in quanteda. The corpus is a basic format for storing text data for NLP research.

Exploratory Analysis

After creating the three corpora for tweets, blogs, and news, I needed to start understanding the data in each corpus, as well as what the statistics were for all corpora put together.

I discovered some basic stats about each corpus. Here were the line counts for each, where line count was the number of text files in each corpus. Each file in the corpus could comprise multiple sentences. We also calculate the total number of words in each corpus: each token is a word, and we sum over all of the tokens in each text (row of data in the corpus).

corpusName	textCount	wordCount
tweetsCorpus	2358696	29349013
blogsCorpus	899288	36802103
newsCorpus	1010242	33399421

Next, I generated 2-grams and 3-grams for each corpus. This was done by creating a data feature matrix (DFM) using the dfm() function in the quanteda package. In the process of creating the DFM, I remove typical English stopwords (e.g., “a”, “an”, “the”, “and”, etc.) that would otherwise dominate the statistics for the 1-grams if we left them in. I also removed numbers and punctuation so that those elements would not distract from identifying correlations between words.

Looking at the basic properties of the DFM tells us the number of documents (which we already calculated above) and the number of features. The features are the unique tokens that we calculate, which can either be words (1-grams), 2-grams, or 3-grams, depending on the value we set for ngrams in our call to dfm(). All of these have 417,684 features as identified by the DFM conversion.

We can plot the top 1-grams, 2-grams, and 3-grams in each corpus plus across all corpora and see how they vary according to the corpus type. First, the frequency distribution of 1-grams (words). The most popular 1-grams, or words, in the tweetsCorpus are clearly less formal (e.g., “u”, “rt”, and “lol”) than the most popular language in the blogs or news clips.

Here are the distributions of the top 20 most frequent 2-grams in the corpora. The blogs and news articles show more similar common 2-grams than the tweets. The top 2-grams in all cases are combinations of a preposition and “the”, which reflects some common structures in the English language.

Finally, the distribution of the top 20 most frequent 3-grams. Again, the blogs and news articles show more similar common 3-grams than the tweets do.

## null device 
##           1

One thing I did learn was that while it was possible to generate n-grams in relatively short order on each individual corpus (tweets, blogs, and news), if I tried to combine all three corpora into a single corpus and then create the n-grams, R Studio would crash. So, it made more sense to compute the n-grams for each individual corpus and the combine them later. However, this may not make a difference when it comes to putting together the dictionary to be used for the text completion algorithm.

In my explorations, I also estimated that fewer than 20,000 words need to be in a dictionary to address 90 percent of all 2-gram combinations.

Finally, using quanteda enabled me to process the entire corpora in their entirety within a reasonable time (less than an hour to compile this knitr document from scratch) on my 2017 vintage laptop with 16G of memory. So, using the entire corpora is quite doable on a computer with decent RAM.

Next Steps

The corpora for tweets, blogs, and news display some basic differences in types of language used, as demonstrated by the different frequencies for the most common 1-grams, 2-grams, and 3-grams in the respective DFMs. Based on my reading of the literature, using a Markoff chain n-gram model with backoff seems like the most productive direction to pursue. The dictionary would be assembled from the most frequently appearing 2-grams from throughout the combined corpora, as I have calculated on the respective corpora and assembled into the composite graphs above. I am currently investigating how to implement this using the R package quanteda.

Appendix

In case other students want to see how I arrived at my results, I am including the R code that was used to generate my results above in this Appendix.

library(sourcetools)
library(quanteda)

removeProfanity <- function(str) {
  str <- gsub("[a-z]*fuck[a-z]*","",str)
  str <- gsub("[a-z]*fcuk[a-z]*","",str)
  str <- gsub("[a-z]*phuk[a-z]*","",str)
  str <- gsub("[a-z]*m[0o]f[0o][a-z]*","",str)
  str <- gsub("jerk[-]?off[a-z]*","",str)
  str <- gsub("(^|\ )a[s$*][s$*]+[a-z]*","",str)
  str <- gsub("cunt[a-z]*","",str)
  str <- gsub("(^|\ )c[o0]ck\ ","",str)
  str <- gsub("(^|\ )[ck]um\ ","",str)
  str <- gsub("b[i1]tch[a-z]*","",str)
  str <- gsub("(^|\ )arse[a-z]*","",str)
  str <- gsub("(^|\ )crap[a-z]*","",str)
  str <- gsub("(^|\ )n[i1]gg[a4]*","",str)
  str <- gsub("(^|\ )n[i1]gg[e3]r*","",str)
  str <- gsub("sh[i1]+t[a-z]*","",str)
  str <- gsub("twat[a-z]*","",str)
  str <- gsub("(^|\ )t[i1]ts?t?[a-z]*","",str)
  str <- gsub("(^|\ )wan[gk]?[a-z]*","",str)
  str <- gsub("(^|\ )b[o0][o0]+b[a-z]*","",str)
  str <- gsub("(^|\ )c[o0][o0]+n[a-z]*","",str)
  str
}

r = readBin("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.twitter.txt", raw(), file.info("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.twitter.txt")$size)
r[r==as.raw(0)] = as.raw(0x20) 
writeBin(r, "/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.twitter_clean.txt")
tweets <- read_lines("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.twitter_clean.txt")

tweets <- as.list(tolower(tweets))
tweets <- removeProfanity(tweets)

tweetsCorpus <- corpus(tweets)

r = readBin("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.blogs.txt", raw(), file.info("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.blogs.txt")$size)
r[r==as.raw(0)] = as.raw(0x20)
writeBin(r, "/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.blogs_clean.txt")
blogs <- read_lines("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.blogs_clean.txt")

blogs <- as.list(tolower(blogs))
blogs <- removeProfanity(blogs)

blogsCorpus <- corpus(blogs)

r = readBin("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.news.txt", raw(), file.info("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.news.txt")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.news_clean.txt")
news <- read_lines("/Users/kristinabkemeier/DataScience/Capstone/final/en_US/en_US.news_clean.txt")

news <- as.list(tolower(news))
news <- removeProfanity(news)

newsCorpus <- corpus(news)

tokenTweet <- tokens(tweetsCorpus, remove_numbers = TRUE, remove_punct = TRUE)
tweetwc <- lapply(tokenTweet, length)
totalTweetWords <- sum(unlist(tweetwc))

tokenBlog <- tokens(blogsCorpus, remove_numbers = TRUE, remove_punct = TRUE)
blogwc <- lapply(tokenBlog, length)
totalBlogWords <- sum(unlist(blogwc))

tokenNews <- tokens(newsCorpus, remove_numbers = TRUE, remove_punct = TRUE)
newswc <- lapply(tokenNews, length)
totalNewsWords <- sum(unlist(newswc))

library(knitr)
corpusName <- c('tweetsCorpus','blogsCorpus','newsCorpus')
textCount <- c(length(tokenTweet),length(tokenBlog),length(tokenNews))
wordCount <- c(totalTweetWords,totalBlogWords,totalNewsWords)
infodf <- data.frame(corpusName, textCount, wordCount)
kable(infodf)

tweetsdfm <- dfm(tweetsCorpus, remove = stopwords("english"), remove_numbers = TRUE, remove_punct = TRUE)
##tweetsdfm ## This tells us the number of documents and number of features

tweet2gramDfm <- dfm(tweetsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=2)
##tweet2gramDfm

tweet3gramDfm <- dfm(tweetsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=3)
##tweet3gramDfm

blogsdfm <- dfm(blogsCorpus, remove = stopwords("english"), remove_numbers = TRUE, remove_punct = TRUE)
##tweetsdfm ## This tells us the number of documents and number of features

blog2gramDfm <- dfm(blogsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=2)
##blog2gramDfm

blog3gramDfm <- dfm(blogsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=3)
##blog3gramDfm

newsdfm <- dfm(newsCorpus, remove = stopwords("english"), remove_numbers = TRUE, remove_punct = TRUE)
##tweetsdfm ## This tells us the number of documents and number of features

news2gramDfm <- dfm(newsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=2)
##news2gramDfm

news3gramDfm <- dfm(newsCorpus, stem = FALSE, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"), ngrams=3)
##news3gramDfm

par(mfrow=c(2,2))
topTweets <- topfeatures(tweetsdfm, 20)
barplot(topTweets, main="Top 1-grams in tweetsCorpus",las=2) 

topBlogs <- topfeatures(blogsdfm, 20)
barplot(topBlogs, main="Top 1-grams in blogsCorpus",las=2) 

topNews <- topfeatures(newsdfm, 20)
barplot(topNews, main="Top 1-grams in newsCorpus",las=2) 

topTweets1000 <- topfeatures(tweetsdfm, 1000)
topBlogs1000 <- topfeatures(blogsdfm, 1000)
topNews1000 <- topfeatures(newsdfm, 1000)

topAll <- c(topTweets1000, topBlogs1000, topNews1000)
topAllTotal <- tapply(topAll, names(topAll), sum)
topAllTotalSorted <- sort(topAllTotal, decreasing = TRUE)
barplot(topAllTotalSorted[1:20], main="Top 1-grams Across All Corpora",las=2) 

par(mfrow=c(2,2))
topTweet2 <- topfeatures(tweet2gramDfm, 20)
barplot(topTweet2, main="Top 2-grams in tweetsCorpus",las=2) 

topBlog2 <- topfeatures(blog2gramDfm, 20)
barplot(topBlog2, main="Top 2-grams in blogsCorpus",las=2) 

topNews2 <- topfeatures(news2gramDfm, 20)
barplot(topNews2, main="Top 2-grams in newsCorpus",las=2) 

topTweets1000 <- topfeatures(tweet2gramDfm, 1000)
topBlogs1000 <- topfeatures(blog2gramDfm, 1000)
topNews1000 <- topfeatures(news2gramDfm, 1000)

topAll <- c(topTweets1000, topBlogs1000, topNews1000)
topAllTotal <- tapply(topAll, names(topAll), sum)
topAllTotalSorted <- sort(topAllTotal, decreasing = TRUE)
barplot(topAllTotalSorted[1:20], main="Top 2-grams Across All Corpora",las=2) 

par(mfrow=c(2,2))
topTweet3 <- topfeatures(tweet3gramDfm, 20)
barplot(topTweet3, main="Top 3-grams in tweetsCorpus",las=2) 

topBlog3 <- topfeatures(blog3gramDfm, 20)
barplot(topBlog3, main="Top 3-grams in blogsCorpus",las=2) 

topNews3 <- topfeatures(news3gramDfm, 20)
barplot(topNews3, main="Top 3-grams in newsCorpus",las=2) 

topTweets1000 <- topfeatures(tweet3gramDfm, 1000)
topBlogs1000 <- topfeatures(blog3gramDfm, 1000)
topNews1000 <- topfeatures(news3gramDfm, 1000)

topAll <- c(topTweets1000, topBlogs1000, topNews1000)
topAllTotal <- tapply(topAll, names(topAll), sum)
topAllTotalSorted <- sort(topAllTotal, decreasing = TRUE)
barplot(topAllTotalSorted[1:20], main="Top 3-grams Across All Corpora",las=2) 
dev.off()