Corpus Analysis of Coursera-Swiftkey Dataset

Introduction

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(readtext)
library(quanteda)
library(wordcloud)
library(stringi)
library(ggplot2)
library(knitr)

pathBlogs<-c("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
pathTwit<-c("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
pathNews<-c("./Coursera-SwiftKey/final/en_US/en_US.news.txt")

rawB<-readLines(pathBlogs, encoding = "UTF-8", skipNul = TRUE)
rawT<-readLines(pathTwit, encoding = "UTF-8", skipNul = TRUE)
rawN<-readLines(pathNews, skipNul = TRUE, warn=FALSE)

Diving deeper into Natural-Language-Processing, an initial exploration of the underlying dataset is critical for determining our approach to restructuring the data, and optimizing our eventual algorithm.

Summary Statistics

summ<-data.frame(c("Blog","News","Twitter"))
names(summ)[1]<-"source"
summ$size <- c(object.size(rawB)/1000,object.size(rawN)/1000,object.size(rawT)/1000)
summ$doccount<-c(length(rawB),length(rawN),length(rawT))
summ$wrdcount<-c(sum(stri_count_words(rawB)),sum(stri_count_words(rawN)),sum(stri_count_words(rawT)))
names(summ)<-c("Da ta Source","Size (in KB)","Doc. Count","Word Count")
kable(summ)

Da ta Source	Size (in KB)	Doc. Count	Word Count
Blog	267758.63	899288	37546239
News	20729.47	77259	2693898
Twitter	334484.99	2360148	30093413

Twitter, while representing the largest file in terms of file size (~316MB) has a smaller wordcount than Blogs (37m:30m) so it’s apparent that the amount of documents in the corpus has an impact on the memory allocation requirements.

News, smaller in both ways, is dwarfed by the size of either of the other corpi.

Transformation

A few novel steps were taken here to minimize the impact on the underlying data set (in terms of former and next token interaction) that allowed us to preserve the underlying structure of the sentences.

These include:

converting isolated numerical digits to words
- “I have 1 dog.” becomes “I have one dog.”
converting multi-numeric digit combinations to a generic “long word” token
- “I have 215 dogs” becomes “I have [largenumber] dogs.”
converting a customized list of expletives to a generic token representative of expletives
- “I have a fucking dog” becomes “I have a [expletive] dog”
eliminating specific types of punctuation.
- “I can’t have a pet.” becomes “I cant have a pet”
transforming each sentence into a unique document to eliminate the need for processing paragraph breaks/ new line characters.
- (This also eliminated dealing with wraparound ngrams later on when developing our algorithm)

Explicit Words

To deal with explicit words, I used a list from http://www.bannedwordlist.com and manually edited the list for words that have multiple meanings, or I didn’t consider a “bad word”

bad<-read.csv("swearWords.txt")
badp<-""
for(val in bad){
  badp<-paste(badp,val,"|",sep="",collapse="")
}
badp<-substr(badp,1,nchar(badp)-1)
badp<-paste("\\b(",badp,")\\b",sep="",collapse=" ")

Processing Function

Designed a function to quickly manipulate data from start to finish so we could randomly select chunks from any of the corpi to produce results.

fullprocess <- function(text){
  sam<-sample(texts(text), size=.01*length(text),replace=FALSE)%>%corpus()
  texts(sam)<-gsub("\\.{2}",".",texts(sam))
  sam<-corpus_segment(sam, pattern = ".", valuetype = "fixed",
                      pattern_position = "after", extract_pattern = FALSE) %>%
    corpus_segment(sam, pattern = "?", valuetype = "fixed",
                   pattern_position = "after", extract_pattern = FALSE) %>%
    corpus_segment(sam, pattern = "!", valuetype = "fixed",
                   pattern_position = "after", extract_pattern = FALSE)
  texts(sam)<-gsub("[[:punct:]]","",texts(sam))
  texts(sam)<-tolower((texts(sam)))
  texts(sam)<-gsub(badp,"cnsrd",texts(sam))
  texts(sam)<-gsub("\\b1st\\b","first",texts(sam))
  texts(sam)<-gsub("\\b2nd\\b","second",texts(sam))
  texts(sam)<-gsub("\\b3rd\\b","third",texts(sam))
  texts(sam)<-gsub("\\b4th\\b","fourth",texts(sam))
  texts(sam)<-gsub("\\b5th\\b","fifth",texts(sam))
  texts(sam)<-gsub("\\b6th\\b","sixth",texts(sam))
  texts(sam)<-gsub("\\b7th\\b","seventh",texts(sam))
  texts(sam)<-gsub("\\b8th\\b","eigth",texts(sam))
  texts(sam)<-gsub("\\b9th\\b","ninth",texts(sam))
  texts(sam)<-gsub("\\b(\\d)+th\\b","nth",texts(sam))
  texts(sam)<-gsub("\\b1\\b","one",texts(sam))
  texts(sam)<-gsub("\\b2\\b","two",texts(sam))
  texts(sam)<-gsub("\\b3\\b","three",texts(sam))
  texts(sam)<-gsub("\\b4\\b","four",texts(sam))
  texts(sam)<-gsub("\\b5\\b","five",texts(sam))
  texts(sam)<-gsub("\\b6\\b","six",texts(sam))
  texts(sam)<-gsub("\\b7\\b","seven",texts(sam))
  texts(sam)<-gsub("\\b8\\b","eight",texts(sam))
  texts(sam)<-gsub("\\b9\\b","nine",texts(sam))
  texts(sam)<-gsub("\\b0\\b","zero",texts(sam))
  texts(sam)<-gsub("\\b(\\d)+\\b","numrep",texts(sam))
  
  #tok<-tokens(sam)%>%tokens_ngrams(n=4)
  #dfmFin<-dfm(tok)
  #textplot_wordcloud(dfmFin, min_count = 3, random_order = FALSE,
  #                  max_words = 100,rotation=0,
  #                  color = RColorBrewer::brewer.pal(8,"Dark2"))
}

Overall this process was repeatable for multiple test conditions.

Corpus Exploration

Towards the end of the previous code there was a section with the code we use to create word_clouds for our initial exploration of the sample.

  tok<-tokens(sam)%>%tokens_ngrams(n=4)
  dfmFin<-dfm(tok)
  textplot_wordcloud(dfmFin, min_count = 3, random_order = FALSE,
                    max_words = 100,rotation=0,
                    color = RColorBrewer::brewer.pal(8,"Dark2"))
}

What we’re seeing here is a random sample of 1% of our blog set converted into a “wordcloud” with a n-gram density of 4. Adjusting the function code can change teh side of the ngrams, or a few tweaks could be made to include the number of ngrams as one of the input variables.

One thing to note about the prior function, is that the initial code converts every sentence break into a separate document, to eliminate wraparounds when we explore the data. E.g. “We went to the store. It was good” would be separated into “We went to the store” and “It was good”

This creates a situations where “the store it was” or “store it was good” would never should up in our 4-gram model, increasing the accuracy of the predictive model we’re building.

Next Steps

Now that we’ve fully explored the underlying dataset and come up with a means of consistently trimming and transforming our test sets, we can began efficiently tackling the problem of building a predictive model, whether through the use of marcov chains or by other means.

A more comprehensive means of dealing with censored numbers involving randomizing the replacement token could help eliminate the collective impact of [censored] (similarly for the number replacement).

All in all, however, our existing clean up mechanism and data subsetting procedure should be sufficient for moving forward.