knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(readtext)
library(quanteda)
library(wordcloud)
library(stringi)
library(ggplot2)
library(knitr)
pathBlogs<-c("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
pathTwit<-c("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
pathNews<-c("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
rawB<-readLines(pathBlogs, encoding = "UTF-8", skipNul = TRUE)
rawT<-readLines(pathTwit, encoding = "UTF-8", skipNul = TRUE)
rawN<-readLines(pathNews, skipNul = TRUE, warn=FALSE)
Diving deeper into Natural-Language-Processing, an initial exploration of the underlying dataset is critical for determining our approach to restructuring the data, and optimizing our eventual algorithm.
summ<-data.frame(c("Blog","News","Twitter"))
names(summ)[1]<-"source"
summ$size <- c(object.size(rawB)/1000,object.size(rawN)/1000,object.size(rawT)/1000)
summ$doccount<-c(length(rawB),length(rawN),length(rawT))
summ$wrdcount<-c(sum(stri_count_words(rawB)),sum(stri_count_words(rawN)),sum(stri_count_words(rawT)))
names(summ)<-c("Da ta Source","Size (in KB)","Doc. Count","Word Count")
kable(summ)
| Da ta Source | Size (in KB) | Doc. Count | Word Count |
|---|---|---|---|
| Blog | 267758.63 | 899288 | 37546239 |
| News | 20729.47 | 77259 | 2693898 |
| 334484.99 | 2360148 | 30093413 |
Twitter, while representing the largest file in terms of file size (~316MB) has a smaller wordcount than Blogs (37m:30m) so it’s apparent that the amount of documents in the corpus has an impact on the memory allocation requirements.
News, smaller in both ways, is dwarfed by the size of either of the other corpi.
A few novel steps were taken here to minimize the impact on the underlying data set (in terms of former and next token interaction) that allowed us to preserve the underlying structure of the sentences.
These include:
To deal with explicit words, I used a list from http://www.bannedwordlist.com and manually edited the list for words that have multiple meanings, or I didn’t consider a “bad word”
bad<-read.csv("swearWords.txt")
badp<-""
for(val in bad){
badp<-paste(badp,val,"|",sep="",collapse="")
}
badp<-substr(badp,1,nchar(badp)-1)
badp<-paste("\\b(",badp,")\\b",sep="",collapse=" ")
Designed a function to quickly manipulate data from start to finish so we could randomly select chunks from any of the corpi to produce results.
fullprocess <- function(text){
sam<-sample(texts(text), size=.01*length(text),replace=FALSE)%>%corpus()
texts(sam)<-gsub("\\.{2}",".",texts(sam))
sam<-corpus_segment(sam, pattern = ".", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE) %>%
corpus_segment(sam, pattern = "?", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE) %>%
corpus_segment(sam, pattern = "!", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE)
texts(sam)<-gsub("[[:punct:]]","",texts(sam))
texts(sam)<-tolower((texts(sam)))
texts(sam)<-gsub(badp,"cnsrd",texts(sam))
texts(sam)<-gsub("\\b1st\\b","first",texts(sam))
texts(sam)<-gsub("\\b2nd\\b","second",texts(sam))
texts(sam)<-gsub("\\b3rd\\b","third",texts(sam))
texts(sam)<-gsub("\\b4th\\b","fourth",texts(sam))
texts(sam)<-gsub("\\b5th\\b","fifth",texts(sam))
texts(sam)<-gsub("\\b6th\\b","sixth",texts(sam))
texts(sam)<-gsub("\\b7th\\b","seventh",texts(sam))
texts(sam)<-gsub("\\b8th\\b","eigth",texts(sam))
texts(sam)<-gsub("\\b9th\\b","ninth",texts(sam))
texts(sam)<-gsub("\\b(\\d)+th\\b","nth",texts(sam))
texts(sam)<-gsub("\\b1\\b","one",texts(sam))
texts(sam)<-gsub("\\b2\\b","two",texts(sam))
texts(sam)<-gsub("\\b3\\b","three",texts(sam))
texts(sam)<-gsub("\\b4\\b","four",texts(sam))
texts(sam)<-gsub("\\b5\\b","five",texts(sam))
texts(sam)<-gsub("\\b6\\b","six",texts(sam))
texts(sam)<-gsub("\\b7\\b","seven",texts(sam))
texts(sam)<-gsub("\\b8\\b","eight",texts(sam))
texts(sam)<-gsub("\\b9\\b","nine",texts(sam))
texts(sam)<-gsub("\\b0\\b","zero",texts(sam))
texts(sam)<-gsub("\\b(\\d)+\\b","numrep",texts(sam))
#tok<-tokens(sam)%>%tokens_ngrams(n=4)
#dfmFin<-dfm(tok)
#textplot_wordcloud(dfmFin, min_count = 3, random_order = FALSE,
# max_words = 100,rotation=0,
# color = RColorBrewer::brewer.pal(8,"Dark2"))
}
Overall this process was repeatable for multiple test conditions.
Towards the end of the previous code there was a section with the code we use to create word_clouds for our initial exploration of the sample.
tok<-tokens(sam)%>%tokens_ngrams(n=4)
dfmFin<-dfm(tok)
textplot_wordcloud(dfmFin, min_count = 3, random_order = FALSE,
max_words = 100,rotation=0,
color = RColorBrewer::brewer.pal(8,"Dark2"))
}
What we’re seeing here is a random sample of 1% of our blog set converted into a “wordcloud” with a n-gram density of 4. Adjusting the function code can change teh side of the ngrams, or a few tweaks could be made to include the number of ngrams as one of the input variables.
One thing to note about the prior function, is that the initial code converts every sentence break into a separate document, to eliminate wraparounds when we explore the data. E.g. “We went to the store. It was good” would be separated into “We went to the store” and “It was good”
This creates a situations where “the store it was” or “store it was good” would never should up in our 4-gram model, increasing the accuracy of the predictive model we’re building.
Now that we’ve fully explored the underlying dataset and come up with a means of consistently trimming and transforming our test sets, we can began efficiently tackling the problem of building a predictive model, whether through the use of marcov chains or by other means.
A more comprehensive means of dealing with censored numbers involving randomizing the replacement token could help eliminate the collective impact of [censored] (similarly for the number replacement).
All in all, however, our existing clean up mechanism and data subsetting procedure should be sufficient for moving forward.