After doing some reading, I came to a conclusion that I will need the quanteda library to process text and hence I have utilized that library (and associated libraries) for my analysis here.
Then whatever words I get out of the Corpus, I need to validate them against some English Dictionary. Based on some Google Search, I found out a package qdapDictionaries which I will utilize.
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)
library(qdapDictionaries)
library(RColorBrewer)
Now the files are quite huge and for the purpose of our analysis, let us take only a small sample (1% of the three files and club them together otherwise we will land up with memory issues)
Based on some Discussion Forum Inputs, I am doing SkipNul = TRUE and reading as “rb” to avoid NULL warnings
set.seed(1234)
fileSample<-file("en_US.sample.txt")
dfFileStats <- data.frame(FileName=character(),Linecount=integer(),WordCount=integer(),stringsAsFactors = FALSE)
getfBufSample <- function(fileName, fileSample, dfFileStats)
{
con <- file(fileName, "rb")
fBuf <- readLines(con,skipNul = TRUE)
close(con)
fBuf.length <- length(fBuf)
fText <- corpus(fBuf)
myToken <- tokens(fText,remove_punct = TRUE)
myDFM <- dfm(myToken)
words.count <- ncol(myDFM)
dfFileStats <- rbind(dfFileStats, c(fileName,fBuf.length,words.count) ,stringsAsFactors = FALSE)
fBuf.Sample <- fBuf[sample(seq_len(fBuf.length), 0.01*fBuf.length)]
rm(myDFM)
rm(myToken)
rm(fText)
rm(fBuf)
writeLines(fBuf.Sample, fileSample)
return(dfFileStats)
}
dfFileStats <- getfBufSample("./final/en_US/en_US.blogs.txt",fileSample,dfFileStats)
dfFileStats <- getfBufSample("./final/en_US/en_US.news.txt",fileSample,dfFileStats)
dfFileStats <- getfBufSample("./final/en_US/en_US.twitter.txt",fileSample,dfFileStats)
close(fileSample)
colnames(dfFileStats) <- c("FILE_NAME","LINE_COUNT","WORD_COUNT")
dfFileStats
## FILE_NAME LINE_COUNT WORD_COUNT
## 1 ./final/en_US/en_US.blogs.txt 899288 447500
## 2 ./final/en_US/en_US.news.txt 1010242 400542
## 3 ./final/en_US/en_US.twitter.txt 2360148 463818
Let us now load the sample data and create a Token in Quanteda
con <- file("en_US.sample.txt", "rb" )
fBuf <- readLines(con,skipNul = TRUE)
close(con)
fBuf.length <- length(fBuf)
fText <- corpus(fBuf)
myToken <- tokens(fText,remove_punct = TRUE)
During my internal analysis I found some offensive words in the corpus which I would rather choose to not display here. I got hold of a dictionary of such words from a CMU web page. I am going to exclude all such sentences which have such words from the list straightaway.
Also I have added some more patterns to this list. I am hiding that part of the code considering the sensitive nature of such words. We do not want such words to be predicted in our algorithm.
Let us now create a DFM. We need a list of variables to remove. The obvious things to remove are as follows -
I have decided to not exclude stop words as they would be needed to predict the next word in sequence. Stop words are words which do not add much meaning to a sentence.
my.addon.words <- NULL
is.word <- function(x) x %in% c(GradyAugmented,my.addon.words)
remove.list <- NULL
remove.list <- c(remove.list,0:9)
remove.list <- c(remove.list,LETTERS[!(LETTERS %in% c('A','I'))])
remove.list <- c(remove.list,letters[!(letters %in% c('a','i'))])
remove.list <- c(remove.list, TRUE, FALSE)
myToken <- myToken %>% tokens_select(remove.list,selection="remove",valuetype = "fixed")
myDFM <- dfm(myToken)
top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features))
head(top.features,20)
## FREQUENCY IN_DICTIONARY
## the 9487 TRUE
## to 7836 TRUE
## i 7185 TRUE
## a 6083 TRUE
## you 5471 TRUE
## and 4365 TRUE
## for 3817 TRUE
## in 3768 TRUE
## of 3610 TRUE
## is 3594 TRUE
## it 2910 TRUE
## my 2901 TRUE
## on 2625 TRUE
## that 2333 TRUE
## me 1967 TRUE
## be 1871 TRUE
## at 1772 TRUE
## with 1721 TRUE
## have 1689 TRUE
## your 1670 TRUE
non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
## FREQUENCY IN_DICTIONARY
## rt 856 FALSE
## \200 810 FALSE
## lol 686 FALSE
## â 522 FALSE
## < 517 FALSE
## im 303 FALSE
## $ 295 FALSE
## haha 263 FALSE
## œ 214 FALSE
## \231 209 FALSE
## > 206 FALSE
## ðÿ 193 FALSE
## \230 168 FALSE
## = 154 FALSE
## ur 150 FALSE
## wanna 141 FALSE
## ~ 140 FALSE
## dont 134 FALSE
## + 127 FALSE
## omg 120 FALSE
We can see several Special Characters here. We want to exclude all special characters from the corpus (which may include foreign language characters). We only want Alphabets A to Z that’s it.
special.character.features <- non.dictionary.features[grep("[^A-Za-z\'-._]+",rownames(non.dictionary.features)),] %>% rownames()
myToken <- myToken %>% tokens_select(special.character.features,selection="remove",valuetype = "fixed")
myDFM <- dfm(myToken)
top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features))
non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
## FREQUENCY IN_DICTIONARY
## rt 856 FALSE
## lol 686 FALSE
## im 303 FALSE
## haha 263 FALSE
## ur 150 FALSE
## wanna 141 FALSE
## dont 134 FALSE
## + 127 FALSE
## omg 120 FALSE
## congrats 114 FALSE
## email 105 FALSE
## bro 89 FALSE
## facebook 84 FALSE
## lmao 74 FALSE
## thats 69 FALSE
## nyc 65 FALSE
## cuz 61 FALSE
## online 61 FALSE
## ppl 59 FALSE
## chicago 59 FALSE
We can see some common internet slangs are used. Let us only focus on the one’s who have high frequence and try to replace them with sober words :)
Also one more thing: we are trying to replace with only one word phrase otherwise we will have to redo the entire thing.
After all these changes, let us recreate the Document Feature Matrix.
myDFM <- dfm(myToken)
top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features))
non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
## FREQUENCY IN_DICTIONARY
## dont 134 FALSE
## congrats 114 FALSE
## email 105 FALSE
## facebook 84 FALSE
## thats 69 FALSE
## nyc 65 FALSE
## online 61 FALSE
## chicago 59 FALSE
## vs 54 FALSE
## blog 54 FALSE
## app 51 FALSE
## dc 47 FALSE
## kinda 46 FALSE
## fb 40 FALSE
## tech 37 FALSE
## cc 35 FALSE
## today's 34 FALSE
## mt 34 FALSE
## obama 33 FALSE
## dj 33 FALSE
non.dictionary.words.remove <- rownames(non.dictionary.features[non.dictionary.features$FREQUENCY < 50,])
myToken <- myToken %>% tokens_select(non.dictionary.words.remove,selection="remove", valuetype = "fixed")
myDFM <- dfm(myToken)
top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features))
Let us create a word cloud now. We will take only those words with minimum of 50 Frequency. Word Cloud is similar to a Histogram but is more useful for Text Data. Words with higher frequency appear in the center and with higher font size and generally darker color.
textplot_wordcloud(myDFM,min_count = 50,color = brewer.pal(12, "Paired"))
biGramToken <- tokens_ngrams(myToken, n = 2)
triGramToken <- tokens_ngrams(myToken, n = 3)
quadGramToken <- tokens_ngrams(myToken, n = 4)
dfmBG <- dfm(biGramToken)
dfmTG <- dfm(triGramToken)
dfmQG <- dfm(quadGramToken)
textplot_wordcloud(dfmBG,min_count = 20,color = brewer.pal(12, "Paired"))
textplot_wordcloud(dfmTG,min_count = 10,color = brewer.pal(12, "Paired"))
textplot_wordcloud(dfmQG,min_count = 5,color = brewer.pal(12, "Paired"))
Note: Not all the N-Grams are shown in the Word Cloud. I have verified that up to Quad-grams are formed for every word in the Word cloud even with low frequency by some random sample testing