Initial Analysis on the Capstone Project (Data Science Specialization)

Problem Statement

  1. We have to build a text prediction system based on some text Corpus provided.
  2. While there are corpus available in multiple languages, we shall do it in English only.
  3. There are three sources viz. Blogs, News and Twitter.

Approach

After doing some reading, I came to a conclusion that I will need the quanteda library to process text and hence I have utilized that library (and associated libraries) for my analysis here.

Then whatever words I get out of the Corpus, I need to validate them against some English Dictionary. Based on some Google Search, I found out a package qdapDictionaries which I will utilize.

Load Libraries

library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)
library(qdapDictionaries)
library(RColorBrewer)

Sampling Data

Now the files are quite huge and for the purpose of our analysis, let us take only a small sample (1% of the three files and club them together otherwise we will land up with memory issues)

Based on some Discussion Forum Inputs, I am doing SkipNul = TRUE and reading as “rb” to avoid NULL warnings

set.seed(1234)
fileSample<-file("en_US.sample.txt")
dfFileStats <- data.frame(FileName=character(),Linecount=integer(),WordCount=integer(),stringsAsFactors = FALSE)
getfBufSample <- function(fileName, fileSample, dfFileStats)
{
  con <- file(fileName, "rb")
  fBuf <- readLines(con,skipNul = TRUE)
  close(con)
  fBuf.length <- length(fBuf)
  fText <- corpus(fBuf)
  myToken <- tokens(fText,remove_punct = TRUE)
  myDFM <- dfm(myToken)
  words.count <- ncol(myDFM)
  
  dfFileStats <- rbind(dfFileStats, c(fileName,fBuf.length,words.count) ,stringsAsFactors = FALSE)
  
  fBuf.Sample <- fBuf[sample(seq_len(fBuf.length), 0.01*fBuf.length)]
  
  rm(myDFM)
  rm(myToken)
  rm(fText)
  rm(fBuf)
  writeLines(fBuf.Sample, fileSample)
  return(dfFileStats)
}
dfFileStats <- getfBufSample("./final/en_US/en_US.blogs.txt",fileSample,dfFileStats)

dfFileStats <- getfBufSample("./final/en_US/en_US.news.txt",fileSample,dfFileStats)

dfFileStats <- getfBufSample("./final/en_US/en_US.twitter.txt",fileSample,dfFileStats)

close(fileSample)

File Statistics

colnames(dfFileStats) <- c("FILE_NAME","LINE_COUNT","WORD_COUNT")
dfFileStats
##                         FILE_NAME LINE_COUNT WORD_COUNT
## 1   ./final/en_US/en_US.blogs.txt     899288     447500
## 2    ./final/en_US/en_US.news.txt    1010242     400542
## 3 ./final/en_US/en_US.twitter.txt    2360148     463818

Loading Sample Data

Let us now load the sample data and create a Token in Quanteda

con <- file("en_US.sample.txt", "rb" )
fBuf <- readLines(con,skipNul = TRUE)
close(con)

fBuf.length <- length(fBuf)

fText <- corpus(fBuf)

myToken <- tokens(fText,remove_punct = TRUE)

Offensive Words Removal

During my internal analysis I found some offensive words in the corpus which I would rather choose to not display here. I got hold of a dictionary of such words from a CMU web page. I am going to exclude all such sentences which have such words from the list straightaway.

Also I have added some more patterns to this list. I am hiding that part of the code considering the sensitive nature of such words. We do not want such words to be predicted in our algorithm.

Document Feature Matrix (DFM)

Let us now create a DFM. We need a list of variables to remove. The obvious things to remove are as follows -

  1. All Numbers
  2. All special characters
  3. Single Alphabets (except A and I)
  4. Punctuation

I have decided to not exclude stop words as they would be needed to predict the next word in sequence. Stop words are words which do not add much meaning to a sentence.

my.addon.words <- NULL
is.word  <- function(x) x %in% c(GradyAugmented,my.addon.words)

remove.list <- NULL
remove.list <- c(remove.list,0:9)
remove.list <- c(remove.list,LETTERS[!(LETTERS %in% c('A','I'))])
remove.list <- c(remove.list,letters[!(letters %in% c('a','i'))])
remove.list <- c(remove.list, TRUE, FALSE)

myToken <- myToken %>% tokens_select(remove.list,selection="remove",valuetype = "fixed")

myDFM <- dfm(myToken)

top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features)) 

Observing the Top Features

head(top.features,20)
##      FREQUENCY IN_DICTIONARY
## the       9487          TRUE
## to        7836          TRUE
## i         7185          TRUE
## a         6083          TRUE
## you       5471          TRUE
## and       4365          TRUE
## for       3817          TRUE
## in        3768          TRUE
## of        3610          TRUE
## is        3594          TRUE
## it        2910          TRUE
## my        2901          TRUE
## on        2625          TRUE
## that      2333          TRUE
## me        1967          TRUE
## be        1871          TRUE
## at        1772          TRUE
## with      1721          TRUE
## have      1689          TRUE
## your      1670          TRUE
non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
##       FREQUENCY IN_DICTIONARY
## rt          856         FALSE
## \200           810         FALSE
## lol         686         FALSE
## â           522         FALSE
## <           517         FALSE
## im          303         FALSE
## $           295         FALSE
## haha        263         FALSE
## œ           214         FALSE
## \231           209         FALSE
## >           206         FALSE
## ðÿ          193         FALSE
## \230           168         FALSE
## =           154         FALSE
## ur          150         FALSE
## wanna       141         FALSE
## ~           140         FALSE
## dont        134         FALSE
## +           127         FALSE
## omg         120         FALSE

Excluding words with Special Characters

We can see several Special Characters here. We want to exclude all special characters from the corpus (which may include foreign language characters). We only want Alphabets A to Z that’s it.

special.character.features <- non.dictionary.features[grep("[^A-Za-z\'-._]+",rownames(non.dictionary.features)),] %>% rownames()

myToken <- myToken %>% tokens_select(special.character.features,selection="remove",valuetype = "fixed")

myDFM <- dfm(myToken)

top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features)) 

non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
##          FREQUENCY IN_DICTIONARY
## rt             856         FALSE
## lol            686         FALSE
## im             303         FALSE
## haha           263         FALSE
## ur             150         FALSE
## wanna          141         FALSE
## dont           134         FALSE
## +              127         FALSE
## omg            120         FALSE
## congrats       114         FALSE
## email          105         FALSE
## bro             89         FALSE
## facebook        84         FALSE
## lmao            74         FALSE
## thats           69         FALSE
## nyc             65         FALSE
## cuz             61         FALSE
## online          61         FALSE
## ppl             59         FALSE
## chicago         59         FALSE

Internet Slangs

We can see some common internet slangs are used. Let us only focus on the one’s who have high frequence and try to replace them with sober words :)

Also one more thing: we are trying to replace with only one word phrase otherwise we will have to redo the entire thing.

Recreate DFM

After all these changes, let us recreate the Document Feature Matrix.

myDFM <- dfm(myToken)

top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features)) 

non.dictionary.features <- top.features[(top.features$IN_DICTIONARY==FALSE),]
head(non.dictionary.features,20)
##          FREQUENCY IN_DICTIONARY
## dont           134         FALSE
## congrats       114         FALSE
## email          105         FALSE
## facebook        84         FALSE
## thats           69         FALSE
## nyc             65         FALSE
## online          61         FALSE
## chicago         59         FALSE
## vs              54         FALSE
## blog            54         FALSE
## app             51         FALSE
## dc              47         FALSE
## kinda           46         FALSE
## fb              40         FALSE
## tech            37         FALSE
## cc              35         FALSE
## today's         34         FALSE
## mt              34         FALSE
## obama           33         FALSE
## dj              33         FALSE

Remove non dictionary words with low frequency

  1. Now it is not practically possible to look up each and every word in the non-dictionary word list.
  2. It is possible that our dictionary is not very exhaustive either.
  3. So I have taken a decision to remove the tokens which have such words with frequency less than 50. This number will change based on sample size. There is no science behind this number: It is just gut feel.
non.dictionary.words.remove <- rownames(non.dictionary.features[non.dictionary.features$FREQUENCY < 50,])
myToken <- myToken %>% tokens_select(non.dictionary.words.remove,selection="remove", valuetype = "fixed")

myDFM <- dfm(myToken) 

top.features <- as.data.frame(topfeatures(myDFM,ncol(myDFM)))
colnames(top.features) <- "FREQUENCY"
top.features$IN_DICTIONARY <- is.word(rownames(top.features)) 

Creating Word Cloud

Let us create a word cloud now. We will take only those words with minimum of 50 Frequency. Word Cloud is similar to a Histogram but is more useful for Text Data. Words with higher frequency appear in the center and with higher font size and generally darker color.

textplot_wordcloud(myDFM,min_count = 50,color = brewer.pal(12, "Paired"))

Word Cloud Analysis

  1. It is evident that the stop words have the highest frequency which was expected.
  2. At the lower layers we have some other functional words which will appear in conjunction with stop words in text.
  3. There seem to be no/very few invalid words in the word cloud. So our pre-processing seems to be working.

Creating N-grams

  1. Put very simply N-grams are sequences of N words appearing in a sentence.
  2. We will utilize N-grams to make predictions on the text provided as input by the user.
  3. Let us create Bigrams, Tri Grams and Quad Grams for now and analyze them
biGramToken <- tokens_ngrams(myToken, n = 2)
triGramToken <- tokens_ngrams(myToken, n = 3)
quadGramToken <- tokens_ngrams(myToken, n = 4)
dfmBG <- dfm(biGramToken)
dfmTG <- dfm(triGramToken)
dfmQG <- dfm(quadGramToken)

Bi-Gram Word Cloud

textplot_wordcloud(dfmBG,min_count = 20,color = brewer.pal(12, "Paired"))

Tri-Gram Word Cloud

textplot_wordcloud(dfmTG,min_count = 10,color = brewer.pal(12, "Paired"))

Quad-Gram Word Cloud

textplot_wordcloud(dfmQG,min_count = 5,color = brewer.pal(12, "Paired"))

Note: Not all the N-Grams are shown in the Word Cloud. I have verified that up to Quad-grams are formed for every word in the Word cloud even with low frequency by some random sample testing

Approach for Prediction

  1. Load Entire Data from all three sources while taking care of memory somehow.
  2. Remove all the undesired tokens
  3. Create Bi-Grams, Tri-Grams and Quad-Grams.
  4. Build a Shiny Web App to accept a User Input
  5. Based on a User Input, predict the best options for the predicted next word(s) based on available bi-grams, tri-grams and quad-grams.

Thank You