Data Science Capstone : Exploratory Data Analysis of Text DataSets

By : Narendra S. Shukla (September 3rd, 2016)

Executive Summary :

NLP or Natural Language Processing is an important branch of Data Science. Modern NLP platforms are based on Machine Learning. They analyze large Corpora and apply Statistical Inferences on them. Some common applications of NLP are : Sentiment Analysis, Topic Modelling, Speech Recognition, Question Answering etc.

Today’s project involves downloading Corpora, consisting of, Blogs, News and Tweets; Cleaning the Corpus of Non-English Words, Swear-Words & English-Stop-Words; Tidying the Data with removing punctuation, white-space & Numbers; Applying Stemming; Building Document Term Matrix; Building Word Frequencies, Bigrams & Trigrams.

We also analyze Top Frequency Words as well as Bigrams & Trigrams.

Loading Data from HC Corpora

For this exercise, we shall use en_US Locale. We are given 3 text files. They are (en_US.blogs), (en_US.news) and (en_US.twitter). They have following line counts, (en_US.blogs) = 899288, (en_US.news) = 1010242 and (en_US.twitter) = 2360148. Since these files are huge, we shall use a sample from each file. Our sample will consist of 10,000 observations, each.

We also notice that there are lot of Non-English characters in the original data-set. We shall remove them before creating a Corpus.

library("tm")
library("stringi")
set.seed(1000)

con1 <- file("../Data/en_US.blogs.txt")
allBlogs <- readLines(con1)
blogs <- sample(allBlogs,10000)
blogs <- stri_trans_general(blogs, "latin-ascii")
writeLines(blogs,"../SampleData/blogs.txt")
close(con1)

con2 <- file("../Data/en_US.news.txt")
allNews <- readLines(con2)
news <- sample(allNews,10000)
news <- stri_trans_general(news, "latin-ascii")
writeLines(news,"../SampleData/news.txt")
close(con2)

con3 <- file("../Data/en_US.twitter.txt")
allTweets <- readLines(con3)
tweets <- sample(allTweets,10000)
tweets <- stri_trans_general(tweets, "latin-ascii")
writeLines(tweets,"../SampleData/tweets.txt")
close(con3)

Now that we have 3 Sample Files, let’s create our Corpus and inspect it,

docs <- Corpus(DirSource("../SampleData", encoding="UTF-8"))
inspect(docs)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2309418
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2035245
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 693589

Pre-Processing the Corpus Data

We shall perform multiple steps of pre-processing on this Corpus. Let’s get started…

docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))

We notice that there are quite a few Swear-Words in this Corpus. We download Swear-Words list from internet and load it into (“swearWords.txt”).

con <- file("swearWords.txt")
badWords <- readLines(con)
close(con)
str(badWords)

##  chr [1:77] "anal" "anus" "arse" "ass" "ballsack" "balls" ...

Next, we remove (Swear-Words) and English (Stop-words) from our Corpus.

docs <- tm_map(docs, removeWords, badWords)
docs <- tm_map(docs, removeWords, stopwords("en"))

Last transformation we apply on the data, is, Stemming

library(SnowballC)
docs <- tm_map(docs,stemDocument)

Now we are ready to do the Actual Analysis on this Corpus…

What are the Distributions of Word Frequencies ?

We start by building Document Term Matrix

dtm <- DocumentTermMatrix(docs)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 39797)>>
## Non-/sparse entries: 57527/61864
## Sparsity           : 52%
## Maximal term length: 83
## Weighting          : term frequency (tf)

Let’s get the Word-Counts by each Document in our Corpus,

rowSums(as.matrix(dtm))

##  blogs.txt   news.txt tweets.txt 
##     207403     191747      69190

Then we find out Word Frequencies and print Top Frequency Words

freq <- colSums(as.matrix(dtm))
ord <- order(freq,decreasing=TRUE)
topFreqTerms <- freq[head(ord,37)]
topFreqTerms

##  said  will   one  like  just   get  time   can  year  make   day   new 
##  3007  2737  2573  2352  2314  2171  2151  2054  1909  1722  1636  1610 
##  love  want  work  know   say peopl   use   now   see  also  look  back 
##  1411  1391  1387  1352  1322  1320  1318  1300  1276  1246  1222  1217 
## first think  good  come  even thing  take   way  need   two  last  well 
##  1216  1203  1195  1119  1106  1105  1083  1067  1060  1056  1053  1004 
##  dont 
##  1003

Notice that these words appear 1000 times or more, each, in our Corpus.

Now let’s plot them,

library(ggplot2)
wf <- data.frame(term=names(freq),occurrences=freq)
p <- ggplot(subset(wf, freq>1000), aes(term, occurrences)) +
           geom_bar(stat="identity", fill="Green") +
           xlab("Terms") + ylab("Frequency")+
           ggtitle("Figure 1: Top Frequency Terms") +
           theme(axis.text.x=element_text(angle=90, hjust=1))
p

What are the frequencies of 2-grams and 3-grams in the dataset ?

Let’s get Bigram details first.

library("RWeka")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2 <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
freq2 <- colSums(as.matrix(dtm2))
ord2 <- order(freq2,decreasing=TRUE)
topFreqBigramTerms <- freq2[(ord2)][1:50]
topFreqBigramTerms

##      new york     last year     look like      year ago     right now 
##           221           179           170           162           141 
##     last week   high school     dont know     feel like    first time 
##           139           122           117           113           109 
##  look forward    last night     make sure     cant wait     dont want 
##           109           108           104            91            90 
##   even though       st loui    unit state         im go      year old 
##            90            89            88            86            86 
##       can see       can get     come back    new jersey       one day 
##            74            72            67            67            67 
##     one thing       go back     next year     just like        let go 
##            67            65            65            63            63 
##     next week san francisco       im sure    sound like    mani peopl 
##            63            63            62            62            61 
##      two year     long time     littl bit     seem like      get back 
##            61            59            58            58            56 
##      let know     los angel    dont think     will make     everi day 
##            55            55            53            52            51 
##     just want    last month    three year       tri get      will get 
##            49            49            49            49            48

Now, let’s print them using WordCloud.

library(wordcloud)
wordcloud(names(freq2),freq2,min.freq=50,colors=brewer.pal(6,"Dark2"))

We now do similar analysis for Trigrams

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3 <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
freq3 <- colSums(as.matrix(dtm3))
ord3 <- order(freq3,decreasing=TRUE)
topFreqTrigramTerms <- freq3[(ord3)][1:25]
topFreqTrigramTerms

##       new york citi    happi mother day   gov chris christi 
##                  28                  25                  18 
## presid barack obama      st loui counti         let us know 
##                  17                  17                  16 
##       new york time       cant wait see     first time sinc 
##                  16                  15                  12 
##      dont even know       cant wait get       four year ago 
##                  11                  10                  10 
##     will take place        world war ii       cinco de mayo 
##                  10                  10                   9 
##     dream come true        fund class r      happi new year 
##                   9                   9                   9 
## wall street journal        class r bbif        feel like im 
##                   9                   8                   8 
## graduat high school high school student     im look forward 
##                   8                   8                   8 
##      im pretti sure 
##                   8

Now, let’s print them using WordCloud.

wordcloud(names(freq3),freq3,min.freq=9,colors=brewer.pal(6,"Dark2"))

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language ?

Let’s see how many total words we have in our Corpus

totWords <- as.numeric(sum(freq))
totWords

## [1] 468340

Also 50% of total word-count is as follows. We now find Cumulative Sum of Each Word-Frequency

fiftyPercentToalWordCount <- 0.5 * sum(freq)
fiftyPercentToalWordCount

## [1] 234170

dfFreq <- as.data.frame(freq)
dfFreqNew <- dfFreq[order(dfFreq$freq,decreasing=TRUE),]
cumWordFreq <- cumsum(dfFreqNew)
flagWords <- cumWordFreq < fiftyPercentToalWordCount
sum(flagWords)

## [1] 639

percentOfWords <- round(sum(flagWords)*100/length(freq))
percentOfWords

## [1] 2

Interestingly enough, we require only 2% of words to describe 50% of total corpora.

Conclusion

Before you start analyzing Corpus Data, you got to clean it. Eliminating extra Whitespace, Removing Punctuation, Stop-Words removal as well as Stemming are absolutely must
Document Term Matrix gives you very useful information about Word Frequencies. Building DTM is a crucial step
Understanding Relationships between Words by virtue of generating Unigrams, Bigrams and Trigrams is vital
Above steps are basic building blocks for Text Mining. They help you formulate your strategy for subsequent Model Building

Next Steps

Build a basic N-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen N-grams
Also explore Markov Chains

References

This dataset is available from a corpus called HC Corpora http://www.corpora.heliohost.org
R tm() package information is available from https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Acknowledgement

We Thank You for your time.