NLP or Natural Language Processing is an important branch of Data Science. Modern NLP platforms are based on Machine Learning. They analyze large Corpora and apply Statistical Inferences on them. Some common applications of NLP are : Sentiment Analysis, Topic Modelling, Speech Recognition, Question Answering etc.
Today’s project involves downloading Corpora, consisting of, Blogs, News and Tweets; Cleaning the Corpus of Non-English Words, Swear-Words & English-Stop-Words; Tidying the Data with removing punctuation, white-space & Numbers; Applying Stemming; Building Document Term Matrix; Building Word Frequencies, Bigrams & Trigrams.
We also analyze Top Frequency Words as well as Bigrams & Trigrams.
For this exercise, we shall use en_US Locale. We are given 3 text files. They are (en_US.blogs), (en_US.news) and (en_US.twitter). They have following line counts, (en_US.blogs) = 899288, (en_US.news) = 1010242 and (en_US.twitter) = 2360148. Since these files are huge, we shall use a sample from each file. Our sample will consist of 10,000 observations, each.
We also notice that there are lot of Non-English characters in the original data-set. We shall remove them before creating a Corpus.
library("tm")
library("stringi")
set.seed(1000)
con1 <- file("../Data/en_US.blogs.txt")
allBlogs <- readLines(con1)
blogs <- sample(allBlogs,10000)
blogs <- stri_trans_general(blogs, "latin-ascii")
writeLines(blogs,"../SampleData/blogs.txt")
close(con1)
con2 <- file("../Data/en_US.news.txt")
allNews <- readLines(con2)
news <- sample(allNews,10000)
news <- stri_trans_general(news, "latin-ascii")
writeLines(news,"../SampleData/news.txt")
close(con2)
con3 <- file("../Data/en_US.twitter.txt")
allTweets <- readLines(con3)
tweets <- sample(allTweets,10000)
tweets <- stri_trans_general(tweets, "latin-ascii")
writeLines(tweets,"../SampleData/tweets.txt")
close(con3)
Now that we have 3 Sample Files, let’s create our Corpus and inspect it,
docs <- Corpus(DirSource("../SampleData", encoding="UTF-8"))
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2309418
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2035245
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 693589
We shall perform multiple steps of pre-processing on this Corpus. Let’s get started…
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
We notice that there are quite a few Swear-Words in this Corpus. We download Swear-Words list from internet and load it into (“swearWords.txt”).
con <- file("swearWords.txt")
badWords <- readLines(con)
close(con)
str(badWords)
## chr [1:77] "anal" "anus" "arse" "ass" "ballsack" "balls" ...
Next, we remove (Swear-Words) and English (Stop-words) from our Corpus.
docs <- tm_map(docs, removeWords, badWords)
docs <- tm_map(docs, removeWords, stopwords("en"))
Last transformation we apply on the data, is, Stemming
library(SnowballC)
docs <- tm_map(docs,stemDocument)
Now we are ready to do the Actual Analysis on this Corpus…
We start by building Document Term Matrix
dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 39797)>>
## Non-/sparse entries: 57527/61864
## Sparsity : 52%
## Maximal term length: 83
## Weighting : term frequency (tf)
Let’s get the Word-Counts by each Document in our Corpus,
rowSums(as.matrix(dtm))
## blogs.txt news.txt tweets.txt
## 207403 191747 69190
Then we find out Word Frequencies and print Top Frequency Words
freq <- colSums(as.matrix(dtm))
ord <- order(freq,decreasing=TRUE)
topFreqTerms <- freq[head(ord,37)]
topFreqTerms
## said will one like just get time can year make day new
## 3007 2737 2573 2352 2314 2171 2151 2054 1909 1722 1636 1610
## love want work know say peopl use now see also look back
## 1411 1391 1387 1352 1322 1320 1318 1300 1276 1246 1222 1217
## first think good come even thing take way need two last well
## 1216 1203 1195 1119 1106 1105 1083 1067 1060 1056 1053 1004
## dont
## 1003
Notice that these words appear 1000 times or more, each, in our Corpus.
Now let’s plot them,
library(ggplot2)
wf <- data.frame(term=names(freq),occurrences=freq)
p <- ggplot(subset(wf, freq>1000), aes(term, occurrences)) +
geom_bar(stat="identity", fill="Green") +
xlab("Terms") + ylab("Frequency")+
ggtitle("Figure 1: Top Frequency Terms") +
theme(axis.text.x=element_text(angle=90, hjust=1))
p
library("RWeka")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2 <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
freq2 <- colSums(as.matrix(dtm2))
ord2 <- order(freq2,decreasing=TRUE)
topFreqBigramTerms <- freq2[(ord2)][1:50]
topFreqBigramTerms
## new york last year look like year ago right now
## 221 179 170 162 141
## last week high school dont know feel like first time
## 139 122 117 113 109
## look forward last night make sure cant wait dont want
## 109 108 104 91 90
## even though st loui unit state im go year old
## 90 89 88 86 86
## can see can get come back new jersey one day
## 74 72 67 67 67
## one thing go back next year just like let go
## 67 65 65 63 63
## next week san francisco im sure sound like mani peopl
## 63 63 62 62 61
## two year long time littl bit seem like get back
## 61 59 58 58 56
## let know los angel dont think will make everi day
## 55 55 53 52 51
## just want last month three year tri get will get
## 49 49 49 49 48
Now, let’s print them using WordCloud.
library(wordcloud)
wordcloud(names(freq2),freq2,min.freq=50,colors=brewer.pal(6,"Dark2"))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3 <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
freq3 <- colSums(as.matrix(dtm3))
ord3 <- order(freq3,decreasing=TRUE)
topFreqTrigramTerms <- freq3[(ord3)][1:25]
topFreqTrigramTerms
## new york citi happi mother day gov chris christi
## 28 25 18
## presid barack obama st loui counti let us know
## 17 17 16
## new york time cant wait see first time sinc
## 16 15 12
## dont even know cant wait get four year ago
## 11 10 10
## will take place world war ii cinco de mayo
## 10 10 9
## dream come true fund class r happi new year
## 9 9 9
## wall street journal class r bbif feel like im
## 9 8 8
## graduat high school high school student im look forward
## 8 8 8
## im pretti sure
## 8
Now, let’s print them using WordCloud.
wordcloud(names(freq3),freq3,min.freq=9,colors=brewer.pal(6,"Dark2"))
Let’s see how many total words we have in our Corpus
totWords <- as.numeric(sum(freq))
totWords
## [1] 468340
Also 50% of total word-count is as follows. We now find Cumulative Sum of Each Word-Frequency
fiftyPercentToalWordCount <- 0.5 * sum(freq)
fiftyPercentToalWordCount
## [1] 234170
dfFreq <- as.data.frame(freq)
dfFreqNew <- dfFreq[order(dfFreq$freq,decreasing=TRUE),]
cumWordFreq <- cumsum(dfFreqNew)
flagWords <- cumWordFreq < fiftyPercentToalWordCount
sum(flagWords)
## [1] 639
percentOfWords <- round(sum(flagWords)*100/length(freq))
percentOfWords
## [1] 2
Interestingly enough, we require only 2% of words to describe 50% of total corpora.
We Thank You for your time.