The objective of this milestone assignment is to explore text mining using R. This model uses three corpus documents from sources such as Blogs, Tweets and News. At this stage we need to analyze text and create a basic n-gram models that will used to finally implement these findings towards text prediction model.
The data has been obtained from HC Corpora Site. Initially we load all the necessary libraries such as tm, RWeka, ggplot2, SnowballC and more (full list in appendix)
The locally saved corpus files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) are at location mainDir. The custom function (code in appendix) takes inputs: source file name, percentage to be randomly sampled and output file name. It creates sampled files in the subDir and outputs the line and word counts.
mainDir <- "D:/Coursera/CapstoneProject/Coursera-SwiftKey/final/en_US/"
subDir <- "SampledFilesDir"
if(!file.exists(subDir))
{
dir.create(file.path(mainDir, subDir))
}
#Taking only randomly 0.07% of each, using custom function
sampleCurrFile(paste(mainDir, "en_US.twitter.txt", sep=""), 0.07, paste(subDir, "twitter.txt",sep ="/"))
## [1] "Lines:" "2360148" "Words:" "30373543"
sampleCurrFile(paste(mainDir, "en_US.news.txt", sep=""), 0.07, paste(subDir, "news.txt", sep ="/"))
## [1] "Lines:" "1010242" "Words:" "2643969"
sampleCurrFile(paste(mainDir, "en_US.blogs.txt", sep=""), 0.07, paste(subDir, "blogs.txt", sep ="/"))
## [1] "Lines:" "899288" "Words:" "37334131"
We observe that the data is big and sampling randomly into a smaller dataset is necessary as memory restrictions are apparent. More importantly the randomly sampled data can be used to infer facts about population.
We need to perform some pre-processing of the data for analysis. The ātmā package functions have been used to apply the transformations of the corpus obtained from the three sampled documents. The removal of numbers, punctuation and stopwords is important as they do not have any analytic value. The removal of common word endings such as āingā and āesā, is called āstemmingā and is performed on the dataset. The transformations stated earlier, leaves lot of white spaces that needs to be removed as well. After processing the corpus it is saved as a text document.
mypath <- file.path("D:","Coursera","CapstoneProject", "Coursera-SwiftKey", "final", "en_US", subDir)
corpus <- Corpus(DirSource(mypath, encoding = "UTF-8", ignore.case = TRUE))
corpus <- tm_map(corpus, removePunctuation) #remove punctuation
corpus <- tm_map(corpus, removeNumbers) #remove numbers
corpus <- tm_map(corpus, removeWords, stopwords("english"))#remove stopwords
#stemming, removing the suffixes such as "ing", "ed", "es", "s"
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace) #remove white spaces
#Setting the corpus as a text document
corpus <- tm_map(corpus, PlainTextDocument)
To begin with text analysis, we need to create Document Term matrix which is a matrix with documents as rows and terms as the columns and count of frequency as the cells of the matrix.
docTM <- DocumentTermMatrix(corpus)
docTM
## <<DocumentTermMatrix (documents: 3, terms: 7019)>>
## Non-/sparse entries: 9363/11694
## Sparsity : 56%
## Maximal term length: 32
## Weighting : term frequency (tf)
We can get term frequencies to find the most frequent terms, 75 or more as shown below
freq <- colSums(as.matrix(docTM))
findFreqTerms(docTM, lowfreq = 75)
## [1] "can" "come" "day" "dont" "get" "good" "great" "just"
## [9] "know" "like" "look" "love" "make" "new" "now" "one"
## [17] "peopl" "right" "see" "thank" "that" "the" "think" "time"
## [25] "use" "want" "will" "work"
We now generate the frequency count of all words in a corpus. Then we plot words whose frequency is greater than 75.
freq <- sort(colSums(as.matrix(docTM)), decreasing = TRUE)
wfreq <- data.frame(word = names(freq), freq = freq)
s <- subset(wfreq, freq > 75)
#Plot the findings in a bar graph
g <- ggplot(s, aes(word, freq), fill = word ) +
ggtitle("Words with frequency greater than 75") +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = freq), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkgreen"))+
theme(axis.text.y = element_text(color= "darkred"))
g
Word Cloud for most frequent words (freq. greater than 75), provides better visualization.
#Plot the findings as a word cloud
set.seed(999)
colorb <- brewer.pal(9, "Spectral")
wordcloud(names(freq), freq, max.words = 100, rot.per = 0.2, colors= colorb,
scale = c(4,0.1))
We can find out how a particular word correlates highly with other words, correlation limit of 0.99 is shown in the example below.
findAssocs(docTM, "tax", corlimit = 0.99)
## $tax
## absolut all amaz answer area avail award
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## behind best better big blind blood break
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## bus career cat chang chicago citi complain
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## crap cultur cute david definit detail dinner
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## done door everybodi everyth far forget general
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## get good got guess handl hang head
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## heat help higher hold ice joke late
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## leav less life like liter look love
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## market match media never news night now
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## okay page panel presid random reaction readi
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## real realiz releas right sell septemb shes
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## show shut sick social sometim song stage
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## still student sun sunday system talk team
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## technolog was wast wave weâ<U+0080><U+0099>r weekend where
## 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## which wit yesterday babi bring care check
## 1.00 1.00 1.00 0.99 0.99 0.99 0.99
## coffe email forward hell hope mind perfect
## 0.99 0.99 0.99 0.99 0.99 0.99 0.99
## pleas quot see serious sleep tire weather
## 0.99 0.99 0.99 0.99 0.99 0.99 0.99
Previously we observed a unigram model, where only one term was considered at a time. The main idea behind N-gram model is that we compute the probability of a word by last few words, two words in case of 2-gram model, 3 in case of 3-gram model and so on. We convert corpus into a dataframe for tokenizer of āRWekaā package.
corpusdf <- data.frame(text = unlist( sapply( corpus, '[', "content")),
stringsAsFactors = FALSE)
tokenizers <- " \\t\\r\\n.;:,?!\"()"
Using RWeka tokenizer, we will create 2-gram(or bigram) and 3-gram (or trigram) sets
biGram <- data.frame(table(NGramTokenizer(corpusdf, Weka_control(min = 2, max = 2, delimiters = tokenizers))))
biGramSorted <- biGram[order(biGram$Freq, decreasing = TRUE), ]
s2 <- biGramSorted[1:25, ]
colnames(s2) <- c("Word", "Frequency")
g2 <- ggplot(s2, aes(Word, Frequency), fill = Word ) +
ggtitle("Top 25 bigrams and their frequency") +
geom_bar(stat = "identity", fill = "skyblue3") +
geom_text(aes(label = Frequency), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkseagreen3")) +
theme(axis.text.y = element_text(color = "darkred")) +
xlab("2-Words")
g2
set.seed(999)
colorb <- brewer.pal(6, "Dark2")
wordcloud(words = s2[, 1], freq= s2[,2], max.words =25, colors = colorb,
scale = c(4,0.1))
triGram <- data.frame(table(NGramTokenizer(corpusdf, Weka_control(min = 3, max = 3, delimiters = tokenizers))))
triGramSorted <- triGram[order(triGram$Freq, decreasing = TRUE), ]
s3 <- triGramSorted[1:25, ]
colnames(s3) <- c("Word", "Frequency")
g3 <- ggplot(s3, aes(Word, Frequency), fill = Word ) +
ggtitle("Top 25 trigrams and their frequency") +
geom_bar(stat = "identity", fill = "skyblue4") +
geom_text(aes(label = Frequency), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkseagreen4")) +
theme(axis.text.y = element_text(color= "darkred")) +
xlab("3-Words")
g3
set.seed(999)
colorb <- brewer.pal(6, "Dark2")
wordcloud(words = s3[, 1], freq= s3[,2], max.words =25, colors = colorb,
scale = c(3,0.1))
From the analysis and visualizations we observe that the frequencies decrease as the ānā in n-gram models increase. Also the data sparsity in the matrix becomes severe. The context becomes important to derive meaningful predictions. There is a need to implement advanced techniques to address unknown words using smoothing methods. The language modeling can be improved with backoff and interpolation methods.
This assignment gave an insight how real life data are. For the final capstone project, need to accomplish the following:
library(tm)
library(R.utils)
library(knitr)
library(RWeka)
library(ggplot2)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)
sampleCurrFile = function(filepath, percentage, outfilename) {
con <- file(filepath, "r")
wordCount <- 0
lineCount <- as.integer(countLines(filepath)) #for range
percentCount <- as.integer((percentage / 100) * lineCount) #How Many
sampledIndices <- sample(1:lineCount, percentCount)
outCon <- file(outfilename, "w", encoding = "UTF-8")
while (length(input <- readLines(con, n = lineCount)) > 0){
for (i in 1:length(input)){
s <- strsplit(input[i], " ")
wordsperline <- length(s[[1]])
wordCount <- wordCount + wordsperline
if(is.element(i, sampledIndices))
{
cat(input[i], file = outCon, sep ="\n", append = TRUE)
}
}
}
close(outCon)
close(con)
return(c("Lines:", lineCount, "Words:", wordCount))
}