Text Prediction Milestone Report

Bhavana Shah

The objective of this milestone assignment is to explore text mining using R. This model uses three corpus documents from sources such as Blogs, Tweets and News. At this stage we need to analyze text and create a basic n-gram models that will used to finally implement these findings towards text prediction model.

Data Source, Sampling and statistics

The data has been obtained from HC Corpora Site. Initially we load all the necessary libraries such as tm, RWeka, ggplot2, SnowballC and more (full list in appendix)

The locally saved corpus files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) are at location mainDir. The custom function (code in appendix) takes inputs: source file name, percentage to be randomly sampled and output file name. It creates sampled files in the subDir and outputs the line and word counts.

mainDir <- "D:/Coursera/CapstoneProject/Coursera-SwiftKey/final/en_US/"
subDir <- "SampledFilesDir"
if(!file.exists(subDir))
{
    dir.create(file.path(mainDir, subDir))    
}
#Taking only randomly 0.07% of each, using custom function
sampleCurrFile(paste(mainDir, "en_US.twitter.txt", sep=""), 0.07, paste(subDir, "twitter.txt",sep ="/"))

## [1] "Lines:"   "2360148"  "Words:"   "30373543"

sampleCurrFile(paste(mainDir, "en_US.news.txt", sep=""), 0.07, paste(subDir, "news.txt", sep ="/"))

## [1] "Lines:"  "1010242" "Words:"  "2643969"

sampleCurrFile(paste(mainDir, "en_US.blogs.txt", sep=""), 0.07, paste(subDir, "blogs.txt", sep ="/"))

## [1] "Lines:"   "899288"   "Words:"   "37334131"

We observe that the data is big and sampling randomly into a smaller dataset is necessary as memory restrictions are apparent. More importantly the randomly sampled data can be used to infer facts about population.

Exploration & Preprocessing

We need to perform some pre-processing of the data for analysis. The ‘tm’ package functions have been used to apply the transformations of the corpus obtained from the three sampled documents. The removal of numbers, punctuation and stopwords is important as they do not have any analytic value. The removal of common word endings such as “ing” and “es”, is called “stemming” and is performed on the dataset. The transformations stated earlier, leaves lot of white spaces that needs to be removed as well. After processing the corpus it is saved as a text document.

mypath <- file.path("D:","Coursera","CapstoneProject", "Coursera-SwiftKey", "final", "en_US", subDir)
corpus <- Corpus(DirSource(mypath, encoding = "UTF-8", ignore.case = TRUE))
corpus <- tm_map(corpus, removePunctuation)     #remove punctuation
corpus <- tm_map(corpus, removeNumbers)         #remove numbers
corpus <- tm_map(corpus, removeWords, stopwords("english"))#remove stopwords 
#stemming, removing the suffixes such as "ing", "ed", "es", "s"
corpus <- tm_map(corpus, stemDocument) 
corpus <- tm_map(corpus, stripWhitespace)       #remove white spaces
#Setting the corpus as a text document
corpus <- tm_map(corpus, PlainTextDocument)

Text Analysis

To begin with text analysis, we need to create Document Term matrix which is a matrix with documents as rows and terms as the columns and count of frequency as the cells of the matrix.

docTM <- DocumentTermMatrix(corpus)
docTM

## <<DocumentTermMatrix (documents: 3, terms: 7019)>>
## Non-/sparse entries: 9363/11694
## Sparsity           : 56%
## Maximal term length: 32
## Weighting          : term frequency (tf)

We can get term frequencies to find the most frequent terms, 75 or more as shown below

freq <- colSums(as.matrix(docTM))
findFreqTerms(docTM, lowfreq = 75)

##  [1] "can"   "come"  "day"   "dont"  "get"   "good"  "great" "just" 
##  [9] "know"  "like"  "look"  "love"  "make"  "new"   "now"   "one"  
## [17] "peopl" "right" "see"   "thank" "that"  "the"   "think" "time" 
## [25] "use"   "want"  "will"  "work"

We now generate the frequency count of all words in a corpus. Then we plot words whose frequency is greater than 75.

freq <- sort(colSums(as.matrix(docTM)), decreasing = TRUE)
wfreq <- data.frame(word = names(freq), freq = freq)
s <- subset(wfreq, freq > 75)

#Plot the findings in a bar graph
g <-  ggplot(s, aes(word, freq), fill = word ) +
ggtitle("Words with frequency greater than 75") +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = freq), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkgreen"))+
theme(axis.text.y = element_text(color= "darkred"))
g

Word Cloud for most frequent words (freq. greater than 75), provides better visualization.

#Plot the findings as a word cloud
set.seed(999)   
colorb <- brewer.pal(9, "Spectral")   
wordcloud(names(freq), freq, max.words = 100, rot.per = 0.2, colors= colorb, 
          scale = c(4,0.1))

We can find out how a particular word correlates highly with other words, correlation limit of 0.99 is shown in the example below.

findAssocs(docTM, "tax", corlimit = 0.99)

## $tax
##   absolut       all      amaz    answer      area     avail     award 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##    behind      best    better       big     blind     blood     break 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##       bus    career       cat     chang   chicago      citi  complain 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      crap    cultur      cute     david   definit    detail    dinner 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      done      door everybodi   everyth       far    forget   general 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##       get      good       got     guess     handl      hang      head 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      heat      help    higher      hold       ice      joke      late 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      leav      less      life      like     liter      look      love 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##    market     match     media     never      news     night       now 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      okay      page     panel    presid    random  reaction     readi 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      real    realiz    releas     right      sell   septemb      shes 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##      show      shut      sick    social   sometim      song     stage 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##     still   student       sun    sunday    system      talk      team 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
## technolog       was      wast      wave    weâ<U+0080><U+0099>r   weekend     where 
##      1.00      1.00      1.00      1.00      1.00      1.00      1.00 
##     which       wit yesterday      babi     bring      care     check 
##      1.00      1.00      1.00      0.99      0.99      0.99      0.99 
##     coffe     email   forward      hell      hope      mind   perfect 
##      0.99      0.99      0.99      0.99      0.99      0.99      0.99 
##     pleas      quot       see   serious     sleep      tire   weather 
##      0.99      0.99      0.99      0.99      0.99      0.99      0.99

N-gram Models and visualizations

Previously we observed a unigram model, where only one term was considered at a time. The main idea behind N-gram model is that we compute the probability of a word by last few words, two words in case of 2-gram model, 3 in case of 3-gram model and so on. We convert corpus into a dataframe for tokenizer of ‘RWeka’ package.

corpusdf <- data.frame(text = unlist( sapply( corpus, '[', "content")), 
                       stringsAsFactors = FALSE)
tokenizers <- " \\t\\r\\n.;:,?!\"()"

Using RWeka tokenizer, we will create 2-gram(or bigram) and 3-gram (or trigram) sets

2-grams

biGram <- data.frame(table(NGramTokenizer(corpusdf, Weka_control(min = 2, max = 2, delimiters = tokenizers))))
biGramSorted <- biGram[order(biGram$Freq, decreasing = TRUE), ]
s2 <- biGramSorted[1:25, ]
colnames(s2) <- c("Word", "Frequency")

g2 <- ggplot(s2, aes(Word, Frequency), fill = Word ) +
ggtitle("Top 25 bigrams and their frequency") +
geom_bar(stat = "identity", fill = "skyblue3") +
geom_text(aes(label = Frequency), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkseagreen3")) +
theme(axis.text.y = element_text(color = "darkred")) +
xlab("2-Words")
g2

set.seed(999)   
colorb <- brewer.pal(6, "Dark2")   
wordcloud(words = s2[, 1], freq= s2[,2], max.words =25, colors = colorb, 
          scale = c(4,0.1))

3-grams

triGram <- data.frame(table(NGramTokenizer(corpusdf, Weka_control(min = 3, max = 3, delimiters = tokenizers))))
triGramSorted <- triGram[order(triGram$Freq, decreasing = TRUE), ]
s3 <- triGramSorted[1:25, ]
colnames(s3) <- c("Word", "Frequency")

g3 <- ggplot(s3, aes(Word, Frequency), fill = Word ) +
ggtitle("Top 25 trigrams and their frequency") +
geom_bar(stat = "identity", fill = "skyblue4") +
geom_text(aes(label = Frequency), vjust=-0.40, size = 3, color = "tan1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color = "darkseagreen4")) +
theme(axis.text.y = element_text(color= "darkred")) +
        xlab("3-Words")
g3

set.seed(999)   
colorb <- brewer.pal(6, "Dark2")   
wordcloud(words = s3[, 1], freq= s3[,2],  max.words =25, colors = colorb, 
          scale = c(3,0.1))

From the analysis and visualizations we observe that the frequencies decrease as the ‘n’ in n-gram models increase. Also the data sparsity in the matrix becomes severe. The context becomes important to derive meaningful predictions. There is a need to implement advanced techniques to address unknown words using smoothing methods. The language modeling can be improved with backoff and interpolation methods.

Outline for developing final Prediction application

This assignment gave an insight how real life data are. For the final capstone project, need to accomplish the following:

Understand and implement efficiency to balance algorithm speed and memory usage.
Understand and implement smoothing/ back-off techniques.
Handling sparsity and unknown words.
Understand which machine learning algorithm gives best results.
Create user interface in Shiny app, with following functionality
* User enters a sentence and predict the next word
* User can set Profanity filter on/off
Create a presentation

Appendix

library(tm)
library(R.utils)
library(knitr)
library(RWeka)
library(ggplot2)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)

sampleCurrFile = function(filepath, percentage, outfilename) {
        con  <-  file(filepath, "r")
        wordCount <- 0
        lineCount <- as.integer(countLines(filepath)) #for range
        percentCount <- as.integer((percentage / 100) * lineCount) #How Many
        sampledIndices <- sample(1:lineCount, percentCount)
        outCon  <- file(outfilename, "w", encoding = "UTF-8")
        while (length(input <- readLines(con, n = lineCount)) > 0){ 
                for (i in 1:length(input)){ 
                       s <- strsplit(input[i], " ")        
                       wordsperline <- length(s[[1]])
                       wordCount <- wordCount + wordsperline
                       
                       if(is.element(i, sampledIndices))
                       {
                          cat(input[i], file = outCon, sep ="\n", append = TRUE)                               
                       }
                } 
        } 
        close(outCon)
        close(con)
        return(c("Lines:", lineCount, "Words:", wordCount))
 }