Milestone Report

Introduction

We will be performing 2 tasks within the framework of this work/report.

TASK #1 - EXPLORATORY DATA ANALYSIS ON TEXT DATA

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build our first linguistic models.

Tasks to accomplish:

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data. TASK #2 - MODELING

The goal here is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish:

Build basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

General adjustments

The raw corpus data is downloaded and stored locally at:

Blog: ./data/en_US.blogs.txt

News: ./data/en_US.news.txt

Twitter: ./data/en_US.twitter.txt

I have saved the datasets in the directory named “Capstone” on my Desktop. The file path of the location is “C:.jyenis2”.

Also, let’s load all the library that we need to do above mentioned tasks.

suppressMessages(library(NLP))
suppressMessages(library(tm))

## Warning: package 'tm' was built under R version 3.6.3

suppressMessages(library(RColorBrewer))
suppressMessages(library(wordcloud))

## Warning: package 'wordcloud' was built under R version 3.6.3

suppressMessages(library(dplyr))

## Warning: package 'dplyr' was built under R version 3.6.3

suppressMessages(library(stringi))

## Warning: package 'stringi' was built under R version 3.6.2

suppressMessages(library(RWeka))

## Warning: package 'RWeka' was built under R version 3.6.3

suppressMessages(library(ggplot2))

## Warning: package 'ggplot2' was built under R version 3.6.3

suppressMessages(library(ngram))
suppressMessages(library(quanteda))

## Warning: package 'quanteda' was built under R version 3.6.3

suppressMessages(library(gridExtra))

## Warning: package 'gridExtra' was built under R version 3.6.3

Load, sample and clean the data

Let’s first load the data and read lines into variables in R

# File path
file1 <- "./final/en_US/en_US.blogs.txt"
file2 <- "./final/en_US/en_US.news.txt"
file3 <- "./final/en_US/en_US.twitter.txt"
# Read blogs
connect <- file(file1, open="rb")
blogs <- readLines(connect, encoding="UTF-8"); close(connect)
# Read news
connect <- file(file2, open="rb")
news <- readLines(connect, encoding="UTF-8"); close(connect)
# Read twitter
connect <- file(file3, open="rb")
twitter <- readLines(connect, encoding="UTF-8"); close(connect)

## Warning in readLines(connect, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(connect, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(connect, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(connect, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

rm(connect)

Let’s examine the data and get the sense of data we will be dealing with

summaryData <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(summaryData) <- c('Min','Mean','Max')
stats <- data.frame(
  FileName=c("en_US.blogs","en_US.news","en_US.twitter"),      
  t(rbind(sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',], summaryData)))
head(stats)

##        FileName   Lines     Chars    Words Min     Mean  Max
## 1   en_US.blogs  899288 206824382 37570839   0 41.75107 6726
## 2    en_US.news 1010242 203223154 34494539   1 34.40997 1796
## 3 en_US.twitter 2360148 162096031 30451128   1 12.75063   47

# Get file sizes
blogs.size <- file.info(file1)$size / 1024 ^ 2
news.size <- file.info(file2)$size / 1024 ^ 2
twitter.size <- file.info(file3)$size / 1024 ^ 2
# Summary of dataset
df<-data.frame(Doc = c("blogs", "news", "twitter"), Size.MB = c(blogs.size, news.size, twitter.size), Num.Lines = c(length(blogs), length(news), length(twitter)), Num.Words=c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter))))
df

##       Doc  Size.MB Num.Lines Num.Words
## 1   blogs 200.4242    899288 206824505
## 2    news 196.2775   1010242 203223159
## 3 twitter 159.3641   2360148 162096031

Since these data are pretty big in size and we only have limited computer memory to process them, we have to sample the data first and then clean the data a bit. In terms of sampling the data, I am going to take 0.1% of each data set to ensure the memory of my machine is sufficient to effectively process the data. I have tried taking 1% but the memory of my machine failed to process it so I had to go for a smaller chunk of the data.

set.seed(123)
# Sampling
sampleBlogs <- blogs[sample(1:length(blogs), 0.001*length(blogs), replace=FALSE)]
sampleNews <- news[sample(1:length(news), 0.001*length(news), replace=FALSE)]
sampleTwitter <- twitter[sample(1:length(twitter), 0.001*length(twitter), replace=FALSE)]
# Cleaning
sampleBlogs <- iconv(sampleBlogs, "UTF-8", "ASCII", sub="")
sampleNews <- iconv(sampleNews, "UTF-8", "ASCII", sub="")
sampleTwitter <- iconv(sampleTwitter, "UTF-8", "ASCII", sub="")
data.sample <- c(sampleBlogs,sampleNews,sampleTwitter)

Build Corpus and more cleaning

Now that we have sampled our data and combined all three of the data sets into one. We will go ahead and build the corpus which will be used to build the data matrix term later. In this section, we will also apply some more cleaning process to remove lowercase, punctuation, numbers and whitespace.

build_corpus <- function (x = data.sample) {
  sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
  sample_c <- tm_map(sample_c, content_transformer(tolower)) # all lowercase
  sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
  sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
  sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
}
corpusData <- build_corpus(data.sample)

Tokenize and build n-grams

getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
  #create term-document matrix tokenized on n-grams
  tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
  tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
  #find the top term grams with a minimum of occurrence in the corpus
  top_terms <- findFreqTerms(tdm,lowfreq)
  top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
  top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
  top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
    
tt.Data <- list(3)
for (i in 1:3) {
  tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}

Build Wordcloud

Let’s plot wordcloud to see word frequencies

# Set random seed for reproducibility
set.seed(123)
# Set Plotting in 1 row 3 columns
par(mfrow=c(1, 3))
for (i in 1:3) {
  wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## would be could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## have been could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## this is could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## need to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## of my could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## to go could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## about the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## has been could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## you have could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## think could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : a
## good could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : a
## great could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## more than could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## to have could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## all the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## thanks for could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## as the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## they are could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## time to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## into the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## is not could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the best could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the world could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## thank you could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the way could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## trying to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## he said could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## cant could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## will could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## was the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## are you could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## just could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## right now could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## you know could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## and then could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## had a could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## know could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## for the first could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## dont want could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## some of the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the rest of could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## be able to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## as well as could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## cant wait to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## love you could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## looking forward to could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## out of the could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## going to be could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## in terms of could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## in the world could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## is one of could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## of my life could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## one of those could not be fit on page. It will not be plotted.

## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the fact that could not be fit on page. It will not be plotted.

Build n-gram models and histograms

In this section, I build unigram, bi-gram and tri-gram models for the data and will give sense of distributions of the words through histograms

plot.Grams <- function (x = tt.Data, N=10) {
  g1 <- ggplot(data = head(x[[1]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "green") + 
        ggtitle(paste("Unigrams")) + 
        xlab("Unigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  g2 <- ggplot(data = head(x[[2]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "blue") + 
        ggtitle(paste("Bigrams")) + 
        xlab("Bigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  g3 <- ggplot(data = head(x[[3]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "darkgreen") + 
        ggtitle(paste("Trigrams")) + 
        xlab("Trigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  # Put three plots into 1 row 3 columns
  gridExtra::grid.arrange(g1, g2, g3, ncol = 3)
}
plot.Grams(x = tt.Data, N = 20)

Findings and next steps

Next is to plan for Creating Prediction Algorithm and Shiny Application

To train the prediction model:

All 3 of the file are very large. Even for the 0.1% of data just to perform the exploratory analysis and ngram model, it look quite a bit of time, so i need to look at better utilizing the resources and increase the performance.
Looking at the unigram frequencies, there are a lot of word overlap between the most frequent words in these 3 files.As next step to this, I need to perform more data cleaning to remove words such as “the”, “of the” and so on.

3.Review on how to remove the mispelled words & not to predict the misspelled word.

I have also looked up on Stemming Words using snowball stemmers and will be performing this.
I have looked up Markov chain solutions for predicting and I might be using this in next steps.
Finally the application will be built in shiny.