We will be performing 2 tasks within the framework of this work/report.
TASK #1 - EXPLORATORY DATA ANALYSIS ON TEXT DATA
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build our first linguistic models.
Tasks to accomplish:
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data. TASK #2 - MODELING
The goal here is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish:
Build basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
The raw corpus data is downloaded and stored locally at:
Blog: ./data/en_US.blogs.txt
News: ./data/en_US.news.txt
Twitter: ./data/en_US.twitter.txt
I have saved the datasets in the directory named “Capstone” on my Desktop. The file path of the location is “C:.jyenis2”.
Also, let’s load all the library that we need to do above mentioned tasks.
suppressMessages(library(NLP))
suppressMessages(library(tm))
## Warning: package 'tm' was built under R version 3.6.3
suppressMessages(library(RColorBrewer))
suppressMessages(library(wordcloud))
## Warning: package 'wordcloud' was built under R version 3.6.3
suppressMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.6.3
suppressMessages(library(stringi))
## Warning: package 'stringi' was built under R version 3.6.2
suppressMessages(library(RWeka))
## Warning: package 'RWeka' was built under R version 3.6.3
suppressMessages(library(ggplot2))
## Warning: package 'ggplot2' was built under R version 3.6.3
suppressMessages(library(ngram))
suppressMessages(library(quanteda))
## Warning: package 'quanteda' was built under R version 3.6.3
suppressMessages(library(gridExtra))
## Warning: package 'gridExtra' was built under R version 3.6.3
# File path
file1 <- "./final/en_US/en_US.blogs.txt"
file2 <- "./final/en_US/en_US.news.txt"
file3 <- "./final/en_US/en_US.twitter.txt"
# Read blogs
connect <- file(file1, open="rb")
blogs <- readLines(connect, encoding="UTF-8"); close(connect)
# Read news
connect <- file(file2, open="rb")
news <- readLines(connect, encoding="UTF-8"); close(connect)
# Read twitter
connect <- file(file3, open="rb")
twitter <- readLines(connect, encoding="UTF-8"); close(connect)
## Warning in readLines(connect, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(connect, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(connect, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(connect, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
rm(connect)
summaryData <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(summaryData) <- c('Min','Mean','Max')
stats <- data.frame(
FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),], Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',], summaryData)))
head(stats)
## FileName Lines Chars Words Min Mean Max
## 1 en_US.blogs 899288 206824382 37570839 0 41.75107 6726
## 2 en_US.news 1010242 203223154 34494539 1 34.40997 1796
## 3 en_US.twitter 2360148 162096031 30451128 1 12.75063 47
# Get file sizes
blogs.size <- file.info(file1)$size / 1024 ^ 2
news.size <- file.info(file2)$size / 1024 ^ 2
twitter.size <- file.info(file3)$size / 1024 ^ 2
# Summary of dataset
df<-data.frame(Doc = c("blogs", "news", "twitter"), Size.MB = c(blogs.size, news.size, twitter.size), Num.Lines = c(length(blogs), length(news), length(twitter)), Num.Words=c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter))))
df
## Doc Size.MB Num.Lines Num.Words
## 1 blogs 200.4242 899288 206824505
## 2 news 196.2775 1010242 203223159
## 3 twitter 159.3641 2360148 162096031
set.seed(123)
# Sampling
sampleBlogs <- blogs[sample(1:length(blogs), 0.001*length(blogs), replace=FALSE)]
sampleNews <- news[sample(1:length(news), 0.001*length(news), replace=FALSE)]
sampleTwitter <- twitter[sample(1:length(twitter), 0.001*length(twitter), replace=FALSE)]
# Cleaning
sampleBlogs <- iconv(sampleBlogs, "UTF-8", "ASCII", sub="")
sampleNews <- iconv(sampleNews, "UTF-8", "ASCII", sub="")
sampleTwitter <- iconv(sampleTwitter, "UTF-8", "ASCII", sub="")
data.sample <- c(sampleBlogs,sampleNews,sampleTwitter)
Now that we have sampled our data and combined all three of the data sets into one. We will go ahead and build the corpus which will be used to build the data matrix term later. In this section, we will also apply some more cleaning process to remove lowercase, punctuation, numbers and whitespace.
build_corpus <- function (x = data.sample) {
sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
sample_c <- tm_map(sample_c, content_transformer(tolower)) # all lowercase
sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
}
corpusData <- build_corpus(data.sample)
getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
#create term-document matrix tokenized on n-grams
tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
#find the top term grams with a minimum of occurrence in the corpus
top_terms <- findFreqTerms(tdm,lowfreq)
top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
tt.Data <- list(3)
for (i in 1:3) {
tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}
Let’s plot wordcloud to see word frequencies
# Set random seed for reproducibility
set.seed(123)
# Set Plotting in 1 row 3 columns
par(mfrow=c(1, 3))
for (i in 1:3) {
wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## would be could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## have been could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## this is could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## need to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## of my could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## to go could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## about the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## has been could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## you have could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## think could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : a
## good could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : a
## great could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## more than could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## to have could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## all the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## thanks for could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## as the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## they are could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## time to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## into the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## is not could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the best could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the world could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## thank you could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the way could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## trying to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## he said could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## cant could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## will could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## was the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## are you could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## just could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## right now could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## you know could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## and then could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## had a could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## know could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## for the first could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## dont want could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## some of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the rest of could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## be able to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## as well as could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## cant wait to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, : i
## love you could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## looking forward to could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## out of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## going to be could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## in terms of could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## in the world could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## is one of could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## of my life could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## one of those could not be fit on page. It will not be plotted.
## Warning in wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3, :
## the fact that could not be fit on page. It will not be plotted.
In this section, I build unigram, bi-gram and tri-gram models for the data and will give sense of distributions of the words through histograms
plot.Grams <- function (x = tt.Data, N=10) {
g1 <- ggplot(data = head(x[[1]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "green") +
ggtitle(paste("Unigrams")) +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- ggplot(data = head(x[[2]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "blue") +
ggtitle(paste("Bigrams")) +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 <- ggplot(data = head(x[[3]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "darkgreen") +
ggtitle(paste("Trigrams")) +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Put three plots into 1 row 3 columns
gridExtra::grid.arrange(g1, g2, g3, ncol = 3)
}
plot.Grams(x = tt.Data, N = 20)
Next is to plan for Creating Prediction Algorithm and Shiny Application
To train the prediction model:
All 3 of the file are very large. Even for the 0.1% of data just to perform the exploratory analysis and ngram model, it look quite a bit of time, so i need to look at better utilizing the resources and increase the performance.
Looking at the unigram frequencies, there are a lot of word overlap between the most frequent words in these 3 files.As next step to this, I need to perform more data cleaning to remove words such as “the”, “of the” and so on.
3.Review on how to remove the mispelled words & not to predict the misspelled word.
I have also looked up on Stemming Words using snowball stemmers and will be performing this.
I have looked up Markov chain solutions for predicting and I might be using this in next steps.
Finally the application will be built in shiny.