We will be performing 2 tasks within the framework of this work/report.
TASK #1 - EXPLORATORY DATA ANALYSIS ON TEXT DATA
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build our first linguistic models.
Tasks to accomplish:
TASK #2 - MODELING
The goal here is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish:
The raw corpus data is downloaded and stored locally at:
Blog: ./data/en_US.blogs.txt
News: ./data/en_US.news.txt
Twitter: ./data/en_US.twitter.txt
I have saved the datasets in the directory named “Capstone” on my Desktop. The file path of the location is “C:.jyenis2”.
Also, let’s load all the library that we need to do above mentioned tasks.
suppressMessages(library(NLP))
suppressMessages(library(tm))
suppressMessages(library(RColorBrewer))
suppressMessages(library(wordcloud))
suppressMessages(library(dplyr))
suppressMessages(library(stringi))
suppressMessages(library(RWeka))
suppressMessages(library(ggplot2))
suppressMessages(library(ngram))
suppressMessages(library(quanteda))
suppressMessages(library(gridExtra))
# File path
file1 <- "./final/en_US/en_US.blogs.txt"
file2 <- "./final/en_US/en_US.news.txt"
file3 <- "./final/en_US/en_US.twitter.txt"
# Read blogs
connect <- file(file1, open="rb")
blogs <- readLines(connect, encoding="UTF-8"); close(connect)
# Read news
connect <- file(file2, open="rb")
news <- readLines(connect, encoding="UTF-8"); close(connect)
# Read twitter
connect <- file(file3, open="rb")
twitter <- readLines(connect, encoding="UTF-8"); close(connect)
rm(connect)
summaryData <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(summaryData) <- c('Min','Mean','Max')
stats <- data.frame(
FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),], Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',], summaryData)))
head(stats)
## FileName Lines Chars Words Min Mean Max
## 1 en_US.blogs 899288 206824382 37570839 0 41.75 6726
## 2 en_US.news 1010242 203223154 34494539 1 34.41 1796
## 3 en_US.twitter 2360148 162096031 30451128 1 12.75 47
# Get file sizes
blogs.size <- file.info(file1)$size / 1024 ^ 2
news.size <- file.info(file2)$size / 1024 ^ 2
twitter.size <- file.info(file3)$size / 1024 ^ 2
# Summary of dataset
df<-data.frame(Doc = c("blogs", "news", "twitter"), Size.MB = c(blogs.size, news.size, twitter.size), Num.Lines = c(length(blogs), length(news), length(twitter)), Num.Words=c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter))))
df
## Doc Size.MB Num.Lines Num.Words
## 1 blogs 200.4242 899288 206824505
## 2 news 196.2775 1010242 203223159
## 3 twitter 159.3641 2360148 162096031
set.seed(123)
# Sampling
sampleBlogs <- blogs[sample(1:length(blogs), 0.001*length(blogs), replace=FALSE)]
sampleNews <- news[sample(1:length(news), 0.001*length(news), replace=FALSE)]
sampleTwitter <- twitter[sample(1:length(twitter), 0.001*length(twitter), replace=FALSE)]
# Cleaning
sampleBlogs <- iconv(sampleBlogs, "UTF-8", "ASCII", sub="")
sampleNews <- iconv(sampleNews, "UTF-8", "ASCII", sub="")
sampleTwitter <- iconv(sampleTwitter, "UTF-8", "ASCII", sub="")
data.sample <- c(sampleBlogs,sampleNews,sampleTwitter)
Now that we have sampled our data and combined all three of the data sets into one. We will go ahead and build the corpus which will be used to build the data matrix term later. In this section, we will also apply some more cleaning process to remove lowercase, punctuation, numbers and whitespace.
build_corpus <- function (x = data.sample) {
sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
sample_c <- tm_map(sample_c, content_transformer(tolower)) # all lowercase
sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
}
corpusData <- build_corpus(data.sample)
getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
#create term-document matrix tokenized on n-grams
tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
#find the top term grams with a minimum of occurrence in the corpus
top_terms <- findFreqTerms(tdm,lowfreq)
top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
tt.Data <- list(3)
for (i in 1:3) {
tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}
Let’s plot wordcloud to see word frequencies
# Set random seed for reproducibility
set.seed(123)
# Set Plotting in 1 row 3 columns
par(mfrow=c(1, 3))
for (i in 1:3) {
wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}
In this section, I build unigram, bi-gram and tri-gram models for the data and will give sense of distributions of the words through histograms
plot.Grams <- function (x = tt.Data, N=10) {
g1 <- ggplot(data = head(x[[1]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "green") +
ggtitle(paste("Unigrams")) +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- ggplot(data = head(x[[2]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "blue") +
ggtitle(paste("Bigrams")) +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 <- ggplot(data = head(x[[3]],N), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "darkgreen") +
ggtitle(paste("Trigrams")) +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Put three plots into 1 row 3 columns
gridExtra::grid.arrange(g1, g2, g3, ncol = 3)
}
plot.Grams(x = tt.Data, N = 20)
Next is to plan for Creating Prediction Algorithm and Shiny Application
To train the prediction model: