Milestone Report: Exploratory Data Analysis of the text files

Introduction

The main purpose of this assignment is to conduct exploratory data analysis of the text files.

There are two tasks to accomplish:
- Perform a thorough exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora;
- Understand variation in the frequencies of words and word pairs in the data by creating figures and tables.

Reading the data into R

Let’s first load the needed R packages to process the data.

# The following lines of code load the required R packages
library(ggplot2)
library(knitr)
library(tm)
library(NLP)
library(RWeka)

The data to analyze was downloaded from Coursera website and can be accessed here. The data set is represented by three .txt files which are en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

# The following lines of code load the data into R and assign each .txt file
# to a corresponding variable.
blogs_data   <- readLines("./CapstoneData/en_US/en_US.blogs.txt")
news_data    <- readLines("./CapstoneData/en_US/en_US.news.txt")
twitter_data <- readLines("./CapstoneData/en_US/en_US.twitter.txt")

Exploratory Data Analysis

Let’s first find out how many lines of text are in each file.

# Calculating the number of lines in each file
blogs_lines <- length(blogs_data)
news_lines <- length(news_data)
twitter_lines <- length(twitter_data)
# Showing the result
blogs_lines

## [1] 899288

news_lines

## [1] 77259

twitter_lines

## [1] 2360148

# Plotting the result
barplot(c(blogs_lines, news_lines, twitter_lines), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of text lines", main="Number of text lines in files", col="lavender", ylim = c(0, 2500000))

Now let’s calculate the number of words in each file, however, before doing that we need to eliminate all extra spaces and punctuation and divide the text into words.

# The following three lines of code eliminate extra spaces and punctuation and split the text into words.
blogs_split   <- strsplit(gsub("[[:punct:][:blank:]]+", " ", blogs_data), " ")
news_split    <- strsplit(gsub("[[:punct:][:blank:]]+", " ", news_data), " ")
twitter_split <- strsplit(gsub("[[:punct:][:blank:]]+", " ", twitter_data), " ")

# Calculating the number of words in each file
blogs_words   <- sapply(blogs_split, length) 
news_words    <- sapply(news_split, length) 
twitter_words <- sapply(twitter_split, length) 

# Plotting the result of calculation
par(mfrow=c(1,2))
barplot(c(sum(blogs_words), sum(news_words), sum(twitter_words)), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of words", main="Total number of words in each file", cex.main=0.9, ylim = c(0, 40000000), col = "lavender")
barplot(c(mean(blogs_words), mean(news_words), mean(twitter_words)), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of words", main="Mean number of words per line in each file", cex.main=0.9, ylim = c(0, 50), col = "mistyrose")

Now, let’s make a summary of our data files.

dataTable <- data.frame("Total number of lines"=c(blogs_lines, news_lines, twitter_lines), 
"Total number of words"=c(sum(blogs_words), sum(news_words), sum(twitter_words)),
"Mean number of words per line"=c(mean(blogs_words), mean(news_words), mean(twitter_words)))
rownames(dataTable)=c("Blogs", "News", "Twitter")
# the kable() function here is used to generate table
kable(dataTable, format="pandoc", col.names=c("Total number of lines", "Total number of words",  "Mean number of words per line"), caption = "Summary of the data files")

Summary of the data files
	Total number of lines	Total number of words	Mean number of words per line
Blogs	899288	38126070	42.39584
News	77259	2752773	35.63045
Twitter	2360148	31193245	13.21665

Analysis of the data files in terms of word prediction

Sampling the data and creating a corpus

This analysis will be performed on the sample of only 1% of the data files for the sake of code execution speed, since the data volume is huge.

# Sampling the data
set.seed(09222020)
blogs_sample   <- sample(blogs_data, length(blogs_data)*0.01)
news_sample    <- sample(news_data, length(news_data)*0.01)
twitter_sample <- sample(twitter_data, length(twitter_data)*0.01)
data_sample    <- c(blogs_sample, news_sample, twitter_sample)

Now we will create a corpus and preprocess it to remove white spaces, punctuation, numbers, stop words and etc.

# The following lines of code create a corpus and clean the data
corpus <- VCorpus(VectorSource(data_sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords,stopwords("english"))

Creating n-gram terms

According to Wikipedia, in the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. An n-gram of size 1 is called a “unigram”, an n-gram of size 2 - a “bigram” or “digram”, and of size 3 is a “trigram”.

The following n-grams will be created with the use of the RWeka R package.

For creating n-gram terms I used the code written by Thet Paing Soe in his Milestone Report published on August 07, 2018

# The following lines of code create functions that tokenize the sample and build the matrices of uniqrams, digrams, and trigrams.
unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
digram  <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))

uniTable <- TermDocumentMatrix(corpus, control = list(tokenize = unigram))
diTable  <- TermDocumentMatrix(corpus, control = list(tokenize = digram))
triTable <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))

#The following lines of code calculate the frequency of terms in all three matrices
uniCorpus <- findFreqTerms(uniTable, lowfreq = 1000)
diCorpus  <- findFreqTerms(diTable,  lowfreq = 80)
triCorpus <- findFreqTerms(triTable, lowfreq = 10)

uniCorpus_num   <- rowSums(as.matrix(uniTable[uniCorpus, ]))
uniCorpus_table <- data.frame(Word = names(uniCorpus_num),frequency = uniCorpus_num)
uniCorpus_sort  <- uniCorpus_table[order(-uniCorpus_table$frequency), ]
head(uniCorpus_sort)

##      Word frequency
## just just      2599
## like like      2302
## will will      2245
## one   one      2165
## can   can      2025
## get   get      1832

diCorpus_num   <- rowSums(as.matrix(diTable[diCorpus, ]))
diCorpus_table <- data.frame(Word = names(diCorpus_num),frequency = diCorpus_num)
diCorpus_sort  <- diCorpus_table[order(-diCorpus_table$frequency), ]
head(diCorpus_sort)

##                            Word frequency
## right now             right now       222
## cant wait             cant wait       162
## dont know             dont know       155
## last night           last night       144
## looking forward looking forward       125
## feel like             feel like       107

triCorpus_num   <- rowSums(as.matrix(triTable[triCorpus, ]))
triCorpus_table <- data.frame(Word = names(triCorpus_num), frequency = triCorpus_num)
triCorpus_sort  <- triCorpus_table[order(-triCorpus_table$frequency), ]
head(triCorpus_sort)

##                                Word frequency
## happy mothers day happy mothers day        33
## cant wait see         cant wait see        29
## let us know             let us know        27
## damn damn damn       damn damn damn        24
## happy new year       happy new year        22
## love love love       love love love        17

Plotting the results

Now let’s graphically demonstrate how the frequencies of all three n-gram terms are distributed.

# plotting the unigrams distribution
ggplot(uniCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) + 
        geom_bar(stat = "identity") + 
        labs(title = "Unigrams",x = "words",y = "frequency") + 
        theme(axis.text.x = element_text(angle = 45))

# plotting the digrams distribution
ggplot(diCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) + 
        geom_bar(stat = "identity") + 
        labs(title = "Digrams",x = "words",y = "frequency") + 
        theme(axis.text.x = element_text(angle = 45))

# plotting the trigrams distribution
ggplot(triCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) + 
        geom_bar(stat = "identity") + 
        labs(title = "Trigrams",x = "words",y = "frequency") + 
        theme(axis.text.x = element_text(angle = 45))

Conclusion

Therefore, we have conducted the exploratory data analysis of the three data files. The next step would be to create the predictive algorithm and deploy it as a Shiny application. The predictive algorithm will be using n-gram model similar to that created above. The possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the digram model, and then to the unigram model.