The main purpose of this assignment is to conduct exploratory data analysis of the text files.
There are two tasks to accomplish:
- Perform a thorough exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora;
- Understand variation in the frequencies of words and word pairs in the data by creating figures and tables.
Let’s first load the needed R packages to process the data.
# The following lines of code load the required R packages
library(ggplot2)
library(knitr)
library(tm)
library(NLP)
library(RWeka)
The data to analyze was downloaded from Coursera website and can be accessed here. The data set is represented by three .txt files which are en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
# The following lines of code load the data into R and assign each .txt file
# to a corresponding variable.
blogs_data <- readLines("./CapstoneData/en_US/en_US.blogs.txt")
news_data <- readLines("./CapstoneData/en_US/en_US.news.txt")
twitter_data <- readLines("./CapstoneData/en_US/en_US.twitter.txt")
Let’s first find out how many lines of text are in each file.
# Calculating the number of lines in each file
blogs_lines <- length(blogs_data)
news_lines <- length(news_data)
twitter_lines <- length(twitter_data)
# Showing the result
blogs_lines
## [1] 899288
news_lines
## [1] 77259
twitter_lines
## [1] 2360148
# Plotting the result
barplot(c(blogs_lines, news_lines, twitter_lines), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of text lines", main="Number of text lines in files", col="lavender", ylim = c(0, 2500000))
Now let’s calculate the number of words in each file, however, before doing that we need to eliminate all extra spaces and punctuation and divide the text into words.
# The following three lines of code eliminate extra spaces and punctuation and split the text into words.
blogs_split <- strsplit(gsub("[[:punct:][:blank:]]+", " ", blogs_data), " ")
news_split <- strsplit(gsub("[[:punct:][:blank:]]+", " ", news_data), " ")
twitter_split <- strsplit(gsub("[[:punct:][:blank:]]+", " ", twitter_data), " ")
# Calculating the number of words in each file
blogs_words <- sapply(blogs_split, length)
news_words <- sapply(news_split, length)
twitter_words <- sapply(twitter_split, length)
# Plotting the result of calculation
par(mfrow=c(1,2))
barplot(c(sum(blogs_words), sum(news_words), sum(twitter_words)), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of words", main="Total number of words in each file", cex.main=0.9, ylim = c(0, 40000000), col = "lavender")
barplot(c(mean(blogs_words), mean(news_words), mean(twitter_words)), names.arg = c("blogs", "news", "twitter"), xlab="data files", ylab="number of words", main="Mean number of words per line in each file", cex.main=0.9, ylim = c(0, 50), col = "mistyrose")
Now, let’s make a summary of our data files.
dataTable <- data.frame("Total number of lines"=c(blogs_lines, news_lines, twitter_lines),
"Total number of words"=c(sum(blogs_words), sum(news_words), sum(twitter_words)),
"Mean number of words per line"=c(mean(blogs_words), mean(news_words), mean(twitter_words)))
rownames(dataTable)=c("Blogs", "News", "Twitter")
# the kable() function here is used to generate table
kable(dataTable, format="pandoc", col.names=c("Total number of lines", "Total number of words", "Mean number of words per line"), caption = "Summary of the data files")
| Total number of lines | Total number of words | Mean number of words per line | |
|---|---|---|---|
| Blogs | 899288 | 38126070 | 42.39584 |
| News | 77259 | 2752773 | 35.63045 |
| 2360148 | 31193245 | 13.21665 |
This analysis will be performed on the sample of only 1% of the data files for the sake of code execution speed, since the data volume is huge.
# Sampling the data
set.seed(09222020)
blogs_sample <- sample(blogs_data, length(blogs_data)*0.01)
news_sample <- sample(news_data, length(news_data)*0.01)
twitter_sample <- sample(twitter_data, length(twitter_data)*0.01)
data_sample <- c(blogs_sample, news_sample, twitter_sample)
Now we will create a corpus and preprocess it to remove white spaces, punctuation, numbers, stop words and etc.
# The following lines of code create a corpus and clean the data
corpus <- VCorpus(VectorSource(data_sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords,stopwords("english"))
According to Wikipedia, in the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. An n-gram of size 1 is called a “unigram”, an n-gram of size 2 - a “bigram” or “digram”, and of size 3 is a “trigram”.
The following n-grams will be created with the use of the RWeka R package.
For creating n-gram terms I used the code written by Thet Paing Soe in his Milestone Report published on August 07, 2018
# The following lines of code create functions that tokenize the sample and build the matrices of uniqrams, digrams, and trigrams.
unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
digram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
uniTable <- TermDocumentMatrix(corpus, control = list(tokenize = unigram))
diTable <- TermDocumentMatrix(corpus, control = list(tokenize = digram))
triTable <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))
#The following lines of code calculate the frequency of terms in all three matrices
uniCorpus <- findFreqTerms(uniTable, lowfreq = 1000)
diCorpus <- findFreqTerms(diTable, lowfreq = 80)
triCorpus <- findFreqTerms(triTable, lowfreq = 10)
uniCorpus_num <- rowSums(as.matrix(uniTable[uniCorpus, ]))
uniCorpus_table <- data.frame(Word = names(uniCorpus_num),frequency = uniCorpus_num)
uniCorpus_sort <- uniCorpus_table[order(-uniCorpus_table$frequency), ]
head(uniCorpus_sort)
## Word frequency
## just just 2599
## like like 2302
## will will 2245
## one one 2165
## can can 2025
## get get 1832
diCorpus_num <- rowSums(as.matrix(diTable[diCorpus, ]))
diCorpus_table <- data.frame(Word = names(diCorpus_num),frequency = diCorpus_num)
diCorpus_sort <- diCorpus_table[order(-diCorpus_table$frequency), ]
head(diCorpus_sort)
## Word frequency
## right now right now 222
## cant wait cant wait 162
## dont know dont know 155
## last night last night 144
## looking forward looking forward 125
## feel like feel like 107
triCorpus_num <- rowSums(as.matrix(triTable[triCorpus, ]))
triCorpus_table <- data.frame(Word = names(triCorpus_num), frequency = triCorpus_num)
triCorpus_sort <- triCorpus_table[order(-triCorpus_table$frequency), ]
head(triCorpus_sort)
## Word frequency
## happy mothers day happy mothers day 33
## cant wait see cant wait see 29
## let us know let us know 27
## damn damn damn damn damn damn 24
## happy new year happy new year 22
## love love love love love love 17
Now let’s graphically demonstrate how the frequencies of all three n-gram terms are distributed.
# plotting the unigrams distribution
ggplot(uniCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) +
geom_bar(stat = "identity") +
labs(title = "Unigrams",x = "words",y = "frequency") +
theme(axis.text.x = element_text(angle = 45))
# plotting the digrams distribution
ggplot(diCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) +
geom_bar(stat = "identity") +
labs(title = "Digrams",x = "words",y = "frequency") +
theme(axis.text.x = element_text(angle = 45))
# plotting the trigrams distribution
ggplot(triCorpus_sort[1:10, ], aes(x = reorder(Word,-frequency), y = frequency, fill = frequency)) +
geom_bar(stat = "identity") +
labs(title = "Trigrams",x = "words",y = "frequency") +
theme(axis.text.x = element_text(angle = 45))
Therefore, we have conducted the exploratory data analysis of the three data files. The next step would be to create the predictive algorithm and deploy it as a Shiny application. The predictive algorithm will be using n-gram model similar to that created above. The possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the digram model, and then to the unigram model.