This is the first assignment in the Data science Capstone course at John Hopkins University, Coursera. As explicitly mentioned in the instructions, the main goal of this project is to demonstrate that the data has been loaded into R environment and is ready to perform Exploratory data analysis which will evenutally culminnate into an N-gram Prediction model.
There are three files which give information regarding English Blogs, News and Twitter data. The following libraries where used during the production of this report and are essential for reproducible research.
library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(stringr)
library(DT)
library(RWeka)
us_news <- readLines("final/en_US/en_US.news.txt")
us_blogs <- readLines("final/en_US/en_US.blogs.txt")
us_twitter <- readLines("final/en_US/en_US.twitter.txt")
names <- c('US-News', 'US-Blogs', 'US-Twitter')
lengths <- c(length(us_news), length(us_blogs), length(us_twitter))
nwords <- c(sum(sapply(strsplit(us_news, " "), length)),
sum(sapply(strsplit(us_blogs, " "), length)),
sum(sapply(strsplit(us_twitter, " "), length)))
df <- data.frame(Files = names, "No. of Lines" = lengths, "No. of Words" = nwords)
datatable(df)
From the above data table we find that the files are pretty big and have a large volume of words. To make the computation process a little faster, we will take a sample of these documents and perform Exploratory data analysis. This approach will help us save time and also give us a gist of how our data sets are structured.
sample_news <- us_news[sample(1:length(us_news), 9000)]
sample_blogs <- us_blogs[sample(1:length(us_blogs), 9000)]
sample_twitter <- us_twitter[sample(1:length(us_twitter), 9000)]
sample_final <- c(sample_news, sample_blogs, sample_twitter)
cat('No. of lines in the sample are:', length(sample_final),
'\nNo. of words in the sample are:', sum(sapply(strsplit(sample_final, " "),
length)))
## No. of lines in the sample are: 27000
## No. of words in the sample are: 798453
The data set is fairly unclean. Thus, text cleaning steps such as
are performed using the tm package in R. This gives us the cleaner version of our data set which can be used for feature extraction.
To perform the conversion we first need to convert the text data into a Corpora. This is done by creating a corpus i.e. a collection of documents in R using th Corpus function from the tm package in R.
sample_corpa <- Corpus(VectorSource(sample_final))
sample_corpa <- tm_map(sample_corpa, content_transformer(removePunctuation))
sample_corpa <- tm_map(sample_corpa, content_transformer(removeNumbers))
sample_corpa <- tm_map(sample_corpa, content_transformer(function(x){tolower(x)}))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
sample_corpa <- tm_map(sample_corpa, content_transformer(removeURL))
sample_corpa <- tm_map(sample_corpa, removeWords, c(stopwords('en')))
sample_corpa <- tm_map(sample_corpa, stemDocument)
sample_corpa <- tm_map(sample_corpa, stripWhitespace)
The Term Document Matrix which is abbreviated as TDM is a sparse matrix that is formed using the documents and the words conatined inside these documents. It helps us summarize and understand the documents in a better manner and hence we will construct a TDM as follows.
sample_tdm <- TermDocumentMatrix(sample_corpa)
sample_m <- as.matrix(sample_tdm)
Term_freq <- rowSums(sample_m)
Term_freq <- sort(Term_freq, decreasing = TRUE)
barplot(Term_freq[1:20], las = 2, col = 'brown')
rm(sample_m, sample_tdm)
From the above graph we are able to tell that words like
occur with maximum frequency. We can also visualize this by forming a wordcloud
wc <- data.frame(Words = names(Term_freq),
Freq = Term_freq, row.names = 1:length(Term_freq))
wordcloud(words = wc$Words, freq = wc$Freq, max.words = 75,colors = c('darkorange',
'firebrick',
'gray22'))
N-Grams are a contiguous list of N most frequent words from a document. So a 2-gram model lists out the two words that occured together most frequently. This gives us more details of how words interact with each other in our data set and will eventually help us extract important features from the data.
We shall construct 2-Gram and 3-Gram models, visualize them and understand how they behave.
Bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bigrams <- data.frame(table(unlist(lapply(sample_corpa, Bigram)))) %>%
arrange(desc(Freq))
trigrams <- data.frame(table(unlist(lapply(sample_corpa, Trigram)))) %>%
arrange(desc(Freq))
datatable(head(bigrams, 10))
datatable(head(trigrams, 10))
bigrams$Var1 <- factor(bigrams$Var1, levels = bigrams$Var1[order(bigrams$Freq)])
trigrams$Var1 <- factor(trigrams$Var1, levels = trigrams$Var1[order(trigrams$Freq)])
ggplot(bigrams[1:10, ], aes(Var1, Freq, fill = Freq)) + geom_bar(stat = 'identity') +
ggtitle("Most Frequent Bigrams") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8))
ggplot(trigrams[1:10, ], aes(Var1, Freq, fill = Freq)) + geom_bar(stat = 'identity') +
ggtitle("Most Frequent Trigrams") + theme(axis.text.x = element_text(angle = 45, hjust = 0.8))
From the above graphs we find the 10 most occuring Bigrams and Trigrams. This will help us model the features for our Prediction model.
Changing the order of our N-gram model yield drastic decrease in the observed counts.
The corpora is very big, sampling the corpora might help us obtain a gist of the data which can be helpful for EDA but we need to find a work around to feed this big data set into R so that we get accurate predictions.
Stemming the document has to be taken good care of as it is giving us words like citi, presid etc.
Next steps will include finding the direction of how to model our prediction model and how to treat the huge data set into R so that we get accurate predictions.