This report documents the initial exploratory analysis done on Coursera Capstone Dataset which is part of Data Science Specialization.
The report includes loading, cleaning, sampling, and initial modeling of the data so that finally a predictive model can be built off of the data. N-gram modeling is performed to build an initial model. The final aim is to deploy a model, in the form of a data product which will predict the next word correctly when a user inputs 3-4 words.
We load all the required libraries and data to work with.
# loading all the libraries
library(stringr); library(knitr); library(stringi);
library(SnowballC); library(quanteda); library(ggplot2)
# loading the data
blog <- readLines("en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=T)
news <- readLines("en_US/en_US.news.txt", encoding = "UTF-8", skipNul=T)
tweets <- readLines("en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=T)
Summarizing the loaded data for better understanding:
# calculating stats
files <- c('en_US/en_US.blogs.txt', 'en_US/en_US.news.txt', 'en_US/en_US.twitter.txt')
fileData <- list(blog, news, tweets)
sizeMB = sapply(files, function(x) {file.size(x)/1024^2})
lines = sapply(fileData, function(x) {length(x)})
words = sapply(fileData, function(x) {summary(stri_count_words(x))[c('Min.','Max.')]})
# combining into a table
stats <- cbind(sizeMB, lines,t(words))
stats <- as.data.frame(stats, row.names = c('Blogs','News','Tweets'))
colnames(stats) <- c('File.Size(MB)', 'Lines', 'Min.WordsPL', 'Max.WordsPL')
kable(stats,digits = 0)
| File.Size(MB) | Lines | Min.WordsPL | Max.WordsPL | |
|---|---|---|---|---|
| Blogs | 200 | 899288 | 0 | 6726 |
| News | 196 | 77259 | 1 | 1123 |
| Tweets | 159 | 2360148 | 1 | 47 |
We can infer two things straightaway:
For sampling the data, we will work with 1% of the original dataset.
set.seed(99)
# sampling the three text files
sampleBlog <- sample(blog, length(blog)*0.01)
sampleNews <- sample(news, length(news)*0.01)
sampleTweets <- sample(tweets, length(tweets)*0.01)
# collective sample data
sampleData <- c(sampleBlog, sampleNews, sampleTweets)
For cleaning, we will build a corpus and do the usual cleaning steps like converting all words to lowercase, eliminating numbers, punctuations, white spaces, bad words and stemming the text.
Using the quanteda package, we tokenize and create subsequent n-gram models.
# storing a list of bad words which will later be excluded.
# The list was obtained from here: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
bw <- scan("./bad-words.txt", what = "character", sep = "\n", encoding = "UTF-8")
# tokenize and clean
tData <- tokens(corpus(sampleData),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
verbose = TRUE,
remove_symbols = TRUE
)
tNostop <- tokens_remove(tData, pattern = stopwords('en'))
tNobadwords <- tokens_remove(tNostop, pattern = bw)
tLower <- tokens_tolower(tNobadwords, keep_acronyms = FALSE)
tStem <- tokens_wordstem(tLower, language = quanteda_options("language_stemmer"))
# build ngram models
uniTokens <-tokens_ngrams(tStem, n = 1, concatenator = " ")
biTokens <- tokens_ngrams(tStem, n = 2, concatenator = " ")
triTokens <- tokens_ngrams(tStem, n = 3, concatenator = " ")
Building Document-feature matrices
unigram <- dfm(uniTokens, verbose = FALSE)
bigram <- dfm(biTokens, verbose = FALSE)
trigram <- dfm(triTokens, verbose = FALSE)
Lets see the word frequencies of all the 3 ngram models starting from the highest.
wordFreq <- function(ngram, gram) {
topVector <- topfeatures(ngram, 20)
topVector <- sort(topVector, decreasing = TRUE)
topdf <- data.frame(words = names(topVector), freq = topVector)
ngramPlot <- ggplot(data = topdf, aes(x = factor(words, levels = words), y = freq, fill = words)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(x = gram, y = expression("Frequency")) +
labs(title = paste(gram, "Frequencies")) +
coord_flip() +
guides(fill=FALSE)
return(ngramPlot)
}
plot(wordFreq(unigram, "Unigram"))
plot(wordFreq(bigram, "Bigram"))
plot(wordFreq(trigram, "Trigram"))
This concludes the basic exploratory analysis and my next steps would be: