Introduction

This report documents the initial exploratory analysis done on Coursera Capstone Dataset which is part of Data Science Specialization.

The report includes loading, cleaning, sampling, and initial modeling of the data so that finally a predictive model can be built off of the data. N-gram modeling is performed to build an initial model. The final aim is to deploy a model, in the form of a data product which will predict the next word correctly when a user inputs 3-4 words.

Loading and Exploring

We load all the required libraries and data to work with.

# loading all the libraries
library(stringr); library(knitr); library(stringi);
library(SnowballC); library(quanteda); library(ggplot2)
# loading the data
blog <- readLines("en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=T)
news <- readLines("en_US/en_US.news.txt", encoding = "UTF-8", skipNul=T)
tweets <- readLines("en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=T)

Summarizing the loaded data for better understanding:

# calculating stats
files <- c('en_US/en_US.blogs.txt', 'en_US/en_US.news.txt', 'en_US/en_US.twitter.txt')
fileData <- list(blog, news, tweets)
sizeMB = sapply(files, function(x) {file.size(x)/1024^2})
lines = sapply(fileData, function(x) {length(x)})
words = sapply(fileData, function(x) {summary(stri_count_words(x))[c('Min.','Max.')]})

# combining into a table
stats <- cbind(sizeMB, lines,t(words))
stats <- as.data.frame(stats, row.names = c('Blogs','News','Tweets'))
colnames(stats) <- c('File.Size(MB)', 'Lines', 'Min.WordsPL', 'Max.WordsPL')
kable(stats,digits = 0)
File.Size(MB) Lines Min.WordsPL Max.WordsPL
Blogs 200 899288 0 6726
News 196 77259 1 1123
Tweets 159 2360148 1 47

We can infer two things straightaway:

  1. As the above table confirms, tweets have shorter words per line compared to blogs and news.
  2. The files are way too large to build and initialize a model everytime we run it. Hence, sampling is required.

Sampling

For sampling the data, we will work with 1% of the original dataset.

set.seed(99)

# sampling the three text files
sampleBlog <- sample(blog, length(blog)*0.01)
sampleNews <- sample(news, length(news)*0.01)
sampleTweets <- sample(tweets, length(tweets)*0.01)

# collective sample data
sampleData <- c(sampleBlog, sampleNews, sampleTweets)

Cleaning and Modeling

For cleaning, we will build a corpus and do the usual cleaning steps like converting all words to lowercase, eliminating numbers, punctuations, white spaces, bad words and stemming the text.

Using the quanteda package, we tokenize and create subsequent n-gram models.

# storing a list of bad words which will later be excluded. 
# The list was obtained from here:  https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
bw <- scan("./bad-words.txt", what = "character", sep = "\n", encoding = "UTF-8")

# tokenize and clean
tData <- tokens(corpus(sampleData),
                        remove_punct = TRUE,
                        remove_numbers = TRUE,
                        remove_separators = TRUE,
                        verbose = TRUE,
                        remove_symbols = TRUE
                        )
tNostop <- tokens_remove(tData, pattern = stopwords('en'))
tNobadwords <- tokens_remove(tNostop, pattern = bw)
tLower <- tokens_tolower(tNobadwords, keep_acronyms = FALSE)
tStem <- tokens_wordstem(tLower, language = quanteda_options("language_stemmer"))

# build ngram models
uniTokens <-tokens_ngrams(tStem, n = 1, concatenator = " ")
biTokens <- tokens_ngrams(tStem, n = 2, concatenator = " ")
triTokens <- tokens_ngrams(tStem, n = 3, concatenator = " ")

Building Document-feature matrices

unigram <- dfm(uniTokens, verbose = FALSE)
bigram <- dfm(biTokens, verbose = FALSE)
trigram <- dfm(triTokens, verbose = FALSE)

Concluding Analysis

Lets see the word frequencies of all the 3 ngram models starting from the highest.

wordFreq <- function(ngram, gram) {
  topVector <- topfeatures(ngram, 20)
  topVector <- sort(topVector, decreasing = TRUE)
  topdf <- data.frame(words = names(topVector), freq = topVector)
  ngramPlot <- ggplot(data = topdf, aes(x = factor(words, levels = words), y = freq, fill = words)) + 
        geom_bar(stat = "identity", position = position_dodge()) +
        theme_minimal() +
        labs(x = gram, y = expression("Frequency")) +
        labs(title = paste(gram, "Frequencies")) +
        coord_flip() +
        guides(fill=FALSE) 
  return(ngramPlot)
}
plot(wordFreq(unigram, "Unigram"))

plot(wordFreq(bigram, "Bigram"))

plot(wordFreq(trigram, "Trigram"))

Next steps

This concludes the basic exploratory analysis and my next steps would be:

  1. Generate predictive model(s).
  2. Build a final data product which will predict the next word based on user input.