Executive Summary

This milestone report is the project of week two within the Data Science Capstone Project Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The overal goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The purpose of this milestone report is to show some exploratory data analyses to investigate some features of the data, which will lead to the eventual prediction app and algorithm.

Summary stats and Basic Information about Corpus Dataset

To know more about how the data looks like, we can see the size in Megabytes, the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). And last, the min, max and average number of words per line.

# Libraries
library(knitr); library(dplyr); library(doParallel); library(tm); library(SnowballC)
library(stringi); library(tm); library(ggplot2); library(wordcloud); library(kableExtra)
    
setwd("~/Coursera/8_Data_Science_Specialization/10 Data Science Capstone/final/en_US")
    
path1 <- "./en_US.blogs.txt"
path2 <- "./en_US.news.txt"
path3 <- "./en_US.twitter.txt"
    
# Read blogs data in binary mode
conn <- file(path1, open="rb")
blogs <- readLines(conn, encoding="UTF-8"); close(conn)
# Read news data in binary mode
conn <- file(path2, open="rb")
news <- readLines(conn, encoding="UTF-8"); close(conn)
# Read twitter data in binary mode
conn <- file(path3, open="rb")
twitter <- readLines(conn, encoding="UTF-8"); close(conn)
# Remove temporary variable
rm(conn)
    
# Summary info
WPL <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
stats <- data.frame(
    FileName=c("en_US.blogs","en_US.news","en_US.twitter"),     
    "File Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
    t(rbind(
        sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
        Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
        WPL)
    ))
FileName File.Size Lines Chars Words WPL_Min WPL_Mean WPL_Max
en_US.blogs 255.4 Mb 899288 206824382 37570839 0 41.75107 6726
en_US.news 257.3 Mb 1010242 203223154 34494539 1 34.40997 1796
en_US.twitter 319 Mb 2360148 162096031 30451128 1 12.75063 47

Build Corpus and Clean the Data

The following clean up steps are performed to a 1% sample of the dataset:

# Build Corpus
    
build_corpus <- function (x = sampleData) {
    sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
    sample_c <- tm_map(sample_c, tolower) # all lowercase
    sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
    sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
    sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
    sample_c <- tm_map(sample_c, removeWords, stopwords("english")) # Eliminate English stop words
    sample_c <- tm_map(sample_c, stemDocument) # Stem the document
    sample_c <- tm_map(sample_c, PlainTextDocument) # Create plain text format
}
corpusData <- build_corpus(sampleData)
# Tokenize and Calculate Frequencies of N-Grams
    
library(RWeka)

getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
    #create term-document matrix tokenized on n-grams
    tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
    tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
    #find the top term grams with a minimum of occurrence in the corpus
    top_terms <- findFreqTerms(tdm,lowfreq)
    top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
    top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
    top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
    
tt.Data <- list(3)
for (i in 1:3) {
    tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}

N-grams

We need to convert the datset into a format that is most useful for Natural Language Prpcessing (NLP). The format of choice are N-grams. The N-gram representation of a text lists all N-tuples of words that appear. The simplest case is the unigram which is based on individual words. The bigram is based on pairs of to words, and the trigram is based on three words.

Plot Sampled Corpus Data with Word Cloud

The wordcloud package is used to demonstrate what the corpus looks like in term of word frequency mapping. Here it is shown the wordcloud for unigrams, bigrams and trigrams from left to right respectively.

N Grams Frequency Plots

Plans for Prediction Algorithm and Shiny App

After the exploratory analysis, we can continue with the goal of this Capstone which is building the predictive model(s) and eventually the data product.