Capstone Project - Week 2: Milestone Report

Executive Summary

This milestone report is the project of week two within the Data Science Capstone Project Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The overal goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The purpose of this milestone report is to show some exploratory data analyses to investigate some features of the data, which will lead to the eventual prediction app and algorithm.

Summary stats and Basic Information about Corpus Dataset

To know more about how the data looks like, we can see the size in Megabytes, the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). And last, the min, max and average number of words per line.

# Libraries
library(knitr); library(dplyr); library(doParallel); library(tm); library(SnowballC)
library(stringi); library(tm); library(ggplot2); library(wordcloud); library(kableExtra)
    
setwd("~/Coursera/8_Data_Science_Specialization/10 Data Science Capstone/final/en_US")
    
path1 <- "./en_US.blogs.txt"
path2 <- "./en_US.news.txt"
path3 <- "./en_US.twitter.txt"
    
# Read blogs data in binary mode
conn <- file(path1, open="rb")
blogs <- readLines(conn, encoding="UTF-8"); close(conn)
# Read news data in binary mode
conn <- file(path2, open="rb")
news <- readLines(conn, encoding="UTF-8"); close(conn)
# Read twitter data in binary mode
conn <- file(path3, open="rb")
twitter <- readLines(conn, encoding="UTF-8"); close(conn)
# Remove temporary variable
rm(conn)
    
# Summary info
WPL <- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
stats <- data.frame(
    FileName=c("en_US.blogs","en_US.news","en_US.twitter"),     
    "File Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
    t(rbind(
        sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
        Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
        WPL)
    ))

FileName	File.Size	Lines	Chars	Words	WPL_Min	WPL_Mean	WPL_Max
en_US.blogs	255.4 Mb	899288	206824382	37570839	0	41.75107	6726
en_US.news	257.3 Mb	1010242	203223154	34494539	1	34.40997	1796
en_US.twitter	319 Mb	2360148	162096031	30451128	1	12.75063	47

Build Corpus and Clean the Data

The following clean up steps are performed to a 1% sample of the dataset:

Convert all words to lowercase
Eliminate punctuation
Eliminate numbers
Strip whitespace
Stemming (Using Porter’s Stemming Algorithm)
Create Plain Text Format

# Build Corpus
    
build_corpus <- function (x = sampleData) {
    sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
    sample_c <- tm_map(sample_c, tolower) # all lowercase
    sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
    sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
    sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
    sample_c <- tm_map(sample_c, removeWords, stopwords("english")) # Eliminate English stop words
    sample_c <- tm_map(sample_c, stemDocument) # Stem the document
    sample_c <- tm_map(sample_c, PlainTextDocument) # Create plain text format
}
corpusData <- build_corpus(sampleData)

# Tokenize and Calculate Frequencies of N-Grams
    
library(RWeka)

getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
    #create term-document matrix tokenized on n-grams
    tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
    tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
    #find the top term grams with a minimum of occurrence in the corpus
    top_terms <- findFreqTerms(tdm,lowfreq)
    top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
    top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
    top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
    
tt.Data <- list(3)
for (i in 1:3) {
    tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}

N-grams

We need to convert the datset into a format that is most useful for Natural Language Prpcessing (NLP). The format of choice are N-grams. The N-gram representation of a text lists all N-tuples of words that appear. The simplest case is the unigram which is based on individual words. The bigram is based on pairs of to words, and the trigram is based on three words.

Plot Sampled Corpus Data with Word Cloud

The wordcloud package is used to demonstrate what the corpus looks like in term of word frequency mapping. Here it is shown the wordcloud for unigrams, bigrams and trigrams from left to right respectively.

N Grams Frequency Plots

Plans for Prediction Algorithm and Shiny App

After the exploratory analysis, we can continue with the goal of this Capstone which is building the predictive model(s) and eventually the data product.