Coursera Data Science Capstone: Week 2 Milestone Report

Introduction

This document is the milestone report about the Data Science Capstone Coursera course (week 2) that is part of the Data Science Specialization track by Johns Hopkins University. In this course we will apply on text data analysis and natural language processing. The final goal is:

to build a predictive model about English text;
to develop and application (a Shiny app) able to use it.

Practically, given a word by a user, the application will predict the next one. For this specific milestone report, the goals are:

to load data and to explain the exploratory analysis;
to show the following steps to develop the prediction algorithm

Setup environment

Load packages and data

library(knitr)
library(kableExtra)
library(dplyr)
library(data.table)
library(stringi)
library(tm)
library(NLP)
library(RWeka)
library(SnowballC)
library(stringr)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
library(wordcloud2)

Loading the data

Data have already been downloaded and unzipped from this link Capstone Dataset. We will work only on english data and we will use three specific text datasets each of them containing information about blogs, news and twitter. As we have to clean the datasets from the profane words, we will use a profanity list suggested by a classmate in the forum of the Course (week 2). You can find this dataset at at this link Profanity list.

workingDir <- getwd()
dataDir <- c("en_US")
profanityDir <- c("profanity")
list.files(file.path(workingDir,dataDir))

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

list.files(file.path(workingDir,profanityDir))

## [1] "Terms-to-Block.csv"

# Reading blogs file
blogsFile <- list.files(file.path(workingDir,dataDir))[1]
connBlogsFile <- file(file.path(workingDir, dataDir, blogsFile), "r")
ds_blogs <- readLines(connBlogsFile, encoding = "UTF-8", skipNul = TRUE)
close(connBlogsFile)

# Reading news file. NB: here we force a binary reading in order to avoid an error occuring during normal reading 
# that is: "incomplete final line found on ..".
newsFile <- list.files(file.path(workingDir,dataDir))[2]
connNewsFile <- file(file.path(workingDir, dataDir, newsFile), "rb")
ds_news <- readLines(connNewsFile, encoding = "UTF-8", skipNul = TRUE)
close(connNewsFile)

# Reading twitter file
twitterFile <- list.files(file.path(workingDir,dataDir))[3]
connTwitterFile <- file(file.path(workingDir, dataDir, twitterFile), "r")
ds_twitter <- readLines(connTwitterFile, encoding = "UTF-8", skipNul = TRUE)
close(connTwitterFile)

# Reading profanity file. This file must be a character vector as required by tm_map function
profanityFile <- list.files(file.path(workingDir,profanityDir))[1]
df_profanity <- read.table(file.path(workingDir, profanityDir, profanityFile),
        header = FALSE, sep = "\t", quote = "\"", stringsAsFactors = F,
        col.names = c('ProfaneWord'))
ds_profanity <- as.character(df_profanity$ProfaneWord)

Data overview

Now let’s analyze the data of the three datasets from a statistical point of view (number of lines, number of words, etc).

Name of file	Size of file (Mb)	No. of lines	No. of lines with at least one non-white space	No. of characters	No. of characters that are non-white space	No. of words	Max length in characters
en_US.blogs.txt	200.4	899,288	899,288	206,824,382	170,389,539	37,570,839	40,833
en_US.news.txt	196.3	1,010,242	1,010,242	203,223,154	169,860,866	34,494,539	11,384
en_US.twitter.txt	159.4	2,360,148	2,360,148	162,096,241	134,082,806	30,451,170	140

The three datasets involve a great number of lines and words (4,269,678 lines and 102,516,548 words). This could be affect performances and developing time so, as suggested in the course instructions and taking into account the limitations of my development environment, we will use a subset (1%) of the total data in order to analyze information.

Sampling the data

set.seed(12345)
sampleRate <- 0.01
ds_blogsSample <- sample(ds_blogs, size = length(ds_blogs) * sampleRate, replace = FALSE)
ds_newsSample <- sample(ds_news, size = length(ds_news) * sampleRate, replace = FALSE)
ds_twitterSample <- sample(ds_twitter, size = length(ds_twitter) * sampleRate, replace = FALSE)
ds_finalSample <- c(ds_blogsSample, ds_newsSample, ds_twitterSample)
rm(ds_blogs, ds_news, ds_twitter)

Now we have 42,695 lines and 1,024,708 words.

Preparing the data (cleaning and preprocessing)

Currently, the sample dataset contains raw data such as urls, profane words, punctuation, special characters and so on which are useless for our purposes. So, we have to perform some steps in order to clean up the data. Moreover, we have to pre-process the data in order to make them ‘ready’ for the Exploratory Analysis. In particular, we will use stemming software in order to reduce every words to its ‘base form’ (see Stanford).

Before proceeding, let’s turn the sample dataset into a ‘corpus’ object, that is a collection of documents containing some text. Then, we will clean the data in order to:

convert some special characters so as to allow the stopword step to work correctly;
convert the text from UTF-8 to ASCII (in order to remove some special characters);
convert all characters into lowercase;
remove web address url;
remove profane words;
remove stopwords (english). For istance: ‘I’, ‘Me’, ‘My’, etc;
remove numbers;
remove both ‘ordinal’ references about numbers (‘st’, ‘nd’, ‘rd’, ‘th’) and what remains about decades after the remove numbers step (90’s –> ’s);
remove punctuation;
strip whitespace;

Please note that the order of the execution of each step is important! After these cleaning steps we apply stemming.

# Define a function in order to convert characters UTF-8 to ASCII. The non-convertible characters will be replaced witha a blank. The goal is to remove all special characters 
char1 <- c("'")
char2 <- c("'")

textToAscii <- function(x) (iconv(x, "UTF-8", "ASCII", sub = ""))

# Corpus creation
ds_corpusSample <- VCorpus(VectorSource(ds_finalSample))
# Substitute specific special characters (hexadecimal X92 e X91)
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = char1, replacement = "'", ignore.case = TRUE)
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = char2, replacement = "'", ignore.case = TRUE)
# Remove special characters
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(textToAscii))
# Convert all characters into lower case
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(tolower))
# Remove web urls
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = "http\\S+\\s*", replacement = "", ignore.case = TRUE)
# Remove profane words. It is suitable to do before removing numbers as some profane words contain numbers (i.e. "a55" for "ass")
ds_corpusSample <- tm_map(ds_corpusSample, removeWords, ds_profanity)
# Remove English stopwords
ds_corpusSample <- tm_map(ds_corpusSample, removeWords, stopwords('english'))
# Remove numbers
ds_corpusSample = tm_map(ds_corpusSample, removeNumbers)
# Remove:
#   a) what remain about ordinal numbers ('st', 'nd', 'rd', 'th') after the remove numbers step
#   b) what remain about decades ('s) which uses a special character as apostrophe after the remove numbers step
#   c) quotes used for citations
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = " st | nd | rd | th | 's", replacement = "", ignore.case = TRUE)
# Remove punctuation
ds_corpusSample <- tm_map(ds_corpusSample, removePunctuation)
# Strip whitespace
ds_corpusSample <- tm_map(ds_corpusSample, stripWhitespace)
# Apply stemming
ds_corpusSample <- tm_map(ds_corpusSample, stemDocument)

Exploratory data analysis

Tokenization

Now we want to analyze the frequency of the words in the data; that is .. ‘what are the most common words in the data?’. But we will do this not only for a single word (unigram - 1-gram), but also for two words at a time (bigram - 2-gram) and for three words at a time (trigram - 3-gram).

To do this we have to apply tokenization to the data; with tokenization we will be able to ‘split a character string into individual element of interest’ (see Introduction to text mining).

The output will be a Term Document Matrix. This matrix will have each ‘corpus term’ represented as a row, and a document as a column. Every cell of the matrix will contain the frequency of the specific ‘corpus term’ in that document (a tweet or a news or a blog). Using tokenization, we will create three different Term Document Matrix so as to have as ‘corpus term’ the unigrams or the bigrams or the trigrams detected by tokenization. In this way, we will be able to analyze their distribution.

With tokenization, the phenomenon of sparsity occurs. We will use a specific parameter (0 < parameter < 1) to control it.

# Unigram tokenization
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
mt_unigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = unigramTokenizer))
# Reduce sparsity
prmt_SparseTerm <- 0.99995
mt_unigram <- removeSparseTerms(mt_unigram, prmt_SparseTerm)
# Find the most frequent unigrams with a frequency >= 7
ls_mostFreqUnigram <- findFreqTerms(mt_unigram, lowfreq = 7)
ls_mostFreqUnigram <- rowSums(as.matrix(mt_unigram[ls_mostFreqUnigram, ]))
df_mostFreqUnigram <- data.frame(Word = names(ls_mostFreqUnigram), Frequency=ls_mostFreqUnigram)
df_mostFreqUnigram <- arrange(df_mostFreqUnigram, desc(Frequency))
df_mostFreqUnigram_sel <- df_mostFreqUnigram[1:20,]

# Bigram tokenization
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
mt_bigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = bigramTokenizer))
# Reduce sparsity
mt_bigram <- removeSparseTerms(mt_bigram, prmt_SparseTerm)
# Find the most frequent bigrams with a frequency >= 7
ls_mostFreqBigram <- findFreqTerms(mt_bigram, lowfreq = 7)
ls_mostFreqBigram <- rowSums(as.matrix(mt_bigram[ls_mostFreqBigram, ]))
df_mostFreqBigram <- data.frame(Word = names(ls_mostFreqBigram), Frequency = ls_mostFreqBigram)
df_mostFreqBigram <- arrange(df_mostFreqBigram, desc(Frequency))
df_mostFreqBigram_sel <- df_mostFreqBigram[1:20,]

# Trigram tokenization
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
mt_trigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = trigramTokenizer))
# Reduce sparsity
mt_trigram <- removeSparseTerms(mt_trigram, prmt_SparseTerm)
# Find the most frequent trigrams with a frequency >= 7
ls_mostFreqTrigram <- findFreqTerms(mt_trigram, lowfreq = 7)
ls_mostFreqTrigram <- rowSums(as.matrix(mt_trigram[ls_mostFreqTrigram, ]))
df_mostFreqTrigram <- data.frame(Word = names(ls_mostFreqTrigram), Frequency = ls_mostFreqTrigram)
df_mostFreqTrigram <- arrange(df_mostFreqTrigram, desc(Frequency))
df_mostFreqTrigram_sel <- df_mostFreqTrigram[1:20,]

Plotting the data

Unigram frequency

Bigram frequency

Trigram frequency

Next steps

Looking at the data of the exploratory analysis we can reasonably count on a reliable dataset to work on. So next steps could be:

find a larger sample in order to build the predictive algorithm;
develop the predictive algorithm paying attention to performance and coding optimization;
implement the final shiny application