Introduction

This document is the milestone report about the Data Science Capstone Coursera course (week 2) that is part of the Data Science Specialization track by Johns Hopkins University. In this course we will apply on text data analysis and natural language processing. The final goal is:

Practically, given a word by a user, the application will predict the next one. For this specific milestone report, the goals are:


Setup environment


Load packages and data

library(knitr)
library(kableExtra)
library(dplyr)
library(data.table)
library(stringi)
library(tm)
library(NLP)
library(RWeka)
library(SnowballC)
library(stringr)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
library(wordcloud2)

Loading the data

Data have already been downloaded and unzipped from this link Capstone Dataset. We will work only on english data and we will use three specific text datasets each of them containing information about blogs, news and twitter. As we have to clean the datasets from the profane words, we will use a profanity list suggested by a classmate in the forum of the Course (week 2). You can find this dataset at at this link Profanity list.

workingDir <- getwd()
dataDir <- c("en_US")
profanityDir <- c("profanity")
list.files(file.path(workingDir,dataDir))
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
list.files(file.path(workingDir,profanityDir))
## [1] "Terms-to-Block.csv"
# Reading blogs file
blogsFile <- list.files(file.path(workingDir,dataDir))[1]
connBlogsFile <- file(file.path(workingDir, dataDir, blogsFile), "r")
ds_blogs <- readLines(connBlogsFile, encoding = "UTF-8", skipNul = TRUE)
close(connBlogsFile)

# Reading news file. NB: here we force a binary reading in order to avoid an error occuring during normal reading 
# that is: "incomplete final line found on ..".
newsFile <- list.files(file.path(workingDir,dataDir))[2]
connNewsFile <- file(file.path(workingDir, dataDir, newsFile), "rb")
ds_news <- readLines(connNewsFile, encoding = "UTF-8", skipNul = TRUE)
close(connNewsFile)

# Reading twitter file
twitterFile <- list.files(file.path(workingDir,dataDir))[3]
connTwitterFile <- file(file.path(workingDir, dataDir, twitterFile), "r")
ds_twitter <- readLines(connTwitterFile, encoding = "UTF-8", skipNul = TRUE)
close(connTwitterFile)

# Reading profanity file. This file must be a character vector as required by tm_map function
profanityFile <- list.files(file.path(workingDir,profanityDir))[1]
df_profanity <- read.table(file.path(workingDir, profanityDir, profanityFile),
        header = FALSE, sep = "\t", quote = "\"", stringsAsFactors = F,
        col.names = c('ProfaneWord'))
ds_profanity <- as.character(df_profanity$ProfaneWord)

Data overview

Now let’s analyze the data of the three datasets from a statistical point of view (number of lines, number of words, etc).

Name of file Size of file (Mb) No. of lines No. of lines with at least one non-white space No. of characters No. of characters that are non-white space No. of words Max length in characters
en_US.blogs.txt 200.4 899,288 899,288 206,824,382 170,389,539 37,570,839 40,833
en_US.news.txt 196.3 1,010,242 1,010,242 203,223,154 169,860,866 34,494,539 11,384
en_US.twitter.txt 159.4 2,360,148 2,360,148 162,096,241 134,082,806 30,451,170 140

The three datasets involve a great number of lines and words (4,269,678 lines and 102,516,548 words). This could be affect performances and developing time so, as suggested in the course instructions and taking into account the limitations of my development environment, we will use a subset (1%) of the total data in order to analyze information.

Sampling the data

set.seed(12345)
sampleRate <- 0.01
ds_blogsSample <- sample(ds_blogs, size = length(ds_blogs) * sampleRate, replace = FALSE)
ds_newsSample <- sample(ds_news, size = length(ds_news) * sampleRate, replace = FALSE)
ds_twitterSample <- sample(ds_twitter, size = length(ds_twitter) * sampleRate, replace = FALSE)
ds_finalSample <- c(ds_blogsSample, ds_newsSample, ds_twitterSample)
rm(ds_blogs, ds_news, ds_twitter)

Now we have 42,695 lines and 1,024,708 words.


Preparing the data (cleaning and preprocessing)

Currently, the sample dataset contains raw data such as urls, profane words, punctuation, special characters and so on which are useless for our purposes. So, we have to perform some steps in order to clean up the data. Moreover, we have to pre-process the data in order to make them ‘ready’ for the Exploratory Analysis. In particular, we will use stemming software in order to reduce every words to its ‘base form’ (see Stanford).

Before proceeding, let’s turn the sample dataset into a ‘corpus’ object, that is a collection of documents containing some text. Then, we will clean the data in order to:

  1. convert some special characters so as to allow the stopword step to work correctly;
  2. convert the text from UTF-8 to ASCII (in order to remove some special characters);
  3. convert all characters into lowercase;
  4. remove web address url;
  5. remove profane words;
  6. remove stopwords (english). For istance: ‘I’, ‘Me’, ‘My’, etc;
  7. remove numbers;
  8. remove both ‘ordinal’ references about numbers (‘st’, ‘nd’, ‘rd’, ‘th’) and what remains about decades after the remove numbers step (90’s –> ’s);
  9. remove punctuation;
  10. strip whitespace;

Please note that the order of the execution of each step is important! After these cleaning steps we apply stemming.

# Define a function in order to convert characters UTF-8 to ASCII. The non-convertible characters will be replaced witha a blank. The goal is to remove all special characters 
char1 <- c("'")
char2 <- c("'")

textToAscii <- function(x) (iconv(x, "UTF-8", "ASCII", sub = ""))

# Corpus creation
ds_corpusSample <- VCorpus(VectorSource(ds_finalSample))
# Substitute specific special characters (hexadecimal X92 e X91)
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = char1, replacement = "'", ignore.case = TRUE)
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = char2, replacement = "'", ignore.case = TRUE)
# Remove special characters
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(textToAscii))
# Convert all characters into lower case
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(tolower))
# Remove web urls
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = "http\\S+\\s*", replacement = "", ignore.case = TRUE)
# Remove profane words. It is suitable to do before removing numbers as some profane words contain numbers (i.e. "a55" for "ass")
ds_corpusSample <- tm_map(ds_corpusSample, removeWords, ds_profanity)
# Remove English stopwords
ds_corpusSample <- tm_map(ds_corpusSample, removeWords, stopwords('english'))
# Remove numbers
ds_corpusSample = tm_map(ds_corpusSample, removeNumbers)
# Remove:
#   a) what remain about ordinal numbers ('st', 'nd', 'rd', 'th') after the remove numbers step
#   b) what remain about decades ('s) which uses a special character as apostrophe after the remove numbers step
#   c) quotes used for citations
ds_corpusSample <- tm_map(ds_corpusSample, content_transformer(gsub), pattern = " st | nd | rd | th | 's", replacement = "", ignore.case = TRUE)
# Remove punctuation
ds_corpusSample <- tm_map(ds_corpusSample, removePunctuation)
# Strip whitespace
ds_corpusSample <- tm_map(ds_corpusSample, stripWhitespace)
# Apply stemming
ds_corpusSample <- tm_map(ds_corpusSample, stemDocument)

Exploratory data analysis

Tokenization

Now we want to analyze the frequency of the words in the data; that is .. ‘what are the most common words in the data?’. But we will do this not only for a single word (unigram - 1-gram), but also for two words at a time (bigram - 2-gram) and for three words at a time (trigram - 3-gram).

To do this we have to apply tokenization to the data; with tokenization we will be able to ‘split a character string into individual element of interest’ (see Introduction to text mining).

The output will be a Term Document Matrix. This matrix will have each ‘corpus term’ represented as a row, and a document as a column. Every cell of the matrix will contain the frequency of the specific ‘corpus term’ in that document (a tweet or a news or a blog). Using tokenization, we will create three different Term Document Matrix so as to have as ‘corpus term’ the unigrams or the bigrams or the trigrams detected by tokenization. In this way, we will be able to analyze their distribution.

With tokenization, the phenomenon of sparsity occurs. We will use a specific parameter (0 < parameter < 1) to control it.

# Unigram tokenization
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
mt_unigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = unigramTokenizer))
# Reduce sparsity
prmt_SparseTerm <- 0.99995
mt_unigram <- removeSparseTerms(mt_unigram, prmt_SparseTerm)
# Find the most frequent unigrams with a frequency >= 7
ls_mostFreqUnigram <- findFreqTerms(mt_unigram, lowfreq = 7)
ls_mostFreqUnigram <- rowSums(as.matrix(mt_unigram[ls_mostFreqUnigram, ]))
df_mostFreqUnigram <- data.frame(Word = names(ls_mostFreqUnigram), Frequency=ls_mostFreqUnigram)
df_mostFreqUnigram <- arrange(df_mostFreqUnigram, desc(Frequency))
df_mostFreqUnigram_sel <- df_mostFreqUnigram[1:20,]

# Bigram tokenization
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
mt_bigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = bigramTokenizer))
# Reduce sparsity
mt_bigram <- removeSparseTerms(mt_bigram, prmt_SparseTerm)
# Find the most frequent bigrams with a frequency >= 7
ls_mostFreqBigram <- findFreqTerms(mt_bigram, lowfreq = 7)
ls_mostFreqBigram <- rowSums(as.matrix(mt_bigram[ls_mostFreqBigram, ]))
df_mostFreqBigram <- data.frame(Word = names(ls_mostFreqBigram), Frequency = ls_mostFreqBigram)
df_mostFreqBigram <- arrange(df_mostFreqBigram, desc(Frequency))
df_mostFreqBigram_sel <- df_mostFreqBigram[1:20,]

# Trigram tokenization
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
mt_trigram <- TermDocumentMatrix(ds_corpusSample, control = list(tokenize = trigramTokenizer))
# Reduce sparsity
mt_trigram <- removeSparseTerms(mt_trigram, prmt_SparseTerm)
# Find the most frequent trigrams with a frequency >= 7
ls_mostFreqTrigram <- findFreqTerms(mt_trigram, lowfreq = 7)
ls_mostFreqTrigram <- rowSums(as.matrix(mt_trigram[ls_mostFreqTrigram, ]))
df_mostFreqTrigram <- data.frame(Word = names(ls_mostFreqTrigram), Frequency = ls_mostFreqTrigram)
df_mostFreqTrigram <- arrange(df_mostFreqTrigram, desc(Frequency))
df_mostFreqTrigram_sel <- df_mostFreqTrigram[1:20,]

Plotting the data

Unigram frequency

Bigram frequency

Trigram frequency


Next steps

Looking at the data of the exploratory analysis we can reasonably count on a reliable dataset to work on. So next steps could be:

  1. find a larger sample in order to build the predictive algorithm;
  2. develop the predictive algorithm paying attention to performance and coding optimization;
  3. implement the final shiny application