Coursera - Data Science Capstone

Introduction
Loading Data and Exploratory analysis
- Summary of Raw Data:
- Plots to Represent Raw Data:
Sampling Data
Preprocessing Data
N-Gram Tokenization
Future Planning

Introduction

This report consist of an exploratory analysis for the data that is being used to generate a word prediction model. The goal is to build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. The data are from 3 different sources: 1. Blogs, 2. Twitter 3. News. In this intermidiatory stage of the project, the data will be sampled, preprocessed and tokenized to generate n-grams. The report also incudes the future planning for generating a word prediction App.

Loading Data and Exploratory analysis

Summary of Raw Data:

source	size_MB	line_count	word_count
blogs	200.4242	899288	37334131
twitter	159.3641	2360148	30373583
news	196.2775	77259	2643969

Plots to Represent Raw Data:

Sampling Data

5% of data is being randomly sampled to represent the larger data set. Following table shows the summary of the samples.

# create a list of random variables
set.seed(123)
percent <- 0.05

blogs_sample <- fn_sample(blogs,percent)
twitter_sample <- fn_sample(twitter,percent)
news_sample <- fn_sample(news,percent)

rm(blogs)
rm(twitter)
rm(news)

# create a list to store samples
list_sample <- list(blogs = blogs_sample, twitter = twitter_sample , news = news_sample)

# create a data frame for samples
df_sample <- data.frame(source = c("blogs", "twitter", "news"), line_count = NA, word_count = NA)

# get counts of samples
df_sample$line_count <- sapply(list_sample, length)
df_sample$word_count <- sapply(list_sample, fn_words)

rm(list_sample)
kable(df_sample)

source	line_count	word_count
blogs	44964	1873491
twitter	118007	1521049
news	3862	132905

Preprocessing Data

Data preprocessing does the following tasks:

Removing URLs
Removing Hash tags
Removing Twitter handlers
Removing non Ascii characters
Converst to lower case
Removing Profanity words (Used 451 words/phrases from the ‘Full List of Bad Words and Top Swear Words Banned by Google’)
Removing punctuation
Removing numbers
Removing stop words
Removing/fixing white spaces

# Preprocess Data

### helper functions
stringi_toLower <- function(x) stringi::stri_trans_tolower(x)
remove_URL <- function(x) gsub("http:[[:alnum:]]*", "", x)
remove_HashTags <- function(x) gsub("#\\S+", "", x)
remove_TwitterHandles <- function(x) gsub("@\\S+", "", x)
remove_nonAscii <- function(x) gsub("[^\x01-\x7F]", "", x)
fix_whitespaces <- function(x) qdapRegex::rm_white(x)
## List of Bad Words and Top Swear Words Banned by Google 
profanity_words <- fn_rlines("list.txt")
  
# function to Preprocess
fn_preprocess <- function(list_text){
  corpus_text <- tm::Corpus(VectorSource(list_text))
  corpus_text <- tm::tm_map(corpus_text, content_transformer(remove_URL))
  corpus_text <- tm::tm_map(corpus_text, content_transformer(remove_HashTags))
  corpus_text <- tm::tm_map(corpus_text, content_transformer(remove_TwitterHandles))
  corpus_text <- tm::tm_map(corpus_text, content_transformer(remove_nonAscii))
  corpus_text <- tm::tm_map(corpus_text, content_transformer(stringi_toLower))
  corpus_text <- tm::tm_map(corpus_text, removeWords, stopwords("en"))
  corpus_text <- tm::tm_map(corpus_text, removeWords, profanity_words)
  corpus_text <- tm::tm_map(corpus_text, removePunctuation)
  corpus_text <- tm::tm_map(corpus_text, removeNumbers)
  corpus_text <- tm::tm_map(corpus_text, content_transformer(fix_whitespaces))
  return (corpus_text)

}

Encoding(blogs_sample) <- "UTF-8"
Encoding(twitter_sample) <- "UTF-8"
Encoding(news_sample) <- "UTF-8"

corpus_blogs <- fn_preprocess(blogs_sample)
corpus_twitter <- fn_preprocess(twitter_sample)
corpus_news <- fn_preprocess(news_sample)

rm(blogs_sample)
rm(twitter_sample)
rm(news_sample)

N-Gram Tokenization

R’s package ‘ngram’ is used to tokenize the data.

# ngram tokenizing

str_blogs <- concatenate ( lapply ( corpus_blogs , "[", 1) )
str_twitter <- concatenate ( lapply ( corpus_twitter , "[", 1) )
str_news <- concatenate ( lapply ( corpus_news , "[", 1) )

rm(corpus_blogs)
rm(corpus_twitter)
rm(corpus_news)

full_string <- concatenate (str_blogs,str_twitter,str_news)

rm(str_blogs)
rm(str_twitter)
rm(str_news)

## 1-grams
ng1 <- ngram (full_string , n =1)
df_1gram <- get.phrasetable ( ng1 )
df_1gram <- df_1gram[order(df_1gram$freq,decreasing = TRUE),]

## 2-grams
ng2 <- ngram (full_string , n =2)
df_2gram <- get.phrasetable ( ng2 )

## 3-grams
ng3 <- ngram (full_string , n =3)
df_3gram <- get.phrasetable ( ng3 )

## 4-grams
ng4 <- ngram (full_string , n =4)
df_4gram <- get.phrasetable ( ng4 )

The word frequencies for 1-grams, 2-grams and 3-grams are shown in the following plots (first 30 highest frequency phrases are plotted):

Future Planning

Some importants points to consider:

An interactive app that predicts the next word when user entering words.
builing a predictive model using the tokenized data.
handling the words come from foreign languages.
handling the words that may not be in the corpora (backoff models).
efficiency of the model (Runtime/Size).

Coursera - Data Science Capstone - Milestone Report

Nethika Suraweera

August 4, 2018