Data Science Capstone - Week 2 Milestone Report

1.0 Executive Summary.

The Key partners for this project are Swiftkey and Coursera.
The project explores the Natural Language Processing facet of Data Science where a large text corpus of documents will be used to predict the next word on a preceding input.
This goal of this project milestone report is to display exploratory insights obtained while working with the data provided - Capstone Dataset, and that the project is on track to create a prediction algorithm.
This, report published on R Pubs explains the exploratory analysis and goals for the eventual app / algorithm.
This document intends to explain only the major features of the data that have been identified and a brief summary of the plans for creating the prediction algorithm and Shiny app.

1.1 Major findings.

Restricting exploration to the enUS corpus:

The downloaded data is large and required manipulation, cleaning and sorting. Several functions were created in this report for that purpose.
More data wrangling may be needed to create the prediction app.
The summary statistics in this report show the direction to take for the predictive model in particular the use of n-grams as in the algorithm.
Fairly large use of computing resources on just the sample training set.

In conclusion, suggestions and comments are welcome from the peer review to be conducted.

2.0 Libraries and Options.

# Libraries and options
library(stringi)    # For string processing and examination.
library(caTools)    # Data manipulation utility.
source('./R.script.sources/funs_dataprep.R')
source('./R.script.sources/funs_display.R')
quanteda_options(threads= 8)

3.0 Data Corpus.

3.1 Data Load.

Only the en_US files were used for exploratory work and were downloaded and stored locally for convenience.

blogs: ./final/en_US.blogs.txt
news: ./final/en_US.news.txt
twitter: ./final/en_US.twitter.txt

#7 Read in data
blogs <- readLines('./data/en_US/en_US.blogs.txt', encoding= "UTF-8", skipNul= T)
news <- readLines('./data/en_US/en_US.news.txt', encoding= "UTF-8", skipNul= T)
twitter <- readLines('./data/en_US/en_US.twitter.txt', encoding= "UTF-8", skipNul= T)

3.2 Data Summary.

The data sets were examined and tabulated to give a sense of the data.
The intent here was to:

Assess the object sizes.
Assess the available number of lines, characters, white space characters and words.
Calculate some stats on the number of words per line (WPL).

fun.sumtab(blogs, news, twitter)

Dataset	SizeMB	Lines	Chars	CharsWhite	Words	WPLmin	WPLmean	WPLmax
blogs	255.35	899,288	206,824,382	42,636,700	37,570,839	0	41.75	6,726
news	19.77	77,259	15,639,408	3,096,618	2,651,432	1	34.62	1,123
twitter	318.99	2,360,148	162,096,241	35,958,529	30,451,170	1	12.75	47

The initial examination showed that the data objects were very large (up to 320 MB).
Surprisingly, the twitter data was the largest but held less words than blogs.
The statistics also show the interesting issue of words per line ranging from 0 to over 6,700 across all data sets.
The number of white space characters was also significant at around 20 - 30% of the datasets.

3.3 Data Sampling.

Given the size of the data, further exploratory work was carried out on smaller samples (1%) to reduce computing resources requirements.

# Sample and combine data.
set.seed(1234) # Repeatable sampling.
samplesize <- 0.01
blogs_smp <- sample(blogs, length(blogs) * samplesize)
news_smp <- sample(news, length(news) * samplesize)
twitter_smp <- sample(twitter, length(twitter) * samplesize)
combined = c(blogs_smp, news_smp, twitter_smp)

### Summary of samples
fun.sumtab(blogs_smp, news_smp, twitter_smp)

Dataset	SizeMB	Lines	Chars	CharsWhite	Words	WPLmin	WPLmean	WPLmax
blogs_smp	2.55	8,992	2,068,110	428,117	377,158	1	41.79	677
news_smp	0.20	772	156,344	31,021	26,624	1	34.74	195
twitter_smp	3.23	23,601	1,620,028	359,400	303,982	1	12.73	35

rm(blogs, blogs_smp, news, news_smp, twitter, twitter_smp)

As can be seen in the table above the 1% samples were smaller and easier to work with.
However, note that the means of the words per line in the samples do not change much from the original datasets.

The samples were further split into training and validation sets for later use to validate the prediction app.

# Split into train and validation sets
split <- sample.split(combined, 0.8)
train <- subset(combined, split == T)
valid <- subset(combined, split == F)
rm(combined)

3.4 Data Cleaning, Corpus Building and Tokenization.

The sample data was combined. A quanteda corpus was built and cleaned. Tokenization (N-grams) and the subsequent frequency analysis was conducted using the quanteda package.
The cleaning steps were:

Removed punctuation.
Removed symbols
Remove numbers.
Removed urls.
Removed separators.
Convert characters to lowercase.
Removed stop words.
Removing profanity using the bad-words.txt list from www.cs.cmu.edu.

# Transfer to quanteda corpus format. 
train <- corpus(train)

# Set up vector of profanity words.
profanity <- readLines('./data/bad-words.txt')

# Clean & Tokenize ie uni grams
train1 <- tokens(train,
                 what= "word",
                 remove_punct = T,
                 remove_symbols = T,
                 remove_numbers = T,
                 remove_url = T,
                 remove_separators = T
                 ) %>%
    tokens_tolower() %>%
    tokens_remove(pattern= stopwords("en")) %>%
    tokens_remove(pattern= as.list(profanity))

# Create n-grams
train2 = fun.ngram(train1, 2)
train3 = fun.ngram(train1, 3)
rm(train, profanity)

# Frequency tables ####
freqtrain1 <- fun.freq(train1)
freqtrain2 <- fun.freq(train2)
freqtrain3 <- fun.freq(train3)
rm(train1, train2, train3)

3.5 N-gram Visualizations and Analysis.

Visualization was created using tables, wordcloud diagrams and bar plots.

3.5.1 Uni-Grams.

fun.display(freqtrain1, "Uni-grams")

It was noted that 13,612 Uni-grams phrases represented 90% of the training sample.

3.5.2 Bi-Grams.

fun.display(freqtrain2, "Bi-grams")

It was noted that 201,515 Bi-grams phrases represented 90% of the training sample.

3.5.3 Tri-Grams.

fun.display(freqtrain3, "Tri-grams")

It was noted that 212,692 Tri-grams phrases represented 90% of the training sample.

4.0 Analysis.

From section 3.5 it was noted that:

the tokenized n-grams were representative of the sample training enUS corpus at the 90% coverage level with the respective number of phrases. Whether this is representative of the language is difficult to determine at this stage.
the n-grams show a large number of phrases have a high frequency of occurrence.
cleaning the data with the listed steps significantly reduces the data set size.
foreign language phrases were not considered yet and may involve a foreign language filter dictionary similar to the profanity filter used.
some phrases are repeated such as the “happy mothers day” phrase in the tri-grams suggesting further cleaning. Using stemming (root words conversion) may be necessary.
the functions used should be transferable to the larger sample data sets, provided some performance tuning is done.

gc()

           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  3061963 163.6    9845976 525.9 12778614 682.5
Vcells 13262602 101.2   49436284 377.2 96234401 734.3

5.0 Plans for Prediction Model.

The next step in the project is text prediction modeling.
The following studies will need to be conducted:

Larger sample size of the full text data sets for model training.
Prediction modeling based on n-grams. (May even require Quad grams)
Model optimization especially with regards to memory utilization.
Implement the model as a Shiny App.
The model design will have to consider invalid and non matching inputs ie. phrases not in the corpus.

Appendices: Functions sourced.

A1. Data Preparation Functions.

## functions dataprep.R

# Libraries
library(quanteda)   # Textual data analyzer.
library(kableExtra) # To display tables.

# Data summary table function
fun.sumtab <- function(t1, t2, t3) {
    # capturing row labels
    lbl1 <- substitute(t1); lbl2 <- substitute(t2); lbl3 <- substitute(t3)
    lbl <- sapply(c(lbl1, lbl2,lbl3),deparse)
    # getting object sizes
    SizeMB <- c(object.size(t1), object.size(t2), object.size(t3))/1024^2
    # calculating Words per Line Stats
    WPL= sapply(list(t1, t2, t3), function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
    rownames(WPL)=c('WPLmin', 'WPLmean', 'WPLmax')
    # setting up output table
    stats= data.frame(Dataset= lbl, SizeMB,
                      t(rbind(sapply(list(t1,t2,t3),stri_stats_general)[c('Lines','Chars'),],
                              sapply(list(t1,t2,t3),stri_stats_latex)[c('CharsWhite','Words'),], WPL)))
    # Printing summary table
    kbl(stats, digits=2, format.args = list(big.mark= ",", scientific = F)) %>%
        kable_minimal(full_width = T, position = "float_right")
}

# n-grams
fun.ngram = function(x, gram= 1) {
    tokens_ngrams(x,
                  n= gram,
                  concatenator= " "
                  )
}

# Frequency analysis
fun.freq <- function(x) {
    textstat_frequency(dfm(x))
}

A2. Visualization Functions.

## functions display.R

# Libraries
library(ggwordcloud)# word frequency visualizer.
library(ggplot2)    # Grammar of Graphics package for Visualizations.
library(grid)       # To manipulate visualizations.
library(gridExtra)  # supplement to grid.
library(RColorBrewer)# color palette.

fun.display <- function(x, title) {
    
    # Create ngram frequency display table.
    table <- tableGrob(x[1:10,1:3], rows = NULL)

    # Create Wordcloud
    wc <- ggplot(x[1:10,], aes(label= feature, size= frequency, color= frequency)) +
        geom_text_wordcloud() +
        scale_size_area(max_size= 12) +
        scale_color_distiller(palette = "YlOrRd", direction= 1) 
    
    # Create BarPlot
    plot <- ggplot(x[1:10, 1:2], aes(y = reorder(feature, frequency), x = frequency, fill = frequency)) +
        geom_bar(stat = "identity") +
        scale_fill_distiller(palette = "YlOrRd", direction= 1) +
        theme(legend.position= "none") +
        labs(y="Phrase")

# Create Coverage Plot
p <- cumsum(x$frequency)/sum(x$frequency)
p.5 <- which(p>=0.5)[1]; p.9 <<- (which(p>=.9)[1]); tit <<-title
p <- data.frame(p)
p$idx <- as.numeric(row.names(p))
names(p) <- c("pr","idx")
cover <- ggplot(p, aes(y= idx, x= pr, fill= pr)) +
    geom_col() +
    scale_fill_distiller(palette = "YlOrRd", direction= 1) +
    theme(legend.position= "none") +
    labs(y= "Phrase Count", x= "Probability of Coverage") +
    geom_vline(xintercept= 0.5, color= "steelblue", size = 1) +
    geom_vline(xintercept= 0.9, color= "steelblue", size = 1) +
    annotate("text", x = .45, y = p.5+1025, label = paste(format(p.5, big.mark= ","), "Phrases", "\n", "@ 50% Coverage")) +
    annotate("text", x = .85, y = p.9+1025, label = paste(format(p.9, big.mark= ","), "Phrases", "\n", "@ 90% Coverage"))

# Assemble Figure
layout <- rbind(c(1,2),
                c(3,4))
grid.arrange(table, cover, plot, wc,
             layout_matrix =layout,
             top= textGrob(paste("Top 10", title, "out of",
                                 format(dim(x)[1],big.mark= ","), sep= " "),
                           gp= gpar(fontsize= 20)))
}