Data Science Capstone - Milestone Report

Executive Summary

This present report is the milestone report for the capstone project. The main tasks that are being accomplished here are the following:

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

The report ends with a section that mentions the future work on the project.

Load the required packages

In this section we gather all the required packages for the analysis.

require(dplyr)
require(stringi)
require(tm)
require(RWeka)
require(ggplot2)

Load and clean the data

The training data which will be the basis for the capstone project is on this Link. The zip file contains text from the news, blogs and twitter in four languages (English, German, Finnish and Russian). All work will be done in the English language. In this respect, the datasets that are in English will be used (en_US folder). First we will read the .txt files.

blgCon <- file("final/en_US/en_US.blogs.txt", open = "r")
blg <- readLines(blgCon, encoding = "UTF-8", skipNul = TRUE)
close(blgCon)

nwsCon <- file("final/en_US/en_US.news.txt", open = "r")
nws <- readLines(nwsCon, encoding = "UTF-8", skipNul = TRUE)
close(nwsCon)

twterCon <- file("final/en_US/en_US.twitter.txt", open = "r")
twter <- readLines(twterCon, encoding = "UTF-8", skipNul = TRUE)
close(twterCon)

rm(blgCon)
rm(nwsCon)
rm(twterCon)

By using the ‘stringi’ package we will retrieve and show basic stats for the three .txt files.

File size in MBs

blgSize <- file.info("final/en_US/en_US.blogs.txt")$size/1024^2
nwsSize <- file.info("final/en_US/en_US.news.txt")$size/1024^2
twterSize <- file.info("final/en_US/en_US.twitter.txt")$size/1024^2

Number of lines in text files

blgLines <- length(blg)
nwsLines <- length(nws)
twterLines <- length(twter)

Number of words in text files

blgWords <- sum(stri_count_words(blg))
nwsWords <- sum(stri_count_words(nws))
twterWords <- sum(stri_count_words(twter))

Information for the three text files:

File	Size in MBs	Number of Lines	Number of Words
Blogs	200.4242077	899288	37546239
News	196.2775126	1010242	34762395
Twitter	159.364069	2360148	30093413

As we see on the table above, all three text files exceed 500 MBs in size and 100 million words. Since the datasets are massive, we will just get a sample of 0.4% (due to my computer’s low memory capabilities) for each text file (~ seventeen thousand lines) to demonstrate our explanatory analysis.

blgSample <- sample(blg, length(blg)*0.004)
nwsSample <- sample(nws, length(nws)*0.004)
twterSample <- sample(twter, length(twter)*0.004)
sampleText <- c(blgSample, nwsSample, twterSample)

Upon this sampled - merged - text we will first start with the cleaning process. We will use the ‘tm’ package for this. We start with converting all text into lower case. Then continue by removing punctuation, white spaces, numbers, and various stopwords of the english language. We also remove any bad words that are included in the text. To remove these bad words we googled around for a bad words list and we found the list of bad words (lower case) that is banned by Google on this Link. The work being done is shown in the code chunk below.

sampleTextcorpus <- VCorpus(VectorSource(sampleText))
sampleTextcorpus <- tm_map(sampleTextcorpus, content_transformer(tolower))
sampleTextcorpus <- tm_map(sampleTextcorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
sampleTextcorpus <- tm_map(sampleTextcorpus, stripWhitespace)
sampleTextcorpus <- tm_map(sampleTextcorpus, removeNumbers)
sampleTextcorpus <- tm_map(sampleTextcorpus, removeWords, stopwords("english"))
badWordList <- VectorSource(readLines("final/en_US/full-list-of-bad-words_text-file_2018_07_30.txt"))
sampleTextcorpus <- tm_map(sampleTextcorpus, removeWords, badWordList)

Exploratory Analysis

Since we processed and cleaned the textual data as described above, we can now proceed in the data exploration, motivated by two questions mentioned in the course readings:

Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?

Below we present the code chunk to compute the frequencies of N-grams for N = 1 (e.g., chief), N = 2 (e.g., chief executive), N = 3 (e.g., chief executive officer).

Token1 <- function(x){
        NGramTokenizer(x, control = Weka_control(min = 1, max = 1))}

Token2 <- function(x){
        NGramTokenizer(x, control = Weka_control(min = 2, max = 2))}

Token3 <- function(x){
        NGramTokenizer(x, control = Weka_control(min = 3, max = 3))}


gramN_1 <- DocumentTermMatrix(sampleTextcorpus, control = list(tokenize = Token1))

gramN_2 <- DocumentTermMatrix(sampleTextcorpus, control = list(tokenize = Token2))

gramN_3 <- DocumentTermMatrix(sampleTextcorpus, control = list(tokenize = Token3))

1. Unigrams

gramN_1FR <- sort(colSums(as.matrix(gramN_1)),decreasing = TRUE)
gramN_1FRdata <- data.frame(word = names(gramN_1FR), frequency = gramN_1FR)
head(gramN_1FRdata, 20)

##          word frequency
## just     just      1240
## will     will      1219
## one       one      1191
## said     said      1161
## like     like      1057
## can       can       983
## get       get       902
## time     time       852
## new       new       783
## now       now       701
## good     good       687
## day       day       650
## know     know       617
## people people       613
## love     love       612
## dont     dont       602
## back     back       591
## also     also       552
## see       see       545
## first   first       536

ggplot(data = gramN_1FRdata[1:20,], aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity", fill = "#FF6666") +
        ggtitle("20 Most Frequent Unigrams") +
        xlab("Unigrams") + ylab("Frequency") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 60, hjust = 1))

2. Bigrams

gramN_2FR <- sort(colSums(as.matrix(gramN_2)),decreasing = TRUE)
gramN_2FRdata <- data.frame(word = names(gramN_2FR), frequency = gramN_2FR)
head(gramN_2FRdata, 20)

##                            word frequency
## right now             right now        97
## new york               new york        85
## last year             last year        80
## cant wait             cant wait        69
## last night           last night        65
## dont know             dont know        58
## high school         high school        54
## feel like             feel like        52
## good morning       good morning        50
## looking forward looking forward        50
## years ago             years ago        48
## let know               let know        47
## im going               im going        44
## last week             last week        43
## st louis               st louis        41
## dont think           dont think        39
## first time           first time        39
## come back             come back        38
## can get                 can get        36
## dont want             dont want        36

ggplot(data = gramN_2FRdata[1:20,], aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity", fill = "#FF6666") +
        ggtitle("20 Most Frequent Bigrams") +
        xlab("Bigrams") + ylab("Frequency") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 60, hjust = 1))

3. Trigrams

gramN_3FR <- sort(colSums(as.matrix(gramN_3)),decreasing = TRUE)
gramN_3FRdata <- data.frame(word = names(gramN_3FR), frequency = gramN_3FR)
head(gramN_3FRdata, 20)

##                                                word frequency
## cant wait see                         cant wait see        19
## happy mothers day                 happy mothers day        17
## happy new year                       happy new year        11
## im pretty sure                       im pretty sure         8
## let us know                             let us know         8
## st louis county                     st louis county         8
## just got back                         just got back         7
## points per game                     points per game         7
## feel like im                           feel like im         6
## looking forward seeing       looking forward seeing         6
## new york city                         new york city         6
## according court documents according court documents         5
## bismarck north dakota         bismarck north dakota         5
## cant wait hear                       cant wait hear         5
## good morning everyone         good morning everyone         5
## grand theater bismarck       grand theater bismarck         5
## la la la                                   la la la         5
## new york times                       new york times         5
## president barack obama       president barack obama         5
## really need get                     really need get         5

ggplot(data = gramN_3FRdata[1:20,], aes(reorder(word,-frequency), frequency)) +
        geom_bar(stat = "identity", fill = "#FF6666") +
        ggtitle("20 Most Frequent Trigrams") +
        xlab("Trigrams") + ylab("Frequency") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 60, hjust = 1))

Future Work

The next steps for the capstone project is to develop a shiny app that deploys a predictive text algorithm.
I am planning to use the N-gram model for this prediction algorith, that is, to search for the (most probable) word that goes with the typed word at the Nth level and if not found move to the N-1 level and so forth until a word is predicted at the unigram level (if no word was predicted in the previous levels).
The shiny app will be accompanied with a presentation that pitches this application.