Capstone: Milestone Report

Introduction

The goal of this project is just to display the ability to work with and explore the data, which will eventually lead up to creating a prediction algorithm and creating a data product.

The motivation for this project is to:
1. Demonstrate that data have been successfully loaded.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings amassed so far.
4. Get feedback on plans for creating a prediction algorithm and Shiny app.

# Chosen packages, which may be of utility for the analysis:
library(dplyr); library(ggplot2); library(tibble); library(stringr); library(stringi); library(tm); library(SnowballC); library(RWeka);
library(RWekajars); library(quanteda)

Importing and Reading the Data

Raw data, stored as text files, were downloaded to local drive from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip (accessed 18 March 2022), and then loaded and read in R for further analysis.

Full data set includes blogs, tweets, and news written in English, Finnish, German, and Russian. For the purposes of this project, just the English text will be analyzed.

blogs <- file('capstone/en_US/en_US.blogs.txt', 'r')
blogLines <- readLines(blogs, encoding = 'UTF-8', skipNul = T, warn = F)
close(blogs)

tweets <- file('capstone/en_US/en_US.twitter.txt', 'r')
tweetLines <- readLines(tweets, encoding = 'UTF-8', skipNul = T, warn = F)
close(tweets)

news <- file('capstone/en_US/en_US.news.txt', 'r')
newsLines <- readLines(news, encoding = 'UTF-8', skipNul = T, warn = F)
close(news)

Characteristics of the Raw Data

For each source (blogs, tweets, news), its file size (in megabytes, MB), count of lines, word count, and character count are summarized in a simple table.

table <- function(blogLines, tweetLines, newsLines) {
     features <- data.frame(source = c('blogs', 'tweets', 'news'),
                           lineCount = c(length(blogLines),
                                         length(tweetLines),
                                         length(newsLines)),
                           words = c(sum(stri_count_words(blogLines)),
                                     sum(stri_count_words(tweetLines)),
                                     sum(stri_count_words(newsLines))),
                           characters = c(stri_stats_general(blogLines)[3],
                                          stri_stats_general(tweetLines)[3],
                                          stri_stats_general(newsLines)[3]))
     features
}

table(blogLines, tweetLines, newsLines)

##   source lineCount    words characters
## 1  blogs    899288 37546250  206824382
## 2 tweets   2360148 30093413  162096241
## 3   news     77259  2674536   15639408

Sampling the Data

Taking into consideration the sheer size of the data set, 20,000 lines of text from each of the three sources are sampled and then combined to form a corpus (with a total 60,000 lines of text).

set.seed(20220322)
sampling <- c(sample(blogLines, 20000), sample(tweetLines, 20000),
              sample(newsLines, 20000))
length(sampling) # Should come up with 60000 as desired

## [1] 60000

stri_stats_general(sampling) # to call general statics

##       Lines LinesNEmpty       Chars CharsNWhite 
##       60000       60000     9968110     8264771

stri_stats_latex(sampling)

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##       7867964            24       2045495       1772188             7 
##        Envirs 
##             0

This sampling of the three sources brings just under 10 million characters and over 1.77 million words.

Cleaning the Sampled Data

This new corpus of cleaned sampled data will be named textColl, as in text collection.Also, note that each line of code has rationale (the comments) provided.

textColl <- VCorpus(VectorSource(sampling))
spacing <- content_transformer(function(x, pattern) {
     return(gsub(pattern, " ", x))
})

textColl <- tm_map(textColl, removePunctuation) # To remove punctuations

textColl <- tm_map(textColl, content_transformer(tolower)) # To convert all letters to lowercase for uniformity

textColl <- tm_map(textColl, stemDocument) # This process of stemming reduces, say, gerunds and past tenses to their stems, or root words. (e.g. 'stemming' and 'stemmed' are stemmed to just 'stem.')

textColl <- tm_map(textColl, removeNumbers) # To remove numerals

removeURL <- function(u) {
     stri_replace_all_regex(u, "(ht|f)tp\\S+\\s*", " ")
     stri_replace_all_regex(u, "www\\S+\\s*", " ")
}
textColl <- tm_map(textColl, content_transformer(removeURL)) # To remove instances of uniform resource locators, a la website addresses

removeHash <- function(h) {
     stri_replace_all_regex(h, "\\S+#\\S+", " ")
}
textColl <- tm_map(textColl, content_transformer(removeHash)) # To remove twitter hashtags

textColl <- tm_map(textColl, removeWords, letters) # To remove any stray letters - may take a moment

textColl <- tm_map(textColl, stripWhitespace) # To remove any extraneous white space.

Part of the whole exercise is to build what are called n-grams to explore single word and word pair frequencies. Specifically, these would be what are called unigrams (1-gram or single word), bigrams (2-gram or a pair of words), and trigrams (3-gram or a string of three words), which are created by ‘tokenizing’ or extracting these n-gram patterns from the ‘textColl’ corpus. An explanation can be found at http://en.wikipedia.org/wiki/N-gram.

# Note: Each tokenizing process may take a moment.

# For unigrams...
uniToken <- function(x) {
     NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
unigrams <- DocumentTermMatrix(textColl, control = list(tokenize = uniToken))
unigrams <- removeSparseTerms(unigrams, 0.999)

# For bigrams...
biToken <- function(x) {
     NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
bigrams <- DocumentTermMatrix(textColl, control = list(tokenize = biToken))
bigrams <- removeSparseTerms(bigrams, 0.999)

# For trigrams...
triToken <- function(x) {
     NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
trigrams <- DocumentTermMatrix(textColl, control = list(tokenize = triToken))
trigrams <- removeSparseTerms(trigrams, 0.999)

# Will now measure n-grams' frequencies. Will also extract the top 25 from each word pattern.

freqs <- function(d) {
     sort(colSums(as.matrix(d)), decreasing = T)
}

unifreqs <- freqs(unigrams)
unifreqs25 <- unifreqs[1:25] # Extracting the top 25
unifreqs25df <- data.frame(word = names(unifreqs25),
                           frequency = as.numeric(unifreqs25)
)

bifreqs <- freqs(bigrams)
bifreqs25 <- bifreqs[1:25] # Extracting the top 25
bifreqs25df <- data.frame(bigram = names(bifreqs25),
                           frequency = as.numeric(bifreqs25)
)

trifreqs <- freqs(trigrams)
trifreqs25 <- trifreqs[1:25] # Extracting the top 25
trifreqs25df <- data.frame(trigram = names(trifreqs25),
                           frequency = as.numeric(trifreqs25)
)

trifreqs50 <- trifreqs[26:50] # Extracting the next 25 trigram frequencies for curiosity
trifreqs50df <- data.frame(trigram50 = names(trifreqs50),
                           frequency = as.numeric(trifreqs50)
)

Basic Plots: Frequency of Words

Issues

Extracting trigrams frequencies was an exercise in trial and error. After some attempts at stemming and removing stop words, the unigram and bigram frequency extractions executed seamlessly. When it was time to extract trigram frequencies, however, the results came up empty. So, after belaboring the trigram frequency extraction a bit further, I decided to settle on just leaving the stop words. As a result, I got the trigrams frequencies I had hoped for.

From some of the literature I read, whether or not to stem or to remove stop words would all boil down to whatever works to achieve optimal results as best as possible. There are pros and cons to applying (or foregoing) stemming or stop word removal. For example, stop words may be essential for phrase completeness. Or stemming may unintentionally truncate an actual stem word. A case in point for this: Notice the phrases accord to the and be abl to in the trigram bar plot.

What’s Next

May re-explore ways to handle the stemming issue, because a cursory look of the top-25 trigrams, for example, reveals unintentional truncation of some of the words.

The quality of trigrams, in terms of their syntax, look promising. Will take the tokenizing process a bit further by increasing n-grams to four words.

Will then create a predictive algorithm based on the n-gram model.

From that predictive algorithm, will develop a data product (an app) that would predict words that are likely to follow a user’s manual entry.

Reference(s)

Feinerer, Ingo, Kurt Hornik, and David Meyer. “Text Mining Infrastructure in R”. Journal of Statistical Software 25, no. 5 (March 31, 2008): 1–54. https://doi.org/10.18637/jss.v025.i05.
Keita, Zoumana. “Stemming, Lemmatization–Which One is Worth Going For?” Towards Data Science. March 7, 2022. https://towardsdatascience.com/stemming-lemmatization-which-one-is-worth-going-for-77e6ec01ad9c
Khanna, Chetna. “Text Pre-processing: Stop Words Removal Using Different Libraries.” Towards Data Science. February 10, 2021. https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a.
Schofield, Alexandra, Mans Magnusson, David Mimmo. “Pulling Out the Stops: Rethinking Stopword Removal for Topic Models.” 2017. https://mimno.infosci.cornell.edu/papers/schofield_eacl_2017.pdf