Capstone Project - Milestone Report

Overview

The Data Science Capstone project’s goal is to create a predictive text model. Natural Language Processing (NLP) techniques will be used to perform the analysis and build the predictive model. A large text corpus of documents, collected from news, blogs and twitter feeds, are provided to use as training data.

This report describes the exploratory data analysis steps conducted, so far, to explore the data provided, summarize it, clean it and make it ready for tokenization, a key step in NLP. The report also describes the n-grams lists generated to be used in the predictive model for the project. Finaly, the report lists the next steps to be taken to complete the capstone project and deliver the required application for it.

To make the report brief and concise, all the code to read the data files, create a corpus from a training dataset, clean it and tokenize it, in addition to the code to generate the tables and plots of this report, are listed in the Appendix of this report

Exploratory Data Analysis

After loading all R libraries required to do the data analysis and plotting, the following steps were taken:

Reading and Anlyzing Data Files

There are three data file, for each data source (news, blogs, twitter). Each file’s lines were read and counted separately. In addition, the total number of words in each file, as well as, the number of Words-per-Line (WPL) statistics (minimum, median, mean and maximum) were calculated. The following table presents the characteristics of each data file.

File	Size	Lines_Num	Words_Total	WPL_min	WPL_med	WPL_mean	WPL_max
en_US.news.txt	196.28 MB	1010242	34762395	1	32	34.41	1796
en_US.blogs.txt	200.42 MB	899288	37546250	0	28	41.75	6726
en_US.twitter.txt	159.36 MB	2360148	30093413	1	12	12.75	47

Training Dataset Creation

A training dataset is generated by sampling all three data files. The code provided in the Appendix for this section, shows that a seed was set, for reproducibility purposes. After multiple attempt to set a sample-size/percentage to be used for extracting lines from each of the files, and been challenged by the size of the memory (16GB) in the machine the data analysis was done on, and how this affected later analysis’ operations, a percentage of 2% is decided on. Its worth mentioning here, as the code in the Appendix shows, all big objects were removed from the memory as soon it had been decided that it was no longer needed. This was done in order to save memory space for the next operations and their objects.

The training dataset was built by taking a sample of 2% of each of the three data files. The created tratining_data has the following characteristics:

Dataset	Lines_Num	Words_Total	WPL_min	WPL_med	WPL_mean	WPL_max
training_data	85391	2038328	1	16	23.87	716

Corpus Creation and Cleaning

A corpus was created from the training dataset, using R’s tm text mining library. After removing the big training data from memory, a set of transformations were done to the corpus to clean it. The corpus was cleaned by removing all: web/email addresses, twitter hash-tags/mentions, profanity words, numbers, punctuation, stopwords and white-spaces. All words in the corpus were converted to lowercase, and then stemmed (reduced to their common root or base form). Finaly, the corpus was transformed to plain-text document to be used in the tokinaztion process.

N-gram Tokenization

Tokenization, in NLP, is the process of identifying tokens, such as words or pairs of words. Usually those tokens called n-grams. An n-gram is a contiguous sequence of n items (words in this case) from a given sample of text corpus. An n-gram of size 1-word is referred to as a unigram, size 2-words is a bigram and size 3-words is a trigram. In this project, R’s RWeka library was used to tokenize the corpus, and generate three lists: one for unigrams, a second for bigrams and a third for trigrams. The lists are then converted to data-frames, each with two columns. The first column is for the terms (a term is 1 word for a unigram, 2 for a bigram and 3 for a trigram). The second column has the frequencies of these terms appearing in the corpus. The code to tokenize the corpus and create these data-frames is provided in the Appendix.

Unigrams

There were 8503 unigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type	terms_num	freq_min	freq_med	freq_mean	freq_max
unigram	8503	9	29	112.76	6324

The top 10 unigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 unigrams.


term	will	said	just	get	one	like	time	can	day	year
freq	6324	6196	6051	6038	5978	5848	5181	4993	4435	4210

Bigrams

There were 5484 bigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type	terms_num	freq_min	freq_med	freq_mean	freq_max
bigram	5484	9	13	19.38	487

The top 10 bigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 bigrams.


term	right now	last year	look like	new york	feel like	year ago	last night	look forward	high school	make sure
freq	487	433	416	386	339	339	338	314	287	267

Trigrams

There were 107 trigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type	terms_num	freq_min	freq_med	freq_mean	freq_max
trigram	107	9	12	14.74	63

The top 10 trigrams are listed here, followed by a bar-plot showing the frequencies for the top 20 trigrams.

term	freq
happi mother day	63
new york citi	56
let us know	51
happi new year	38
new york time	32
presid barack obama	32
look forward see	31
world war ii	28
two year ago	27
cinco de mayo	26

Conclusion and Next Steps

This report documented the process of exploring the given data files, generating a training dataset by sampling these files, and then creating and cleaning/tidying a corpus from the trainingg dataset. As a step to build a predictive next-word model, tokenization was done to the corpus. Three n-gram data-frames were produced: one for unigrams, another for bigrams, and a third for trigrams. Each n-gram data-frame has the terms and their frequencies.

The next steps for this project include:

creating a predictive model based on the generated n-gram lists.
optimizing the model for fast processing and better memory mangement, as both speed and memory management proved to be an issue for NLP type projects.
build and deploy a shiny data product to demo the model and its predictive capabilities.
create a presentation to discuss the app, its predictive model/algorithm, and its design/UI characteristics.

Appendix

This appendix lists all R code chunks used in this report production.

Code for: Reading and Anlyzing Data Files

## load required libraries
library(kableExtra)
library(stringi)
library(tm)   
library(jsonlite)
library(RWeka)
library(ggplot2)

## open data files**, read data line, then close files' connections
##   **all data files are assumed to be downloaded and placed in R's working directory. 

news_file_name <- "en_US.news.txt"
blogs_file_name <- "en_US.blogs.txt"
twitter_file_name <- "en_US.twitter.txt"

news_file_con <- file(news_file_name, "r")
blogs_file_con <- file(blogs_file_name, "r")
twitter_file_con <- file(twitter_file_name, "r")

news_lines <- readLines(news_file_con, skipNul = TRUE) ; 
blogs_lines <- readLines(blogs_file_con, skipNul = TRUE)
twitter_lines <- readLines(twitter_file_con, skipNul = TRUE)

close(news_file_con) ; close(blogs_file_con) ; close(twitter_file_con)

## get the words counts and statistics for the data files 
news_lines_words_count <- stri_count_words(news_lines) 
blogs_lines_words_count <- stri_count_words(blogs_lines) 
twitter_lines_words_count <- stri_count_words(twitter_lines) 
news_lines_word_summary <- c (sum(news_lines_words_count), summary(news_lines_words_count))
blogs_lines_word_summary <- c (sum(blogs_lines_words_count), summary(blogs_lines_words_count))
twitter_lines_word_summary <- c (sum(twitter_lines_words_count), summary(twitter_lines_words_count))

## put all data files' info collected in a data-frame
files_summary <- data.frame(File = c(news_file_name, blogs_file_name, twitter_file_name),
           Size = c( paste(round(file.info(news_file_name)$size/(1024^2), 2), " MB"), 
                         paste(round(file.info(blogs_file_name)$size/(1024^2), 2), " MB"), 
                         paste(round(file.info(twitter_file_name)$size/(1024^2), 2), " MB") ),
           Lines_Num = c( length(news_lines), length(blogs_lines), length(twitter_lines) ),
           Words_Total = c( news_lines_word_summary[1], blogs_lines_word_summary[1],
                            twitter_lines_word_summary[1]),
           # word per line (WPL) stats
           WPL_min = c( news_lines_word_summary[2], blogs_lines_word_summary[2], 
                        twitter_lines_word_summary[2]),
           WPL_med = c( news_lines_word_summary[4], blogs_lines_word_summary[4], 
                        twitter_lines_word_summary[4]),
           WPL_mean = c( round(news_lines_word_summary[5], 2), round(blogs_lines_word_summary[5], 2),
                         round(twitter_lines_word_summary[5], 2)),
           WPL_max = c( news_lines_word_summary[7], blogs_lines_word_summary[7], twitter_lines_word_summary[7]) )

## show a summary of the three  data files info as a table.
kable(files_summary, align = "lccccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

Code for: Training Dataset Creation

set.seed(12345)       ## set for reproducibility
sample_size <- 0.02   ## more than this value, system could not continue later operation due to memory limit

## build training data set 
training_data <- c(sample(news_lines, length(news_lines)*sample_size, replace = FALSE),
              sample(blogs_lines, length(blogs_lines)*sample_size, replace = FALSE),
              sample(twitter_lines, length(twitter_lines)*sample_size, replace = FALSE))

## remove the not needed big objects, to free memory
rm(news_lines, blogs_lines, twitter_lines)

## put all training dataset's info in a data-frame
training_lines_words_count <- stri_count_words(training_data) 
training_lines_word_summary <- c(sum(training_lines_words_count), summary(training_lines_words_count))
training_data_summary <- data.frame( Dataset = "training_data", 
                                     Lines_Num = length(training_data),
                                     Words_Total = training_lines_word_summary[1], 
                                     WPL_min = training_lines_word_summary[2], 
                                     WPL_med = training_lines_word_summary[4],
                                     WPL_mean = round(training_lines_word_summary[5], 2),
                                     WPL_max = training_lines_word_summary[7] )

## show a summary of the training dataset's info as a table.
kable(training_data_summary, row.names = FALSE, align = "lcccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

Code for: Corpus Creation and Cleaning

## build corpus from the training set
corpus <- VCorpus(VectorSource(training_data))

## after creating the corpus,the big training_data object is no longer needed, and removed to free memory
rm(training_data)

## clean corpus, from web and email addresses; and also from twitter hash tags and mentions 
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+") 
corpus <- tm_map(corpus, toSpace, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}") 
corpus <- tm_map(corpus, toSpace, "#[[:alnum:]_]+") 
corpus <- tm_map(corpus, toSpace, "@[^\\s]+") 

## clean corpus, from profanity words
profanity <- fromJSON("https://raw.githubusercontent.com/zautumnz/profane-words/refs/heads/master/words.json")
profanity <- profanity[!grepl(" ", profanity)] # removing words with spaces from the list
corpus <- tm_map(corpus, removeWords, profanity)
rm(profanity)  ## remove from memory, as it is no longer needed

## clean and tidy corpus, by setting all words to lowercase and to their root-words; 
## by removing stop-words, punctuation, numbers and white-spaces; then make plain-text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)

Code for: N-gram Tokenization

## tokenization functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# build the Term-Document-Matrices (TDM) with tokenization and removal of sparse terms
unigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = unigram)), 0.9999)
bigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999)
trigramDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999)

## remove the not needed corpus object, to free memory
rm(corpus)

# get the the terms' frequencies from the TDMs, and sort them
unigram_freq <- sort(rowSums(as.matrix(unigramTDM_freq)), decreasing=TRUE)
bigram_freq <- sort(rowSums(as.matrix(bigramTDM_freq)), decreasing=TRUE)
trigram_freq <- sort(rowSums(as.matrix(trigramDM_freq)), decreasing=TRUE)

# create uni/bi/tri-grams data-frams for each's terms and their frequencies
unigramDF <- data.frame(term=names(unigram_freq), freq = unigram_freq)   
bigramDF <- data.frame(term=names(bigram_freq), freq = bigram_freq)   
trigramDF <- data.frame(term=names(trigram_freq), freq = trigram_freq)

Code for: uni/bi/trigrams

## put the unigramsDF summary in a data-farme
uni_summary <- c(nrow(unigramDF), summary(unigramDF$freq))
unigrams_summary <- data.frame( ngram_type = "unigram", 
                                terms_num = uni_summary[1],
                                freq_min = uni_summary[2],
                                freq_med = uni_summary[4],
                                freq_mean = round(uni_summary[5], 2),
                                freq_max = uni_summary[7] )

## show the unigramsDF summary as a table.
kable(unigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

## present a table of the top 10 frequent unigrams
df_transposed <- as.data.frame(t(head(unigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped") %>%
  column_spec(1, bold = TRUE, background = "#ffcccb") %>% 
  column_spec(seq(3, 11, by = 2), color = "darkred") 

## plot the top 30 frequent unigrams
head(unigramDF,30) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity", fill = "darkred") +
  ggtitle("30 Most Frequent Unigrams") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "darkred", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "darkred", face = "bold"))

## put the bigramsDF summary in a data-farme
bi_summary <- c(nrow(bigramDF), summary(bigramDF$freq))
bigrams_summary <- data.frame( ngram_type = "bigram", 
                                terms_num = bi_summary[1],
                                freq_min = bi_summary[2],
                                freq_med = bi_summary[4],
                                freq_mean = round(bi_summary[5], 2),
                                freq_max = bi_summary[7] )

## show the bigramsDF summary as a table.
kable(bigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

## present a table of the top 10 frequent bigrams
df_transposed <- as.data.frame(t(head(bigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped")  %>%
  column_spec(1, bold = TRUE, background = "lightblue") %>% 
  column_spec(seq(3, 11, by = 2), color = "steelblue") 

## plot the top 30 frequent bigrams
head(bigramDF,30) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  ggtitle("30 Most Frequent Bigrams") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "steelblue", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "steelblue", face = "bold"))

## put the trigramsDF summary in a data-farme
tri_summary <- c(nrow(trigramDF), summary(trigramDF$freq))
trigrams_summary <- data.frame( ngram_type = "trigram", 
                                terms_num = tri_summary[1],
                                freq_min = tri_summary[2],
                                freq_med = tri_summary[4],
                                freq_mean = round(tri_summary[5], 2),
                                freq_max = tri_summary[7] )

## show the trigramsDF summary as a table.
kable(trigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

## present a table of the top 10 frequent trigrams
kable(head(trigramDF,10), row.names = FALSE) %>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped")  %>%
  row_spec(0, bold = TRUE, background = "lightgreen") %>% 
  row_spec(seq(2, 10, by = 2), color = "darkgreen") 

## plot the top 20 frequent trigrams
head(trigramDF,20) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity",fill = "darkgreen") +
  ggtitle("20 Most Frequent Trigrams") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "darkgreen", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "darkgreen", face = "bold"))