Overview

The Data Science Capstone project’s goal is to create a predictive text model. Natural Language Processing (NLP) techniques will be used to perform the analysis and build the predictive model. A large text corpus of documents, collected from news, blogs and twitter feeds, are provided to use as training data.

This report describes the exploratory data analysis steps conducted, so far, to explore the data provided, summarize it, clean it and make it ready for tokenization, a key step in NLP. The report also describes the n-grams lists generated to be used in the predictive model for the project. Finaly, the report lists the next steps to be taken to complete the capstone project and deliver the required application for it.

To make the report brief and concise, all the code to read the data files, create a corpus from a training dataset, clean it and tokenize it, in addition to the code to generate the tables and plots of this report, are listed in the Appendix of this report

Exploratory Data Analysis

After loading all R libraries required to do the data analysis and plotting, the following steps were taken:

Reading and Anlyzing Data Files

There are three data file, for each data source (news, blogs, twitter). Each file’s lines were read and counted separately. In addition, the total number of words in each file, as well as, the number of Words-per-Line (WPL) statistics (minimum, median, mean and maximum) were calculated. The following table presents the characteristics of each data file.

File Size Lines_Num Words_Total WPL_min WPL_med WPL_mean WPL_max
en_US.news.txt 196.28 MB 1010242 34762395 1 32 34.41 1796
en_US.blogs.txt 200.42 MB 899288 37546250 0 28 41.75 6726
en_US.twitter.txt 159.36 MB 2360148 30093413 1 12 12.75 47


Training Dataset Creation

A training dataset is generated by sampling all three data files. The code provided in the Appendix for this section, shows that a seed was set, for reproducibility purposes. After multiple attempt to set a sample-size/percentage to be used for extracting lines from each of the files, and been challenged by the size of the memory (16GB) in the machine the data analysis was done on, and how this affected later analysis’ operations, a percentage of 2% is decided on. Its worth mentioning here, as the code in the Appendix shows, all big objects were removed from the memory as soon it had been decided that it was no longer needed. This was done in order to save memory space for the next operations and their objects.

The training dataset was built by taking a sample of 2% of each of the three data files. The created tratining_data has the following characteristics:

Dataset Lines_Num Words_Total WPL_min WPL_med WPL_mean WPL_max
training_data 85391 2038328 1 16 23.87 716


Corpus Creation and Cleaning

A corpus was created from the training dataset, using R’s tm text mining library. After removing the big training data from memory, a set of transformations were done to the corpus to clean it. The corpus was cleaned by removing all: web/email addresses, twitter hash-tags/mentions, profanity words, numbers, punctuation, stopwords and white-spaces. All words in the corpus were converted to lowercase, and then stemmed (reduced to their common root or base form). Finaly, the corpus was transformed to plain-text document to be used in the tokinaztion process.


N-gram Tokenization

Tokenization, in NLP, is the process of identifying tokens, such as words or pairs of words. Usually those tokens called n-grams. An n-gram is a contiguous sequence of n items (words in this case) from a given sample of text corpus. An n-gram of size 1-word is referred to as a unigram, size 2-words is a bigram and size 3-words is a trigram. In this project, R’s RWeka library was used to tokenize the corpus, and generate three lists: one for unigrams, a second for bigrams and a third for trigrams. The lists are then converted to data-frames, each with two columns. The first column is for the terms (a term is 1 word for a unigram, 2 for a bigram and 3 for a trigram). The second column has the frequencies of these terms appearing in the corpus. The code to tokenize the corpus and create these data-frames is provided in the Appendix.


Unigrams

There were 8503 unigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type terms_num freq_min freq_med freq_mean freq_max
unigram 8503 9 29 112.76 6324

The top 10 unigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 unigrams.

term will said just get one like time can day year
freq 6324 6196 6051 6038 5978 5848 5181 4993 4435 4210

Bigrams

There were 5484 bigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type terms_num freq_min freq_med freq_mean freq_max
bigram 5484 9 13 19.38 487

The top 10 bigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 bigrams.

term right now last year look like new york feel like year ago last night look forward high school make sure
freq 487 433 416 386 339 339 338 314 287 267

Trigrams

There were 107 trigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).

ngram_type terms_num freq_min freq_med freq_mean freq_max
trigram 107 9 12 14.74 63

The top 10 trigrams are listed here, followed by a bar-plot showing the frequencies for the top 20 trigrams.

term freq
happi mother day 63
new york citi 56
let us know 51
happi new year 38
new york time 32
presid barack obama 32
look forward see 31
world war ii 28
two year ago 27
cinco de mayo 26

Conclusion and Next Steps

This report documented the process of exploring the given data files, generating a training dataset by sampling these files, and then creating and cleaning/tidying a corpus from the trainingg dataset. As a step to build a predictive next-word model, tokenization was done to the corpus. Three n-gram data-frames were produced: one for unigrams, another for bigrams, and a third for trigrams. Each n-gram data-frame has the terms and their frequencies.

The next steps for this project include:


Appendix

This appendix lists all R code chunks used in this report production.

Code for: Reading and Anlyzing Data Files

## load required libraries
library(kableExtra)
library(stringi)
library(tm)   
library(jsonlite)
library(RWeka)
library(ggplot2)
## open data files**, read data line, then close files' connections
##   **all data files are assumed to be downloaded and placed in R's working directory. 

news_file_name <- "en_US.news.txt"
blogs_file_name <- "en_US.blogs.txt"
twitter_file_name <- "en_US.twitter.txt"

news_file_con <- file(news_file_name, "r")
blogs_file_con <- file(blogs_file_name, "r")
twitter_file_con <- file(twitter_file_name, "r")

news_lines <- readLines(news_file_con, skipNul = TRUE) ; 
blogs_lines <- readLines(blogs_file_con, skipNul = TRUE)
twitter_lines <- readLines(twitter_file_con, skipNul = TRUE)

close(news_file_con) ; close(blogs_file_con) ; close(twitter_file_con)

## get the words counts and statistics for the data files 
news_lines_words_count <- stri_count_words(news_lines) 
blogs_lines_words_count <- stri_count_words(blogs_lines) 
twitter_lines_words_count <- stri_count_words(twitter_lines) 
news_lines_word_summary <- c (sum(news_lines_words_count), summary(news_lines_words_count))
blogs_lines_word_summary <- c (sum(blogs_lines_words_count), summary(blogs_lines_words_count))
twitter_lines_word_summary <- c (sum(twitter_lines_words_count), summary(twitter_lines_words_count))
## put all data files' info collected in a data-frame
files_summary <- data.frame(File = c(news_file_name, blogs_file_name, twitter_file_name),
           Size = c( paste(round(file.info(news_file_name)$size/(1024^2), 2), " MB"), 
                         paste(round(file.info(blogs_file_name)$size/(1024^2), 2), " MB"), 
                         paste(round(file.info(twitter_file_name)$size/(1024^2), 2), " MB") ),
           Lines_Num = c( length(news_lines), length(blogs_lines), length(twitter_lines) ),
           Words_Total = c( news_lines_word_summary[1], blogs_lines_word_summary[1],
                            twitter_lines_word_summary[1]),
           # word per line (WPL) stats
           WPL_min = c( news_lines_word_summary[2], blogs_lines_word_summary[2], 
                        twitter_lines_word_summary[2]),
           WPL_med = c( news_lines_word_summary[4], blogs_lines_word_summary[4], 
                        twitter_lines_word_summary[4]),
           WPL_mean = c( round(news_lines_word_summary[5], 2), round(blogs_lines_word_summary[5], 2),
                         round(twitter_lines_word_summary[5], 2)),
           WPL_max = c( news_lines_word_summary[7], blogs_lines_word_summary[7], twitter_lines_word_summary[7]) )

## show a summary of the three  data files info as a table.
kable(files_summary, align = "lccccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

Code for: Training Dataset Creation

set.seed(12345)       ## set for reproducibility
sample_size <- 0.02   ## more than this value, system could not continue later operation due to memory limit

## build training data set 
training_data <- c(sample(news_lines, length(news_lines)*sample_size, replace = FALSE),
              sample(blogs_lines, length(blogs_lines)*sample_size, replace = FALSE),
              sample(twitter_lines, length(twitter_lines)*sample_size, replace = FALSE))

## remove the not needed big objects, to free memory
rm(news_lines, blogs_lines, twitter_lines)
## put all training dataset's info in a data-frame
training_lines_words_count <- stri_count_words(training_data) 
training_lines_word_summary <- c(sum(training_lines_words_count), summary(training_lines_words_count))
training_data_summary <- data.frame( Dataset = "training_data", 
                                     Lines_Num = length(training_data),
                                     Words_Total = training_lines_word_summary[1], 
                                     WPL_min = training_lines_word_summary[2], 
                                     WPL_med = training_lines_word_summary[4],
                                     WPL_mean = round(training_lines_word_summary[5], 2),
                                     WPL_max = training_lines_word_summary[7] )

## show a summary of the training dataset's info as a table.
kable(training_data_summary, row.names = FALSE, align = "lcccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")

Code for: Corpus Creation and Cleaning

## build corpus from the training set
corpus <- VCorpus(VectorSource(training_data))

## after creating the corpus,the big training_data object is no longer needed, and removed to free memory
rm(training_data)

## clean corpus, from web and email addresses; and also from twitter hash tags and mentions 
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+") 
corpus <- tm_map(corpus, toSpace, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}") 
corpus <- tm_map(corpus, toSpace, "#[[:alnum:]_]+") 
corpus <- tm_map(corpus, toSpace, "@[^\\s]+") 

## clean corpus, from profanity words
profanity <- fromJSON("https://raw.githubusercontent.com/zautumnz/profane-words/refs/heads/master/words.json")
profanity <- profanity[!grepl(" ", profanity)] # removing words with spaces from the list
corpus <- tm_map(corpus, removeWords, profanity)
rm(profanity)  ## remove from memory, as it is no longer needed

## clean and tidy corpus, by setting all words to lowercase and to their root-words; 
## by removing stop-words, punctuation, numbers and white-spaces; then make plain-text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)

Code for: N-gram Tokenization

## tokenization functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# build the Term-Document-Matrices (TDM) with tokenization and removal of sparse terms
unigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = unigram)), 0.9999)
bigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999)
trigramDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999)

## remove the not needed corpus object, to free memory
rm(corpus) 
# get the the terms' frequencies from the TDMs, and sort them
unigram_freq <- sort(rowSums(as.matrix(unigramTDM_freq)), decreasing=TRUE)
bigram_freq <- sort(rowSums(as.matrix(bigramTDM_freq)), decreasing=TRUE)
trigram_freq <- sort(rowSums(as.matrix(trigramDM_freq)), decreasing=TRUE)

# create uni/bi/tri-grams data-frams for each's terms and their frequencies
unigramDF <- data.frame(term=names(unigram_freq), freq = unigram_freq)   
bigramDF <- data.frame(term=names(bigram_freq), freq = bigram_freq)   
trigramDF <- data.frame(term=names(trigram_freq), freq = trigram_freq)

Code for: uni/bi/trigrams

## put the unigramsDF summary in a data-farme
uni_summary <- c(nrow(unigramDF), summary(unigramDF$freq))
unigrams_summary <- data.frame( ngram_type = "unigram", 
                                terms_num = uni_summary[1],
                                freq_min = uni_summary[2],
                                freq_med = uni_summary[4],
                                freq_mean = round(uni_summary[5], 2),
                                freq_max = uni_summary[7] )

## show the unigramsDF summary as a table.
kable(unigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent unigrams
df_transposed <- as.data.frame(t(head(unigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped") %>%
  column_spec(1, bold = TRUE, background = "#ffcccb") %>% 
  column_spec(seq(3, 11, by = 2), color = "darkred") 

## plot the top 30 frequent unigrams
head(unigramDF,30) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity", fill = "darkred") +
  ggtitle("30 Most Frequent Unigrams") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "darkred", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "darkred", face = "bold"))
## put the bigramsDF summary in a data-farme
bi_summary <- c(nrow(bigramDF), summary(bigramDF$freq))
bigrams_summary <- data.frame( ngram_type = "bigram", 
                                terms_num = bi_summary[1],
                                freq_min = bi_summary[2],
                                freq_med = bi_summary[4],
                                freq_mean = round(bi_summary[5], 2),
                                freq_max = bi_summary[7] )

## show the bigramsDF summary as a table.
kable(bigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent bigrams
df_transposed <- as.data.frame(t(head(bigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped")  %>%
  column_spec(1, bold = TRUE, background = "lightblue") %>% 
  column_spec(seq(3, 11, by = 2), color = "steelblue") 

## plot the top 30 frequent bigrams
head(bigramDF,30) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  ggtitle("30 Most Frequent Bigrams") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "steelblue", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "steelblue", face = "bold"))
## put the trigramsDF summary in a data-farme
tri_summary <- c(nrow(trigramDF), summary(trigramDF$freq))
trigrams_summary <- data.frame( ngram_type = "trigram", 
                                terms_num = tri_summary[1],
                                freq_min = tri_summary[2],
                                freq_med = tri_summary[4],
                                freq_mean = round(tri_summary[5], 2),
                                freq_max = tri_summary[7] )

## show the trigramsDF summary as a table.
kable(trigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent trigrams
kable(head(trigramDF,10), row.names = FALSE) %>%
  kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped")  %>%
  row_spec(0, bold = TRUE, background = "lightgreen") %>% 
  row_spec(seq(2, 10, by = 2), color = "darkgreen") 

## plot the top 20 frequent trigrams
head(trigramDF,20) %>% 
   ggplot(aes(reorder(term,-freq), freq)) +
  geom_bar(stat = "identity",fill = "darkgreen") +
  ggtitle("20 Most Frequent Trigrams") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(panel.grid.major.x = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5, color = "darkgreen", face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = "darkgreen", face = "bold"))