The Data Science Capstone project’s goal is to create a predictive text model. Natural Language Processing (NLP) techniques will be used to perform the analysis and build the predictive model. A large text corpus of documents, collected from news, blogs and twitter feeds, are provided to use as training data.
This report describes the exploratory data analysis steps conducted, so far, to explore the data provided, summarize it, clean it and make it ready for tokenization, a key step in NLP. The report also describes the n-grams lists generated to be used in the predictive model for the project. Finaly, the report lists the next steps to be taken to complete the capstone project and deliver the required application for it.
To make the report brief and concise, all the code to read the data
files, create a corpus from a training dataset, clean it and tokenize
it, in addition to the code to generate the tables and plots of this
report, are listed in the Appendix of this report
After loading all R libraries required to do the data analysis and plotting, the following steps were taken:
There are three data file, for each data source (news, blogs, twitter). Each file’s lines were read and counted separately. In addition, the total number of words in each file, as well as, the number of Words-per-Line (WPL) statistics (minimum, median, mean and maximum) were calculated. The following table presents the characteristics of each data file.
| File | Size | Lines_Num | Words_Total | WPL_min | WPL_med | WPL_mean | WPL_max |
|---|---|---|---|---|---|---|---|
| en_US.news.txt | 196.28 MB | 1010242 | 34762395 | 1 | 32 | 34.41 | 1796 |
| en_US.blogs.txt | 200.42 MB | 899288 | 37546250 | 0 | 28 | 41.75 | 6726 |
| en_US.twitter.txt | 159.36 MB | 2360148 | 30093413 | 1 | 12 | 12.75 | 47 |
A training dataset is generated by sampling all three data files. The code provided in the Appendix for this section, shows that a seed was set, for reproducibility purposes. After multiple attempt to set a sample-size/percentage to be used for extracting lines from each of the files, and been challenged by the size of the memory (16GB) in the machine the data analysis was done on, and how this affected later analysis’ operations, a percentage of 2% is decided on. Its worth mentioning here, as the code in the Appendix shows, all big objects were removed from the memory as soon it had been decided that it was no longer needed. This was done in order to save memory space for the next operations and their objects.
The training dataset was built by taking a sample of 2% of each of the three data files. The created tratining_data has the following characteristics:
| Dataset | Lines_Num | Words_Total | WPL_min | WPL_med | WPL_mean | WPL_max |
|---|---|---|---|---|---|---|
| training_data | 85391 | 2038328 | 1 | 16 | 23.87 | 716 |
A corpus was created from the training dataset, using R’s tm text mining library. After removing the big training data from memory, a set of transformations were done to the corpus to clean it. The corpus was cleaned by removing all: web/email addresses, twitter hash-tags/mentions, profanity words, numbers, punctuation, stopwords and white-spaces. All words in the corpus were converted to lowercase, and then stemmed (reduced to their common root or base form). Finaly, the corpus was transformed to plain-text document to be used in the tokinaztion process.
Tokenization, in NLP, is the process of identifying tokens, such as words or pairs of words. Usually those tokens called n-grams. An n-gram is a contiguous sequence of n items (words in this case) from a given sample of text corpus. An n-gram of size 1-word is referred to as a unigram, size 2-words is a bigram and size 3-words is a trigram. In this project, R’s RWeka library was used to tokenize the corpus, and generate three lists: one for unigrams, a second for bigrams and a third for trigrams. The lists are then converted to data-frames, each with two columns. The first column is for the terms (a term is 1 word for a unigram, 2 for a bigram and 3 for a trigram). The second column has the frequencies of these terms appearing in the corpus. The code to tokenize the corpus and create these data-frames is provided in the Appendix.
There were 8503 unigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).
| ngram_type | terms_num | freq_min | freq_med | freq_mean | freq_max |
|---|---|---|---|---|---|
| unigram | 8503 | 9 | 29 | 112.76 | 6324 |
The top 10 unigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 unigrams.
| term | will | said | just | get | one | like | time | can | day | year |
| freq | 6324 | 6196 | 6051 | 6038 | 5978 | 5848 | 5181 | 4993 | 4435 | 4210 |
There were 5484 bigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).
| ngram_type | terms_num | freq_min | freq_med | freq_mean | freq_max |
|---|---|---|---|---|---|
| bigram | 5484 | 9 | 13 | 19.38 | 487 |
The top 10 bigrams are listed here, followed by a bar-plot showing the frequencies for the top 30 bigrams.
| term | right now | last year | look like | new york | feel like | year ago | last night | look forward | high school | make sure |
| freq | 487 | 433 | 416 | 386 | 339 | 339 | 338 | 314 | 287 | 267 |
There were 107 trigrams generated by the tokenization process. The following table shows statistics of these terms (their minimum, median, mean and maximum frequency values).
| ngram_type | terms_num | freq_min | freq_med | freq_mean | freq_max |
|---|---|---|---|---|---|
| trigram | 107 | 9 | 12 | 14.74 | 63 |
The top 10 trigrams are listed here, followed by a bar-plot showing the frequencies for the top 20 trigrams.
| term | freq |
|---|---|
| happi mother day | 63 |
| new york citi | 56 |
| let us know | 51 |
| happi new year | 38 |
| new york time | 32 |
| presid barack obama | 32 |
| look forward see | 31 |
| world war ii | 28 |
| two year ago | 27 |
| cinco de mayo | 26 |
This report documented the process of exploring the given data files, generating a training dataset by sampling these files, and then creating and cleaning/tidying a corpus from the trainingg dataset. As a step to build a predictive next-word model, tokenization was done to the corpus. Three n-gram data-frames were produced: one for unigrams, another for bigrams, and a third for trigrams. Each n-gram data-frame has the terms and their frequencies.
The next steps for this project include:
This appendix lists all R code chunks used in this report production.
## load required libraries
library(kableExtra)
library(stringi)
library(tm)
library(jsonlite)
library(RWeka)
library(ggplot2)
## open data files**, read data line, then close files' connections
## **all data files are assumed to be downloaded and placed in R's working directory.
news_file_name <- "en_US.news.txt"
blogs_file_name <- "en_US.blogs.txt"
twitter_file_name <- "en_US.twitter.txt"
news_file_con <- file(news_file_name, "r")
blogs_file_con <- file(blogs_file_name, "r")
twitter_file_con <- file(twitter_file_name, "r")
news_lines <- readLines(news_file_con, skipNul = TRUE) ;
blogs_lines <- readLines(blogs_file_con, skipNul = TRUE)
twitter_lines <- readLines(twitter_file_con, skipNul = TRUE)
close(news_file_con) ; close(blogs_file_con) ; close(twitter_file_con)
## get the words counts and statistics for the data files
news_lines_words_count <- stri_count_words(news_lines)
blogs_lines_words_count <- stri_count_words(blogs_lines)
twitter_lines_words_count <- stri_count_words(twitter_lines)
news_lines_word_summary <- c (sum(news_lines_words_count), summary(news_lines_words_count))
blogs_lines_word_summary <- c (sum(blogs_lines_words_count), summary(blogs_lines_words_count))
twitter_lines_word_summary <- c (sum(twitter_lines_words_count), summary(twitter_lines_words_count))
## put all data files' info collected in a data-frame
files_summary <- data.frame(File = c(news_file_name, blogs_file_name, twitter_file_name),
Size = c( paste(round(file.info(news_file_name)$size/(1024^2), 2), " MB"),
paste(round(file.info(blogs_file_name)$size/(1024^2), 2), " MB"),
paste(round(file.info(twitter_file_name)$size/(1024^2), 2), " MB") ),
Lines_Num = c( length(news_lines), length(blogs_lines), length(twitter_lines) ),
Words_Total = c( news_lines_word_summary[1], blogs_lines_word_summary[1],
twitter_lines_word_summary[1]),
# word per line (WPL) stats
WPL_min = c( news_lines_word_summary[2], blogs_lines_word_summary[2],
twitter_lines_word_summary[2]),
WPL_med = c( news_lines_word_summary[4], blogs_lines_word_summary[4],
twitter_lines_word_summary[4]),
WPL_mean = c( round(news_lines_word_summary[5], 2), round(blogs_lines_word_summary[5], 2),
round(twitter_lines_word_summary[5], 2)),
WPL_max = c( news_lines_word_summary[7], blogs_lines_word_summary[7], twitter_lines_word_summary[7]) )
## show a summary of the three data files info as a table.
kable(files_summary, align = "lccccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
set.seed(12345) ## set for reproducibility
sample_size <- 0.02 ## more than this value, system could not continue later operation due to memory limit
## build training data set
training_data <- c(sample(news_lines, length(news_lines)*sample_size, replace = FALSE),
sample(blogs_lines, length(blogs_lines)*sample_size, replace = FALSE),
sample(twitter_lines, length(twitter_lines)*sample_size, replace = FALSE))
## remove the not needed big objects, to free memory
rm(news_lines, blogs_lines, twitter_lines)
## put all training dataset's info in a data-frame
training_lines_words_count <- stri_count_words(training_data)
training_lines_word_summary <- c(sum(training_lines_words_count), summary(training_lines_words_count))
training_data_summary <- data.frame( Dataset = "training_data",
Lines_Num = length(training_data),
Words_Total = training_lines_word_summary[1],
WPL_min = training_lines_word_summary[2],
WPL_med = training_lines_word_summary[4],
WPL_mean = round(training_lines_word_summary[5], 2),
WPL_max = training_lines_word_summary[7] )
## show a summary of the training dataset's info as a table.
kable(training_data_summary, row.names = FALSE, align = "lcccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## build corpus from the training set
corpus <- VCorpus(VectorSource(training_data))
## after creating the corpus,the big training_data object is no longer needed, and removed to free memory
rm(training_data)
## clean corpus, from web and email addresses; and also from twitter hash tags and mentions
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
corpus <- tm_map(corpus, toSpace, "#[[:alnum:]_]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
## clean corpus, from profanity words
profanity <- fromJSON("https://raw.githubusercontent.com/zautumnz/profane-words/refs/heads/master/words.json")
profanity <- profanity[!grepl(" ", profanity)] # removing words with spaces from the list
corpus <- tm_map(corpus, removeWords, profanity)
rm(profanity) ## remove from memory, as it is no longer needed
## clean and tidy corpus, by setting all words to lowercase and to their root-words;
## by removing stop-words, punctuation, numbers and white-spaces; then make plain-text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
## tokenization functions
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# build the Term-Document-Matrices (TDM) with tokenization and removal of sparse terms
unigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = unigram)), 0.9999)
bigramTDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999)
trigramDM_freq <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999)
## remove the not needed corpus object, to free memory
rm(corpus)
# get the the terms' frequencies from the TDMs, and sort them
unigram_freq <- sort(rowSums(as.matrix(unigramTDM_freq)), decreasing=TRUE)
bigram_freq <- sort(rowSums(as.matrix(bigramTDM_freq)), decreasing=TRUE)
trigram_freq <- sort(rowSums(as.matrix(trigramDM_freq)), decreasing=TRUE)
# create uni/bi/tri-grams data-frams for each's terms and their frequencies
unigramDF <- data.frame(term=names(unigram_freq), freq = unigram_freq)
bigramDF <- data.frame(term=names(bigram_freq), freq = bigram_freq)
trigramDF <- data.frame(term=names(trigram_freq), freq = trigram_freq)
## put the unigramsDF summary in a data-farme
uni_summary <- c(nrow(unigramDF), summary(unigramDF$freq))
unigrams_summary <- data.frame( ngram_type = "unigram",
terms_num = uni_summary[1],
freq_min = uni_summary[2],
freq_med = uni_summary[4],
freq_mean = round(uni_summary[5], 2),
freq_max = uni_summary[7] )
## show the unigramsDF summary as a table.
kable(unigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent unigrams
df_transposed <- as.data.frame(t(head(unigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(1, bold = TRUE, background = "#ffcccb") %>%
column_spec(seq(3, 11, by = 2), color = "darkred")
## plot the top 30 frequent unigrams
head(unigramDF,30) %>%
ggplot(aes(reorder(term,-freq), freq)) +
geom_bar(stat = "identity", fill = "darkred") +
ggtitle("30 Most Frequent Unigrams") +
xlab("Unigrams") + ylab("Frequency") +
theme(panel.grid.major.x = element_blank()) +
theme(plot.title = element_text(hjust = 0.5, color = "darkred", face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, color = "darkred", face = "bold"))
## put the bigramsDF summary in a data-farme
bi_summary <- c(nrow(bigramDF), summary(bigramDF$freq))
bigrams_summary <- data.frame( ngram_type = "bigram",
terms_num = bi_summary[1],
freq_min = bi_summary[2],
freq_med = bi_summary[4],
freq_mean = round(bi_summary[5], 2),
freq_max = bi_summary[7] )
## show the bigramsDF summary as a table.
kable(bigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent bigrams
df_transposed <- as.data.frame(t(head(bigramDF,10)))
kable(df_transposed, col.names = NULL)%>%
kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(1, bold = TRUE, background = "lightblue") %>%
column_spec(seq(3, 11, by = 2), color = "steelblue")
## plot the top 30 frequent bigrams
head(bigramDF,30) %>%
ggplot(aes(reorder(term,-freq), freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
ggtitle("30 Most Frequent Bigrams") +
xlab("Bigrams") + ylab("Frequency") +
theme(panel.grid.major.x = element_blank()) +
theme(plot.title = element_text(hjust = 0.5, color = "steelblue", face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, color = "steelblue", face = "bold"))
## put the trigramsDF summary in a data-farme
tri_summary <- c(nrow(trigramDF), summary(trigramDF$freq))
trigrams_summary <- data.frame( ngram_type = "trigram",
terms_num = tri_summary[1],
freq_min = tri_summary[2],
freq_med = tri_summary[4],
freq_mean = round(tri_summary[5], 2),
freq_max = tri_summary[7] )
## show the trigramsDF summary as a table.
kable(trigrams_summary, row.names = FALSE, align = "lccccc") %>%
kable_styling(position = "center", full_width = F, bootstrap_options = "striped","bordered")
## present a table of the top 10 frequent trigrams
kable(head(trigramDF,10), row.names = FALSE) %>%
kable_styling(position = "center", full_width = FALSE, bootstrap_options = "striped") %>%
row_spec(0, bold = TRUE, background = "lightgreen") %>%
row_spec(seq(2, 10, by = 2), color = "darkgreen")
## plot the top 20 frequent trigrams
head(trigramDF,20) %>%
ggplot(aes(reorder(term,-freq), freq)) +
geom_bar(stat = "identity",fill = "darkgreen") +
ggtitle("20 Most Frequent Trigrams") +
xlab("Trigrams") + ylab("Frequency") +
theme(panel.grid.major.x = element_blank()) +
theme(plot.title = element_text(hjust = 0.5, color = "darkgreen", face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, color = "darkgreen", face = "bold"))