The goal of this project is to introduce the getting and cleaning data proceess, present some exploratory data analysis of the text data for the capstone project and show the goals for the prediction model and final application.
In order to exectute the R code the following libraries must be loaded.
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
library(data.table)
library(stringr)
library(tm)
library(RWeka)
library(knitr)
At this section we can configure some parameters relted to the analaysis. Frist we set the absolute paths to documents.
home_dir <- "/home/jagprieto/Escritorio/COURSERA/DATA_SCIENCE/CAPSTONE_PROJECT/"
data_dir <- "/home/jagprieto/Escritorio/COURSERA/DATA_SCIENCE/CAPSTONE_PROJECT/data/corpus"
news_file_path <- paste(data_dir, '/en_US.news.txt', sep = '')
blogs_file_path <- paste(data_dir, '/en_US.blogs.txt', sep = '')
twitter_file_path <- paste(data_dir, '/en_US.twitter.txt', sep = '')
setwd(home_dir)
Now we set the number of lines (in %) to read from each document to perform the basic exploratory data analysis.
conf.train.percentage <- 0.01
Set the number of words in the word cloud visualization.
conf.word.cloud.number <- 100
Set the number of terms visualized in the frequency tables (the most frequently terms).
conf.terms.freq.number <- 25
Set the number of clusters.
conf.terms.clusters <- 10
We start by analyzing the document files and obtaining some general info about its size, number of lines and number of words/chars per line.
# Extract blogs documents info.
blogs_file_connection <- file(blogs_file_path, "r", blocking = FALSE)
blogs_table <- readLines(blogs_file_connection) #
close(blogs_file_connection)
blogs_lines_count <- length(blogs_table)
# Count blogs words and chars.
blogs_lines_words_count <- lapply(blogs_table, function (line_text) sapply(gregexpr("\\W+", line_text), length))
blogs_words_count <- Reduce("+", blogs_lines_words_count)
blogs_lines_chars_count <- lapply(blogs_table, function (line_text) nchar(line_text))
blogs_chars_count <- Reduce("+", blogs_lines_chars_count)
blogs_size <- object.size(blogs_table)
# Extract blogs train data table.
blogs_train_lines <- as.integer(conf.train.percentage*blogs_lines_count)
blogs_table_train <- sample(blogs_table, blogs_train_lines)
# Clean blogs data table.
rm(blogs_table)
# Extract news documents info.
news_file_connection <- file(news_file_path, "r", blocking = FALSE)
news_table <- readLines(news_file_connection)
close(news_file_connection)
news_lines_count <- length(news_table)
# Count news words and chars.
news_lines_words_count <- lapply(news_table, function (line_text) sapply(gregexpr("\\W+", line_text), length))
news_words_count <- Reduce("+", news_lines_words_count)
news_lines_chars_count <- lapply(news_table, function (line_text) nchar(line_text))
news_chars_count <- Reduce("+", news_lines_chars_count)
news_size <- object.size(news_table)
# Extract news train data table.
news_train_lines <- as.integer(conf.train.percentage*news_lines_count)
news_table_train <- sample(news_table, news_train_lines)
# Clean news data table.
rm(news_table)
# Extract twitter documents info.
twitter_file_connection <- file(twitter_file_path, "r", blocking = FALSE)
twitter_table <- readLines(twitter_file_connection)
close(twitter_file_connection)
twitter_lines_count <- length(twitter_table)
# Count twitter words and chars.
twitter_lines_words_count <- lapply(twitter_table, function (line_text) sapply(gregexpr("\\W+", line_text), length))
twitter_words_count <- Reduce("+", twitter_lines_words_count)
twitter_lines_chars_count <- lapply(twitter_table, function (line_text) nchar(line_text))
twitter_chars_count <- Reduce("+", twitter_lines_chars_count)
twitter_size <- object.size(twitter_table)
# Extract twitter train data table.
twitter_train_lines <- as.integer(conf.train.percentage*twitter_lines_count)
twitter_table_train <- sample(twitter_table, twitter_train_lines)
# Clean twitter data table.
rm(twitter_table)
The next table shows the general information related to the document files.
Document | Size | Lines | Words |
| Blogs | 260564320 Mb. | 899288 | 38222304 |
| News | 261759048 Mb. | 1010242 | 35710849 |
| 316037344 Mb. | 2360148 | 30433509 |
Document | Lines |
| Blogs | 8992 |
| News | 10102 |
| 23601 |
At this point we start to apply the tm functions that allows the process of texts at grammar and semantic levels. We first initialize the functions used to create the n-gram (n =[1:4]) models.
# Creating the tokenizer functions.
one_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
two_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
three_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
Now lets create the blog corpus and its n-gram models data.
################# Blogs corpus processing and analysis. #################
# Creating the corpus for n-gram analysis.
blogs_corpus <- Corpus(VectorSource(list(blogs_table_train)))
# Preprocessing blog corpus with tm : White space elimination, numbers and punctuation removing, tolower case transformation, remove english stop words
# and finally steamming the document.
blogs_corpus <- tm_map(blogs_corpus, stripWhitespace)
blogs_corpus <- tm_map(blogs_corpus, removeNumbers)
blogs_corpus <- tm_map(blogs_corpus, removePunctuation)
blogs_corpus <- tm_map(blogs_corpus, content_transformer(tolower))
blogs_corpus <- tm_map(blogs_corpus, removeWords, stopwords("english"))
blogs_corpus <- tm_map(blogs_corpus, stemDocument, language = "english")
# Create 1 gram model.
blogs_one_gram_dtm <- TermDocumentMatrix(blogs_corpus, control = list(tokenize = one_tokenizer))
blogs_one_gram_data <- data.frame(as.matrix(blogs_one_gram_dtm))
names(blogs_one_gram_data) <- c("Freq")
blogs_one_gram_data$Word <- row.names(blogs_one_gram_data)
blogs_one_gram_data <- blogs_one_gram_data[order(-blogs_one_gram_data$Freq),]
# Create 2 gram model.
blogs_two_gram_dtm <- TermDocumentMatrix(blogs_corpus, control = list(tokenize = two_tokenizer))
blogs_two_gram_data <- data.frame(as.matrix(blogs_two_gram_dtm))
names(blogs_two_gram_data) <- c("Freq")
blogs_two_gram_data$Word <- row.names(blogs_two_gram_data)
blogs_two_gram_data <- blogs_two_gram_data[order(-blogs_two_gram_data$Freq),]
# Create 3 gram model.
blogs_three_gram_dtm <- TermDocumentMatrix(blogs_corpus, control = list(tokenize = three_tokenizer))
blogs_three_gram_data <- data.frame(as.matrix(blogs_three_gram_dtm))
names(blogs_three_gram_data) <- c("Freq")
blogs_three_gram_data$Word <- row.names(blogs_three_gram_data)
blogs_three_gram_data <- blogs_three_gram_data[order(-blogs_three_gram_data$Freq),]
# Create 4 gram model.
blogs_four_gram_dtm <- TermDocumentMatrix(blogs_corpus, control = list(tokenize = four_tokenizer))
blogs_four_gram_data <- data.frame(as.matrix(blogs_four_gram_dtm))
names(blogs_four_gram_data) <- c("Freq")
blogs_four_gram_data$Word <- row.names(blogs_four_gram_data)
blogs_four_gram_data <- blogs_four_gram_data[order(-blogs_four_gram_data$Freq),]
Now we repeat the same code operations for the news and twitter corpus in order to obtain its n-gram models data (code is omitted for brevity).
We start the exploratory data analysis by creating the frequency distribution of number of words and number of chars per line obtaining before the NLP data processing.
# Create the data frame distributions
blogs_lines_table <- data.frame(type = 'blogs', words = rep(0, blogs_lines_count), chars = rep(0, blogs_lines_count))
blogs_lines_table$words <- matrix(unlist(blogs_lines_words_count))
blogs_lines_table$chars <- matrix(unlist(blogs_lines_chars_count))
news_lines_table <- data.frame(type = 'news', words = rep(0, news_lines_count), chars = rep(0, news_lines_count))
news_lines_table$words <- matrix(unlist(news_lines_words_count))
news_lines_table$chars <- matrix(unlist(news_lines_chars_count))
twitter_lines_table <- data.frame(type = 'twitter', words = rep(0, twitter_lines_count), chars = rep(0, twitter_lines_count))
twitter_lines_table$words <- matrix(unlist(twitter_lines_words_count))
twitter_lines_table$chars <- matrix(unlist(twitter_lines_chars_count))
documents_lines_table <- rbind(blogs_lines_table, news_lines_table, twitter_lines_table)
documents_lines_table$type <- as.factor(documents_lines_table$type)
In the next figure we can observe the frequency distributions where its clear that, because of space limitation, twitter presents a unimodal normal distribution in the line lengths but blogs and news present unimodal distributions with large rigth tails.
Next we introduce some basics statisticis sumaries of the data displayed:
Blogs | News | |
| Min. : 1.0 1st Qu.: 9.0 Median : 29.0 Mean : 42.5 3rd Qu.: 61.0 Max. :6851.0 | Min. : 1.00 1st Qu.: 20.00 Median : 32.00 Mean : 35.35 3rd Qu.: 47.00 Max. :1928.00 | Min. : 1.00 1st Qu.: 7.00 Median :12.00 Mean :12.89 3rd Qu.:19.00 Max. :46.00 |
Blogs | News | |
| Min. : 1 1st Qu.: 47 Median : 156 Mean : 230 3rd Qu.: 329 Max. :40833 | Min. : 1.0 1st Qu.: 110.0 Median : 185.0 Mean : 201.2 3rd Qu.: 268.0 Max. :11384.0 | Min. : 2.00 1st Qu.: 37.00 Median : 64.00 Mean : 68.68 3rd Qu.:100.00 Max. :140.00 |
In the next section we will introduce an analysis of the n-gram models data by exploring the data processed by the NLP tools.
Lets create the data related to the one gram model and obtain the most frequently words that appears in the documents.
# Aggregation and word cloud visualization of one gram data models.
one_gram_data <- merge(blogs_one_gram_data, news_one_gram_data, by='Word', all=TRUE)
one_gram_data <- merge(one_gram_data, twitter_one_gram_data, by='Word', all=TRUE)
one_gram_data[is.na(one_gram_data)] <- 0
one_gram_data$Freq <- as.integer(one_gram_data$Freq.x + one_gram_data$Freq.y)
one_gram_data <- one_gram_data[order(-one_gram_data$Freq), c('Freq','Word')]
To show the top frequent words visually, we next make a barplot for them. Lets show the distribution of the 25 frequently words in documents:
After building a term-document matrix, we can show the importance of words with a word cloud. The one gram word cloud data visualization is as follows :
Finally we try to find clusters of words with hierarchical clustering where sparse terms are removed, so that the plot of clustering will not be crowded with words. Our approximation is to develop a cluster approximation for each kind of document so we can obtain and analyze some differences between them by means of this grouping method. For each kind of document the distances between terms are calculated with dist() after scaling. The terms are clustered with hclust() and the dendrogram is cut into 10 clusters. The agglomeration method is set to ward, which denotes the increase in variance when two clusters are merged.
blogs_one_gram_dtm_sparse <- removeSparseTerms(blogs_one_gram_dtm, sparse=0.95)
blogs_one_gram_dtm_matrix <- as.matrix(blogs_one_gram_dtm_sparse)
blogs_one_gram_dtm_matrix_distance <- dist(scale(blogs_one_gram_dtm_matrix))
blogs_one_gram_dtm_matrix_distance_fit <- hclust(blogs_one_gram_dtm_matrix_distance, method="ward.D2")
blogs_one_gram_dtm_matrix_distance_dendogram <- as.dendrogram(blogs_one_gram_dtm_matrix_distance_fit)
# plot(cut(blogs_one_gram_dtm_matrix_distance_dendogram, h = as.integer(max(blogs_one_gram_data$Freq)))$lower[[2]], main="Most frequently words cluster dendogram tree")
At this point we repeat the same code operations to obtain the 2-gram model data and the distribution of the 25 frequently words pairs in documents:
Again, the word cloud:
The 3-gram model data and the distribution of the 25 frequently words triples in documents:
The word cloud:
The 4-gram model data and the distribution of the 25 frequently words cuatriples in documents:
The word cloud:
Here we discuss our plan for building a model in order to apply predictive functions. As suggested in the capstone description by the moment we will focus on the application of the data obtained from the 1-grams through 4-grams models, that is, when we consider to develop Markov Chains models where the conditional probablities of a word depends on the conditional probability of a fixed number of previous words. In order to build such n-gram predition model we use selected training sentences to compute the n-gram probability estimates (training) and assigning non-zero probability to unseen n-grams (smoothing) in order to avoid bad behaviours for any statistical estimator.
In the develop of this prediciton model we will analyze some of the following possible appoximations: - Good-Turing Estimator : use n-grams seen once to estimate n-grams never seen and so on. - Linear MLE Interpolation : Mix n-grams models to offset sparsity. - Backoff Models : Consult different models in order depending on specificity.
We will also introduce some different analysis approximation based on a deepest clustering algorithm based on new statistics parameters that measures the differences between each document type and use this information in order to achieve better adjustments for the prediction models conntructed for each document type.
A shiny app will be developed in order to show this prediction model working online.