In this milestone report we will quickly explain the exploratory
analysis of the of the major features of the data that will be used to
train the model. Let’s load the libraries that we will use in this
document. For this report, we wil work with the datasets
blogs_final.txt and news_final.txt. These
datasets are subsets of the originally provided
en_US.blogs.txt and en_US.news.txt that have
been preprocessed. Specifically, the datasets have been sampled with a
0.2 probability for efficiency, the profanity has been filtered,
punctuation and digits have been removed and capital letters have been
transformed to lowercase.
Let’s now load these datasets and tokenize them to proceed with our exploratory data analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tokenizers)
#load the datasets
news_final <- readLines("final_sampled/news_final.txt")
blogs_final <- readLines("final_sampled/blogs_final.txt")
#tokenize the data
news_tokens <- tokenize_words(news_final)
blogs_tokens <- tokenize_words(blogs_final)
Let’s now take a look at the 50 most common words:
library(ggplot2)
tokens <- c(news_tokens, blogs_tokens)
word_counts <- table(unlist(tokens))
word_frequency <- as.data.frame(word_counts, stringsAsFactors = FALSE)
colnames(word_frequency) <- c("word","frequency")
word_frequency <- word_frequency[order(-word_frequency$frequency), ]
top50 <- head(word_frequency, 50)
top50plot <- ggplot(top50, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab("Words") +
ylab("Frequency") +
ggtitle('Top 50 most common words') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
top50plot
Next, lets take a look at the frequency of the bigrams
library(tidyr)
library(stringr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.3.2
## Package version: 4.0.2
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
all_tokens <- unlist(c(news_tokens, blogs_tokens))
combined_tokens <- paste(all_tokens, collapse = " ")
tokens_object <- tokens(combined_tokens)
bigrams <- tokens_ngrams(tokens_object, n = 2)
saveRDS(bigrams, file = "bigrams.rds")
bigrams_strings <- unlist(bigrams)
bigram_df <- as.data.frame(table(bigrams_strings), stringsAsFactors = FALSE)
colnames(bigram_df) <- c("bigram","frequency")
bigram_df <- bigram_df %>% arrange(desc(frequency))
top30bigrams <- head(bigram_df, 30)
ggplot(top30bigrams, aes(x = reorder(bigram, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab('Bigrams') +
ylab("Frequency") +
ggtitle("Top 30 Most Common Bigrams") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
and the same for trigrams:
trigrams <- tokens_ngrams(tokens_object, n = 3)
saveRDS(trigrams, file = "trigrams.rds")
trigrams_strings <- unlist(trigrams)
trigram_df <- as.data.frame(table(trigrams_strings), stringsAsFactors = FALSE)
colnames(trigram_df) <- c("trigram","frequency")
trigram_df <- trigram_df %>% arrange(desc(frequency))
top30trigrams <- head(trigram_df, 30)
ggplot(top30trigrams, aes(x = reorder(trigram, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
xlab('Trigrams') +
ylab("Frequency") +
ggtitle("Top 30 Most Common Trigrams") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Finally, let’s determine how many individual words we need to cover 50% and 90%:
word_freq <- dfm(tokens_object) %>% colSums() %>% sort(decreasing = TRUE)
word_freq_df <- data.frame(word = names(word_freq), frequency = as.vector(word_freq))
word_freq_df <- word_freq_df %>% mutate(comulative_frequency = cumsum(frequency),
total_words = sum(frequency),
coverage = comulative_frequency/total_words)
num_words_50 <- nrow(word_freq_df %>% filter(coverage <= 0.5))
num_words_90 <- nrow(word_freq_df %>% filter(coverage <= 0.9))
cat("Number of words to cover 50% of all word instances:", num_words_50, "\n")
## Number of words to cover 50% of all word instances: 126
cat("Number of words to cover 90% of all word instances:", num_words_90, "\n")
## Number of words to cover 90% of all word instances: 7005
These plots and variables give us valuable insight on the structure of our dataset that we can use when working on our model. As we can see, there is much left to be done, such as fixing the bigrams, given that the removal of punctuation turned words like “I’m” and “don’t” into bigrams “i” & “m” and “don” & “t”, and did a similar thing with the trigrams. But for now, this is the end for our exploratory data analysis.