Language Model Milestone Report

In this milestone report we will quickly explain the exploratory analysis of the of the major features of the data that will be used to train the model. Let’s load the libraries that we will use in this document. For this report, we wil work with the datasets blogs_final.txt and news_final.txt. These datasets are subsets of the originally provided en_US.blogs.txt and en_US.news.txt that have been preprocessed. Specifically, the datasets have been sampled with a 0.2 probability for efficiency, the profanity has been filtered, punctuation and digits have been removed and capital letters have been transformed to lowercase.

Let’s now load these datasets and tokenize them to proceed with our exploratory data analysis.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tokenizers)

#load the datasets
news_final <- readLines("final_sampled/news_final.txt")
blogs_final <- readLines("final_sampled/blogs_final.txt")

#tokenize the data
news_tokens <- tokenize_words(news_final)
blogs_tokens <- tokenize_words(blogs_final)

Let’s now take a look at the 50 most common words:

library(ggplot2)

tokens <- c(news_tokens, blogs_tokens)
word_counts <- table(unlist(tokens))

word_frequency <- as.data.frame(word_counts, stringsAsFactors = FALSE)
colnames(word_frequency) <- c("word","frequency")
word_frequency <- word_frequency[order(-word_frequency$frequency), ]
top50 <- head(word_frequency, 50)

top50plot <- ggplot(top50, aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  xlab("Words") +
  ylab("Frequency") +
  ggtitle('Top 50 most common words') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

top50plot

Next, lets take a look at the frequency of the bigrams

library(tidyr)
library(stringr)
library(tidytext)

## Warning: package 'tidytext' was built under R version 4.3.2

library(quanteda)

## Warning: package 'quanteda' was built under R version 4.3.2

## Package version: 4.0.2
## Unicode version: 14.0
## ICU version: 71.1

## Parallel computing: disabled

## See https://quanteda.io for tutorials and examples.

all_tokens <- unlist(c(news_tokens, blogs_tokens))
combined_tokens <- paste(all_tokens, collapse = " ")
tokens_object <- tokens(combined_tokens)

bigrams <- tokens_ngrams(tokens_object, n = 2)
saveRDS(bigrams, file = "bigrams.rds")
bigrams_strings <- unlist(bigrams)
bigram_df <- as.data.frame(table(bigrams_strings), stringsAsFactors = FALSE)
colnames(bigram_df) <- c("bigram","frequency")

bigram_df <- bigram_df %>% arrange(desc(frequency))
top30bigrams <- head(bigram_df, 30)

ggplot(top30bigrams, aes(x = reorder(bigram, -frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  xlab('Bigrams') +
  ylab("Frequency") +
  ggtitle("Top 30 Most Common Bigrams") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

and the same for trigrams:

trigrams <- tokens_ngrams(tokens_object, n = 3)
saveRDS(trigrams, file = "trigrams.rds")
trigrams_strings <- unlist(trigrams)
trigram_df <- as.data.frame(table(trigrams_strings), stringsAsFactors = FALSE)
colnames(trigram_df) <- c("trigram","frequency")

trigram_df <- trigram_df %>% arrange(desc(frequency))
top30trigrams <- head(trigram_df, 30)

ggplot(top30trigrams, aes(x = reorder(trigram, -frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  xlab('Trigrams') +
  ylab("Frequency") +
  ggtitle("Top 30 Most Common Trigrams") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Finally, let’s determine how many individual words we need to cover 50% and 90%:

word_freq <- dfm(tokens_object) %>% colSums() %>% sort(decreasing = TRUE)
word_freq_df <- data.frame(word = names(word_freq), frequency = as.vector(word_freq))
word_freq_df <- word_freq_df %>% mutate(comulative_frequency = cumsum(frequency),
                                        total_words = sum(frequency),
                                        coverage = comulative_frequency/total_words)

num_words_50 <- nrow(word_freq_df %>% filter(coverage <= 0.5))
num_words_90 <- nrow(word_freq_df %>% filter(coverage <= 0.9)) 

cat("Number of words to cover 50% of all word instances:", num_words_50, "\n")

## Number of words to cover 50% of all word instances: 126

cat("Number of words to cover 90% of all word instances:", num_words_90, "\n")

## Number of words to cover 90% of all word instances: 7005

These plots and variables give us valuable insight on the structure of our dataset that we can use when working on our model. As we can see, there is much left to be done, such as fixing the bigrams, given that the removal of punctuation turned words like “I’m” and “don’t” into bigrams “i” & “m” and “don” & “t”, and did a similar thing with the trigrams. But for now, this is the end for our exploratory data analysis.

Language Model Milestone Report

2024-07-31