1. Summary of Corpus Files

The given text datasets are from blogs, news, and twitter. Each line in each dataset represents a distinct entry (eg. a sentence, a paragraph, or a tweet). Table 1: descriptive statistics with respect to each of these files. Figure 1: summary statistics visually. Observations: 1. The Twitter dataset is smaller by characters and by words, but much longer by lines. 2. The news dataset has the longest average word length whereas the Twitter dataset has the shortest.

Table 1: Summary Statistics Calculated for Each US English Corpus File
Document Character Count Word Count Line Count Mean Line Length (Characters) Mean Line Length (Words) Mean Word Length
blogs 206,824,505 37,570,839 899,288 229.98695 41.77843 5.504921
news 203,223,159 34,494,539 1,010,242 201.16285 34.14483 5.891459
twitter 162,096,031 30,451,128 2,360,148 68.68045 12.90221 5.323154

2. Data Processing and Exploratory Data Analysis

For exploratory data analyses, random 1% samples of the corpus datasets are taken due to resource constraints. The text samples are combined into a single corpus and is then split into tokens (words), removing punctuation, numbers, URLs, and profanity words. All tokens are converted to lowercase. Any token that appears fewer than five times in the corpus is removed. Additionally, the tokenized corpus is used to create two more tokenized versions of the corpus, grouping tokens into 2-grams (overlapping word pairs) and 3-grams (overlapping word triplets). For each tokenized dataset, two graphs listing each unique token and its frequency are created: one includes and the other excludes stopwords.

Table 2 shows a phenomenon of removing stopwords. Removing these relatively small number of unique stopwords decreases the total word counts considerably. Stopwords are used very frequently, but do not affect meaning much.

Table 2: Total and Unique Token Counts
Stopwords Included Total Tokens Unique Tokens
Yes 931,129 12,702
No 481,677 12,532

3. N-Gram Token Frequency (Stopwords Removed)

Figure 2 shows a wordcloud generated from the 1-gram no-stopwords frequency matrix. Figures 3, 4 and 5 display the top 30 no-stopwords 1-, 2- and 3-grams by frequency, respectively.

Figure 2: Wordcloud (Stopwords Removed)

4. N-Gram Token Frequency (Stopwords Included)

Figure 6 shows a wordcloud generated from the 1-gram stopword-inclusive frequency matrix. Figures 7, 8 and 9 display the top 30 stopword-inclusive 1-, 2- and 3-grams by frequency, respectively. It can be seen from these figures that stopwords are indeed used very frequently.

Figure 6: Wordcloud (Stopwords Included)

5. Relationship Between Unique Features and Corpus Coverage Levels

Table 3 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords removed. Figure 10 shows the relationship between the coverage of the no-stopwords corpus and the proportion of unique tokens needed.

Table 3: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Removed)
Proportion of Corpus Covered Proportion of Unique Features Required
0.50 0.0573731
0.90 0.4838813
0.98 0.8503032
0.99 0.9231567

Table 4 lists the proportion of unique tokens needed to cover 50%, 90%, 98% and 99% of the corpus with stopwords included. Figure 11 shows the relationship between the coverage of the stopwords-inclusive corpus and the proportion of unique tokens needed. Compared to the no-stopwords corpus, fewer unique features are required to meet each coverage level, due to the very high frequency of many stopwords.

Table 4: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Included)
Proportion of Corpus Covered Proportion of Unique Features Required
0.50 0.0083451
0.90 0.2902692
0.98 0.7410644
0.99 0.8565580

6. Next Steps: Predictive Modelling

The final goal for the analysis of this dataset is to build a model which can accurately predict the next word based on previous words, while minimising resource usage as the final product must function as a web app. Based on reasearch and prior experience, three methods are being considered for the modelling stage. Stupid Backoff is the simplest method, and is known to achieve high accuracy depending on the amount of data available. Kneser-Ney Smoothing is a similar but more complex method, and may yield higher accuracy. A Recurrent Neural Network (RNN)-based language model is a more advanced option which could likely achieve higher accuracy than the others, but may also require more resources. Experimentation will determine which method or methods are the best solutions for the goals in mind.

Appendix A: Source Code Listings

1. Summary of Corpus Files

# define path and generate full filenames
setwd("~/Dropbox/@Next/AI/JH_Capstone")
path <- "./data/final/en_US/"
docs <- c("blogs", "news", "twitter")
filenames <- sapply(docs, (function(f) { paste0(path, "en_US.", f, ".txt") }))
# read lines of each file
data <- sapply(filenames, readLines)
library(stringi)
# calculate statistics for each file
char_counts <- sapply(data, (function(c) { sum(nchar(c)) }))
word_counts <- sapply(data, (function(c) { sum(stri_stats_latex(c)[["Words"]]) }))
line_counts <- sapply(data, length)
mean_line_chars <- char_counts / line_counts
mean_line_words <- word_counts / line_counts
mean_word_length <- char_counts / word_counts
# generate data frame
file_stats <- data.frame(document = docs,
                         char.count = char_counts, 
                         word.count = word_counts, 
                         line.count = line_counts,
                         mean.line.chars = mean_line_chars,
                         mean.line.words = mean_line_words, 
                         mean.word.length = mean_word_length,
                         row.names = NULL)
library(knitr)
# generate column names for table
file_stats_table_col_names <- c("Document",
                                "Character Count",
                               "Word Count",
                               "Line Count",
                               "Mean Line Length (Characters)",
                               "Mean Line Length (Words)",
                               "Mean Word Length")
# generate table
kable(file_stats, 
      caption = "Table 1: Summary Statistics Calculated for Each US English Corpus File", 
      format.args = list(big.mark=","), 
      col.names = file_stats_table_col_names)
library(ggplot2)
library(grid)
library(gridExtra)
# barplot of character counts
char_count_plot <- ggplot(file_stats) + 
  aes(x = document, y = char.count, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Character Counts") +
  theme(legend.position="none")
# barplot of word counts
word_count_plot <- ggplot(file_stats) + 
  aes(x = document, y = word.count, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Word Counts") +
  theme(legend.position="none")
# barplot of line counts
line_count_plot <- ggplot(file_stats) + 
  aes(x = document, y = line.count, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Line Counts") +
  theme(legend.position="none")
# barplot of mean line lengths (characters)
mean_line_chars_plot <- ggplot(file_stats) + 
  aes(x = document, y = mean.line.chars, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Mean Line Lengths\n(Characters)") +
  theme(legend.position="none")
# barplot of mean line lengths (words)
mean_line_words_plot <- ggplot(file_stats) + 
  aes(x = document, y = mean.line.words, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Mean Line Lengths\n(Words)") +
  theme(legend.position="none")
# barplot of mean word lengths
mean_word_length_plot <- ggplot(file_stats) + 
  aes(x = document, y = mean.word.length, fill = document) + 
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("") + 
  ggtitle("Mean Word Lengths") +
  theme(legend.position="none")
# arrange plots into a grid
grid.arrange(char_count_plot, 
             word_count_plot, 
             line_count_plot, 
             mean_line_chars_plot, 
             mean_line_words_plot, 
             mean_word_length_plot, 
             ncol = 3, 
             bottom="Figure 1: Plots of Summary Statistics Calculated for Each US English Corpus File")

2. Data Processing and Exploratory Data Analysis

set.seed(Sys.time())
path <- "~/Dropbox/@Next/AI/JH_Capstone/data/"
for (doc in docs) {
  # get filename
  filename <- paste0(path, "final/en_US/en_US.", doc, ".txt")
  sample_filename <- paste0(path, "sample_", doc, ".txt")
  # read data
  data <- readLines(file(filename))
  # generate logical vector of random 1% of lines to keep
  keep <- as.logical(rbinom(data, 1, 0.01))
  # subset using this vector
  sample <- data[keep]
  # write samples to new files
  writeLines(sample, file(sample_filename))
}
library(readtext)
library(quanteda)
library(dplyr)
# load sample data
path <- "~/Dropbox/@Next/AI/JH_Capstone/"
sample_data <- readtext(paste(path, "data/*.txt", sep=""))
# create corpus
sample_corpus <- corpus(sample_data)
# load profanity dictionary
profanity <- read.csv(paste(path, "profanity.csv", sep=""), sep=";", header = FALSE, stringsAsFactors = FALSE)$V1
# generate tokens object
sample_tokens <- sample_corpus %>% 
  tokens(remove_punct=TRUE, remove_numbers=TRUE, remove_twitter=TRUE, remove_url=TRUE) %>%
  tokens_select(profanity, selection='remove') %>%
  tokens_tolower()
# generate dfm object
sample_dfm <- dfm(sample_tokens)
# generate list of features occuring 4 or fewer times
features_le4 <- textstat_frequency(sample_dfm) %>% filter(frequency <= 4) %>% .$feature
# remove these features from tokens and dfm objects
sample_tokens <- sample_tokens %>% tokens_select(features_le4, selection='remove')
sample_dfm <- dfm(sample_tokens)
# generate 2- and 3-gram tokens objects
sample_tokens_2gram <- tokens_ngrams(sample_tokens, 2)
sample_tokens_3gram <- tokens_ngrams(sample_tokens, 3)
# generate 2- and 3-gram dfm objects
sample_dfm_2gram <- dfm(sample_tokens_2gram)
sample_dfm_3gram <- dfm(sample_tokens_3gram)
# generate feature frequency matrices for 1-, 2- and 3-grams
freq_df <- textstat_frequency(sample_dfm) %>% mutate(feature = as.factor(feature))
freq_df_2gram <- textstat_frequency(sample_dfm_2gram) %>% mutate(feature = as.factor(feature))
freq_df_3gram <- textstat_frequency(sample_dfm_3gram) %>% mutate(feature = as.factor(feature))
# create token and dfm objects as above but with stopwords removed
sample_tokens_ns <- sample_tokens %>% tokens_select(stopwords("english"), selection = 'remove')
sample_dfm_ns <- dfm(sample_tokens_ns)
sample_tokens_2gram_ns <- tokens_ngrams(sample_tokens_ns, 2)
sample_tokens_3gram_ns <- tokens_ngrams(sample_tokens_ns, 3)
sample_dfm_2gram_ns <- dfm(sample_tokens_2gram_ns)
sample_dfm_3gram_ns <- dfm(sample_tokens_3gram_ns)
freq_df_ns <- textstat_frequency(sample_dfm_ns) %>% mutate(feature = as.factor(feature))
freq_df_2gram_ns <- textstat_frequency(sample_dfm_2gram_ns) %>% mutate(feature = as.factor(feature))
freq_df_3gram_ns <- textstat_frequency(sample_dfm_3gram_ns) %>% mutate(feature = as.factor(feature))
# calculate total tokens and features w/ and w/o stopwords
total_tokens <- sum(ntoken(sample_dfm))
total_features <- nfeat(sample_dfm)
total_tokens_ns <- sum(ntoken(sample_dfm_ns))
total_features_ns <- nfeat(sample_dfm_ns)
# generate and display table of total tokens and features w/ and w/o stopwords
fcount <- data.frame(stopwords.included = c("Yes", "No"), 
                     totals = c(total_tokens, total_tokens_ns), 
                     unique = c(total_features, total_features_ns))
kable(fcount, 
      col.names = c("Stopwords Included", "Total Tokens", "Unique Tokens"),
      format.args = list(big.mark=","),
      caption = "Table 2: Total and Unique Token Counts")

3. N-Gram Token Frequency (Stopwords Removed)

textplot_wordcloud(sample_dfm_ns, max_words = 100)
# get top 30 1-, 2- and 3-grams by frequency (no stopwords)
freq_top30_ns <- freq_df_ns %>% slice(1:30)
freq_2gram_top30_ns <- freq_df_2gram_ns %>% slice(1:30)
freq_3gram_top30_ns <- freq_df_3gram_ns %>% slice(1:30)
# plot top 30 1-grams
ggplot(freq_top30_ns) + 
  aes(x = reorder(feature, frequency), y = frequency) + 
  ggtitle("Figure 3: Top 30 1-Grams by Frequency (Stopwords Removed)") +
  xlab("feature (1-gram)") +
  geom_col(fill='darkred') +
  coord_flip()
# plot top 30 2-grams
ggplot(freq_2gram_top30_ns) + 
  aes(x = reorder(feature, frequency), y = frequency) +
  ggtitle("Figure 4: Top 30 2-Grams by Frequency (Stopwords Removed)") +
  xlab("feature (2-gram)") +
  geom_col(fill='darkred') +
  coord_flip()
# plot top 30 3-grams
ggplot(freq_3gram_top30_ns) + 
  aes(x = reorder(feature, frequency), y = frequency) + 
  ggtitle("Figure 5: Top 30 3-Grams by Frequency (Stopwords Removed)") +
  xlab("feature (3-gram)") +
  geom_col(fill='darkred') + 
  coord_flip()

4. N-Gram Token Frequency (Stopwords Included)

textplot_wordcloud(sample_dfm, max_words = 100, min_size=1)
# get top 30 1-, 2- and 3-grams by frequency (stopwords included)
freq_top30 <- freq_df %>% slice(1:30)
freq_2gram_top30 <- freq_df_2gram %>% slice(1:30)
freq_3gram_top30 <- freq_df_3gram %>% slice(1:30)
# plot top 30 1-grams
ggplot(freq_top30) + 
  aes(x = reorder(feature, frequency), y = frequency) + 
  ggtitle("Figure 7: Top 30 1-grams by Frequency (Stopwords Included)") +
  xlab("feature (1-Gram)") +
  geom_col(fill='darkred') +
  coord_flip()
# plot top 30 2-grams
ggplot(freq_2gram_top30) + 
  aes(x = reorder(feature, frequency), y = frequency) +
  ggtitle("Figure 8: Top 30 2-Grams by Frequency (Stopwords Included)") +
  xlab("Feature (2-gram)") +
  geom_col(fill='darkred') +
  coord_flip()
# plot top 30 3-grams
ggplot(freq_3gram_top30) + 
  aes(x = reorder(feature, frequency), y = frequency) + 
  ggtitle("Figure 9: Top 30 3-Grams by Frequency (Stopwords Included)") +
  xlab("feature (3-gram)") +
  geom_col(fill='darkred') + 
  coord_flip()

5. Relationship Between Unique Features and Corpus Coverage Levels

# calculate minimum no. of unique features required to meet required corpus coverage level
features_needed <- function(required_coverage, freq, total)
{
  current_feature <- 0
  current_tokens <- 0
  
  # while tokens counted is less than required amount
  while (current_tokens < total * required_coverage) {
    # increment feature counter/index
    current_feature <- current_feature + 1
    # increment count of tokens covered based on frequency of current feature
    current_tokens <- current_tokens + freqs[current_feature,]$frequency
  }
  
  # return no. of features needed
  return(current_feature)
}
# generate vector of coverage proportions in 0.01 increments
coverage <- seq(0, 1, 0.01)
# calculate required features
features_proportion <- sapply(coverage, function(n) { features_needed(n, freq_df_ns, total_tokens_ns) }) / total_features_ns
# generate data frame from these
coverage_table <- data.frame(coverage, features_proportion)
# display table for 50, 90, 98, 99%
kable(coverage_table[c(51, 91, 99, 100),],
      col.names = c("Proportion of Corpus Covered", "Proportion of Unique Features Required"),
      caption = "Table 3: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Removed)",
      row.names = FALSE)
# plot features needed against coverage level
ggplot(coverage_table) + 
  aes(x = coverage, y = features_proportion) + 
  geom_line() +
  xlab("Proportion of Corpus Covered") +
  ylab("Proportion of Unique Features Required") +
  ggtitle("Figure 10: Graph of Proportion of Unique Features Required\nAgainst Corpus Coverage Level (Stopwords Removed)")
# same as previously, but with stopwords included
coverage <- seq(0, 1, 0.01)
features_proportion <- sapply(coverage, function(n) { features_needed(n, freq_df, total_tokens) }) / total_features
coverage_table <- data.frame(coverage, features_proportion)
kable(coverage_table[c(51, 91, 99, 100),],
      col.names = c("Proportion of Corpus Covered", "Proportion of Unique Features Required"),
      caption = "Table 4: Proportions of Unique Featured Required for Certain Coverage Levels of the Corpus (Stopwords Included)",
      row.names = FALSE)
ggplot(coverage_table) + 
  aes(x = coverage, y = features_proportion) + 
  geom_line() +
  xlab("Proportion of Corpus Covered") +
  ylab("Proportion of Unique Features Required") +
  ggtitle("Figure 11: Graph of Proportion of Unique Features Required\nAgainst Corpus Coverage Level (Stopwords Included)")