Prep the data and find average word length.
# Find the average word length of the above documents.
# A for loop could work, but it acts difficult when unnest_tokens() is called
# since the input column cannot be pulled from lists. It makes more sense to
# create an author column for each paper, row-bind the tables, and group by
# author before applying unnest_tokens().
Hamilton <- Hamilton %>% mutate(author = "Hamilton")
Madison <- Madison %>% mutate(author = "Madison")
Jay <- Jay %>% mutate(author = "Jay")
HamiltonMadison <- HamiltonMadison %>% mutate(author = "Hamilton and Madison")
Unknown <- Unknown %>% mutate(author = "Unknown")
papers <- rbind(Hamilton, Madison, Jay, HamiltonMadison, Unknown)
word_counts <- papers %>%
unnest_tokens(
input = "body_text",
output = "words",
token = "words"
) %>%
group_by(author) %>%
count(words, sort = TRUE) %>%
mutate(
word_length = nchar(words),
total_length = n * word_length
)
# Group by author and calculate the average word length.
author_avg_word_length <- word_counts %>%
group_by(author) %>%
summarize(`average word length` = sum(total_length)/sum(n))
# Tokenize Federalist Paper No. 51 (author unknown) and write to a CSV.
unknown_tokenized <- Unknown %>%
unnest_tokens(
input = "body_text",
output = "words",
token = "words"
)
#write.csv(unknown_tokenized, file = "Unknown_Tokenized.csv")
Bootstrap samples from Federalist Paper No. 51 and place
the results in a histogram to estimate authorship.
### Bootstrap samples from Federalist Paper No. 51.
# Take samples of 5 words and calculate their average word length. Place the
# results on a histogram. On the histogram, place vertical lines showing the
# average word length of the known authors and author combination. Use this to
# predict a potential author for Federalist Paper No. 51.
# Use replicate() to take 1,000 samples and store the results in a table.
averages_list <- replicate(
1000,
unknown_tokenized %>%
slice_sample(n = 5) %>%
summarize(avg_length = sum(nchar(words)) / 5)
)
sample_averages <- averages_list %>% map_df(as_tibble)
# Make a histogram using sample_averages.
# Stylistic note: Initially, I wanted to have a legend displayed showing the
# line color and its matching author. Doing so would require the use of the
# function scale_color_manual(), which necessitates a column to reference.
# Since I am using an x-intercept, this is not the most practical choice.
ggplot(sample_averages, aes(x = value)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = 4.923508, color = "black") +
geom_vline(xintercept = 5.013823, color = "red") +
geom_vline(xintercept = 4.903823, color = "darkgreen") +
geom_vline(xintercept = 4.834774, color = "darkorange1") +
geom_vline(
aes(xintercept = mean(value)),
color = "blue",
linetype = "dashed",
size = 1.2
) +
scale_x_continuous(breaks = seq(2, 10, 1)) +
labs(
title = "Bootstrapped Sample of Average Word Length from Federalist Paper No. 51",
x = "Average Word Length",
y = "Frequency",
caption = " Black - Hamilton, Red - Hamilton and Madison, Green - Jay, Orange - Madison, Blue - Unknown"
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Zoom in to get a better view.
ggplot(sample_averages, aes(x = value)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = 4.923508, color = "black", size = 1.2) +
geom_vline(xintercept = 5.013823, color = "red", size = 1.2) +
geom_vline(xintercept = 4.903823, color = "darkgreen", size = 1.2) +
geom_vline(xintercept = 4.834774, color = "darkorange1", size = 1.2) +
geom_vline(
aes(xintercept=mean(value)),
color = "blue",
linetype ="dashed",
size = 1.2
) +
scale_x_continuous(lim = c(4.7, 5.2)) +
labs(
title = "Bootstrapped Sample of Average Word Length from Federalist Paper No. 51",
subtitle = "(Zoomed in for an Enhanced View)",
x = "Average Word Length",
y = "Frequency",
caption = " Black - Hamilton, Red - Hamilton and Madison, Green - Jay, Orange - Madison, Blue - Unknown"
) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
## Warning: Removed 832 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Observe that the bootstrapped average word length and actual average
word
length for Federalist Paper No. 51 are almost identical. It is clear
that
the blue dashed line (unknown author) and the black line (Hamilton)
are
practically the same. Based on the average word length of the
bootstrapped samples, Alexander Hamilton is most likely the author
of
Federalist Paper No. 51. (Keep in mind that average word length
alone
is a very rough estimation of authorship).