Below, we load the packages required for data analysis, predictive document classification, and visualization.
library(knitr)
library(tidyverse)
library(tidytext)
library(textdata)
library(rpart)
library(rpart.plot)
library(DT)
We will be predicting whether a text document is a poem or a short story based on some of its features, so we load a number of poetry and short story collections in the public domain that we gathered from Project Gutenberg into a dataframe. We manually added delimiters to the text files to mark the beginning of each unique poem or short story in the collection. We also named the files in such a way as to make extracting metadata like category (P for poetry, SS for short story), collection name, and author name easier. Once we have captured the lines of each text and numbered them sequentially, we pivot the data into a wider format so that each row observation is a single poem or short story, and that row’s numbered lines are its columns.
completed <- readLines("completed.txt")
if (length(completed) == 0){
files <- list.files(pattern = "^P_.+\\.txt$|^SS_.+\\.txt$")
txt_df <- as.data.frame(matrix(nrow = 0, ncol = 6))
cols <- c("category", "collection", "author", "title", "author_title", "lines")
colnames(txt_df) <- cols
for (i in 1:length(files)){
extraction <- str_replace_all(files[i], "_", " ")
p <- "(?<cat>P|SS)[- ](?<coll>.+)(?<by> by )(?<auth>.+)(?<ftype> Altered\\.txt)"
extraction <- str_match(extraction, p)
category <- extraction[1, 2]
collection <- extraction[1, 3]
author <- extraction[1, 5]
txt <- trimws(readLines(files[i]), which="left")
txt <- as.data.frame(txt)
write(files[i], file = "completed.txt", append = TRUE)
dlim <- txt[1, 1]
if (dlim == "+"){
starting_line <- str_detect(txt[, 1], "^\\+$")
}else if (dlim == "="){
starting_line <- str_detect(txt[, 1], "^=$")
}
for (j in nrow(txt):1){
if (j == nrow(txt)){
end <- j
}
if (starting_line[j]){
start <- j + 1
content <- txt[start:end, 1]
title <- content[1]
author_title <- paste(author, title, sep = "_")
lines <- content[2:length(content)]
#we want to preserve meaningful line breaks throughout the text
#but eliminate leading or tailing sequences of them; process follows
non_empty <- lines != ""
#find index of first TRUE in vector of whether lines are not empty
x <- which.max(non_empty)
lines <- lines[x:length(lines)] #trim empty lines from front only
rev_lines <- rev(lines) #reverse lines so we can trim from back
y <- which.max(rev(non_empty))
rev_lines <- rev_lines[y:length(rev_lines)] #trim from back
lines <- rev(rev_lines) #put lines back in proper order
addition <- cbind(category, collection, author, title,
author_title, lines)
txt_df <- rbind(txt_df, addition)
end <- j - 1
}else{
next
}
}
}
txt_df$line_num <- as.integer(ave(txt_df$lines, txt_df$author_title, FUN = seq_along))
txt_pivot <- txt_df %>%
pivot_wider(names_from = line_num, names_prefix = "line_" ,
values_from = lines)
write.csv(txt_df, file = "txt_df.csv", row.names = FALSE)
write.csv(txt_pivot, file = "txt_pivot.csv", row.names = FALSE)
}else{
my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_df.csv"
txt_df <- read.csv(my_url1)
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_pivot.csv"
txt_pivot <- read.csv(my_url2)
show <- c("category", "collection", "author", "title", "line_1")
kable(head(txt_pivot[, colnames(txt_pivot) %in% show]), format = "simple")
}
| category | collection | author | title | line_1 |
|---|---|---|---|---|
| P | Look We Have Come Through | DH Lawrence | CRAVING FOR SPRING | I WISH it were spring in the world. |
| P | Look We Have Come Through | DH Lawrence | FROST FLOWERS | IT is not long since, here among all these folk |
| P | Look We Have Come Through | DH Lawrence | AUTUMN RAIN | THE plane leaves |
| P | Look We Have Come Through | DH Lawrence | MANIFESTO | I |
| P | Look We Have Come Through | DH Lawrence | ELYSIUM | I HAVE found a place of loneliness |
| P | Look We Have Come Through | DH Lawrence | NEW HEAVEN AND EARTH | I |
We load two sentiment lexicons so that we can analyze both emotional and polarity indicators for the words in each line. Since the nrc lexicon includes both emotional (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) indicators and polarity (negative, positive) indicators, we will prioritize the indicators from the nrc lexicon and consider it the primary. We will only use the secondary sentiment lexicon bing to fill in polarity indicators for words the primary lexicon hasn’t labeled positive or negative.
nrc <- get_sentiments("nrc")
cols <- c("word", "sentiment_nrc")
colnames(nrc) <- cols
bing <- get_sentiments("bing")
cols <- c("word", "sentiment_bing")
colnames(bing) <- cols
We set a seed so that when we shuffle our data and later create train and test datasets from it, we get the same results each time.
set.seed(6001)
We shuffle the data so that the poems are no longer sorted next to all the other poems, and the short stories are no longer sorted next to all the other short stories. We also change the P (poetry) vs. SS (short story) category to a numeric variable with 0 indicating P and 1 indicating SS, and we emove some columns we don’t need for the numeric analysis we’ll be doing later.
txt_shuffle <- txt_pivot
shuffle <- sample(1:nrow(txt_shuffle))
txt_shuffle <- txt_shuffle[shuffle, ]
txt_shuffle$category <- ifelse(txt_shuffle$category == "SS", 1, 0)
remove <- c("collection", "author", "title")
txt_shuffle <- txt_shuffle[, !colnames(txt_shuffle) %in% remove]
show <- c("category", "author_title", "line_1")
kable(head(txt_shuffle[, colnames(txt_shuffle) %in% show]), format = "simple")
| category | author_title | line_1 | |
|---|---|---|---|
| 108 | 0 | Ella Wheeler Wilcox_EARTH BOUND | New paradise, and groom and bride; |
| 17 | 0 | DH Lawrence_BIRTH NIGHT | THIS fireglow is a red womb |
| 175 | 0 | Dora Sigerson_BEWARE | I closed my hands upon a moth |
| 371 | 1 | Hans Christian Andersen_THE MONEY-BOX | In a nursery where a number of toys lay scattered about, a |
| 185 | 0 | Dora Sigerson_SANCTUARY | Neighbour! for pity a hound cries on your steps |
| 296 | 0 | William Butler Yeats_INTO THE TWILIGHT | Out-worn heart, in a time out-worn, |
We create numeric variables that will be able to capture some features of our text data. The first numeric variables we create are a count of the punctuation in a text, as well as a count of the non-empty lines in a text, i.e. lines that do not represent intentional breaks separating stanzas or paragraphs within that text.
txt_punct <- txt_shuffle
txt_punct[txt_punct == ""] <- NA
txt_punct <- txt_punct %>%
mutate(across(starts_with("line_"), \(x) str_count(x, "[[:punct:]]"))) %>%
mutate(total_punct = rowSums(select(., starts_with("line_")), na.rm = TRUE)) %>%
mutate(total_non_empty_lines = rowSums(!is.na(select(.,
starts_with("line_"))))) %>%
select(!starts_with("line_"))
kable(head(txt_punct), format = "simple")
| category | author_title | total_punct | total_non_empty_lines | |
|---|---|---|---|---|
| 108 | 0 | Ella Wheeler Wilcox_EARTH BOUND | 71 | 30 |
| 17 | 0 | DH Lawrence_BIRTH NIGHT | 22 | 24 |
| 175 | 0 | Dora Sigerson_BEWARE | 5 | 8 |
| 371 | 1 | Hans Christian Andersen_THE MONEY-BOX | 162 | 77 |
| 185 | 0 | Dora Sigerson_SANCTUARY | 58 | 24 |
| 296 | 0 | William Butler Yeats_INTO THE TWILIGHT | 21 | 16 |
We also create a count of the total lines in a text, including the previously mentioned intentional breaks.
txt_words <- txt_df
remove <- c("category", "collection", "author", "title")
txt_lines <- txt_words[, !colnames(txt_words) %in% remove]
txt_lines <- txt_lines %>%
group_by(author_title) %>%
summarize(total_lines = max(line_num))
kable(head(txt_lines), format = "simple")
| author_title | total_lines |
|---|---|
| Anton Chekhov_A MALEFACTOR | 198 |
| Anton Chekhov_AGAFYA | 461 |
| Anton Chekhov_AT CHRISTMAS TIME | 222 |
| Anton Chekhov_DREAMS | 340 |
| Anton Chekhov_GUSEV | 607 |
| Anton Chekhov_HAPPINESS | 411 |
We then separate the lines into their individual word components.
txt_words <- txt_words[, !colnames(txt_words) %in% remove]
txt_words <- txt_words %>%
unnest_tokens(word, lines)
kable(head(txt_words), format = "simple")
| author_title | line_num | word |
|---|---|---|
| DH Lawrence_CRAVING FOR SPRING | 1 | i |
| DH Lawrence_CRAVING FOR SPRING | 1 | wish |
| DH Lawrence_CRAVING FOR SPRING | 1 | it |
| DH Lawrence_CRAVING FOR SPRING | 1 | were |
| DH Lawrence_CRAVING FOR SPRING | 1 | spring |
| DH Lawrence_CRAVING FOR SPRING | 1 | in |
We join the words to the emotional and polarity indicators we previously discussed. Then we create numeric variables that indicate which portion of the labeled words in each text belong to each emotional indicator, as well as which portion of the labeled words in each text belong to each emotional indicator. A text might have only one labeled word, and it might have two labels: joy and positive. That text would get a score of 1 for percent_joy and 1 for percent_pos, which is not to indicate every word in the text is associated with both joy and positivity, but that every labeled word in the text is. We are not attempting to create a measure that accounts for the many unlabeled words.
txt_sentiments <- txt_words %>%
left_join(nrc, by = join_by(word), multiple = "all") %>%
left_join(bing, by = join_by(word), multiple = "all") %>%
mutate(sentiment = coalesce(sentiment_nrc, sentiment_bing))
remove <- c("sentiment_nrc", "sentiment_bing")
txt_sentiments <- txt_sentiments[, !colnames(txt_sentiments) %in% remove]
txt_emotions <- txt_sentiments %>%
filter(!sentiment %in% c("positive", "negative") & !is.na(sentiment)) %>%
group_by(author_title, sentiment) %>%
summarize(sentiment_count = n())
emotions_pivot <- txt_emotions %>%
pivot_wider(names_from = sentiment, values_from = sentiment_count)
emotion_totals <- txt_emotions %>%
group_by(author_title) %>%
summarize(emotion_total = sum(sentiment_count)) %>%
left_join(emotions_pivot, by = join_by(author_title)) %>%
mutate(percent_anger = round(anger / emotion_total, 2),
percent_anticipation = round(anticipation / emotion_total, 2),
percent_disgust = round(disgust / emotion_total, 2),
percent_fear = round(fear / emotion_total, 2),
percent_joy = round(joy / emotion_total, 2),
percent_sadness = round(sadness / emotion_total, 2),
percent_surprise = round(surprise / emotion_total, 2),
percent_trust = round(trust / emotion_total, 2))
emotion_totals[is.na(emotion_totals)] <- 0
show <- c("author_title", "percent_anger", "percent_anticipation",
"percent_disgust", "percent_fear", "percent_joy", "percent_sadness",
"percent_surprise", "percent_trust")
kable(head(emotion_totals[, colnames(emotion_totals) %in% show]), format = "simple")
| author_title | percent_anger | percent_anticipation | percent_disgust | percent_fear | percent_joy | percent_sadness | percent_surprise | percent_trust |
|---|---|---|---|---|---|---|---|---|
| Anton Chekhov_A MALEFACTOR | 0.10 | 0.11 | 0.10 | 0.15 | 0.10 | 0.16 | 0.08 | 0.20 |
| Anton Chekhov_AGAFYA | 0.10 | 0.17 | 0.06 | 0.13 | 0.13 | 0.15 | 0.09 | 0.16 |
| Anton Chekhov_AT CHRISTMAS TIME | 0.08 | 0.20 | 0.04 | 0.13 | 0.18 | 0.09 | 0.04 | 0.24 |
| Anton Chekhov_DREAMS | 0.06 | 0.12 | 0.11 | 0.16 | 0.12 | 0.14 | 0.08 | 0.21 |
| Anton Chekhov_GUSEV | 0.15 | 0.12 | 0.11 | 0.17 | 0.08 | 0.18 | 0.06 | 0.15 |
| Anton Chekhov_HAPPINESS | 0.11 | 0.15 | 0.08 | 0.15 | 0.10 | 0.15 | 0.06 | 0.20 |
txt_polarity <- txt_sentiments %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(author_title, sentiment) %>%
summarize(sentiment_count = n())
polarity_pivot <- txt_polarity %>%
pivot_wider(names_from = sentiment, values_from = sentiment_count)
polarity_totals <- txt_polarity %>%
group_by(author_title) %>%
summarize(polarity_total = sum(sentiment_count)) %>%
left_join(polarity_pivot, by = join_by(author_title)) %>%
mutate(percent_neg = round(negative / polarity_total, 2),
percent_pos = round(positive / polarity_total, 2))
polarity_totals[is.na(polarity_totals)] <- 0
show <- show <- c("author_title", "percent_pos", "percent_neg")
kable(head(polarity_totals[, colnames(polarity_totals) %in% show]), format = "simple")
| author_title | percent_neg | percent_pos |
|---|---|---|
| Anton Chekhov_A MALEFACTOR | 0.48 | 0.52 |
| Anton Chekhov_AGAFYA | 0.47 | 0.53 |
| Anton Chekhov_AT CHRISTMAS TIME | 0.37 | 0.63 |
| Anton Chekhov_DREAMS | 0.47 | 0.53 |
| Anton Chekhov_GUSEV | 0.54 | 0.46 |
| Anton Chekhov_HAPPINESS | 0.48 | 0.52 |
We load a pronunciation dictionary so that we can analyze the last words in each line and check whether they rhyme with either the last word of the previous line or the last word of the line before that. Poems might feature either of these rhyming structures, or none at all, but it’s unlikely that the same poem would feature both, so we give each text a rhyming score based on the larger of its count of lines that rhyme with the previous line and its count of lines that rhyme with the line two lines prior.
my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/cmudict_0_7b.txt"
cmudict <- readLines(my_url3)
cmudict <- as.data.frame(cmudict[-c(1:56)])
colnames(cmudict) <- "word"
cmudict <- cmudict %>%
separate_wider_delim(cols = word, delim = " ", names_sep = "_",
too_few = "align_start")
colnames(cmudict) <- c("word", "pronunciation")
cmudict$word <- str_to_lower(str_replace_all(cmudict$word, "[[:punct:]]", ""))
cmudict$start_index <- sapply(gregexpr("[A-Z]{2}1", cmudict$pronunciation),
function(x) rev(x)[1])
cmudict <- cmudict %>%
mutate(rhyming_phoneme = substr(.$pronunciation, start_index, 1000000L))
txt_words$author_title_line_num <- paste(txt_words$author_title,
txt_words$line_num, sep = "_")
txt_words$word_num <- as.integer(ave(txt_words$word, txt_words$author_title_line_num,
FUN = seq_along))
remove <- "author_title_line_num"
txt_words <- txt_words[, !colnames(txt_words) %in% remove]
if (!"txt_rhymes.csv" %in% completed){
txt_rhymes <- txt_words %>%
group_by(author_title, line_num) %>%
top_n(1, word_num) %>%
left_join(cmudict, by = join_by(word), multiple = "first") %>%
mutate(rhymes_1_line_prev = 0, rhymes_2_line_prev = 0)
remove <- c("pronunciation", "start_index")
txt_rhymes <- txt_rhymes[, !colnames(txt_rhymes) %in% remove]
for (i in 2:nrow(txt_rhymes)){
if (is.na(txt_rhymes[i, 5]) |
is.na(txt_rhymes[(i-1), 5]) |
txt_rhymes[i, 5] != txt_rhymes[(i-1), 5]){
next
}else{
if (txt_rhymes[i, 1] != txt_rhymes[(i-1), 1]){
next
}else{
txt_rhymes[i, 6] <- 1
}
}
}
for (i in 3:nrow(txt_rhymes)){
if (is.na(txt_rhymes[i, 5]) |
is.na(txt_rhymes[(i-2), 5]) |
txt_rhymes[i, 5] != txt_rhymes[(i-2), 5]){
next
}else{
if (txt_rhymes[i, 1] != txt_rhymes[(i-2), 1]){
next
}else{
txt_rhymes[i, 7] <- 1
}
}
}
write.csv(txt_rhymes, "txt_rhymes.csv", row.names = FALSE)
}else{
my_url4 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_rhymes.csv"
txt_rhymes <- read.csv(my_url4)
kable(head(txt_rhymes), format = "simple")
}
| author_title | line_num | word | word_num | rhyming_phoneme | rhymes_1_line_prev | rhymes_2_line_prev |
|---|---|---|---|---|---|---|
| DH Lawrence_CRAVING FOR SPRING | 1 | world | 8 | ER1 L D | 0 | 0 |
| DH Lawrence_CRAVING FOR SPRING | 3 | spring | 4 | IH1 NG | 0 | 0 |
| DH Lawrence_CRAVING FOR SPRING | 4 | sap | 6 | AE1 P | 0 | 0 |
| DH Lawrence_CRAVING FOR SPRING | 5 | creation | 4 | EY1 SH AH0 N | 0 | 0 |
| DH Lawrence_CRAVING FOR SPRING | 6 | mortifica | 8 | NA | 0 | 0 |
| DH Lawrence_CRAVING FOR SPRING | 7 | tion | 1 | NA | 0 | 0 |
rhymes_summary <- txt_rhymes %>%
group_by(author_title) %>%
summarize(rhyming_score = max(sum(rhymes_1_line_prev), sum(rhymes_2_line_prev))) %>%
arrange(desc(rhyming_score))
kable(head(rhymes_summary), format = "simple")
| author_title | rhyming_score |
|---|---|
| Phillis Wheatley_G O L I A T H O F G A T H. | 76 |
| Phillis Wheatley_NIOBE in Distress for her Children slain by APOLLO, from Ovid’s Metamorphoses, Book VI. and from a view of the Painting of Mr. Richard Wilson. | 76 |
| Phillis Wheatley_Thoughts on the WORKS OF PROVIDENCE. | 52 |
| Louisa May Alcott_CLOVER-BLOSSOM. | 45 |
| Hans Christian Andersen_A STORY FROM THE SAND-HILLS | 38 |
| Dora Sigerson_THE SUICIDE’S GRAVE | 36 |
Now that we’ve created many of the numeric variables that could possibly distinguish a poem from a short story, we create a summary data frame and add one more numeric variable, as well as a few ratios between our numeric variables that could potentially distinguish these texts further: the word count, the words per non-empty lines, the punctuation per non-empty lines, and the rhymes per non-empty lines.
txt_summary <- txt_words %>%
group_by(author_title) %>%
summarize(word_count = n()) %>%
left_join(txt_lines, by = join_by(author_title)) %>%
left_join(txt_punct, by = join_by(author_title)) %>%
left_join(emotion_totals, by = join_by(author_title)) %>%
left_join(polarity_totals, by = join_by(author_title)) %>%
left_join(rhymes_summary, by = join_by(author_title))
txt_summary$total_lines <- as.integer(txt_summary$total_lines)
remove <- c("emotion_total", "polarity_total", "anger", "anticipation", "disgust",
"fear", "joy", "sadness", "surprise", "trust", "positive", "negative")
txt_summary <- txt_summary[, !colnames(txt_summary) %in% remove]
txt_summary <- txt_summary %>%
mutate(words_per_non_empty_lines = round(word_count / total_non_empty_lines, 2),
punct_per_non_empty_lines = round(total_punct / total_non_empty_lines, 2),
rhymes_per_non_empty_lines = round(rhyming_score / total_non_empty_lines, 2))
datatable(head(txt_summary), options = list(scrollX = TRUE))
Now we can split our entirely numeric summary data into train and test datasets. Then, we fit a decision tree model.
# process for this chunk and the next adapted from
# https://www.guru99.com/r-decision-trees.html
split_data_into_test_train <- function(data, size = 0.7, train = TRUE){
n_row <- nrow(data)
split_n_row <- size * n_row
split <- 1:split_n_row
if (train == TRUE){
return (data[split, ])
} else {
return (data[-split, ])
}
}
train <- split_data_into_test_train(txt_summary, size = 0.7, train = TRUE)
train <- train %>%
column_to_rownames(var="author_title")
test <- split_data_into_test_train(txt_summary, size = 0.7, train = FALSE)
test <- test %>%
column_to_rownames(var="author_title")
fit <- rpart(category~., data = train, method = "class")
rpart.plot(fit, extra = 106)
We see that the model has only incorporated three of the variables we created into its decision tree: word count, words per non-empty lines, and rhymes per non-empty lines. We’ll discuss ways to improve the data that could make our other numeric variables more meaningful to the model later, as word count is so indicative of whether something is a poem or short story that we believe it clouds the influence of other variables. We also know some of our other variables could be measured more accurately as well. First, let’s see how the model does at predictive classification.
predict_unseen <- predict(fit, test, type = "class")
table_mat <- table(test$category, predict_unseen)
table_mat
## predict_unseen
## 0 1
## 0 88 14
## 1 18 63
accuracy_test <- round(sum(diag(table_mat)) / sum(table_mat), 4) * 100
The model correctly labeled 88 poems, but mislabeled 14 of them as short stories. It also correctly labeled 63 short stories, but mislabled 18 of them as poems.
The model is 82.51% accurate.
The model accuracy is great, but we wanted text features other than word count to stand out when predicting poetry vs. short stories. We could probably improve the metrics we developed if:
the line length of the short stories was fixed like they would be in a physical book, making punctuation per non-empty line more meaningful
we took text samples that were all within range of the same word count
we excluded short stories that have poetry or poetic features within them
we expanded the number of words for which we had emotional/polarity sentiment indicators