Load the Required Packages:

Below, we load the packages required for data analysis, predictive document classification, and visualization.

library(knitr)
library(tidyverse)
library(tidytext)
library(textdata)
library(rpart)
library(rpart.plot)
library(DT)

Load the Poetry and Short Story Data:

We will be predicting whether a text document is a poem or a short story based on some of its features, so we load a number of poetry and short story collections in the public domain that we gathered from Project Gutenberg into a dataframe. We manually added delimiters to the text files to mark the beginning of each unique poem or short story in the collection. We also named the files in such a way as to make extracting metadata like category (P for poetry, SS for short story), collection name, and author name easier. Once we have captured the lines of each text and numbered them sequentially, we pivot the data into a wider format so that each row observation is a single poem or short story, and that row’s numbered lines are its columns.

completed <- readLines("completed.txt") 
if (length(completed) == 0){
    files <- list.files(pattern = "^P_.+\\.txt$|^SS_.+\\.txt$")
    txt_df <- as.data.frame(matrix(nrow = 0, ncol = 6))
    cols <- c("category", "collection", "author", "title", "author_title", "lines")
    colnames(txt_df) <- cols
    for (i in 1:length(files)){
        extraction <- str_replace_all(files[i], "_", " ")
        p <- "(?<cat>P|SS)[- ](?<coll>.+)(?<by> by )(?<auth>.+)(?<ftype> Altered\\.txt)"
        extraction <- str_match(extraction, p)
        category <- extraction[1, 2]
        collection <- extraction[1, 3]
        author <- extraction[1, 5]
        txt <- trimws(readLines(files[i]), which="left")
        txt <- as.data.frame(txt)
        write(files[i], file = "completed.txt", append = TRUE)
        dlim <- txt[1, 1]
        if (dlim == "+"){
            starting_line <- str_detect(txt[, 1], "^\\+$")
        }else if (dlim == "="){
            starting_line <- str_detect(txt[, 1], "^=$")
        }
        for (j in nrow(txt):1){
            if (j == nrow(txt)){
                end <- j
            }
            if (starting_line[j]){
                start <- j + 1
                content <- txt[start:end, 1]
                title <- content[1]
                author_title <- paste(author, title, sep = "_")
                lines <- content[2:length(content)]
                #we want to preserve meaningful line breaks throughout the text
                #but eliminate leading or tailing sequences of them; process follows
                non_empty <- lines != ""
                #find index of first TRUE in vector of whether lines are not empty
                x <- which.max(non_empty) 
                lines <- lines[x:length(lines)] #trim empty lines from front only
                rev_lines <- rev(lines) #reverse lines so we can trim from back
                y <- which.max(rev(non_empty)) 
                rev_lines <- rev_lines[y:length(rev_lines)] #trim from back
                lines <- rev(rev_lines) #put lines back in proper order
                addition <- cbind(category, collection, author, title,
                                  author_title, lines)
                txt_df <- rbind(txt_df, addition)
                end <- j - 1
            }else{
                next
            }
        }
    }
    txt_df$line_num <- as.integer(ave(txt_df$lines, txt_df$author_title, FUN = seq_along))
    txt_pivot <- txt_df %>%
        pivot_wider(names_from = line_num, names_prefix = "line_" ,
                values_from = lines)
    write.csv(txt_df, file = "txt_df.csv", row.names = FALSE)
    write.csv(txt_pivot, file = "txt_pivot.csv", row.names = FALSE)
}else{
    my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_df.csv"
    txt_df <- read.csv(my_url1)
    my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_pivot.csv"
    txt_pivot <- read.csv(my_url2)
    show <- c("category", "collection", "author", "title", "line_1")
    kable(head(txt_pivot[, colnames(txt_pivot) %in% show]), format = "simple")
}
category collection author title line_1
P Look We Have Come Through DH Lawrence CRAVING FOR SPRING I WISH it were spring in the world.
P Look We Have Come Through DH Lawrence FROST FLOWERS IT is not long since, here among all these folk
P Look We Have Come Through DH Lawrence AUTUMN RAIN THE plane leaves
P Look We Have Come Through DH Lawrence MANIFESTO I
P Look We Have Come Through DH Lawrence ELYSIUM I HAVE found a place of loneliness
P Look We Have Come Through DH Lawrence NEW HEAVEN AND EARTH I

Load Sentiment Analysis Lexicons:

We load two sentiment lexicons so that we can analyze both emotional and polarity indicators for the words in each line. Since the nrc lexicon includes both emotional (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) indicators and polarity (negative, positive) indicators, we will prioritize the indicators from the nrc lexicon and consider it the primary. We will only use the secondary sentiment lexicon bing to fill in polarity indicators for words the primary lexicon hasn’t labeled positive or negative.

nrc <- get_sentiments("nrc")
cols <- c("word", "sentiment_nrc")
colnames(nrc) <- cols
bing <- get_sentiments("bing")
cols <- c("word", "sentiment_bing")
colnames(bing) <- cols

Set a Seed:

We set a seed so that when we shuffle our data and later create train and test datasets from it, we get the same results each time.

set.seed(6001)

Shuffle the Data:

We shuffle the data so that the poems are no longer sorted next to all the other poems, and the short stories are no longer sorted next to all the other short stories. We also change the P (poetry) vs. SS (short story) category to a numeric variable with 0 indicating P and 1 indicating SS, and we emove some columns we don’t need for the numeric analysis we’ll be doing later.

txt_shuffle <- txt_pivot
shuffle <- sample(1:nrow(txt_shuffle))
txt_shuffle <- txt_shuffle[shuffle, ]
txt_shuffle$category <- ifelse(txt_shuffle$category == "SS", 1, 0)
remove <- c("collection", "author", "title")
txt_shuffle <- txt_shuffle[, !colnames(txt_shuffle) %in% remove]
show <- c("category", "author_title", "line_1")
kable(head(txt_shuffle[, colnames(txt_shuffle) %in% show]), format = "simple")
category author_title line_1
108 0 Ella Wheeler Wilcox_EARTH BOUND New paradise, and groom and bride;
17 0 DH Lawrence_BIRTH NIGHT THIS fireglow is a red womb
175 0 Dora Sigerson_BEWARE I closed my hands upon a moth
371 1 Hans Christian Andersen_THE MONEY-BOX In a nursery where a number of toys lay scattered about, a
185 0 Dora Sigerson_SANCTUARY Neighbour! for pity a hound cries on your steps
296 0 William Butler Yeats_INTO THE TWILIGHT Out-worn heart, in a time out-worn,

Create Numeric Variables:

We create numeric variables that will be able to capture some features of our text data. The first numeric variables we create are a count of the punctuation in a text, as well as a count of the non-empty lines in a text, i.e. lines that do not represent intentional breaks separating stanzas or paragraphs within that text.

txt_punct <- txt_shuffle
txt_punct[txt_punct == ""] <- NA
txt_punct <- txt_punct %>%
    mutate(across(starts_with("line_"), \(x) str_count(x, "[[:punct:]]"))) %>%
    mutate(total_punct = rowSums(select(., starts_with("line_")), na.rm = TRUE)) %>%
    mutate(total_non_empty_lines = rowSums(!is.na(select(.,
                                                         starts_with("line_"))))) %>%
    select(!starts_with("line_"))
kable(head(txt_punct), format = "simple")
category author_title total_punct total_non_empty_lines
108 0 Ella Wheeler Wilcox_EARTH BOUND 71 30
17 0 DH Lawrence_BIRTH NIGHT 22 24
175 0 Dora Sigerson_BEWARE 5 8
371 1 Hans Christian Andersen_THE MONEY-BOX 162 77
185 0 Dora Sigerson_SANCTUARY 58 24
296 0 William Butler Yeats_INTO THE TWILIGHT 21 16

We also create a count of the total lines in a text, including the previously mentioned intentional breaks.

txt_words <- txt_df
remove <- c("category", "collection", "author", "title")
txt_lines <- txt_words[, !colnames(txt_words) %in% remove]
txt_lines <- txt_lines %>%
    group_by(author_title) %>%
    summarize(total_lines = max(line_num))
kable(head(txt_lines), format = "simple")
author_title total_lines
Anton Chekhov_A MALEFACTOR 198
Anton Chekhov_AGAFYA 461
Anton Chekhov_AT CHRISTMAS TIME 222
Anton Chekhov_DREAMS 340
Anton Chekhov_GUSEV 607
Anton Chekhov_HAPPINESS 411

We then separate the lines into their individual word components.

txt_words <- txt_words[, !colnames(txt_words) %in% remove]
txt_words <- txt_words %>%
    unnest_tokens(word, lines)
kable(head(txt_words), format = "simple")
author_title line_num word
DH Lawrence_CRAVING FOR SPRING 1 i
DH Lawrence_CRAVING FOR SPRING 1 wish
DH Lawrence_CRAVING FOR SPRING 1 it
DH Lawrence_CRAVING FOR SPRING 1 were
DH Lawrence_CRAVING FOR SPRING 1 spring
DH Lawrence_CRAVING FOR SPRING 1 in

We join the words to the emotional and polarity indicators we previously discussed. Then we create numeric variables that indicate which portion of the labeled words in each text belong to each emotional indicator, as well as which portion of the labeled words in each text belong to each emotional indicator. A text might have only one labeled word, and it might have two labels: joy and positive. That text would get a score of 1 for percent_joy and 1 for percent_pos, which is not to indicate every word in the text is associated with both joy and positivity, but that every labeled word in the text is. We are not attempting to create a measure that accounts for the many unlabeled words.

txt_sentiments <- txt_words %>%
    left_join(nrc, by = join_by(word), multiple = "all") %>%
    left_join(bing, by = join_by(word), multiple = "all") %>%
    mutate(sentiment = coalesce(sentiment_nrc, sentiment_bing))
remove <- c("sentiment_nrc", "sentiment_bing")
txt_sentiments <- txt_sentiments[, !colnames(txt_sentiments) %in% remove]
txt_emotions <- txt_sentiments %>%
    filter(!sentiment %in% c("positive", "negative") & !is.na(sentiment)) %>% 
    group_by(author_title, sentiment) %>%
    summarize(sentiment_count = n())
emotions_pivot <- txt_emotions %>%
    pivot_wider(names_from = sentiment, values_from = sentiment_count)
emotion_totals <- txt_emotions %>%
    group_by(author_title) %>%
    summarize(emotion_total = sum(sentiment_count)) %>%
    left_join(emotions_pivot, by = join_by(author_title)) %>%
    mutate(percent_anger = round(anger / emotion_total, 2),
           percent_anticipation = round(anticipation / emotion_total, 2),
           percent_disgust = round(disgust / emotion_total, 2),
           percent_fear = round(fear / emotion_total, 2),
           percent_joy = round(joy / emotion_total, 2),
           percent_sadness = round(sadness / emotion_total, 2),
           percent_surprise = round(surprise / emotion_total, 2),
           percent_trust = round(trust / emotion_total, 2))
emotion_totals[is.na(emotion_totals)] <- 0
show <- c("author_title", "percent_anger", "percent_anticipation",
          "percent_disgust", "percent_fear", "percent_joy", "percent_sadness",
          "percent_surprise", "percent_trust")
kable(head(emotion_totals[, colnames(emotion_totals) %in% show]), format = "simple")
author_title percent_anger percent_anticipation percent_disgust percent_fear percent_joy percent_sadness percent_surprise percent_trust
Anton Chekhov_A MALEFACTOR 0.10 0.11 0.10 0.15 0.10 0.16 0.08 0.20
Anton Chekhov_AGAFYA 0.10 0.17 0.06 0.13 0.13 0.15 0.09 0.16
Anton Chekhov_AT CHRISTMAS TIME 0.08 0.20 0.04 0.13 0.18 0.09 0.04 0.24
Anton Chekhov_DREAMS 0.06 0.12 0.11 0.16 0.12 0.14 0.08 0.21
Anton Chekhov_GUSEV 0.15 0.12 0.11 0.17 0.08 0.18 0.06 0.15
Anton Chekhov_HAPPINESS 0.11 0.15 0.08 0.15 0.10 0.15 0.06 0.20
txt_polarity <- txt_sentiments %>%
    filter(sentiment %in% c("positive", "negative")) %>%
    group_by(author_title, sentiment) %>%
    summarize(sentiment_count = n())
polarity_pivot <- txt_polarity %>%
    pivot_wider(names_from = sentiment, values_from = sentiment_count)
polarity_totals <- txt_polarity %>%
    group_by(author_title) %>%
    summarize(polarity_total = sum(sentiment_count)) %>%
    left_join(polarity_pivot, by = join_by(author_title)) %>%
    mutate(percent_neg = round(negative / polarity_total, 2),
           percent_pos = round(positive / polarity_total, 2))
polarity_totals[is.na(polarity_totals)] <- 0
show <- show <- c("author_title", "percent_pos", "percent_neg")
kable(head(polarity_totals[, colnames(polarity_totals) %in% show]), format = "simple")
author_title percent_neg percent_pos
Anton Chekhov_A MALEFACTOR 0.48 0.52
Anton Chekhov_AGAFYA 0.47 0.53
Anton Chekhov_AT CHRISTMAS TIME 0.37 0.63
Anton Chekhov_DREAMS 0.47 0.53
Anton Chekhov_GUSEV 0.54 0.46
Anton Chekhov_HAPPINESS 0.48 0.52

We load a pronunciation dictionary so that we can analyze the last words in each line and check whether they rhyme with either the last word of the previous line or the last word of the line before that. Poems might feature either of these rhyming structures, or none at all, but it’s unlikely that the same poem would feature both, so we give each text a rhyming score based on the larger of its count of lines that rhyme with the previous line and its count of lines that rhyme with the line two lines prior.

my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/cmudict_0_7b.txt"
cmudict <- readLines(my_url3)
cmudict <- as.data.frame(cmudict[-c(1:56)])
colnames(cmudict) <- "word"
cmudict <- cmudict %>%
    separate_wider_delim(cols = word, delim = "  ", names_sep = "_",
                         too_few = "align_start")
colnames(cmudict) <- c("word", "pronunciation")
cmudict$word <- str_to_lower(str_replace_all(cmudict$word, "[[:punct:]]", ""))
cmudict$start_index <- sapply(gregexpr("[A-Z]{2}1", cmudict$pronunciation),
                                      function(x) rev(x)[1])
cmudict <- cmudict %>%
    mutate(rhyming_phoneme = substr(.$pronunciation, start_index, 1000000L))
txt_words$author_title_line_num <- paste(txt_words$author_title,
                                         txt_words$line_num, sep = "_")
txt_words$word_num <- as.integer(ave(txt_words$word, txt_words$author_title_line_num,
                                     FUN = seq_along))
remove <- "author_title_line_num"
txt_words <- txt_words[, !colnames(txt_words) %in% remove]
if (!"txt_rhymes.csv" %in% completed){
    txt_rhymes <- txt_words %>%
        group_by(author_title, line_num) %>%
        top_n(1, word_num) %>%
        left_join(cmudict, by = join_by(word), multiple = "first") %>%
        mutate(rhymes_1_line_prev = 0, rhymes_2_line_prev = 0)
    remove <- c("pronunciation", "start_index")
    txt_rhymes <- txt_rhymes[, !colnames(txt_rhymes) %in% remove]
    for (i in 2:nrow(txt_rhymes)){
        if (is.na(txt_rhymes[i, 5]) | 
            is.na(txt_rhymes[(i-1), 5]) | 
            txt_rhymes[i, 5] != txt_rhymes[(i-1), 5]){
            next
        }else{
            if (txt_rhymes[i, 1] != txt_rhymes[(i-1), 1]){
                next
            }else{
                txt_rhymes[i, 6] <- 1
            }
        }
    }
    for (i in 3:nrow(txt_rhymes)){
        if (is.na(txt_rhymes[i, 5]) | 
            is.na(txt_rhymes[(i-2), 5]) | 
            txt_rhymes[i, 5] != txt_rhymes[(i-2), 5]){
            next
        }else{
            if (txt_rhymes[i, 1] != txt_rhymes[(i-2), 1]){
                next
            }else{
                txt_rhymes[i, 7] <- 1
            }
        }
    }
    write.csv(txt_rhymes, "txt_rhymes.csv", row.names = FALSE)
}else{
    my_url4 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_rhymes.csv"
    txt_rhymes <- read.csv(my_url4)
    kable(head(txt_rhymes), format = "simple")
}
author_title line_num word word_num rhyming_phoneme rhymes_1_line_prev rhymes_2_line_prev
DH Lawrence_CRAVING FOR SPRING 1 world 8 ER1 L D 0 0
DH Lawrence_CRAVING FOR SPRING 3 spring 4 IH1 NG 0 0
DH Lawrence_CRAVING FOR SPRING 4 sap 6 AE1 P 0 0
DH Lawrence_CRAVING FOR SPRING 5 creation 4 EY1 SH AH0 N 0 0
DH Lawrence_CRAVING FOR SPRING 6 mortifica 8 NA 0 0
DH Lawrence_CRAVING FOR SPRING 7 tion 1 NA 0 0
rhymes_summary <- txt_rhymes %>%
    group_by(author_title) %>%
    summarize(rhyming_score = max(sum(rhymes_1_line_prev), sum(rhymes_2_line_prev))) %>%
    arrange(desc(rhyming_score))
kable(head(rhymes_summary), format = "simple")
author_title rhyming_score
Phillis Wheatley_G O L I A T H O F G A T H. 76
Phillis Wheatley_NIOBE in Distress for her Children slain by APOLLO, from Ovid’s Metamorphoses, Book VI. and from a view of the Painting of Mr. Richard Wilson. 76
Phillis Wheatley_Thoughts on the WORKS OF PROVIDENCE. 52
Louisa May Alcott_CLOVER-BLOSSOM. 45
Hans Christian Andersen_A STORY FROM THE SAND-HILLS 38
Dora Sigerson_THE SUICIDE’S GRAVE 36

Now that we’ve created many of the numeric variables that could possibly distinguish a poem from a short story, we create a summary data frame and add one more numeric variable, as well as a few ratios between our numeric variables that could potentially distinguish these texts further: the word count, the words per non-empty lines, the punctuation per non-empty lines, and the rhymes per non-empty lines.

txt_summary <- txt_words %>%
    group_by(author_title) %>%
    summarize(word_count = n()) %>%
    left_join(txt_lines, by = join_by(author_title)) %>%
    left_join(txt_punct, by = join_by(author_title)) %>%
    left_join(emotion_totals, by = join_by(author_title)) %>%
    left_join(polarity_totals, by = join_by(author_title)) %>%
    left_join(rhymes_summary, by = join_by(author_title))
txt_summary$total_lines <- as.integer(txt_summary$total_lines)
remove <- c("emotion_total", "polarity_total", "anger", "anticipation", "disgust",
            "fear", "joy", "sadness", "surprise", "trust", "positive", "negative")
txt_summary <- txt_summary[, !colnames(txt_summary) %in% remove]
txt_summary <- txt_summary %>%
    mutate(words_per_non_empty_lines = round(word_count / total_non_empty_lines, 2),
           punct_per_non_empty_lines = round(total_punct / total_non_empty_lines, 2),
           rhymes_per_non_empty_lines = round(rhyming_score / total_non_empty_lines, 2))
datatable(head(txt_summary), options = list(scrollX = TRUE))

Create Train and Test Datasets:

Now we can split our entirely numeric summary data into train and test datasets. Then, we fit a decision tree model.

# process for this chunk and the next adapted from
# https://www.guru99.com/r-decision-trees.html
split_data_into_test_train <- function(data, size = 0.7, train = TRUE){
    n_row <- nrow(data)
    split_n_row <- size * n_row
    split <- 1:split_n_row
    if (train == TRUE){
        return (data[split, ])
    } else {
        return (data[-split, ])
    }
}
train <- split_data_into_test_train(txt_summary, size = 0.7, train = TRUE)
train <- train %>%
    column_to_rownames(var="author_title")
test <- split_data_into_test_train(txt_summary, size = 0.7, train = FALSE)
test <- test %>%
    column_to_rownames(var="author_title")
fit <- rpart(category~., data = train, method = "class")
rpart.plot(fit, extra = 106)

We see that the model has only incorporated three of the variables we created into its decision tree: word count, words per non-empty lines, and rhymes per non-empty lines. We’ll discuss ways to improve the data that could make our other numeric variables more meaningful to the model later, as word count is so indicative of whether something is a poem or short story that we believe it clouds the influence of other variables. We also know some of our other variables could be measured more accurately as well. First, let’s see how the model does at predictive classification.

Predict Document Class Using Decision Tree:

predict_unseen <- predict(fit, test, type = "class")
table_mat <- table(test$category, predict_unseen)
table_mat
##    predict_unseen
##      0  1
##   0 88 14
##   1 18 63
accuracy_test <- round(sum(diag(table_mat)) / sum(table_mat), 4) * 100

Assess Accuracy of the Model:

The model correctly labeled 88 poems, but mislabeled 14 of them as short stories. It also correctly labeled 63 short stories, but mislabled 18 of them as poems.

The model is 82.51% accurate.

Conclusions:

The model accuracy is great, but we wanted text features other than word count to stand out when predicting poetry vs. short stories. We could probably improve the metrics we developed if: