Data 607 - Project 4

Load the Required Packages:

Below, we load the packages required for data analysis, predictive document classification, and visualization.

library(knitr)
library(tidyverse)
library(tidytext)
library(textdata)
library(rpart)
library(rpart.plot)
library(DT)

Load the Poetry and Short Story Data:

We will be predicting whether a text document is a poem or a short story based on some of its features, so we load a number of poetry and short story collections in the public domain that we gathered from Project Gutenberg into a dataframe. We manually added delimiters to the text files to mark the beginning of each unique poem or short story in the collection. We also named the files in such a way as to make extracting metadata like category (P for poetry, SS for short story), collection name, and author name easier. Once we have captured the lines of each text and numbered them sequentially, we pivot the data into a wider format so that each row observation is a single poem or short story, and that row’s numbered lines are its columns.

completed <- readLines("completed.txt") 
if (length(completed) == 0){
    files <- list.files(pattern = "^P_.+\\.txt$|^SS_.+\\.txt$")
    txt_df <- as.data.frame(matrix(nrow = 0, ncol = 6))
    cols <- c("category", "collection", "author", "title", "author_title", "lines")
    colnames(txt_df) <- cols
    for (i in 1:length(files)){
        extraction <- str_replace_all(files[i], "_", " ")
        p <- "(?<cat>P|SS)[- ](?<coll>.+)(?<by> by )(?<auth>.+)(?<ftype> Altered\\.txt)"
        extraction <- str_match(extraction, p)
        category <- extraction[1, 2]
        collection <- extraction[1, 3]
        author <- extraction[1, 5]
        txt <- trimws(readLines(files[i]), which="left")
        txt <- as.data.frame(txt)
        write(files[i], file = "completed.txt", append = TRUE)
        dlim <- txt[1, 1]
        if (dlim == "+"){
            starting_line <- str_detect(txt[, 1], "^\\+$")
        }else if (dlim == "="){
            starting_line <- str_detect(txt[, 1], "^=$")
        }
        for (j in nrow(txt):1){
            if (j == nrow(txt)){
                end <- j
            }
            if (starting_line[j]){
                start <- j + 1
                content <- txt[start:end, 1]
                title <- content[1]
                author_title <- paste(author, title, sep = "_")
                lines <- content[2:length(content)]
                #we want to preserve meaningful line breaks throughout the text
                #but eliminate leading or tailing sequences of them; process follows
                non_empty <- lines != ""
                #find index of first TRUE in vector of whether lines are not empty
                x <- which.max(non_empty) 
                lines <- lines[x:length(lines)] #trim empty lines from front only
                rev_lines <- rev(lines) #reverse lines so we can trim from back
                y <- which.max(rev(non_empty)) 
                rev_lines <- rev_lines[y:length(rev_lines)] #trim from back
                lines <- rev(rev_lines) #put lines back in proper order
                addition <- cbind(category, collection, author, title,
                                  author_title, lines)
                txt_df <- rbind(txt_df, addition)
                end <- j - 1
            }else{
                next
            }
        }
    }
    txt_df$line_num <- as.integer(ave(txt_df$lines, txt_df$author_title, FUN = seq_along))
    txt_pivot <- txt_df %>%
        pivot_wider(names_from = line_num, names_prefix = "line_" ,
                values_from = lines)
    write.csv(txt_df, file = "txt_df.csv", row.names = FALSE)
    write.csv(txt_pivot, file = "txt_pivot.csv", row.names = FALSE)
}else{
    my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_df.csv"
    txt_df <- read.csv(my_url1)
    my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_pivot.csv"
    txt_pivot <- read.csv(my_url2)
    show <- c("category", "collection", "author", "title", "line_1")
    kable(head(txt_pivot[, colnames(txt_pivot) %in% show]), format = "simple")
}

category	collection	author	title	line_1
P	Look We Have Come Through	DH Lawrence	CRAVING FOR SPRING	I WISH it were spring in the world.
P	Look We Have Come Through	DH Lawrence	FROST FLOWERS	IT is not long since, here among all these folk
P	Look We Have Come Through	DH Lawrence	AUTUMN RAIN	THE plane leaves
P	Look We Have Come Through	DH Lawrence	MANIFESTO	I
P	Look We Have Come Through	DH Lawrence	ELYSIUM	I HAVE found a place of loneliness
P	Look We Have Come Through	DH Lawrence	NEW HEAVEN AND EARTH	I

Load Sentiment Analysis Lexicons:

We load two sentiment lexicons so that we can analyze both emotional and polarity indicators for the words in each line. Since the nrc lexicon includes both emotional (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) indicators and polarity (negative, positive) indicators, we will prioritize the indicators from the nrc lexicon and consider it the primary. We will only use the secondary sentiment lexicon bing to fill in polarity indicators for words the primary lexicon hasn’t labeled positive or negative.

nrc <- get_sentiments("nrc")
cols <- c("word", "sentiment_nrc")
colnames(nrc) <- cols
bing <- get_sentiments("bing")
cols <- c("word", "sentiment_bing")
colnames(bing) <- cols

Set a Seed:

We set a seed so that when we shuffle our data and later create train and test datasets from it, we get the same results each time.

set.seed(6001)

Shuffle the Data:

We shuffle the data so that the poems are no longer sorted next to all the other poems, and the short stories are no longer sorted next to all the other short stories. We also change the P (poetry) vs. SS (short story) category to a numeric variable with 0 indicating P and 1 indicating SS, and we emove some columns we don’t need for the numeric analysis we’ll be doing later.

txt_shuffle <- txt_pivot
shuffle <- sample(1:nrow(txt_shuffle))
txt_shuffle <- txt_shuffle[shuffle, ]
txt_shuffle$category <- ifelse(txt_shuffle$category == "SS", 1, 0)
remove <- c("collection", "author", "title")
txt_shuffle <- txt_shuffle[, !colnames(txt_shuffle) %in% remove]
show <- c("category", "author_title", "line_1")
kable(head(txt_shuffle[, colnames(txt_shuffle) %in% show]), format = "simple")

	category	author_title	line_1
108	0	Ella Wheeler Wilcox_EARTH BOUND	New paradise, and groom and bride;
17	0	DH Lawrence_BIRTH NIGHT	THIS fireglow is a red womb
175	0	Dora Sigerson_BEWARE	I closed my hands upon a moth
371	1	Hans Christian Andersen_THE MONEY-BOX	In a nursery where a number of toys lay scattered about, a
185	0	Dora Sigerson_SANCTUARY	Neighbour! for pity a hound cries on your steps
296	0	William Butler Yeats_INTO THE TWILIGHT	Out-worn heart, in a time out-worn,

Create Numeric Variables:

We create numeric variables that will be able to capture some features of our text data. The first numeric variables we create are a count of the punctuation in a text, as well as a count of the non-empty lines in a text, i.e. lines that do not represent intentional breaks separating stanzas or paragraphs within that text.

txt_punct <- txt_shuffle
txt_punct[txt_punct == ""] <- NA
txt_punct <- txt_punct %>%
    mutate(across(starts_with("line_"), \(x) str_count(x, "[[:punct:]]"))) %>%
    mutate(total_punct = rowSums(select(., starts_with("line_")), na.rm = TRUE)) %>%
    mutate(total_non_empty_lines = rowSums(!is.na(select(.,
                                                         starts_with("line_"))))) %>%
    select(!starts_with("line_"))
kable(head(txt_punct), format = "simple")

	category	author_title	total_punct	total_non_empty_lines
108	0	Ella Wheeler Wilcox_EARTH BOUND	71	30
17	0	DH Lawrence_BIRTH NIGHT	22	24
175	0	Dora Sigerson_BEWARE	5	8
371	1	Hans Christian Andersen_THE MONEY-BOX	162	77
185	0	Dora Sigerson_SANCTUARY	58	24
296	0	William Butler Yeats_INTO THE TWILIGHT	21	16

We also create a count of the total lines in a text, including the previously mentioned intentional breaks.

txt_words <- txt_df
remove <- c("category", "collection", "author", "title")
txt_lines <- txt_words[, !colnames(txt_words) %in% remove]
txt_lines <- txt_lines %>%
    group_by(author_title) %>%
    summarize(total_lines = max(line_num))
kable(head(txt_lines), format = "simple")

author_title	total_lines
Anton Chekhov_A MALEFACTOR	198
Anton Chekhov_AGAFYA	461
Anton Chekhov_AT CHRISTMAS TIME	222
Anton Chekhov_DREAMS	340
Anton Chekhov_GUSEV	607
Anton Chekhov_HAPPINESS	411

We then separate the lines into their individual word components.

txt_words <- txt_words[, !colnames(txt_words) %in% remove]
txt_words <- txt_words %>%
    unnest_tokens(word, lines)
kable(head(txt_words), format = "simple")

author_title	line_num	word
DH Lawrence_CRAVING FOR SPRING	1	i
DH Lawrence_CRAVING FOR SPRING	1	wish
DH Lawrence_CRAVING FOR SPRING	1	it
DH Lawrence_CRAVING FOR SPRING	1	were
DH Lawrence_CRAVING FOR SPRING	1	spring
DH Lawrence_CRAVING FOR SPRING	1	in

We join the words to the emotional and polarity indicators we previously discussed. Then we create numeric variables that indicate which portion of the labeled words in each text belong to each emotional indicator, as well as which portion of the labeled words in each text belong to each emotional indicator. A text might have only one labeled word, and it might have two labels: joy and positive. That text would get a score of 1 for percent_joy and 1 for percent_pos, which is not to indicate every word in the text is associated with both joy and positivity, but that every labeled word in the text is. We are not attempting to create a measure that accounts for the many unlabeled words.

txt_sentiments <- txt_words %>%
    left_join(nrc, by = join_by(word), multiple = "all") %>%
    left_join(bing, by = join_by(word), multiple = "all") %>%
    mutate(sentiment = coalesce(sentiment_nrc, sentiment_bing))
remove <- c("sentiment_nrc", "sentiment_bing")
txt_sentiments <- txt_sentiments[, !colnames(txt_sentiments) %in% remove]
txt_emotions <- txt_sentiments %>%
    filter(!sentiment %in% c("positive", "negative") & !is.na(sentiment)) %>% 
    group_by(author_title, sentiment) %>%
    summarize(sentiment_count = n())
emotions_pivot <- txt_emotions %>%
    pivot_wider(names_from = sentiment, values_from = sentiment_count)
emotion_totals <- txt_emotions %>%
    group_by(author_title) %>%
    summarize(emotion_total = sum(sentiment_count)) %>%
    left_join(emotions_pivot, by = join_by(author_title)) %>%
    mutate(percent_anger = round(anger / emotion_total, 2),
           percent_anticipation = round(anticipation / emotion_total, 2),
           percent_disgust = round(disgust / emotion_total, 2),
           percent_fear = round(fear / emotion_total, 2),
           percent_joy = round(joy / emotion_total, 2),
           percent_sadness = round(sadness / emotion_total, 2),
           percent_surprise = round(surprise / emotion_total, 2),
           percent_trust = round(trust / emotion_total, 2))
emotion_totals[is.na(emotion_totals)] <- 0
show <- c("author_title", "percent_anger", "percent_anticipation",
          "percent_disgust", "percent_fear", "percent_joy", "percent_sadness",
          "percent_surprise", "percent_trust")
kable(head(emotion_totals[, colnames(emotion_totals) %in% show]), format = "simple")

author_title	percent_anger	percent_anticipation	percent_disgust	percent_fear	percent_joy	percent_sadness	percent_surprise	percent_trust
Anton Chekhov_A MALEFACTOR	0.10	0.11	0.10	0.15	0.10	0.16	0.08	0.20
Anton Chekhov_AGAFYA	0.10	0.17	0.06	0.13	0.13	0.15	0.09	0.16
Anton Chekhov_AT CHRISTMAS TIME	0.08	0.20	0.04	0.13	0.18	0.09	0.04	0.24
Anton Chekhov_DREAMS	0.06	0.12	0.11	0.16	0.12	0.14	0.08	0.21
Anton Chekhov_GUSEV	0.15	0.12	0.11	0.17	0.08	0.18	0.06	0.15
Anton Chekhov_HAPPINESS	0.11	0.15	0.08	0.15	0.10	0.15	0.06	0.20

txt_polarity <- txt_sentiments %>%
    filter(sentiment %in% c("positive", "negative")) %>%
    group_by(author_title, sentiment) %>%
    summarize(sentiment_count = n())
polarity_pivot <- txt_polarity %>%
    pivot_wider(names_from = sentiment, values_from = sentiment_count)
polarity_totals <- txt_polarity %>%
    group_by(author_title) %>%
    summarize(polarity_total = sum(sentiment_count)) %>%
    left_join(polarity_pivot, by = join_by(author_title)) %>%
    mutate(percent_neg = round(negative / polarity_total, 2),
           percent_pos = round(positive / polarity_total, 2))
polarity_totals[is.na(polarity_totals)] <- 0
show <- show <- c("author_title", "percent_pos", "percent_neg")
kable(head(polarity_totals[, colnames(polarity_totals) %in% show]), format = "simple")

author_title	percent_neg	percent_pos
Anton Chekhov_A MALEFACTOR	0.48	0.52
Anton Chekhov_AGAFYA	0.47	0.53
Anton Chekhov_AT CHRISTMAS TIME	0.37	0.63
Anton Chekhov_DREAMS	0.47	0.53
Anton Chekhov_GUSEV	0.54	0.46
Anton Chekhov_HAPPINESS	0.48	0.52

We load a pronunciation dictionary so that we can analyze the last words in each line and check whether they rhyme with either the last word of the previous line or the last word of the line before that. Poems might feature either of these rhyming structures, or none at all, but it’s unlikely that the same poem would feature both, so we give each text a rhyming score based on the larger of its count of lines that rhyme with the previous line and its count of lines that rhyme with the line two lines prior.

my_url3 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/cmudict_0_7b.txt"
cmudict <- readLines(my_url3)
cmudict <- as.data.frame(cmudict[-c(1:56)])
colnames(cmudict) <- "word"
cmudict <- cmudict %>%
    separate_wider_delim(cols = word, delim = "  ", names_sep = "_",
                         too_few = "align_start")
colnames(cmudict) <- c("word", "pronunciation")
cmudict$word <- str_to_lower(str_replace_all(cmudict$word, "[[:punct:]]", ""))
cmudict$start_index <- sapply(gregexpr("[A-Z]{2}1", cmudict$pronunciation),
                                      function(x) rev(x)[1])
cmudict <- cmudict %>%
    mutate(rhyming_phoneme = substr(.$pronunciation, start_index, 1000000L))
txt_words$author_title_line_num <- paste(txt_words$author_title,
                                         txt_words$line_num, sep = "_")
txt_words$word_num <- as.integer(ave(txt_words$word, txt_words$author_title_line_num,
                                     FUN = seq_along))
remove <- "author_title_line_num"
txt_words <- txt_words[, !colnames(txt_words) %in% remove]
if (!"txt_rhymes.csv" %in% completed){
    txt_rhymes <- txt_words %>%
        group_by(author_title, line_num) %>%
        top_n(1, word_num) %>%
        left_join(cmudict, by = join_by(word), multiple = "first") %>%
        mutate(rhymes_1_line_prev = 0, rhymes_2_line_prev = 0)
    remove <- c("pronunciation", "start_index")
    txt_rhymes <- txt_rhymes[, !colnames(txt_rhymes) %in% remove]
    for (i in 2:nrow(txt_rhymes)){
        if (is.na(txt_rhymes[i, 5]) | 
            is.na(txt_rhymes[(i-1), 5]) | 
            txt_rhymes[i, 5] != txt_rhymes[(i-1), 5]){
            next
        }else{
            if (txt_rhymes[i, 1] != txt_rhymes[(i-1), 1]){
                next
            }else{
                txt_rhymes[i, 6] <- 1
            }
        }
    }
    for (i in 3:nrow(txt_rhymes)){
        if (is.na(txt_rhymes[i, 5]) | 
            is.na(txt_rhymes[(i-2), 5]) | 
            txt_rhymes[i, 5] != txt_rhymes[(i-2), 5]){
            next
        }else{
            if (txt_rhymes[i, 1] != txt_rhymes[(i-2), 1]){
                next
            }else{
                txt_rhymes[i, 7] <- 1
            }
        }
    }
    write.csv(txt_rhymes, "txt_rhymes.csv", row.names = FALSE)
}else{
    my_url4 <- "https://raw.githubusercontent.com/geedoubledee/data607_project4/main/txt_rhymes.csv"
    txt_rhymes <- read.csv(my_url4)
    kable(head(txt_rhymes), format = "simple")
}

author_title	line_num	word	word_num	rhyming_phoneme
DH Lawrence_CRAVING FOR SPRING	1	world	8	ER1 L D
DH Lawrence_CRAVING FOR SPRING	3	spring	4	IH1 NG
DH Lawrence_CRAVING FOR SPRING	4	sap	6	AE1 P
DH Lawrence_CRAVING FOR SPRING	5	creation	4	EY1 SH AH0 N
DH Lawrence_CRAVING FOR SPRING	6	mortifica	8	NA
DH Lawrence_CRAVING FOR SPRING	7	tion	1	NA

rhymes_summary <- txt_rhymes %>%
    group_by(author_title) %>%
    summarize(rhyming_score = max(sum(rhymes_1_line_prev), sum(rhymes_2_line_prev))) %>%
    arrange(desc(rhyming_score))
kable(head(rhymes_summary), format = "simple")

author_title	rhyming_score
Phillis Wheatley_G O L I A T H O F G A T H.	76
Phillis Wheatley_NIOBE in Distress for her Children slain by APOLLO, from Ovid’s Metamorphoses, Book VI. and from a view of the Painting of Mr. Richard Wilson.	76
Phillis Wheatley_Thoughts on the WORKS OF PROVIDENCE.	52
Louisa May Alcott_CLOVER-BLOSSOM.	45
Hans Christian Andersen_A STORY FROM THE SAND-HILLS	38
Dora Sigerson_THE SUICIDE’S GRAVE	36

Now that we’ve created many of the numeric variables that could possibly distinguish a poem from a short story, we create a summary data frame and add one more numeric variable, as well as a few ratios between our numeric variables that could potentially distinguish these texts further: the word count, the words per non-empty lines, the punctuation per non-empty lines, and the rhymes per non-empty lines.

txt_summary <- txt_words %>%
    group_by(author_title) %>%
    summarize(word_count = n()) %>%
    left_join(txt_lines, by = join_by(author_title)) %>%
    left_join(txt_punct, by = join_by(author_title)) %>%
    left_join(emotion_totals, by = join_by(author_title)) %>%
    left_join(polarity_totals, by = join_by(author_title)) %>%
    left_join(rhymes_summary, by = join_by(author_title))
txt_summary$total_lines <- as.integer(txt_summary$total_lines)
remove <- c("emotion_total", "polarity_total", "anger", "anticipation", "disgust",
            "fear", "joy", "sadness", "surprise", "trust", "positive", "negative")
txt_summary <- txt_summary[, !colnames(txt_summary) %in% remove]
txt_summary <- txt_summary %>%
    mutate(words_per_non_empty_lines = round(word_count / total_non_empty_lines, 2),
           punct_per_non_empty_lines = round(total_punct / total_non_empty_lines, 2),
           rhymes_per_non_empty_lines = round(rhyming_score / total_non_empty_lines, 2))
datatable(head(txt_summary), options = list(scrollX = TRUE))

Create Train and Test Datasets:

Now we can split our entirely numeric summary data into train and test datasets. Then, we fit a decision tree model.

# process for this chunk and the next adapted from
# https://www.guru99.com/r-decision-trees.html
split_data_into_test_train <- function(data, size = 0.7, train = TRUE){
    n_row <- nrow(data)
    split_n_row <- size * n_row
    split <- 1:split_n_row
    if (train == TRUE){
        return (data[split, ])
    } else {
        return (data[-split, ])
    }
}
train <- split_data_into_test_train(txt_summary, size = 0.7, train = TRUE)
train <- train %>%
    column_to_rownames(var="author_title")
test <- split_data_into_test_train(txt_summary, size = 0.7, train = FALSE)
test <- test %>%
    column_to_rownames(var="author_title")
fit <- rpart(category~., data = train, method = "class")
rpart.plot(fit, extra = 106)

We see that the model has only incorporated three of the variables we created into its decision tree: word count, words per non-empty lines, and rhymes per non-empty lines. We’ll discuss ways to improve the data that could make our other numeric variables more meaningful to the model later, as word count is so indicative of whether something is a poem or short story that we believe it clouds the influence of other variables. We also know some of our other variables could be measured more accurately as well. First, let’s see how the model does at predictive classification.

Predict Document Class Using Decision Tree:

predict_unseen <- predict(fit, test, type = "class")
table_mat <- table(test$category, predict_unseen)
table_mat

##    predict_unseen
##      0  1
##   0 88 14
##   1 18 63

accuracy_test <- round(sum(diag(table_mat)) / sum(table_mat), 4) * 100

Assess Accuracy of the Model:

The model correctly labeled 88 poems, but mislabeled 14 of them as short stories. It also correctly labeled 63 short stories, but mislabled 18 of them as poems.

The model is 82.51% accurate.

Conclusions:

The model accuracy is great, but we wanted text features other than word count to stand out when predicting poetry vs. short stories. We could probably improve the metrics we developed if:

the line length of the short stories was fixed like they would be in a physical book, making punctuation per non-empty line more meaningful
we took text samples that were all within range of the same word count
we excluded short stories that have poetry or poetic features within them
we expanded the number of words for which we had emotional/polarity sentiment indicators