Text Analysis Project

Author

Monique Grew

Text Analysis of Classic and Contemporary Novels

For this text project, I am analyzing young adult novels from the 1950s and the 2010s. I chose to examine To Kill a Mockingbird by Harper Lee and Catcher in the Rye by J.D. Salinger as classic novels, and The Fault in Our Stars by John Green and It Ends With Us by Colleen Hoover as contemporary novels.

I will investigate language and word choice in the novels, and display and compare# positive and negative sentiments between classical and contemporary book pairs.

My hypothesis is that the word choice and usage will overlap will be very slim or non-existent between the two time periods. I think that the contemporary novels will overall have more positive sentiments than classical novels.

First, I loaded all necessary packages into the R file.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(textdata)
library(wordcloud2)

Next, I loaded four different books into the R file from txt files. I separated each book into data frames and named them.

catcher_in_the_rye <- read_csv("catcher_in_the_rye.txt", col_names = FALSE, show_col_types = FALSE, col_types = NULL)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
it_ends_with_us <- read_csv("it_ends_with_us.txt", col_names = FALSE, show_col_types = FALSE)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
mockingbird <- read_csv("mockingbird.txt", col_names = FALSE, show_col_types = FALSE)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
the_fault_in_our_stars <- read_csv("the_fault_in_our_stars.txt", col_names = FALSE, show_col_types = FALSE)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Next, I unnested tokens for Catcher in the Rye, separating the words into the individual columns, and placing it in the “Old” time period category.

catcher_in_the_rye <- catcher_in_the_rye |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "Catcher in the Rye") |> 
  mutate(Period = 'Old')

Then, I found the top positive sentiments in Catcher in the Rye, and plotted them on a graph.

catcher_in_the_rye |> 
  inner_join(get_sentiments('afinn'), by = 'word') |> 
  arrange(desc(value)) |> 
  head(12) |> 
  ggplot(aes(x = reorder(word, value), y =  value, fill = word)) + geom_col() + coord_flip() +
  labs(x = "Word",
       y = "Frequency", 
       title = "Top Positive Sentiments in 'Catcher in the Rye'")+ 
  theme_classic()

Then, I found the most frequently appearing words in Catcher in the Rye.

catcher_in_the_rye |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable() 
Joining with `by = join_by(word)`
word n
didn’t 323
goddam 243
hell 223
don’t 200
i’d 184
time 181
sort 178
guy 169
started 151
boy 142
told 132
damn 122
pretty 117
feel 102
phoebe 102
stuff 102
wouldn’t 101
lot 92
couldn’t 89
wasn’t 89

Afterwards I found the 20 most frequently appearing words, and visualized them in a word cloud.

Doing so displays a comparison to the positive sentiments, since none of the three sentiments are within the top 20 words. The positive sentiments also appeared a maximum of 24 times, whereas even the 20th most frequent word appeared 89 times.

For the word cloud, I filtered out character names and other common words.

catcher_in_the_rye |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
                      "atticus", "jem", "ryle", "lily", "phoebe")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()
Joining with `by = join_by(word)`

I repeated this process for It Ends With Us, first visualizing the top positive sentiments in the book on a graph.

Since this is a contemporary novel, I also labeled it to be a part of the “New” period.

it_ends_with_us |> 
  unnest_tokens(word, X1) -> it_ends_with_us 

it_ends_with_us <- it_ends_with_us |> 
  mutate(Book = "It Ends With Us") |> 
  mutate(Period = 'New')

it_ends_with_us |> 
  inner_join(get_sentiments('afinn'), by = 'word') |> 
  arrange(desc(value)) |> 
  head(30) |> 
  ggplot(aes(x = reorder(word, value), y =  value, fill = word)) + geom_col() + coord_flip() +
  labs(x = "Word",
       y = "Frequency", 
       title = "Top Positive Sentiments in 'It Ends With Us'")+ 
  theme_classic()

I repeated the process for finding the most common words, and visualized using a word cloud again.

it_ends_with_us |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable() 
Joining with `by = join_by(word)`
word n
i’m 658
ryle 445
it’s 367
he’s 341
lily 286
don’t 285
head 276
door 267
hand 252
allysa 246
atlas 246
eyes 240
time 232
feel 221
hands 202
i’ve 152
mother 137
that’s 129
bed 127
can’t 127
it_ends_with_us |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
                      "atticus", "jem", "ryle", "lily", "allysa")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()
Joining with `by = join_by(word)`

I followed the same steps for categorizing and visualizing words in To Kill a Mockingbird, filtering out character names or unessecary words and displaying in a word cloud.

library(dplyr)

mockingbird <- mockingbird |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "To Kill A Mockingbird") |> 
  mutate(Period = 'Old')

data("stop_words")

mockingbird |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable() 
Joining with `by = join_by(word)`
word n
jem 975
atticus 823
miss 394
dill 254
time 209
looked 182
finch 160
house 160
radley 158
scout 155
maycomb 154
head 146
front 144
calpurnia 142
home 141
em 138
maudie 124
aunt 119
ewell 119
heard 118
mockingbird |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's","atticus", "jem", "ryle", "lily", "calpurnia", "radley", 
"finch", "scout", "maudie", "dill", "aunt", "alexandra", "maycomb", "front", "jems", "em", "ewell", "tate", "tom")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()
Joining with `by = join_by(word)`

I continued this trend once more for The Fault in our Stars.

the_fault_in_our_stars <- the_fault_in_our_stars |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "The Fault in Our Stars") |> 
  mutate(Period = 'New')

the_fault_in_our_stars |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable() 
Joining with `by = join_by(word)`
word n
augustus 29
isaac 21
cancer 18
time 18
mom 17
patrick 17
support 16
eye 12
heart 12
waters 12
green 11
hazel 11
boy 9
fault 9
john 9
lungs 9
stars 9
life 8
walked 8
blind 7
the_fault_in_our_stars |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
"atticus", "jem", "ryle", "lily", "isaac", "patrick", "mom",
                      "john", "augustus", "lungs", "blind")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()
Joining with `by = join_by(word)`

After that, I merged the text from all four books together by using the full_join function.

catcher_in_the_rye |> 
  full_join(it_ends_with_us) |> 
  full_join(mockingbird) |> 
  full_join(the_fault_in_our_stars) -> merged 
Joining with `by = join_by(word, Book, Period)`
Joining with `by = join_by(word, Book, Period)`
Joining with `by = join_by(word, Book, Period)`

Just for reference, I noted the most commonly appearing words across all four books as well. I filtered out only common character names.

merged |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  filter(!word %in% c("jem", "atticus", "ryle", "lily")) |>
  knitr::kable() 
Joining with `by = join_by(word)`
word n
i’m 679
time 640
don’t 485
head 468
door 425
hand 421
miss 417
didn’t 411
it’s 380
feel 352
he’s 349
looked 342
told 333
eyes 323
home 307
night 285

I visualized these words in a graph, showing the most common words across all four texts.

merged |> 
  anti_join(stop_words) |> 
  mutate(clean_text = gsub(pattern = '[[:punct:]]', replacement='', word)) -> merged
Joining with `by = join_by(word)`
merged |> 
  count(clean_text, sort = TRUE) |> 
  head(20) |> 
  ggplot(aes(reorder(clean_text, n), n, fill)) + 
  geom_col() + 
  coord_flip() +
  labs(title = "Most Common Words Across Texts", 
       x = "Word",
       y = "Frequency of Word")

Now, I went back and got sentiments from To Kill a Mockingbird, The Fault in Our Stars, and It Ends With Us since I hadn’t done so for these texts yet.

merged |> 
  inner_join(get_sentiments('afinn')) -> merged_sentiment
Joining with `by = join_by(word)`
merged_sentiment |> 
  filter(Book %in% "To Kill A Mockingbird") |> 
  count(word, value, sort = TRUE) 
# A tibble: 762 × 3
   word    value     n
   <chr>   <dbl> <int>
 1 miss       -2   394
 2 stopped    -1    59
 3 scared     -2    40
 4 nigger     -5    37
 5 matter      1    36
 6 stop       -1    36
 7 dead       -3    35
 8 hard       -1    32
 9 reached     1    29
10 god         1    28
# ℹ 752 more rows
merged_sentiment |> 
  filter(Book %in% "The Fault in Our Stars") |> 
  count(word, value, sort = TRUE)
# A tibble: 130 × 3
   word    value     n
   <chr>   <dbl> <int>
 1 cancer     -1    18
 2 support     2    16
 3 blind      -1     7
 4 jesus       1     7
 5 smile       2     6
 6 smiled      2     6
 7 god         1     5
 8 grace       1     4
 9 love        3     4
10 yeah        1     4
# ℹ 120 more rows
merged_sentiment |> 
  filter(Book %in% "It Ends With Us") |> 
  count(word, value, sort = TRUE) 
# A tibble: 666 × 3
   word   value     n
   <chr>  <dbl> <int>
 1 love       3   115
 2 smile      2   101
 3 leave     -1    78
 4 hard      -1    63
 5 laugh      1    63
 6 stop      -1    63
 7 top        2    61
 8 hurt      -2    59
 9 bad       -3    57
10 scared    -2    49
# ℹ 656 more rows
book_sentiment <- merged_sentiment %>%
  group_by(Book) %>%
  summarise(sentiment_score = sum(value)) %>%
  arrange(desc(sentiment_score))

data("stop_words")

I then created a word cloud for the top 20 sentiments appearing across the four texts. This included both positive and negative sentiments.

merged |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "it's", "he's",
                      "atticus", "jem", "ryle", "lily")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()
Joining with `by = join_by(word)`

Finally, I completed a sentiment analysis of the four books, by plotting the “sentiment score” on a graph.

ggplot(book_sentiment, aes(x = reorder(Book, sentiment_score), y = sentiment_score)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Sentiment Analysis of Books",
       x = "Book",
       y = "Sentiment Score") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As it turns out, the two “old” books had a stark negative sentiment score. On the other hand, the two “new” novels had less negative sentiment scores.

The Fault in Our Stars had a sentiment score so close to 0 that it was not visible on the graph.

Based on this, it is clear that all four novels had a strong preference to negative sentiments over positive sentiments. Although, my hypothesis is somewhat correct in that the contemporary novels were overall more positive than the classic novels.