Text Analysis Project

Author

Monique Grew

Text Analysis of Classic and Contemporary Novels

For this text project, I am analyzing young adult novels from the 1950s and the 2010s. I chose to examine To Kill a Mockingbird by Harper Lee and Catcher in the Rye by J.D. Salinger as classic novels, and The Fault in Our Stars by John Green and It Ends With Us by Colleen Hoover as contemporary novels.

I will investigate language and word choice in the novels, and display and compare# positive and negative sentiments between classical and contemporary book pairs.

My hypothesis is that the word choice and usage will overlap will be very slim or non-existent between the two time periods. I think that the contemporary novels will overall have more positive sentiments than classical novels.

First, I loaded all necessary packages into the R file.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(textdata)
library(wordcloud2)

Next, I loaded four different books into the R file from txt files. I separated each book into data frames and named them.

catcher_in_the_rye <- read_csv("catcher_in_the_rye.txt", col_names = FALSE, show_col_types = FALSE, col_types = NULL)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

it_ends_with_us <- read_csv("it_ends_with_us.txt", col_names = FALSE, show_col_types = FALSE)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

mockingbird <- read_csv("mockingbird.txt", col_names = FALSE, show_col_types = FALSE)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

the_fault_in_our_stars <- read_csv("the_fault_in_our_stars.txt", col_names = FALSE, show_col_types = FALSE)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Next, I unnested tokens for Catcher in the Rye, separating the words into the individual columns, and placing it in the “Old” time period category.

catcher_in_the_rye <- catcher_in_the_rye |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "Catcher in the Rye") |> 
  mutate(Period = 'Old')

Then, I found the top positive sentiments in Catcher in the Rye, and plotted them on a graph.

catcher_in_the_rye |> 
  inner_join(get_sentiments('afinn'), by = 'word') |> 
  arrange(desc(value)) |> 
  head(12) |> 
  ggplot(aes(x = reorder(word, value), y =  value, fill = word)) + geom_col() + coord_flip() +
  labs(x = "Word",
       y = "Frequency", 
       title = "Top Positive Sentiments in 'Catcher in the Rye'")+ 
  theme_classic()

Then, I found the most frequently appearing words in Catcher in the Rye.

catcher_in_the_rye |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable()

Joining with `by = join_by(word)`

word	n
didn’t	323
goddam	243
hell	223
don’t	200
i’d	184
time	181
sort	178
guy	169
started	151
boy	142
told	132
damn	122
pretty	117
feel	102
phoebe	102
stuff	102
wouldn’t	101
lot	92
couldn’t	89
wasn’t	89

Afterwards I found the 20 most frequently appearing words, and visualized them in a word cloud.

Doing so displays a comparison to the positive sentiments, since none of the three sentiments are within the top 20 words. The positive sentiments also appeared a maximum of 24 times, whereas even the 20th most frequent word appeared 89 times.

For the word cloud, I filtered out character names and other common words.

catcher_in_the_rye |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
                      "atticus", "jem", "ryle", "lily", "phoebe")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()

Joining with `by = join_by(word)`

I repeated this process for It Ends With Us, first visualizing the top positive sentiments in the book on a graph.

Since this is a contemporary novel, I also labeled it to be a part of the “New” period.

it_ends_with_us |> 
  unnest_tokens(word, X1) -> it_ends_with_us 

it_ends_with_us <- it_ends_with_us |> 
  mutate(Book = "It Ends With Us") |> 
  mutate(Period = 'New')

it_ends_with_us |> 
  inner_join(get_sentiments('afinn'), by = 'word') |> 
  arrange(desc(value)) |> 
  head(30) |> 
  ggplot(aes(x = reorder(word, value), y =  value, fill = word)) + geom_col() + coord_flip() +
  labs(x = "Word",
       y = "Frequency", 
       title = "Top Positive Sentiments in 'It Ends With Us'")+ 
  theme_classic()

I repeated the process for finding the most common words, and visualized using a word cloud again.

it_ends_with_us |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable()

Joining with `by = join_by(word)`

word	n
i’m	658
ryle	445
it’s	367
he’s	341
lily	286
don’t	285
head	276
door	267
hand	252
allysa	246
atlas	246
eyes	240
time	232
feel	221
hands	202
i’ve	152
mother	137
that’s	129
bed	127
can’t	127

it_ends_with_us |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
                      "atticus", "jem", "ryle", "lily", "allysa")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()

Joining with `by = join_by(word)`

I followed the same steps for categorizing and visualizing words in To Kill a Mockingbird, filtering out character names or unessecary words and displaying in a word cloud.

library(dplyr)

mockingbird <- mockingbird |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "To Kill A Mockingbird") |> 
  mutate(Period = 'Old')

data("stop_words")

mockingbird |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable()

Joining with `by = join_by(word)`

word	n
jem	975
atticus	823
miss	394
dill	254
time	209
looked	182
finch	160
house	160
radley	158
scout	155
maycomb	154
head	146
front	144
calpurnia	142
home	141
em	138
maudie	124
aunt	119
ewell	119
heard	118

mockingbird |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's","atticus", "jem", "ryle", "lily", "calpurnia", "radley", 
"finch", "scout", "maudie", "dill", "aunt", "alexandra", "maycomb", "front", "jems", "em", "ewell", "tate", "tom")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()

Joining with `by = join_by(word)`

I continued this trend once more for The Fault in our Stars.

the_fault_in_our_stars <- the_fault_in_our_stars |> 
  unnest_tokens(word, X1) |> 
  mutate(Book = "The Fault in Our Stars") |> 
  mutate(Period = 'New')

the_fault_in_our_stars |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  knitr::kable()

Joining with `by = join_by(word)`

word	n
augustus	29
isaac	21
cancer	18
time	18
mom	17
patrick	17
support	16
eye	12
heart	12
waters	12
green	11
hazel	11
boy	9
fault	9
john	9
lungs	9
stars	9
life	8
walked	8
blind	7

the_fault_in_our_stars |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "i'm", "it's", "he's",
"atticus", "jem", "ryle", "lily", "isaac", "patrick", "mom",
                      "john", "augustus", "lungs", "blind")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()

Joining with `by = join_by(word)`

After that, I merged the text from all four books together by using the full_join function.

catcher_in_the_rye |> 
  full_join(it_ends_with_us) |> 
  full_join(mockingbird) |> 
  full_join(the_fault_in_our_stars) -> merged

Joining with `by = join_by(word, Book, Period)`
Joining with `by = join_by(word, Book, Period)`
Joining with `by = join_by(word, Book, Period)`

Just for reference, I noted the most commonly appearing words across all four books as well. I filtered out only common character names.

merged |> 
  anti_join(stop_words) |> 
  count(word, sort = TRUE) |> 
  head(20) |> 
  filter(!word %in% c("jem", "atticus", "ryle", "lily")) |>
  knitr::kable()

Joining with `by = join_by(word)`

word	n
i’m	679
time	640
don’t	485
head	468
door	425
hand	421
miss	417
didn’t	411
it’s	380
feel	352
he’s	349
looked	342
told	333
eyes	323
home	307
night	285

I visualized these words in a graph, showing the most common words across all four texts.

merged |> 
  anti_join(stop_words) |> 
  mutate(clean_text = gsub(pattern = '[[:punct:]]', replacement='', word)) -> merged

Joining with `by = join_by(word)`

merged |> 
  count(clean_text, sort = TRUE) |> 
  head(20) |> 
  ggplot(aes(reorder(clean_text, n), n, fill)) + 
  geom_col() + 
  coord_flip() +
  labs(title = "Most Common Words Across Texts", 
       x = "Word",
       y = "Frequency of Word")

Now, I went back and got sentiments from To Kill a Mockingbird, The Fault in Our Stars, and It Ends With Us since I hadn’t done so for these texts yet.

merged |> 
  inner_join(get_sentiments('afinn')) -> merged_sentiment

Joining with `by = join_by(word)`

merged_sentiment |> 
  filter(Book %in% "To Kill A Mockingbird") |> 
  count(word, value, sort = TRUE)

# A tibble: 762 × 3
   word    value     n
   <chr>   <dbl> <int>
 1 miss       -2   394
 2 stopped    -1    59
 3 scared     -2    40
 4 nigger     -5    37
 5 matter      1    36
 6 stop       -1    36
 7 dead       -3    35
 8 hard       -1    32
 9 reached     1    29
10 god         1    28
# ℹ 752 more rows

merged_sentiment |> 
  filter(Book %in% "The Fault in Our Stars") |> 
  count(word, value, sort = TRUE)

# A tibble: 130 × 3
   word    value     n
   <chr>   <dbl> <int>
 1 cancer     -1    18
 2 support     2    16
 3 blind      -1     7
 4 jesus       1     7
 5 smile       2     6
 6 smiled      2     6
 7 god         1     5
 8 grace       1     4
 9 love        3     4
10 yeah        1     4
# ℹ 120 more rows

merged_sentiment |> 
  filter(Book %in% "It Ends With Us") |> 
  count(word, value, sort = TRUE)

# A tibble: 666 × 3
   word   value     n
   <chr>  <dbl> <int>
 1 love       3   115
 2 smile      2   101
 3 leave     -1    78
 4 hard      -1    63
 5 laugh      1    63
 6 stop      -1    63
 7 top        2    61
 8 hurt      -2    59
 9 bad       -3    57
10 scared    -2    49
# ℹ 656 more rows

book_sentiment <- merged_sentiment %>%
  group_by(Book) %>%
  summarise(sentiment_score = sum(value)) %>%
  arrange(desc(sentiment_score))

data("stop_words")

I then created a word cloud for the top 20 sentiments appearing across the four texts. This included both positive and negative sentiments.

merged |>
  anti_join(stop_words) |>
  count(word) |> 
  filter(!word %in% c("didn't", "don't", "door", "it's", "he's",
                      "atticus", "jem", "ryle", "lily")) |>
  arrange(desc(n)) |>
  head(20) |> 
  wordcloud2()

Joining with `by = join_by(word)`

Finally, I completed a sentiment analysis of the four books, by plotting the “sentiment score” on a graph.

ggplot(book_sentiment, aes(x = reorder(Book, sentiment_score), y = sentiment_score)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Sentiment Analysis of Books",
       x = "Book",
       y = "Sentiment Score") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As it turns out, the two “old” books had a stark negative sentiment score. On the other hand, the two “new” novels had less negative sentiment scores.

The Fault in Our Stars had a sentiment score so close to 0 that it was not visible on the graph.

Based on this, it is clear that all four novels had a strong preference to negative sentiments over positive sentiments. Although, my hypothesis is somewhat correct in that the contemporary novels were overall more positive than the classic novels.