For this text project, I am analyzing young adult novels from the 1950s and the 2010s. I chose to examine To Kill a Mockingbird by Harper Lee and Catcher in the Rye by J.D. Salinger as classic novels, and The Fault in Our Stars by John Green and It Ends With Us by Colleen Hoover as contemporary novels.
I will investigate language and word choice in the novels, and display and compare# positive and negative sentiments between classical and contemporary book pairs.
My hypothesis is that the word choice and usage will overlap will be very slim or non-existent between the two time periods. I think that the contemporary novels will overall have more positive sentiments than classical novels.
First, I loaded all necessary packages into the R file.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Next, I unnested tokens for Catcher in the Rye, separating the words into the individual columns, and placing it in the “Old” time period category.
catcher_in_the_rye <- catcher_in_the_rye |>unnest_tokens(word, X1) |>mutate(Book ="Catcher in the Rye") |>mutate(Period ='Old')
Then, I found the top positive sentiments in Catcher in the Rye, and plotted them on a graph.
catcher_in_the_rye |>inner_join(get_sentiments('afinn'), by ='word') |>arrange(desc(value)) |>head(12) |>ggplot(aes(x =reorder(word, value), y = value, fill = word)) +geom_col() +coord_flip() +labs(x ="Word",y ="Frequency", title ="Top Positive Sentiments in 'Catcher in the Rye'")+theme_classic()
Then, I found the most frequently appearing words in Catcher in the Rye.
Afterwards I found the 20 most frequently appearing words, and visualized them in a word cloud.
Doing so displays a comparison to the positive sentiments, since none of the three sentiments are within the top 20 words. The positive sentiments also appeared a maximum of 24 times, whereas even the 20th most frequent word appeared 89 times.
For the word cloud, I filtered out character names and other common words.
I followed the same steps for categorizing and visualizing words in To Kill a Mockingbird, filtering out character names or unessecary words and displaying in a word cloud.
merged |>count(clean_text, sort =TRUE) |>head(20) |>ggplot(aes(reorder(clean_text, n), n, fill)) +geom_col() +coord_flip() +labs(title ="Most Common Words Across Texts", x ="Word",y ="Frequency of Word")
Now, I went back and got sentiments from To Kill a Mockingbird,The Fault in Our Stars, and It Ends With Us since I hadn’t done so for these texts yet.
Finally, I completed a sentiment analysis of the four books, by plotting the “sentiment score” on a graph.
ggplot(book_sentiment, aes(x =reorder(Book, sentiment_score), y = sentiment_score)) +geom_bar(stat ="identity", fill ="pink") +labs(title ="Sentiment Analysis of Books",x ="Book",y ="Sentiment Score") +theme(axis.text.x =element_text(angle =45, hjust =1))
As it turns out, the two “old” books had a stark negative sentiment score. On the other hand, the two “new” novels had less negative sentiment scores.
The Fault in Our Stars had a sentiment score so close to 0 that it was not visible on the graph.
Based on this, it is clear that all four novels had a strong preference to negative sentiments over positive sentiments. Although, my hypothesis is somewhat correct in that the contemporary novels were overall more positive than the classic novels.