Fairytales

Author

Katelyn Litvan

Fairy Tale Insights: A Look Into the Patterns Found in the Words of the Grimm Brothers

For my project, I chose to use a dataset that contained 62 classic fairytales by the Grimm Brothers. I loved reading fairytales growing up, and it is no secret that some fairytales have darker undertones than others. For this reason, I think that doing a sentiment analysis and analyzing the words to these fairytales will help us to appreciate and understand the stories on a whole new level! I also believe that my findings from this project will help me identify common themes in fairytales.

Loading and Cleaning Data

To begin, I loaded my necessary libraries, and imported the dataset from Kaggle, which was saved as a csv, and saved it as a dataframe called “grimms.”

Code
library(tidyverse)
library(tidytext)
library(textdata)
library(ggplot2) 
library(dplyr)
library(readr)
grimms <- read_csv("~/Desktop/Applied Media Analytics/grimms_fairytales.csv")
View(grimms)

I then began to clean my data. I noticed that the Red Riding Hood story had gotten cut off some how, and continued mid story into a new column. After a lot of research, I found that the paste function would help me grab the story bits that had gotten placed in Text and Title columns in row 23, and place them where they belonged in the Text column of row 22, which was where the story was stored. I also noticed that the text in the stories sometimes had a between words. To get rid of that, I used the gsub function, which I found is used for matching and replacing in character strings. I replaced the with a simple blank, ““.

Code
grimms$Text[22] <- paste(grimms$Text[22], grimms$Title[23], grimms$Text[23])

grimms$Title[23] <- ""
grimms$Text[23] <- ""

grimms <- grimms[-23, ]

grimms$Text <- gsub("\\n", " ", grimms$Text)

Unesting the Words and Calculating Sentiments

Finally, I was ready to start analyzing the data. To begin, I unnested each word from the Text column, and then eliminated stop words and calculated the sentiment value of each word. I know that there is code to remove the apostrophes from the words, but I chose to leave them in, as the only instances I noticed them present was between plural and possessive nouns (ex: kings vs king’s), and I think that those two words have different meanings.

Code
grimms|> 
  unnest_tokens(word, Text)-> grimms2

grimms2 |> 
  anti_join(stop_words) |> 
  inner_join(get_sentiments('afinn'))-> grimms3

Then, I wanted to start analyzing the sentiment by story. I calculated the average sentiment value for each story.

Code
grimms3 |> 
  group_by(Title) |> 
  summarize(avg_sentiment=mean(value)) ->grimms_story

Visualization #1- Top Ten Positive and Negative Stories

It is time to create my first visualization. I wanted to display the top 10 stories with the highest average sentiment value by word (or, the happiest stories), as well as the top 10 stories with the lowest average sentiment value by word (or, the saddest/scariest stories). I did this by using the head function to sort and pull the top 10 highest and lowest average sentiment stories. Then, I used the rbind function, which combines by rows (or stacking vertically), to create one big dataset so that it could all be one visualization.

Code
top_10_high <- head(grimms_story[order(grimms_story$avg_sentiment, decreasing = TRUE), ], 10)

top_10_low <- head(grimms_story[order(grimms_story$avg_sentiment), ], 10)

top_10 <- rbind(top_10_high, top_10_low)

As for building the plot, I knew I wanted to create a bar graph, as for me, that is easy to read and directly compare values with. This was a hard plot to make, I want to point out some of the features I added to make this plot as easy to understand as possible. I used the fill= to color the bars on my graph based on if their average sentiment was over 0 (positive) or under 0 (negative). The stat= “identity” line in my geom_bar line means that I want my bars to be represented by avg_sentiment, which I already specified above. I chose red to fill the negative sentiment bars and green to fill the positive ones, as that made sense in my head. I flipped the graph using coord_flip. Finally, guide= “none” means that I did not want the legend to be present, I didn’t think it was necessary.

Code
ggplot(top_10, aes(x = reorder(Title, avg_sentiment), y = avg_sentiment, fill = avg_sentiment > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "green"), guide= "none") +
  coord_flip() +
  labs(x = "Story Title", y = "Average Sentiment", title = "Top 10 Stories by Average Sentiment")

Looking at this graph, the first thing I noticed is the range of average sentiment values in the two categories. If you look at the higher sentiments, or the “Happy” category, the range between the top spot and the bottom spot is not that large. If you direct your attention to the bottom category, or the “Sad” section, the range is much more extreme. “The Old Man and his Grandson” has double the negative average sentiment compared to “Fundevogel.” It is also important to notice that the happiest story, “Hans in Luck,” has an average sentiment value of + 0.59, and “The Old Man and His Grandson” has an average sentimnent value of -1.83. This suggests that there are more extreme emotions on the negative side compared to the positive side of the sentiment spectrum among these stories. Considering the Grimm Brothers are known for writing darker stories, this is not surprising.

Visualization #2- Word Frequency and Word Cloud

Now for my second plot, I wanted to create a word cloud looking at the most common words in these fairytales. I also included a table of word frequency to look closer at the numbers.

Code
grimms3 |> 
  count(word, sort = TRUE)->word_frequency 
print(word_frequency)
# A tibble: 540 × 2
   word          n
   <chr>     <int>
 1 cried       153
 2 beautiful   123
 3 poor         91
 4 cut          78
 5 fire         70
 6 dead         58
 7 dear         57
 8 care         56
 9 fine         54
10 leave        49
# ℹ 530 more rows
Code
library(wordcloud2)
grimms3 |> 
  count(word, sort=TRUE) |> 
  wordcloud2(size=0.75)

Wow! It is incredibly interesting (and sad?) that “cried” is the most used word, with 153 uses, among these 62 stories! I found that a lot of the usage of this word was in the context of speaking, and not actually shedding tears. Here is an example of that from “Little Red Cap” (or better known as “Little Red Riding Hood”): “Soon afterwards the wolf knocked, and cried: ‘Open the door, grandmother, I am Little Red-Cap, and am bringing you some cakes.’”

Luckily there is a bit of a redemption, and “beautiful” is the second most used word, with 123 uses. Here is an example of the usage of the word from one of my all time favorite stories, “The Twelve Dancing Princesses”: “There was a king who had twelve beautiful daughters.”