Fairytales

Author

Katelyn Litvan

Fairy Tale Insights: A Look Into the Patterns Found in the Words of the Grimm Brothers

For my project, I chose to use a dataset that contained 62 classic fairytales by the Grimm Brothers. I loved reading fairytales growing up, and it is no secret that some fairytales have darker undertones than others. For this reason, I think that doing a sentiment analysis and analyzing the words to these fairytales will help us to appreciate and understand the stories on a whole new level! I also believe that my findings from this project will help me identify common themes in fairytales.

Loading and Cleaning Data

To begin, I loaded my necessary libraries, and imported the dataset from Kaggle, which was saved as a csv, and saved it as a dataframe called “grimms.”

Code

library(tidyverse)
library(tidytext)
library(textdata)
library(ggplot2) 
library(dplyr)
library(readr)
grimms <- read_csv("~/Desktop/Applied Media Analytics/grimms_fairytales.csv")
View(grimms)

I then began to clean my data. I noticed that the Red Riding Hood story had gotten cut off some how, and continued mid story into a new column. After a lot of research, I found that the paste function would help me grab the story bits that had gotten placed in Text and Title columns in row 23, and place them where they belonged in the Text column of row 22, which was where the story was stored. I also noticed that the text in the stories sometimes had a between words. To get rid of that, I used the gsub function, which I found is used for matching and replacing in character strings. I replaced the with a simple blank, ““.

Code

grimms$Text[22] <- paste(grimms$Text[22], grimms$Title[23], grimms$Text[23])

grimms$Title[23] <- ""
grimms$Text[23] <- ""

grimms <- grimms[-23, ]

grimms$Text <- gsub("\\n", " ", grimms$Text)

Unesting the Words and Calculating Sentiments

Finally, I was ready to start analyzing the data. To begin, I unnested each word from the Text column, and then eliminated stop words and calculated the sentiment value of each word. I know that there is code to remove the apostrophes from the words, but I chose to leave them in, as the only instances I noticed them present was between plural and possessive nouns (ex: kings vs king’s), and I think that those two words have different meanings.

Code

grimms|> 
  unnest_tokens(word, Text)-> grimms2

grimms2 |> 
  anti_join(stop_words) |> 
  inner_join(get_sentiments('afinn'))-> grimms3

Then, I wanted to start analyzing the sentiment by story. I calculated the average sentiment value for each story.

Code

grimms3 |> 
  group_by(Title) |> 
  summarize(avg_sentiment=mean(value)) ->grimms_story

Visualization #1- Top Ten Positive and Negative Stories

It is time to create my first visualization. I wanted to display the top 10 stories with the highest average sentiment value by word (or, the happiest stories), as well as the top 10 stories with the lowest average sentiment value by word (or, the saddest/scariest stories). I did this by using the head function to sort and pull the top 10 highest and lowest average sentiment stories. Then, I used the rbind function, which combines by rows (or stacking vertically), to create one big dataset so that it could all be one visualization.

Code

top_10_high <- head(grimms_story[order(grimms_story$avg_sentiment, decreasing = TRUE), ], 10)

top_10_low <- head(grimms_story[order(grimms_story$avg_sentiment), ], 10)

top_10 <- rbind(top_10_high, top_10_low)

As for building the plot, I knew I wanted to create a bar graph, as for me, that is easy to read and directly compare values with. This was a hard plot to make, I want to point out some of the features I added to make this plot as easy to understand as possible. I used the fill= to color the bars on my graph based on if their average sentiment was over 0 (positive) or under 0 (negative). The stat= “identity” line in my geom_bar line means that I want my bars to be represented by avg_sentiment, which I already specified above. I chose red to fill the negative sentiment bars and green to fill the positive ones, as that made sense in my head. I flipped the graph using coord_flip. Finally, guide= “none” means that I did not want the legend to be present, I didn’t think it was necessary.

Code

ggplot(top_10, aes(x = reorder(Title, avg_sentiment), y = avg_sentiment, fill = avg_sentiment > 0)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red", "green"), guide= "none") +
  coord_flip() +
  labs(x = "Story Title", y = "Average Sentiment", title = "Top 10 Stories by Average Sentiment")

Looking at this graph, the first thing I noticed is the range of average sentiment values in the two categories. If you look at the higher sentiments, or the “Happy” category, the range between the top spot and the bottom spot is not that large. If you direct your attention to the bottom category, or the “Sad” section, the range is much more extreme. “The Old Man and his Grandson” has double the negative average sentiment compared to “Fundevogel.” It is also important to notice that the happiest story, “Hans in Luck,” has an average sentiment value of + 0.59, and “The Old Man and His Grandson” has an average sentimnent value of -1.83. This suggests that there are more extreme emotions on the negative side compared to the positive side of the sentiment spectrum among these stories. Considering the Grimm Brothers are known for writing darker stories, this is not surprising.

Visualization #2- Word Frequency and Word Cloud

Now for my second plot, I wanted to create a word cloud looking at the most common words in these fairytales. I also included a table of word frequency to look closer at the numbers.

Code

grimms3 |> 
  count(word, sort = TRUE)->word_frequency 
print(word_frequency)

# A tibble: 540 × 2
   word          n
   <chr>     <int>
 1 cried       153
 2 beautiful   123
 3 poor         91
 4 cut          78
 5 fire         70
 6 dead         58
 7 dear         57
 8 care         56
 9 fine         54
10 leave        49
# ℹ 530 more rows

Code

library(wordcloud2)
grimms3 |> 
  count(word, sort=TRUE) |> 
  wordcloud2(size=0.75)

Wow! It is incredibly interesting (and sad?) that “cried” is the most used word, with 153 uses, among these 62 stories! I found that a lot of the usage of this word was in the context of speaking, and not actually shedding tears. Here is an example of that from “Little Red Cap” (or better known as “Little Red Riding Hood”): “Soon afterwards the wolf knocked, and cried: ‘Open the door, grandmother, I am Little Red-Cap, and am bringing you some cakes.’”

Luckily there is a bit of a redemption, and “beautiful” is the second most used word, with 123 uses. Here is an example of the usage of the word from one of my all time favorite stories, “The Twelve Dancing Princesses”: “There was a king who had twelve beautiful daughters.”

Visualization #3- Bi grams and Most Popular Preceding and Following Words

This led me to want to explore more of the context in which these words are used, so I want to use some bigrams to look into this. I went back to my original grimms dataframe, but this time I un-nested two words at a time to create bi grams. I then filtered out the stop words from both word1 and word2 so I could look more in depth at the pairings without being flooded with a bunch of “a” and “the”. I saved my new bi grams set with the filtering and separating of the words as grimms_bigrams2.

Code

grimms |> 
  unnest_tokens(bigram, Text, token="ngrams", n=2) -> grimms_bigrams

grimms_bigrams |> 
  separate(bigram, c('word1', 'word2'), sep =" ") |> 
  filter(!word1 %in% stop_words$word) |> 
  filter(!word2 %in% stop_words$word) |> 
  count(word1, word2, sort = TRUE) -> grimms_bigrams2

print(grimms_bigrams2)

# A tibble: 4,277 × 3
   word1   word2        n
   <chr>   <chr>    <int>
 1 king’s  daughter    28
 2 king’s  son         28
 3 red     cap         27
 4 snow    white       21
 5 fell    asleep      20
 6 rose    red         17
 7 juniper tree        16
 8 fast    asleep      15
 9 cat     skin        14
10 clever  elsie       14
# ℹ 4,267 more rows

The most common bi gram, “king’s daughter,” can be found in many stories, but I particularly enjoyed this usage of the phrase in “Briar Rose” (or better known as “Sleeping Beauty”): “So she cried out, ‘The king’s daughter shall, in her fifteenth year, be wounded by a spindle, and fall down dead.’”

I wanted to explore a little further with the bi grams, and find out what the most common first word in the bi gram pairing was.

Code

grimms_bigrams2 |> 
  count(word1, sort = TRUE)->most_common_word1

The most common word1 was “beautiful.” I then wanted to see what the most common word following “beautiful” was:

Code

grimms_bigrams2 |> 
  filter(word1 == "beautiful") -> beautiful_bigrams 
print(beautiful_bigrams)

# A tibble: 54 × 3
   word1     word2        n
   <chr>     <chr>    <int>
 1 beautiful bird        11
 2 beautiful princess     6
 3 beautiful daughter     3
 4 beautiful flower       3
 5 beautiful maiden       3
 6 beautiful white        3
 7 beautiful body         2
 8 beautiful castle       2
 9 beautiful child        2
10 beautiful clothes      2
# ℹ 44 more rows

It was in fact “bird,” with 11 uses. This was a little surprising to me, so I looked into the stories to see when “beautiful bird” was occurring. It turns out in “The Juniper Tree,” a character sings a song many times with the lyric “Kywitt, Kywitt, what a beautiful bird am I!”

I then did the same thing for the second word in the bi gram pairing:

Code

grimms_bigrams2 |> 
  count(word2, sort = TRUE)->most_common_word2

The most common word2 was “till.” Again, I wanted to see what the most common word preceding “till” was:

Code

till_bigrams <- grimms_bigrams2 |> 
  filter(word2 == "till")
print(till_bigrams)

# A tibble: 62 × 3
   word1   word2     n
   <chr>   <chr> <int>
 1 waited  till      6
 2 night   till      3
 3 stone   till      3
 4 wait    till      3
 5 whirl’d till      3
 6 peace   till      2
 7 time    till      2
 8 abide   till      1
 9 awake   till      1
10 bellow  till      1
# ℹ 52 more rows

It turned out to be “waited,” with 6 uses. This made sense, as a lot of characters in fairy tales wait until a certain time or instant to go through with their plan or idea. Here’s an example of that from “The Water of Life”: “Then they waited till he was fast asleep, and poured the Water of Life out of the cup, and took it for themselves, giving him bitter sea-water instead.”

In conclusion, I think this project offered a fresh perspective on these age-old fairy tales, and helped me to better understand some of the common themes as well as the writing styles behind them. I hope that this project inspires you to revisit some of your favorite fairy tales, and see if you can pick up on some of the patterns that were discovered in this project!