Code
library(tidyverse)
library(tidytext)
library(textdata)
library(ggplot2)
library(dplyr)
library(readr)
grimms <- read_csv("~/Desktop/Applied Media Analytics/grimms_fairytales.csv")
View(grimms)For my project, I chose to use a dataset that contained 62 classic fairytales by the Grimm Brothers. I loved reading fairytales growing up, and it is no secret that some fairytales have darker undertones than others. For this reason, I think that doing a sentiment analysis and analyzing the words to these fairytales will help us to appreciate and understand the stories on a whole new level! I also believe that my findings from this project will help me identify common themes in fairytales.
To begin, I loaded my necessary libraries, and imported the dataset from Kaggle, which was saved as a csv, and saved it as a dataframe called “grimms.”
library(tidyverse)
library(tidytext)
library(textdata)
library(ggplot2)
library(dplyr)
library(readr)
grimms <- read_csv("~/Desktop/Applied Media Analytics/grimms_fairytales.csv")
View(grimms)I then began to clean my data. I noticed that the Red Riding Hood story had gotten cut off some how, and continued mid story into a new column. After a lot of research, I found that the paste function would help me grab the story bits that had gotten placed in Text and Title columns in row 23, and place them where they belonged in the Text column of row 22, which was where the story was stored. I also noticed that the text in the stories sometimes had a between words. To get rid of that, I used the gsub function, which I found is used for matching and replacing in character strings. I replaced the with a simple blank, ““.
grimms$Text[22] <- paste(grimms$Text[22], grimms$Title[23], grimms$Text[23])
grimms$Title[23] <- ""
grimms$Text[23] <- ""
grimms <- grimms[-23, ]
grimms$Text <- gsub("\\n", " ", grimms$Text)Finally, I was ready to start analyzing the data. To begin, I unnested each word from the Text column, and then eliminated stop words and calculated the sentiment value of each word. I know that there is code to remove the apostrophes from the words, but I chose to leave them in, as the only instances I noticed them present was between plural and possessive nouns (ex: kings vs king’s), and I think that those two words have different meanings.
grimms|>
unnest_tokens(word, Text)-> grimms2
grimms2 |>
anti_join(stop_words) |>
inner_join(get_sentiments('afinn'))-> grimms3Then, I wanted to start analyzing the sentiment by story. I calculated the average sentiment value for each story.
grimms3 |>
group_by(Title) |>
summarize(avg_sentiment=mean(value)) ->grimms_storyIt is time to create my first visualization. I wanted to display the top 10 stories with the highest average sentiment value by word (or, the happiest stories), as well as the top 10 stories with the lowest average sentiment value by word (or, the saddest/scariest stories). I did this by using the head function to sort and pull the top 10 highest and lowest average sentiment stories. Then, I used the rbind function, which combines by rows (or stacking vertically), to create one big dataset so that it could all be one visualization.
top_10_high <- head(grimms_story[order(grimms_story$avg_sentiment, decreasing = TRUE), ], 10)
top_10_low <- head(grimms_story[order(grimms_story$avg_sentiment), ], 10)
top_10 <- rbind(top_10_high, top_10_low)As for building the plot, I knew I wanted to create a bar graph, as for me, that is easy to read and directly compare values with. This was a hard plot to make, I want to point out some of the features I added to make this plot as easy to understand as possible. I used the fill= to color the bars on my graph based on if their average sentiment was over 0 (positive) or under 0 (negative). The stat= “identity” line in my geom_bar line means that I want my bars to be represented by avg_sentiment, which I already specified above. I chose red to fill the negative sentiment bars and green to fill the positive ones, as that made sense in my head. I flipped the graph using coord_flip. Finally, guide= “none” means that I did not want the legend to be present, I didn’t think it was necessary.
ggplot(top_10, aes(x = reorder(Title, avg_sentiment), y = avg_sentiment, fill = avg_sentiment > 0)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red", "green"), guide= "none") +
coord_flip() +
labs(x = "Story Title", y = "Average Sentiment", title = "Top 10 Stories by Average Sentiment")Looking at this graph, the first thing I noticed is the range of average sentiment values in the two categories. If you look at the higher sentiments, or the “Happy” category, the range between the top spot and the bottom spot is not that large. If you direct your attention to the bottom category, or the “Sad” section, the range is much more extreme. “The Old Man and his Grandson” has double the negative average sentiment compared to “Fundevogel.” It is also important to notice that the happiest story, “Hans in Luck,” has an average sentiment value of + 0.59, and “The Old Man and His Grandson” has an average sentimnent value of -1.83. This suggests that there are more extreme emotions on the negative side compared to the positive side of the sentiment spectrum among these stories. Considering the Grimm Brothers are known for writing darker stories, this is not surprising.
Now for my second plot, I wanted to create a word cloud looking at the most common words in these fairytales. I also included a table of word frequency to look closer at the numbers.
grimms3 |>
count(word, sort = TRUE)->word_frequency
print(word_frequency)# A tibble: 540 × 2
word n
<chr> <int>
1 cried 153
2 beautiful 123
3 poor 91
4 cut 78
5 fire 70
6 dead 58
7 dear 57
8 care 56
9 fine 54
10 leave 49
# ℹ 530 more rows
library(wordcloud2)
grimms3 |>
count(word, sort=TRUE) |>
wordcloud2(size=0.75)Wow! It is incredibly interesting (and sad?) that “cried” is the most used word, with 153 uses, among these 62 stories! I found that a lot of the usage of this word was in the context of speaking, and not actually shedding tears. Here is an example of that from “Little Red Cap” (or better known as “Little Red Riding Hood”): “Soon afterwards the wolf knocked, and cried: ‘Open the door, grandmother, I am Little Red-Cap, and am bringing you some cakes.’”
Luckily there is a bit of a redemption, and “beautiful” is the second most used word, with 123 uses. Here is an example of the usage of the word from one of my all time favorite stories, “The Twelve Dancing Princesses”: “There was a king who had twelve beautiful daughters.”
This led me to want to explore more of the context in which these words are used, so I want to use some bigrams to look into this. I went back to my original grimms dataframe, but this time I un-nested two words at a time to create bi grams. I then filtered out the stop words from both word1 and word2 so I could look more in depth at the pairings without being flooded with a bunch of “a” and “the”. I saved my new bi grams set with the filtering and separating of the words as grimms_bigrams2.
grimms |>
unnest_tokens(bigram, Text, token="ngrams", n=2) -> grimms_bigrams
grimms_bigrams |>
separate(bigram, c('word1', 'word2'), sep =" ") |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
count(word1, word2, sort = TRUE) -> grimms_bigrams2
print(grimms_bigrams2)# A tibble: 4,277 × 3
word1 word2 n
<chr> <chr> <int>
1 king’s daughter 28
2 king’s son 28
3 red cap 27
4 snow white 21
5 fell asleep 20
6 rose red 17
7 juniper tree 16
8 fast asleep 15
9 cat skin 14
10 clever elsie 14
# ℹ 4,267 more rows
The most common bi gram, “king’s daughter,” can be found in many stories, but I particularly enjoyed this usage of the phrase in “Briar Rose” (or better known as “Sleeping Beauty”): “So she cried out, ‘The king’s daughter shall, in her fifteenth year, be wounded by a spindle, and fall down dead.’”
I wanted to explore a little further with the bi grams, and find out what the most common first word in the bi gram pairing was.
grimms_bigrams2 |>
count(word1, sort = TRUE)->most_common_word1The most common word1 was “beautiful.” I then wanted to see what the most common word following “beautiful” was:
grimms_bigrams2 |>
filter(word1 == "beautiful") -> beautiful_bigrams
print(beautiful_bigrams)# A tibble: 54 × 3
word1 word2 n
<chr> <chr> <int>
1 beautiful bird 11
2 beautiful princess 6
3 beautiful daughter 3
4 beautiful flower 3
5 beautiful maiden 3
6 beautiful white 3
7 beautiful body 2
8 beautiful castle 2
9 beautiful child 2
10 beautiful clothes 2
# ℹ 44 more rows
It was in fact “bird,” with 11 uses. This was a little surprising to me, so I looked into the stories to see when “beautiful bird” was occurring. It turns out in “The Juniper Tree,” a character sings a song many times with the lyric “Kywitt, Kywitt, what a beautiful bird am I!”
I then did the same thing for the second word in the bi gram pairing:
grimms_bigrams2 |>
count(word2, sort = TRUE)->most_common_word2The most common word2 was “till.” Again, I wanted to see what the most common word preceding “till” was:
till_bigrams <- grimms_bigrams2 |>
filter(word2 == "till")
print(till_bigrams)# A tibble: 62 × 3
word1 word2 n
<chr> <chr> <int>
1 waited till 6
2 night till 3
3 stone till 3
4 wait till 3
5 whirl’d till 3
6 peace till 2
7 time till 2
8 abide till 1
9 awake till 1
10 bellow till 1
# ℹ 52 more rows
It turned out to be “waited,” with 6 uses. This made sense, as a lot of characters in fairy tales wait until a certain time or instant to go through with their plan or idea. Here’s an example of that from “The Water of Life”: “Then they waited till he was fast asleep, and poured the Water of Life out of the cup, and took it for themselves, giving him bitter sea-water instead.”
In conclusion, I think this project offered a fresh perspective on these age-old fairy tales, and helped me to better understand some of the common themes as well as the writing styles behind them. I hope that this project inspires you to revisit some of your favorite fairy tales, and see if you can pick up on some of the patterns that were discovered in this project!