I have set up an R file that will recreate the corpus every time is is called, as seen below:
source("Corpus_Create.R")
creates following objects:
-complete_corpus (the collated corpus of all works)
-fic_[#] (the individual corpus for each work, numbered by popularity)
-summary_[#] (the metadata for each fic)
the extracted metadata variables for the corpus are:
-title
-author
-tagged_fandom
-tagged_relationships
-published_date
-completion_date
-chapter_count
Here is a link to a post that shows the complete code of Corpus_Create.R
Currently, there are 50 works of fanfiction in this corpus.
#tokenize, remove punctuation and numbers, remove stopwords
#complete_corpus_words is the non-stopwords from complete_corpus
wordplot_from_corpus<-function(corpus_name){
corpus_words <- tokens(corpus_name,
remove_punct = T,
remove_numbers = T) %>%
tokens_select(
pattern = stopwords("en"),
selection = "remove")
#make all lowercase
corpus_words<-tokens_tolower(corpus_words)
corpus_words_dfm<-dfm(corpus_words)
textplot_wordcloud(corpus_words_dfm)
}
wordplot_from_corpus(complete_corpus)
Already, this word plot tells me a lot. There are many character names within it, and most of those names are from some of the most popular fandoms- Harry Potter and My Hero Academia. ‘Harry’ and ‘Izuku’ are both the main characters of the aforementioned fandoms and are clearly written about often. Some names also belie other popular fandoms- ‘Zuko’ from Avatar: the Last Airbender, ‘Tony’ and ‘Bucky’ are likely from the Marvel Cinematic Universe, ‘Stiles’, ‘Derek’, and ‘Scott’ all from Teen WOlf, and ‘Keith’ from Voltron. However, most of the names I can spot do seem to belong to Harry Potter or My Hero Academia.
This is very interesting in terms of what it means about the development of fanfiction for these fandoms- Harry Potter is an older fandom that has always been popular, and so consistently produced fanworks since before the inception of this fanfiction site, and those works tend to be among the longer of the works available. It makes sense that many Harry Potter works find themselves ranked in the top 50- they’re numerous, and more likely to have been consistently updated over a long period of time, garnering more traffic and therefore more likes. However, My Hero Academia has only become popular around 2015, and has clearly matched or overtaken the Harry Potter fanworks. This means that it managed to create more quality works than Harry Potter in a smaller period of time.
Moving on from names and fandoms, and onto the actual words. ‘like’ is a big one- and this could be for several reasons. It can be a sign that many similes are being used, or that casual language is more common, or that the characters express that they like each other often, or a combination of all of the above. ‘just’ also implies a level of casual language- and could have been combined with ‘like’ in places: x character looks ‘just like’ y character. The frequency of ‘said’ shows that dialog is very common in fanworks. ‘eyes’ might be an interesting one to investigate- I wonder if they are more often doing actions(‘rolled her eyes’, ‘his eyes followed the movement’, ect), or if they are frequently being described?
#find out most common fandoms using frequency table
fandoms_table<-table(docvars(complete_corpus)$tagged_fandom)
fandoms_table<-sort(fandoms_table, decreasing=TRUE)
barplot(fandoms_table, las=2)
fandoms_table
<U+50D5><U+306E><U+30D2><U+30FC><U+30ED><U+30FC><U+30A2><U+30AB><U+30C7><U+30DF><U+30A2> | Boku no Hero Academia | My Hero Academia
504
Harry Potter - J. K. Rowling
104
Marvel Cinematic Universe
100
Spider-Man: Homecoming (2017), Marvel Cinematic Universe, Captain America (Movies)
25
Teen Wolf (TV)
25
Avatar: The Last Airbender
20
SK8 the Infinity (Anime)
19
Voltron: Legendary Defender
15
Miraculous Ladybug
10
Spider-Man: Homecoming (2017), Spider-Man - All Media Types
6
Good Omens (TV)
5
Haikyuu!!
3
Captain America (Movies)
2
Merlin (TV)
1
Minecraft (Video Game)
1
She-Ra and the Princesses of Power (2018)
1
Sherlock (TV), Sherlock Holmes & Related Fandoms
1
Spider-Man - All Media Types, Daredevil (TV), Daredevil (Comics), Marvel Cinematic Universe
1
Spider-Man - All Media Types, The Avengers (Marvel Movies), Marvel Cinematic Universe
1
Star Wars - All Media Types, Star Wars Episode VII: The Force Awakens (2015)
1
By this list, My Hero Academia, Harry Potter, and the Marvel Cinematic Universe are the top three fandoms in the top 50 works. Many of the other fandoms that I predicted to be popular based off of the frequency of the character names in the word cloud are also pretty high on this list.
fandoms<-docvars(complete_corpus)$tagged_fandom
#find works in the 3 most common fandoms
#fandom num 1- MHA(found fist instance to use as comparator)
fandom_1_corpus<-corpus_subset(complete_corpus, docvars(complete_corpus)$tagged_fandom==fandoms[1])
#fandom num 2- HP(found fist instance to use as comparator)
fandom_2_corpus<-corpus_subset(complete_corpus, docvars(complete_corpus)$tagged_fandom==fandoms[344])
#fandom num 3- MCU(found fist instance to use as comparator)
fandom_3_corpus<-corpus_subset(complete_corpus, docvars(complete_corpus)$tagged_fandom==fandoms[353])
#create wordplot for each fandom
wordplot_from_corpus(fandom_1_corpus)#mha
wordplot_from_corpus(fandom_2_corpus)#hp
wordplot_from_corpus(fandom_3_corpus)#mcu
The word most written in each of these fandoms is a character name. It is interesting that for the Marvel Cinematic Universe, ‘tony’ is clearly the most written about character by miles, since this is a somewhat vage category that spans several different movies and tv shows, many that are not about him. So, unlike for the Harry Potter or My Hero Academia fandoms, ‘tony’ far from the only contender for main character and yet he is written about by the authors as frequently as if her were.
‘like’ still appears on all of these word clouds, but seems most popular in the My Hero Academia fandom. ‘just’ seems to be very popular in the My Hero Academia fandom and not in the other two popular fandoms, leading me to believe that its appearance in the overall wordcloud is mostly the work of this fandom. ‘eyes’ is not apparent in the Harry Potter fandom, though it is in the other two. ‘said’ is in all three fandoms, but is most apparent in the Harry Potter fandom.
I would like to extract mentioned characters- possibly by adding a tagged_characters variable to the metadata of the corpus- in order to see what words are common that are NOT character names, as well as to see the frequency of character names within the works.
I would also extract relationships between charcters from tagged_relationships and then try to find both of those characters when mentioned near(maybe within a sentance or two) each other to see if there are common phrases.
I would also like to analyse the sentances containing ‘eyes’ to answer my question about the context of its frequent use.
I also need to resort all of the Spiderman works as Marvel Cinematic Universe, as it falls under that category.