BLog Post 2

Zoe Bean
2/20/2022

The Data Source

I have decided that I would like to work with fanfiction for my project. The site Archive of Our Own is a hosting site for fanfiction, with a robust system for categorization. Since I would like to work with popular fanfiction, and those works tend to be on the longer end of the spectrum of possibilities, I do not believe I will need a huge amount of works in order to get useful data.

The works stored on the archive are easily downloaded as html or pdf files, so I do not believe it necessary to scrape from the archive itself- I will simply download the works that I need and then process them.

Part of the archive’s robust system for categorization involves a good amount of metadata input by the authors of the fanfiction, which will be ar the beginning of every document I download. This metadata includes: The fandom the work is in (the piece of media that the fanfiction is based on), the main characters involved in the work, the main relationships in the work, as well as an “additional tags” section for the author to put any other information they wish in the metadata. This category can range from tags depicting genre to the author rambling about the work. I am uncertain to what degree I would keep the metadata, if I keep it at all, but regardless, this will certainly be factored in how I would process the data.

Example of Fanfiction html file

However, the metadata is useful for filtering the works on the website itself, so to narrow down what works I will be collecting, I have decided on the following criteria: the top most popular works- as determined by ‘Kudos’(the archive’s version of likes), that are marked as ‘Complete’, are written in English, that do not use more than one source media, and are not rated ‘Mature’ or ‘Explicit’. This last requirement is due to the fact that I am uninterested in the patterns found in fanwritten erotica.

Useful Packages

WHile I mentioned that I do not need to scrape the archive website itself, the methods can still be useful, such as read_html() and maybe css_selector, as one of the available ways to download is in html format.

RegEx will probably be more useful in preprocessing than in the analysis itself, as looking at the fanfiction will probably not involve much set expressions.

Analysis of fanfiction will involve NLP, as well as the associated packages to do such.

Research

The research paper that I am presenting is Beyond Canonical Texts: A Computational Analysis of Fanfiction by Milli & Bamman.

Data

This paper uses data from fanfiction.net, which is a site very similar to archiveofourown.org, but a little older and with much less features for categorization. It would be somewhat accurate to call fanfiction.net the predecessor to archiveofourown.org, but fanfiction.net still operates today and the sites are run by different organizations. Their dataset has 55 billion tokens, and is 88% English, with 44 languages represented.

Analytic Strategy

They looked into differences between fanworks and their source media. They used BookNLP to see if there are differences in what characters are emphasized, as well as differences in the mentioned genders.

The authors also analysed how the function of a site that allowed anyone, authors included, to comment on individual chapters of a work, and how that is similar to social networks.

Findings

They found that fanworks paid more attention to characters other than the emphasized ones in the source media. They found that there is about 2% more work written about women than in the source media, and they found that there is a social network structure with three different types on interactions: asking for updates, encouragement of the author, and emotional reviews of the story.

Citation

Milli, S., & Bamman, D. (2016). Beyond Canonical Texts: A Computational Analysis of Fanfiction. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2048–2053. https://doi.org/10.18653/v1/D16-1218