In the course of writing the book That Type of Girl: Notes on Takako Shimura’s Sweet Blue Flowers, I created a complete index of all characters in the manga and every page on which they appeared. In this document I use that character index to create a social graph showing the relative prominence of the characters and their relationships as featured in the manga’s pages.
For those readers unfamiliar with the R statistical software and the additional Tidyverse and tidygraph software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information, check out the various ways to learn more about the Tidyverse.
I use a Python program to parse the character index, so I first go to the terminal window and set up a Python virtual environment:
virtualenv .env
source .venv/bin/activate
I also create a file .Rprofile
containing the following line, to force the use of the version of Python set for the virtual environment:
Sys.setenv("RETICULATE_PYTHON" = ".venv/bin/python")
(This requires restarting the R session.)
I then load the following R libraries, for purposes as listed:
quantcut()
function.library("tidyverse")
library("reticulate")
library("tidygraph")
library("ggraph")
library("flextable")
library("gtools")
library("tools")
I use a local copy of the unabridged character index (which is in Markdown format). (The original file will be in the public source repository for That Type of Girl. That repository did not exist at the time this analysis was created.)
I check the MD5 hash values for the file, and stop if the contents are not what are expected.
stopifnot(md5sum("char-index.md") == "7d00bec7a2e00c4496b915e84f935391")
This file is formatted to have one line for each character, giving their name (or role) and perhaps a parenthetical comment. Following each such line there are multiple lines, one line for each page or consecutive range of pages on which the character appears.
I use Python code to convert the character index Markdown file into a CSV file containing rows matching characters to the pages (or ranges of pages) on which they appear.
I start by importing the needed Python modules.
import re
import math
import csv
I then define a Python function to take a line from the index that references the character’s name, and extract the actual name itself.
NOTE: I do not include any blank lines in the function definition, so that I can step through this file a chunk at a time when debugging the analysis. Leaving out the blank lines enables the entire chunk to be properly treated as a Python function definition by the Python console.
def get_character_name(index_entry):
"""Extract character name from index entry."""
#
# We have three possible cases, with a regular expression for each.
# 1. The character has a full name in the format family, given.
full_re = r'^- +\**([A-Za-z]+), +([A-Za-z]+).*$'
#
# 2. The character has a single name (which may be either their
# given name or family name) followed by a comment in parentheses.
single_re = r'^- +([A-Za-z]+) *\(([^)]+)\).*$'
#
# 3. The character is identified in some other way.
other_re = r'^- +(.+)[,:].*$'
#
# We attempt to match the line against all three possibilities.
full_m = re.match(full_re, index_entry)
single_m = re.match(single_re, index_entry)
other_m = re.match(other_re, index_entry)
#
# Check for each case and set the character name accordingly.
if full_m is not None: # Family name, given name
if full_m.group(2) is None: # Shouldn't happen, but..
character = full_m.group(1)
else:
character = f"{full_m.group(2)} {full_m.group(1)}"
elif single_m is not None: # Single name with comment
if single_m.group(2) is None: # Shouldn't happen, but...
character = single_m.group(1)
else:
character = f"{single_m.group(1)} ({single_m.group(2)})"
elif other_m is not None: # Some other format
character = other_m.group(1)
else: # Unknown format
character = ''
#
# Return the name to be used for this character.
return get_display_name(character)
Some names we want to display specially (e.g., “Akira” instead of “Akira Okudaira”), so I define a function to modify names as desired.
def get_display_name(name):
"""Return given names, nicknames, or other special designators."""
#
display_names = {
'Akira Okudaira': 'Akira',
'Chizu Hanashiro': 'Chizu',
'Fumi Manjome': 'Fumi',
'Miwa Motegi': 'Mogi',
'Shinobu Okudaira': 'Shinobu',
'Haruka Ono': 'Haruka',
'Hinako Yamashina': 'Hinako',
'Kagami (husband of Kazusa)': 'Mr. Kagami',
'Kayoko Ikumi': 'Kyoko’s mother',
'Kazusa Sugimoto': 'Kazusa',
'Ko Sawanoi': 'Ko',
'Kuri Sugimoto': 'Kuri',
'Kyoko Ikumi': 'Kyoko',
'Manjome (mother of Fumi)': 'Fumi’s mother',
'Misako Yasuda': 'Yassan',
'Orie Ono': 'Orie',
'Ryoko Ueda': 'Ueda',
'Sakiko Okudaira': 'Akira’s mother',
'Shinako Sugimoto': 'Shinako',
'Yasuko Sugimoto': 'Yasuko',
'Yoko Honatsugi': 'Pon',
}
return display_names.get(name, name)
Each character will have one or more page references. I therefore need a function to take the page references from the index (which include a volume number) and extract the triplet volume number, first page number, and last number (which per the Chicago Manual of Style scheme is not necessarily a full page number).
def get_volume_pages(ref):
"""Given a page reference, return volume plus page range."""
#
# Page references are of the form x:yyy or x:yyy--zzz where x is
# the volume number, yyy is the first page number in the reference,
# and zzz is the end of the page number range (in CMOS format).
# References may also have a preceding or succeeding underscore,
# and may be followed by additional punctuation or other material
# we can ignore.
ref_re = '^ *_?([1-9][0-9]*):([1-9][0-9]*)(--*)?([0-9]+)?_?[,; ]?.*$'
#
# Look for a page reference.
ref_m = re.match(ref_re, ref)
if ref_m is None:
return (0, 0, 0)
#
# Pull out the volume, starting page, and ending page (CMOS format).
volume = int(ref_m.group(1))
if ref_m.group(2) is None: # Shouldn't happen, but...
return (0, 0, 0)
if ref_m.group(4) is None: # Single page
first = int(ref_m.group(2))
lst = first
else: # Range of pages
first = int(ref_m.group(2))
lst = int(ref_m.group(4))
return (volume, first, lst)
Finally, I define a function to take a page range condensed according to the scheme recommended by the Chicago Manual of Style (CMOS 9.61) and convert that to the full first and last page numbers.
def get_first_last_pages(first, lst):
"""Convert CMOS-style page range to first and last page numbers."""
#
assert isinstance(first, int) and first > 0
assert isinstance(lst, int) and lst > 0
if lst >= first: # Second part is the actual page number
last = lst
else: # Second part contains only changed digits from first part
# Compute the number of digits in the second part.
n_digits = math.ceil(math.log10(lst))
#
# Compute base digits of first part, ignoring last n digits.
first_base = math.floor(first / 10**n_digits)
#
# The last page number is that value plus the changed digits.
last = first_base * 10**n_digits + lst
return (first, last)
I now run Python code to open the index file, extract the information about the characters and their appearances, and create a CSV file char-appear.csv
that lists each page or range of pages on which a given character appears.
index_path = 'char-index.md'
with open(index_path, 'r', encoding='UTF-8') as index_f:
lines = index_f.readlines()
# Initialize appearances dict. Each entry is a list of ranges.
appearances = {}
character = ''
# Look for characters and record their appearances.
for line in lines:
# Look for character entries.
if line.startswith('- '):
if line.startswith('- ') or '*See*' in line:
continue
character = get_character_name(line)
elif character == '': # Haven't see a character entry yet
continue
else:
volume, first, lst = get_volume_pages(line)
if volume <= 0:
continue
if character not in appearances:
appearances[character] = []
first, last = get_first_last_pages(first, lst)
appearances[character].append((volume, first, last))
with open('char-appear.csv', 'w', encoding='UTF-8') as appearances_f:
appearances_csv = csv.writer(
appearances_f,
delimiter=',',
quotechar='"',
quoting=csv.QUOTE_MINIMAL,
)
for appearance in appearances.items():
for page_range in appearance[1]:
appearances_csv.writerow(
(
appearance[0],
page_range[0],
page_range[1],
page_range[2],
),
)
However, this CSV file, although useful for other purposes, is not ideal for analysis since each row may correspond to either a page or a range of pages.
I thus use more Python code to create a CSV file char-page.csv
in which each row marks an appearance by a given character on a given page.
with open('char-appear.csv', 'r', encoding='UTF-8') as appearance_f:
appearance_csv = csv.reader(
appearance_f,
delimiter=',',
quotechar='"',
)
with open('char-page.csv', 'w', encoding='UTF-8') as page_f:
page_csv = csv.writer(
page_f,
delimiter=',',
quotechar='"',
quoting=csv.QUOTE_MINIMAL,
)
for row in appearance_csv:
for page in range(int(row[2]), int(row[3]) + 1):
page_csv.writerow((row[0], f"{row[1]}:{page}"))
I then read the char-page.csv
file into a data table char_page_tb
:
char_page_tb <- read_csv(
"char-page.csv",
col_names=c('Character', 'Page'),
col_types="cc",
)
I now have a table showing each and every page on which a given character appears, for every character depicted in the manga. This table will serve as the basis for all further analyses.
The first question I look at is: how prominent is each character?. For example, does Fumi appear more often than Akira, or vice versa? And what about Kyoko? How frequently does she appear relative to Fumi and Akira?
To answer this I create a table char_pct_tb
containing the percentage of all pages in Sweet Blue Flowers on which each character appears. I sort the table in descending order by the percentage value, so the most prominent characters will be at the top.
N_Appearances <- length(char_page_tb$Page)
N_Pages <- char_page_tb %>%
select(Page) %>%
unique() %>%
summarize(count = n()) %>%
as.integer()
char_pct_tb <- char_page_tb %>%
group_by(Character) %>%
summarize(Page_Count = n()) %>%
mutate(Page_Pct = round((100.0 * Page_Count) / N_Pages, 1)) %>%
arrange(desc(Page_Pct))
N_Chars <- length(char_pct_tb$Page_Count)
In total there are 82 characters listed in the character index, which includes information for 1319 pages of the manga. The total number of character appearances is 3399, so the average number of characters depicted per page is 2.6. The median number of appearances for a character is on 7 pages.
I graph the percentage of pages in which the top 20 characters appear:
char_pct_tb %>%
head(20) %>%
mutate(Character = fct_reorder(Character, -Page_Pct)) %>%
ggplot() +
geom_col(aes(x = Character, y = Page_Pct)) +
scale_y_continuous(breaks = seq(0, 100, 10)) +
ylab("Percentage of Pages") +
labs(
title = "Prominence of Characters in Sweet Blue Flowers",
subtitle = "Based on the Percentage of Pages in which the Character Appears",
caption = "Data source: \n That Type of Girl, unabridged character index"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
Here is the same data in tabular form, again, for the top twenty characters only:
char_pct_tb %>%
head(20) %>%
select(Character, Page_Pct) %>%
flextable(col_keys = c("Character", "Page_Pct")) %>%
set_header_labels(Page_Pct = "% of Pages") %>%
set_caption("Prominence of Sweet Blue Flowers Characters") %>%
autofit()
Character | % of Pages |
Fumi | 52.1 |
Akira | 49.4 |
Kyoko | 23.1 |
Yasuko | 16.5 |
Yassan | 11.1 |
Haruka | 9.5 |
Pon | 8.8 |
Mogi | 8.0 |
Hinako | 7.7 |
Ko | 6.9 |
Shinobu | 6.2 |
Akira’s mother | 4.5 |
Mr. Kagami | 4.5 |
Ueda | 4.5 |
Fumi’s mother | 3.6 |
Kazusa | 3.3 |
Shinako | 3.0 |
Orie | 2.7 |
Kuri | 2.3 |
Kyoko’s mother | 2.0 |
The graph and table answer both of the questions asked earlier:
Fumi and Akira are almost equally prominent in Sweet Blue Flowers, with Fumi having a slight edge. Each of the two girls appears in around half of the manga’s pages.
Kyoko is the next most prominent character, appearing on about a quarter of all pages, half as many as Fumi or Akira. Yasuko appears on about one sixth of all pages.
I now look at the relationships between characters. I consider two characters to have some sort of relationship to each other if they both appear on a page together.
This criterion is not foolproof—for example, a given page may have some panels with one group of characters and then transition to other panels with other characters. However, if those instances are relatively few (and I believe they are) then this criterion is a reasonable one.
In addition to knowing which characters have a relationship to others, I would also like to know how prominently those relationships are featured in the manga. To determine this I use the number of pages on which two characters appear together as a proxy measure.
I now build a table joint_pct_tb
of characters’ joint appearances as follows:
char_page_tb
to itself, using the Page
column as a common field. This produces a large table with rows for every page on which a given character A appears with another character B (or vice versa).Character.x
column and Fumi in the Character.y
column, and a second row where their positions are reversed. There are also rows in which a character appears twice. To eliminate this double counting and self-counting, I retain only rows where the first character name is strictly before the second character name in alphabetical order.Page
field, since it’s no longer needed.Page_Count
.Page_Pct
expressing the page count as a percentage of all pages in the manga.Pair
that combines the two characters’ names in a single field.Page_Count
.joint_pct_tb <- full_join(char_page_tb, char_page_tb, by = 'Page') %>%
filter(Character.x < Character.y) %>%
select(!Page) %>%
group_by(Character.x, Character.y) %>%
summarize(Page_Count = n()) %>%
ungroup() %>%
mutate(Page_Pct = round(100.0 * Page_Count / N_Pages, 1)) %>%
mutate(Pair = paste(Character.x, Character.y, sep=' / ')) %>%
arrange(desc(Page_Count))
N_Jt_Appear <- length(joint_pct_tb$Pair)
N_Max_Jt_Appear <- (N_Chars * (N_Chars - 1)) / 2
There are 350 unique joint appearances of characters. This compares to the theoretical maximum of 3321 joint appearances if every character appeared at least once on a page with every other character. Thus only 11% of all possible character relationships are actually realized in the manga.
The median number of joint appearances by a pair of characters is 3 pages.
I then plot the top 20 character pairs as measured by the percentage of all pages in Sweet Blue Flowers in which the two characters in each pair both appear.
joint_pct_tb %>%
head(20) %>%
mutate(Pair = fct_reorder(Pair, -Page_Count)) %>%
ggplot() +
geom_col(aes(x = Pair, y = Page_Pct)) +
scale_y_continuous(breaks = seq(0, 100, 10)) +
xlab("Character Pair") +
ylab("Percentage of Pages") +
labs(
title = "Most Prominent Relationships in Sweet Blue Flowers",
subtitle = "Based on Percentage of Pages in which Characters Appear Together",
caption = "Data source: \n That Type of Girl, unabridged character index"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
Here is the same data in tabular form, again for the top twenty pairs only:
joint_pct_tb %>%
head(20) %>%
select(Pair, Page_Pct) %>%
flextable() %>%
set_header_labels(Page_Pct = "% of Pages") %>%
set_caption("Prominence of Sweet Blue Flowers Relationships") %>%
autofit()
Pair | % of Pages |
Akira / Fumi | 30.3 |
Akira / Kyoko | 9.7 |
Fumi / Yasuko | 8.8 |
Pon / Yassan | 8.4 |
Mogi / Yassan | 7.5 |
Fumi / Yassan | 7.1 |
Mogi / Pon | 6.6 |
Fumi / Pon | 5.8 |
Akira / Yassan | 5.3 |
Fumi / Mogi | 5.3 |
Ko / Kyoko | 5.3 |
Akira / Shinobu | 4.6 |
Akira / Akira’s mother | 4.2 |
Fumi / Kyoko | 4.0 |
Akira / Pon | 3.8 |
Akira / Yasuko | 3.8 |
Kyoko / Yasuko | 3.6 |
Akira / Mogi | 3.5 |
Fumi / Fumi’s mother | 3.5 |
Akira / Haruka | 3.4 |
As one would expect, the relationship between Fumi and Akira is the most prominent one, featured on almost a third of the manga’s pages. No other relationship is featured on more than a tenth of the manga’s pages, with the relationships between Akira and Kyoko and Fumi and Yasuko being the next most prominent.
A brief digression: as noted above, what I am measuring in the analysis of relationships is their prominence, i.e., how often they are depicted in the manga. One could construct a separate measure of relationship strength by looking at how often two characters are depicted on the same page versus how often either was depicted on a page without the other.
For example, Maeda and her friend Nakajima are depicted on only four pages of Sweet Blue Flowers (3:357–60). Their relationship is thus not very prominent, but it can be construed as very strong, since on those four pages they always appear together. In contrast, the relationship between Fumi and Akira is much more prominent but not as strong by this measure, since one of them is often depicted on a page without the other.
For my purposes I am more interested in how prominently various relationships are featured, so I do not bother to construct an alternate measure of relationship strength.
Using the appearance of two characters on the same page to infer a relationship between them can sometimes produce spurious results, as discussed above. Correcting this would require editing the joint appearances list to remove false positives.
The character index used for this analysis was originally compiled by me in the course of writing the book That Type of Girl. The character index in the book is abridged somewhat to improve readability, but the unabridged version will be available in the public repository for That Type of Girl. (See the section “Source Code” below.)
Try re-doing the analysis of relationships using a measure of relationship strength based on the number of times two characters are depicted together on the same page versus being depicted without the other.
Try plotting different types of social graphs. See the introduction to ggraph layouts for ideas.
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gtools_3.9.2 flextable_0.6.10 ggraph_2.0.5 tidygraph_1.2.0
## [5] reticulate_1.24 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8
## [9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
## [13] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.2 bit64_4.0.5 lubridate_1.8.0 httr_1.4.2
## [5] backports_1.4.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.2
## [9] colorspace_2.0-2 withr_2.4.3 tidyselect_1.1.1 gridExtra_2.3
## [13] bit_4.0.4 compiler_4.1.2 cli_3.2.0 rvest_1.0.2
## [17] xml2_1.3.3 officer_0.4.1 labeling_0.4.2 scales_1.1.1
## [21] systemfonts_1.0.4 digest_0.6.29 rmarkdown_2.11 base64enc_0.1-3
## [25] pkgconfig_2.0.3 htmltools_0.5.2 highr_0.9 dbplyr_2.1.1
## [29] fastmap_1.1.0 rlang_1.0.1 readxl_1.3.1 rstudioapi_0.13
## [33] jquerylib_0.1.4 farver_2.1.0 generics_0.1.2 jsonlite_1.7.3
## [37] vroom_1.5.7 zip_2.2.0 magrittr_2.0.2 Matrix_1.3-4
## [41] Rcpp_1.0.8 munsell_0.5.0 fansi_1.0.2 gdtools_0.2.4
## [45] viridis_0.6.2 lifecycle_1.0.1 stringi_1.7.6 yaml_2.2.2
## [49] MASS_7.3-54 grid_4.1.2 parallel_4.1.2 ggrepel_0.9.1
## [53] crayon_1.5.0 lattice_0.20-45 graphlayouts_0.8.0 haven_2.4.3
## [57] hms_1.1.1 knitr_1.37 pillar_1.7.0 igraph_1.2.11
## [61] uuid_1.0-3 reprex_2.0.1 glue_1.6.1 evaluate_0.14
## [65] data.table_1.14.2 modelr_0.1.8 png_0.1-7 vctrs_0.3.8
## [69] tzdb_0.2.0 tweenr_1.0.2 cellranger_1.1.0 gtable_0.3.0
## [73] polyclip_1.10-0 assertthat_0.2.1 xfun_0.29 ggforce_0.3.3
## [77] broom_0.7.12 viridisLite_0.4.0 ellipsis_0.3.2
You will be able to find the source code for this analysis at my public repository for That Type of Girl once the book is published.
This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.