Isaac Djabate — April 2026
Exploring 10,000 comics across 21 countries
This report explores a dataset of comic books sourced from Kaggle, spanning from 2000 to 2026 across 21 countries. This report will analyze the comic book industry to answer the following questions:
#Loading the data set
comics <- read.csv("comic_books_10000_dataset.csv")
comics <- comics%>%
#Splitting the Genre into two seperates Genres(Primary and Secondary)
separate(Genre, into = c("Primary.Genre", "Secondary.Genre"), sep = " / ", extra = "merge", fill = "right") %>%
# Convert numeric column from char to int/num type
mutate(Release.Year = as.numeric(Release.Year),
Page.Count = as.numeric(Page.Count),
Rating..out.of.10. = as.numeric(Rating..out.of.10.),
Volume.Count = as.numeric(Volume.Count)) %>%
# Simply Column into Winner, Nominee or None
mutate(Award.Status = case_when(
str_detect(Awards, "Winner") ~ "Winner",
str_detect(Awards, "Nominee") ~ "Nominee",
TRUE ~ "None")) %>%
# Group Year of release into decade Eras(2000s, 2010s, 2020s)
mutate(Era = case_when(
Release.Year < 2010 ~ "2000s",
Release.Year < 2020 ~ "2010s",
TRUE ~ "2020s"
)) %>%
# Order Age from least mature to most
mutate(Age.Rating = factor(Age.Rating, levels = c("All Ages", "Teen+", "Young Adult", "Mature", "Mature 17+")))%>%
# Drop column with no value
filter(!is.na(Rating..out.of.10.), !is.na(Page.Count), !is.na(Release.Year))
# count comic from Genre and keep top 10
genre_counts <- comics %>%
count(Primary.Genre) %>%
slice_max(n, n = 10)
#Horizontal bar charts
bar_plot <- ggplot(genre_counts, aes(x = n, y = reorder(Primary.Genre, n),
text = paste("Genre:", Primary.Genre,
"<br>Count:", n))) +
geom_col(fill = "#2E5FA3") + #blue bars
labs(title = "Top 10 Comic Book Genres",
x = "Number of Comics",
y = "Genre") +
scale_x_continuous(expand = expansion(mult= c(0,0.1))) + # Make space for labels
theme_comic()
ggplotly(bar_plot, tooltip = "text") %>%
layout(
paper_bgcolor = "white",
plot_bgcolor = "white"
)
The superhero and Action genres have the most comic numbers. The Golden and Silver Ages of comics played a massive role in their rise. However, which countries are responsible for producing these comics?
#Loading the World Map using the SF library
world <- ne_countries(scale = "medium", returnclass = "sf")
# Counts the comics per country and map names to match the country
country_counts <- comics %>%
count(Country.of.Origin) %>%
mutate(Country.of.Origin = recode(Country.of.Origin,
"USA" = "United States of America",
"UK" = "United Kingdom",
"South Korea" = "South Korea"))
# Join comics onto the map by left join
world_comics <- world %>%
left_join(country_counts, by = c("name_en" = "Country.of.Origin"))
ggplot(world_comics) +
geom_sf(aes(fill = n), color = "white", size = 0.1) +
scale_fill_gradient(low = "#B8D4F5", high = "#1B2A4A",
na.value = "#E2E8F0",
trans = "log10",
name = "Number of Comics") +
labs(title = "Global Comic Book Production by Country",
subtitle = "Countries shaded by numbers of comics produced") +
theme_comic()
Japan leads with the most comics, followed by the USA, creating a close gap. Comic production is significant across Asia, while the rest of the world produces little. Given the global popularity of Manga, the industry underwent a massive shift toward it. However, when exactly did that shift occur?
# Counting comics per year and combine them by era
year_counts <- comics %>%
count(Release.Year, Era)
year_counts$Era <- factor(year_counts$Era, levels = c("2000s", "2010s", "2020s"))
#Stacked Area chart showing comic production over time in each era
area_plot <- ggplot(year_counts, aes(x = Release.Year, y = n , fill = Era)) +
geom_area(alpha = 0.8, stat = "identity") +
scale_fill_manual(values = c("2000s" = "#89CFF0",
"2010s" = "#2E5FA3",
"2020s" = "#7C5CBF"
),
drop = FALSE)+ # Unique Color per era
scale_x_continuous(breaks = seq(2000, 2026, by = 5)) + # Year axis break
labs(title = "Comic Production Over Time",
subtitle = "Number of comics published per year by era",
x = "Year",
y = "Number of Comics") +
theme_comic() +
transition_reveal(Release.Year)
animate(area_plot, duration = 12, fps = 60, renderer = gifski_renderer())
anim_save("area_chart.gif")
A steady increase in production occurred from the 2000s to the 2010s. Comic book production not only grew significantly during the COVID-19 Pandemic but also peaked in 2023. With such growth, which demographic invested more time in comics?
# Violin Plot showing rating distribution by age rating
ggplot(comics, aes(x = Age.Rating, y= Rating..out.of.10., fill = Age.Rating))+
geom_violin(alpha= 0.7) + # Violion showing distribution shape
geom_boxplot(width=0.1, alpha = 0.5) + # Boxplot showing median and quartiles
scale_fill_manual(values = c("All Ages" = "#B8D4F5",
"Teen+" = "#5B8DD9",
"Young Adult" = "#2E5FA3",
"Mature" = "#7C5CBF",
"Mature 17+" = "#1B2A4A")) + # Color Brewer for distinct color palettes
labs(title = "Comic Rating Distribution by Age Rating",
x = "Age Rating",
y = "Rating (out of 10)") +
theme_comic()
Ratings do not differ across age groups and do not affect the quality of the comics. “Teens” comics have the most consistent ratings.
# Defining the top countries and genre to avoid data errors
top_countries <- c("Japan", "USA", "South Korea", "UK", "Canada", "Australia")
top_genres <- c("Superhero", "Action", "Horror", "Romance", "Fantasy", "Sci-Fi", "Comedy")
# filter and count comics per country and genre combination
heatmap_data <- comics %>%
filter(Country.of.Origin %in% top_countries,
Primary.Genre %in% top_genres) %>%
count(Country.of.Origin, Primary.Genre)
# Heatmap shows genre production by country
heatmap_plot <- ggplot(heatmap_data, aes(x = Country.of.Origin, y = Primary.Genre, fill = n,
text = paste("Country:", Country.of.Origin,
"<br>Genre:", Primary.Genre,
"<br>Count:", n))) +
geom_tile(color = "white")+ # White border between tiles
geom_text(aes(label = n), color = "white", size = 3)+ # Count labels on tiles
scale_fill_gradient(low = "#B8D4F5",
high = "#1B2A4A",
name = "Number of Comics") + # Color scale
labs(title= "Comic Genre by Country",
x = "Country of Origin",
y = "Primary Genre") +
theme_comic()
ggplotly(heatmap_plot,tooltip = "text") %>%
layout(
paper_bgcolor ="white",
plot_bgcolor = "white"
)
The USA’s main comic genre is superheroes, which, given how that genre pioneered comics. While the USA has massive success in one genre, Japan offers a wider variety. The West put all its eggs in one basket, and Japan focused more on genre diversity, thereby influencing the rise of Manga worldwide.
# Calculating average rating per country(excluding combined entries)
# keeping only top 10 rated countries
country_ratings <- comics %>%
filter(!str_detect(Country.of.Origin, "/")) %>% #Remove combined entries like France/Iran
group_by(Country.of.Origin) %>%
summarize(Avg.Rating = mean(Rating..out.of.10., na.rm = TRUE)) %>%
arrange(desc(Avg.Rating)) %>%
slice_max(Avg.Rating, n = 10)
# Lollipop Chart showing averages raintg per country
lolipop_chart <- ggplot(country_ratings, aes(x = Avg.Rating, y = reorder(Country.of.Origin, Avg.Rating),
text = paste("Country:", Country.of.Origin,
"<br>Avg Rating:", round(Avg.Rating, 2)))) +
geom_segment(aes (x = 7.8, xend = Avg.Rating, #drawing the stick
y = reorder(Country.of.Origin, Avg.Rating),
yend = reorder(Country.of.Origin, Avg.Rating)),
color = "#5B8DD9", size = 1) +
geom_point(color = "#1B2A4A", size = 4)+ # Draw the dot
scale_x_continuous(limits = c(7.8,8.3),
breaks = seq(7.8,8.3, by = 0.1)) + # Custom Axis Range
labs(title = "Average Comic Rating by Country",
x = "Average Rating(out of 10)",
y = "Country")+
theme_comic()
ggplotly(lolipop_chart, tooltip = "text") %>%
layout(
paper_bgcolor = "white",
plot_bgcolor = "white"
)
From Japan’s point of view, production volume correlates with the overall quality of comics, while the USA lags in ratings.
# Counting Comics per coloring style, sorted by frequency
color_counts <- comics %>%
count(Theme..Color.Style.) %>%
arrange(desc(n))
#Convert to named vector and scale to waffle rendering
#( Dividing by 50 so each square represent 50 comics)
color_vec <- setNames(round(color_counts$n/ 50), color_counts$Theme..Color.Style.)
#Waffle charts showing proportion of colors
waffle(color_vec, rows = 8,
title = "Comic Book Color Styles",
xlab = "1 square = ~50 comics",
colors = c("#1B2A4A", "#2E5FA3", "#5B8DD9",
"#7C5CBF", "#A78BDB", "#3ABFA8",
"#E8C547", "#E8543A"))
The black-and-white colour scheme dominates the medium. Given the growth of Manga, it is no surprise that black-and-white is dominant.
# Filtering to top 6 countries and their comic counts per publication status
sankey_data <- comics %>%
filter(Country.of.Origin %in% c("Japan", "USA", "South Korea", "UK", "Canada", "Australia")) %>%
count(Country.of.Origin, Status)
# Sankey diagram showing flow from country to population
ggplot(sankey_data, aes(axis1 = Country.of.Origin, axis2= Status, y = n)) +
geom_stratum() + # Draw the blocks
geom_alluvium(aes(fill = Country.of.Origin), alpha = 0.7) + # Draw the flow
geom_text(stat = "stratum", aes(label= after_stat(stratum))) + # label blocks
scale_x_discrete(limits = c("Country", "Status")) +# Labeling the axes
scale_fill_manual(values = c("Japan" = "#1B2A4A", # Ink Black
"USA" = "#2E5FA3", # Hero Blue
"South Korea" = "#7C5CBF", # Villain Purple
"UK" = "#3ABFA8", # Power Teal
"Canada" = "#E8C547", # Action Gold
"Australia" = "#5B8DD9")) + # Sky Blue
labs(title= "Comic Publication Status By Country",
y = "Number of Comics") +
theme_comic()
The USA has plenty of ongoing comics. However, Japan still has many comics that are either cancelled or on hiatus, which is likely due to the harsh working conditions and standards in Manga magazines.
# Scatterplot showing relationship between page count and rating
# colored by age rating with a linear trend line
set.seed(42)
comics_sample <- comics %>% slice_sample(n = 500)
# Calculating Linear Trend
lm_fit <- lm(Rating..out.of.10. ~ Page.Count, data = comics_sample)
trend_line <- data.frame(
Page.Count= seq(0, 1000, length.out = 100)
)
trend_line$Rating..out.of.10. <- predict(lm_fit, newdata = trend_line)
scatter_plot <- ggplot(comics_sample, aes(x = Page.Count, y = Rating..out.of.10. , color = Age.Rating,
text = paste("Page Count:", Page.Count,
"<br>Rating:", Rating..out.of.10.,
"<br>Age Rating:", Age.Rating))) +
geom_point(alpha = 0.6, size = 2.5) + # Transparent points for overplotting
coord_cartesian(xlim = c(0, 1000)) +
scale_color_manual(values = c("All Ages" = "#3ABFA8",
"Teen+" = "#2E5FA3",
"Young Adult" = "#E8C547",
"Mature" = "#7C5CBF",
"Mature 17+" = "#E8543A"),
name = "Age Rating") +
geom_line(data = trend_line, aes(x = Page.Count, y = Rating..out.of.10.), color = "black", linewidth = 1, inherit.aes = FALSE) + # Linear Line
labs(title = "Page Count vs Rating by Age Rating",
x = "Page Count",
y = "Rating(out of 10)") +
theme_comic()
ggplotly(scatter_plot, tooltip = "text") %>%
layout(
paper_bgcolor = "white",
plot_bgcolor = "white",
font = list(family = "nunito")
)
The graph demonstrates a positive relationship between page count and comic rating: comics with more pages tend to be rated higher. However, the slope is gentle, indicating that page count is not a particularly strong predictor of quality.
# Calculate Average rating per country and award status
dumbbell_data <- comics %>%
filter(Country.of.Origin %in% c("Japan", "USA", "South Korea", "UK", "Canada", "Australia")) %>%
group_by(Country.of.Origin, Award.Status) %>%
summarize(Avg.Rating = mean(Rating..out.of.10., na.rm = TRUE))
# Pivot to wide format so Winner and Nominee are seperate columns
dumbbell_wide <- dumbbell_data %>%
filter(Award.Status != "None") %>%
pivot_wider(names_from = Award.Status, values_from = Avg.Rating)
# Dumbbell chart comparing Winner vs Nominee ratings by country
dumbbell_plot <- ggplot(dumbbell_wide, aes(y = reorder(Country.of.Origin, Winner),
text = paste(
"Country:", Country.of.Origin,
"<br>Winner Avg:", round(Winner,2),
"<br>Nominee Avg:", round(Nominee, 2))
)) +
geom_segment(aes(x = Nominee, xend = Winner, # draw line between points
yend = reorder(Country.of.Origin, Winner)),
color ="grey50", size = 1) +
geom_point(aes(x = Nominee, color = "Nominee"),size = 4) + # Nominee dot
geom_point(aes(x = Winner, color = "Winner"), size = 4) +
scale_color_manual( values = c("Winner" = "#1B2A4A", "Nominee" = "#5B8DD9"), # Winner dot
name = "Award Status") +
labs(title = "Average Rating: Awards Winners vs Nominees by Country",
x = "Average Rating",
y = "Country")+
theme_comic()
ggplotly(dumbbell_plot, tooltip ="text") %>%
layout(
paper_bgcolor = "white",
plot_bgcolor = "white"
)
Winners and nominees have similar ratings, except in Canada. This analysis shows that comic production has grown significantly, that Japan and the USA dominate the industry, and neither age rating nor page count significantly affects comic quality.
The following prompts were used for code optimization and problem solving: