#BIOPICS
#Create data frame using dplyr that shows year, count of males and females in biopics for that year
df <- biopics %>%
select(year_release, subject_sex) %>%
group_by(year_release) %>%
filter(year_release >= 1915) %>%
summarise(female_count = sum(subject_sex == 'Female'), male_count = sum(subject_sex == 'Male'))%>%
arrange(year_release)
#Make a new data frame that has a column specifying the sex for graphing purposes
df2 <- rbind(
data.frame(df$year_release, "count" = df$male_count, "type"= 'Male'),
data.frame(df$year_release, "count" = df$female_count, "type"= 'Female'))
#Plot the data
ggplot(df2, aes(x = df.year_release, y = count, fill = type)) +
geom_bar(position = "identity", stat = "identity", width = .5) +
xlab(NULL) + ylab(NULL) + ggtitle("Biopic Subjects Are Mostly Male", subtitle =
"Number of male and female subjects in 676 biopics since 1915") +
scale_fill_manual(values = c("#634299","#FFBA6D")) +
theme(legend.position = c(.22, .77), legend.title = element_blank(),
plot.title = element_text(color = "#3C3C3C", size = 20, face = "bold", family =
"Helvetica"),
plot.title.position = "plot",
plot.subtitle = element_text(color = "#3C3C3C", size = 15.5, family = "Helvetica"),
plot.background=element_rect(fill="#F0F0F0"),
plot.margin = unit(c(.4, .4, .4, .4), "cm"),
axis.text = element_text(size = 12, family = "Courier"),
axis.ticks = element_blank(),
legend.background = element_rect(colour = "black", fill = "#F0F0F0", linetype="solid"),
legend.key.size = unit(4, 'line'),
legend.key.height = unit(.35, 'line'),
legend.key.width = unit(1, 'line'),
legend.text = element_text(face = "italic", size = 10),
legend.spacing.x = unit(.3, 'cm'),
panel.grid.major=element_line(colour="#D7D7D7"),
panel.grid.minor=element_line(colour="#D7D7D7"),
panel.background = element_rect(fill = "#F0F0F0")) +
scale_y_continuous(breaks = seq(0, 35, 5), minor_breaks = seq(0, 35, 10)) +
scale_x_continuous(breaks = seq(1920, 2020, 10), minor_breaks = seq(1920, 2020, 10),
labels=c("1920" = "1920", "1930" = "'30", "1940" = "'40", "1950" = "'50",
"1960" = "'60", "1970" = "'70", "1980" = "'80", "1990" = "'90",
"2000" = "2000", "2010" = "'10", "2020" = "'20")) +
guides(fill = guide_legend(reverse = TRUE, byrow = TRUE)) +
geom_hline(aes(yintercept = 0))
First, I loaded the fivethirtyeightpackage which contained the biopics datasets, as well as dplyr for data wrangling and ggplot2 for visualization. For the wrangling part, I created a new dataframe which held three columns (year, female_count, male_count), grouping by years greater than 1915 and using summarise to get the count of each gender in biopics for each year, and then arranging them by year in ascending order. After wrangling I created a new dataset which contained the year of the movie, the count of actors, and the gender in another column. This way I can use fill = type on the type of gender for graphing purposes. Finally, using ggplot2 I was able to create the biopics graph, with year on the x axis and count on the y axis. In order to create the stacked bar graph I specified the fill to be equal to type, that way the graph would show male and female. The rest of the code focuses on the design, color, and positioning of different elements in order to make the graph look as close as possible to the original.
The graph shows the count of each gender (male and female) in 676 biopics for each year from 1915 to 2014. As we can see from the graph, the count for each gender was relatively low and equal up to 1970. After 1970, however, males began to dominate the biopics scene, staring in many more biopics than females up to 2014.
Title of Article: ‘Straight Outta Compton’ Is The Rare Biopic Not About White Dudes Link to Article: Link