Marianna Zhang

This homework is due by Tuesday, January 15th, 8:00pm.

Upload a zipped folder to Canvas called 1-visualization_homework.zip which contains three files:
• 1-visualization_homework.Rmd
• 1-visualization_homework.html
• 1-visualization_homework.pdf

Instructions

In this homework, you’ll write a short blog post about a data set. Your goal is to tell us something interesting using a well-crafted, thoughtfully-prepared data graphic. One data graphic should suffice, but you may include more if you choose (not more than 3 though). Feel free to make plots with multiple panels by using the patchwork package we’ve discussed (or one of the alternatives such as cowplot).

Your blog post should be short (between 100 and 500 words). We envision an introductory paragraph that explains your findings and provides some context to your data, the data graphic(s), and then a caption-like paragraph providing more detail about what to look for in the data graphic and how to interpret it. That is it. You will not earn more points by including more words or data graphics. What we are looking for is something that is insightful and well-crafted.

Here are some examples of articles that are similar in spirit to yours. Most of these are much longer than yours will be, but the idea is similar: use a good data graphic to tell us something we don’t already know.

Data

You are free to use whatever data you want. However, the purpose of this exercise is to learn how to make good plots – not to wrangle data (we’ll do that next). So we don’t want you to spend much time wrangling data. There are perfectly good data sets available through R packages that are already well-curated. Here is a list of packages with data sets.

fivethirtyeight: provides access to data sets that drive many articles on FiveThirtyEight
nycflights13: data about flights leaving from the three major NYC airports in 2013
NHANES: Data from the US National Health and Nutrition Examination Study
Lahman: comprehensive historical archive of major league baseball data
fueleconomy: fuel economy data from the EPA, 1985–2015
datasets: package that contains a large number of data sets

For example, to take a look at the datasets in the fivethirtyeight package, you can do the following:

# install the package 
install.packages("fivethirtyeight") 

# load the package 
library("fivethirtyeight")

# take a look at the data sets that come with the package
data(package = "fivethirtyeight")

# take a look at the help file to get more information about the different data sets (not 
# all packages have help files)
help("fivethirtyeight")

# the "fivethirtyeight" provides a detailed overview over the different data sets with 
# this command
vignette("fivethirtyeight", package = "fivethirtyeight")

# to load a particular data set (e.g. US_births_2000_2014, replace with the name of the 
# data set you'd liked to load) into your environment, run the following 
df.data = bechdel

Note that I’ve set the code chunk option for the code block above to eval=FALSE. Thi way, the code is not evaluated. You can find out more about the different chunk options here.

Women to Watch (Out) For: Visualizations of the Representation of Women in Film

Load packages

Add the package with the data set that you’d like to load below.

library("knitr")
library("tidyverse")
library("fivethirtyeight")

Load the data set

df.data <- bechdel

Description

The representation of women on screen has been historically limited and troubled. A rule-of-thumb measure for the representation of women in films was developed by Alison Bechdel (author of the amazing graphic memoir Fun Home) in her comic strip Dykes to Watch Out For. The Bechdel test assesses whether the film: 1) has at least 2 female characters, 2) who talk to each other, 3) about something other than a man. If a film passes all three criteria, it passes the Bechdel test. The Bechdel test is by no means the perfect measure of female representation on screen - for an example, films that focus on a solitary female character are a threat to the test’s specificity - but it is a useful rule of thumb that can be easily applied to a variety of movies.

The bechdel dataset is a dataset from the fivethirtyeight package that contains information on various films from the 1970s to the present, their Bechdel test status (as drawn from the crowdsourced Bechdel Test website), and their performance. The bechdel dataset was used in the FiveThirtyEight article “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”.

Here, I generate two visualizations of the bechdel dataset, which suggest that the representation of women in films (as measured by the Bechdel test) has been improving over time, and that improved representation of women in films may actually be associated with higher rather than lower rates of return on investment.

Data Preparation

#### Hide warnings and messages
knitr::opts_chunk$set(warning=FALSE, message=FALSE)

#### calculate total interntional + domestic gross (2013 adjusted) and return on investment per dollar spent (2013 adjusted) for each film
df.data <- df.data %>% 
  mutate(totalgross_2013 = intgross_2013 + domgross_2013,
         return = totalgross_2013 / budget_2013)

#### recode and order Bechdel test results
df.data$clean_test <- df.data$clean_test %>% 
  recode("nowomen" = "Fails - <2 female characters", 
         "notalk" = "Fails - female characters don't talk to each other",
         "men" = "Fails - female characters only talk to each other about men",
         "dubious" = "Dubious - unclear whether it passes",
         "ok" = "Passes") %>% 
  factor(levels = c("Fails - <2 female characters", 
                    "Fails - female characters don't talk to each other", 
                    "Fails - female characters only talk to each other about men", 
                    "Dubious - unclear whether it passes", 
                    "Passes"))

#### add decade as categorical variable
df.data <- df.data %>% 
  mutate(decade = case_when(
    year %in% 1970:1979 ~ "1970s",
    year %in% 1980:1989 ~ "1980s",
    year %in% 1990:1999 ~ "1990s",
    year %in% 2000:2009 ~ "2000s",
    year %in% 2010:2019 ~ "2010s"
  ))

Figures

#### Set visualizations theme
theme_set(
  theme_classic() + 
    theme(text = element_text(size = 14)) 
)

#### stacked bar plot of how films do on the Bechdel test by decade
ggplot(df.data, aes(x = decade, fill = clean_test)) + 
  geom_bar(position = "fill") +
  scale_fill_brewer(type = "seq", palette = 1, direction = 1, aesthetics = "fill", name = "Bechdel result") +
  ylab("Percentage of films released") + 
  theme(legend.position = "bottom") + 
  guides(fill = guide_legend(nrow = 5))

A stacked bar plot of Bechdel test results by decade. The proportion of films released that pass the Bechdel test has been growing decade by decade, which is great news for the representation of women on screen! The stacked bars allow us to see that there are fewer movies failing step 2 of the test, namely, that there are fewer movies with 2 or more female characters who fail to talk to each other.

#### bar plot of median return on investment, by Bechdel test result
ggplot(df.data, aes(x = fct_rev(clean_test), y = return, fill = fct_rev(clean_test))) +
  stat_summary(fun.y = "median", 
               geom = "bar") + 
  stat_summary(fun.ymin = function(z) { quantile(z,0.25) },
               fun.ymax = function(z) { quantile(z,0.75) },
               geom = "linerange",
               size = 1) +
  xlab("Bechdel result") +
  ylab("Rate of return on investment (2013 adjusted)") +
  scale_color_brewer(type = "seq", palette = 1, direction = -1, aesthetics = "fill", name = "Bechdel result", guide = FALSE) +
  coord_flip()

A bar plot of rate of return on investment (namely, how much is earned per dollar spent) (2013 adjusted) by Bechdel test results. The different bars show that rate of return on investment appears to grow as films pass various steps of the Bechdel test. The visualization suggests that films with better representation of women are not a worse investment than films with worse representation of women, in fact, they may even be a better investment.

Session info

Information about this R session including which version of R was used, and what packages were loaded.

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2        fivethirtyeight_0.4.0 forcats_0.3.0        
##  [4] stringr_1.3.1         dplyr_0.7.8           purrr_0.2.5          
##  [7] readr_1.2.1           tidyr_0.8.2           tibble_2.0.0         
## [10] ggplot2_3.1.0         tidyverse_1.2.1       knitr_1.21           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.4           haven_2.0.0       
##  [4] lattice_0.20-38    colorspace_1.3-2   htmltools_0.3.6   
##  [7] yaml_2.2.0         rlang_0.3.1        pillar_1.3.1      
## [10] glue_1.3.0         withr_2.1.2        RColorBrewer_1.1-2
## [13] modelr_0.1.2       readxl_1.1.0       bindr_0.1.1       
## [16] plyr_1.8.4         munsell_0.5.0      gtable_0.2.0      
## [19] cellranger_1.1.0   rvest_0.3.2        evaluate_0.12     
## [22] labeling_0.3       broom_0.5.0        Rcpp_1.0.0        
## [25] scales_1.0.0       backports_1.1.3    jsonlite_1.6      
## [28] hms_0.4.2          digest_0.6.18      stringi_1.2.4     
## [31] grid_3.5.2         cli_1.0.1          tools_3.5.2       
## [34] magrittr_1.5       lazyeval_0.2.1     crayon_1.3.4      
## [37] pkgconfig_2.0.2    xml2_1.2.0         lubridate_1.7.4   
## [40] assertthat_0.2.0   rmarkdown_1.11     httr_1.3.1        
## [43] rstudioapi_0.8     R6_2.3.0           nlme_3.1-137      
## [46] compiler_3.5.2

Grading Rubric

There are 15 possible points for this homework.

Baseline

+1 for an .Rmd file that compiles without errors
+1 for describing the dataset
+1 for having a plot
+1 for including the code that generated the plot
+1 for describing the visual mapping (i.e. a key)

Average

+1 unnecessary messages from R are hidden from being displayed in the HTML
+1 for including a catchy and/or engaging title
+1 for having at least 100 words and no more than 500 words
+1 for explaining in a single coherent sentence what we can learn from this graphic
+1 for explaining the choice of geometric mapping

Advanced

+1 blog post text provides context or background useful in interpreting the graphic
+0-4 WOW factor: awarded at the grader’s discretion for submissions that are exceptionally compelling