loading libraries

library(tidyverse)

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'stringr' was built under R version 3.6.3

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)

## Warning: package 'dslabs' was built under R version 3.6.3

library(ggsci)

## Warning: package 'ggsci' was built under R version 3.6.3

reading in data

data("research_funding_rates")
str(research_funding_rates)

## 'data.frame':    9 obs. of  10 variables:
##  $ discipline         : chr  "Chemical sciences" "Physical sciences" "Physics" "Humanities" ...
##  $ applications_total : num  122 174 76 396 251 183 282 834 505
##  $ applications_men   : num  83 135 67 230 189 105 156 425 245
##  $ applications_women : num  39 39 9 166 62 78 126 409 260
##  $ awards_total       : num  32 35 20 65 43 29 56 112 75
##  $ awards_men         : num  22 26 18 33 30 12 38 65 46
##  $ awards_women       : num  10 9 2 32 13 17 18 47 29
##  $ success_rates_total: num  26.2 20.1 26.3 16.4 17.1 15.8 19.9 13.4 14.9
##  $ success_rates_men  : num  26.5 19.3 26.9 14.3 15.9 11.4 24.4 15.3 18.8
##  $ success_rates_women: num  25.6 23.1 22.2 19.3 21 21.8 14.3 11.5 11.2

summary(research_funding_rates)

##   discipline        applications_total applications_men applications_women
##  Length:9           Min.   : 76.0      Min.   : 67.0    Min.   :  9       
##  Class :character   1st Qu.:174.0      1st Qu.:105.0    1st Qu.: 39       
##  Mode  :character   Median :251.0      Median :156.0    Median : 78       
##                     Mean   :313.7      Mean   :181.7    Mean   :132       
##                     3rd Qu.:396.0      3rd Qu.:230.0    3rd Qu.:166       
##                     Max.   :834.0      Max.   :425.0    Max.   :409       
##   awards_total      awards_men     awards_women   success_rates_total
##  Min.   : 20.00   Min.   :12.00   Min.   : 2.00   Min.   :13.4       
##  1st Qu.: 32.00   1st Qu.:22.00   1st Qu.:10.00   1st Qu.:15.8       
##  Median : 43.00   Median :30.00   Median :17.00   Median :17.1       
##  Mean   : 51.89   Mean   :32.22   Mean   :19.67   Mean   :18.9       
##  3rd Qu.: 65.00   3rd Qu.:38.00   3rd Qu.:29.00   3rd Qu.:20.1       
##  Max.   :112.00   Max.   :65.00   Max.   :47.00   Max.   :26.3       
##  success_rates_men success_rates_women
##  Min.   :11.4      Min.   :11.20      
##  1st Qu.:15.3      1st Qu.:14.30      
##  Median :18.8      Median :21.00      
##  Mean   :19.2      Mean   :18.89      
##  3rd Qu.:24.4      3rd Qu.:22.20      
##  Max.   :26.9      Max.   :25.60

Research Funding Rate Analysis

tidying data

long <- research_funding_rates %>%
  gather(applications,totals,applications_total:success_rates_women)

success_rates <- long %>% 
  filter(applications %in% c("success_rates_men","success_rates_women"))

Funding Rates Plot

funding_rate_plot <- success_rates %>%
  ggplot(aes(x=discipline,y=totals,fill=applications)) +
  geom_bar(stat="identity",position="dodge") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_fill_lancet() +
  labs(x="Discipline",y="Success Rate (%)",title="Funding Success Rates by Gender and Discipline")
funding_rate_plot

This chart shows a pretty good parity between the success rates of men and women across different scientific disciplines. There is certainly room for improvement, but there are no fields that absolutely exclude one gender.

However, funding rates don’t tell the whole story.

Awards vs Submissions Analysis

tidying data

application_rates <- long %>% 
  filter(applications %in% c("applications_men","applications_women","awards_men","awards_women")) %>%
  mutate(gender = if_else(applications %in% c("applications_women","awards_women"),"Women","Men")) %>%
  mutate(type = if_else(applications %in% c("applications_women","applications_men"),"Applications","Awards"))

plot

application_rate_plot <- application_rates %>%
  ggplot(aes(x=gender,y=totals,fill=type)) +
  facet_wrap(~discipline) +
  geom_bar(stat="identity",position="identity",alpha=0.5) +
  theme(
    strip.text.x = element_text(size = 10),
    legend.title = element_blank()) +
  scale_fill_lancet() +
  labs(x="Discipline",y="Number of Applications/Awards",title="Applications and Awards by Gender and Discipline") 
application_rate_plot

This shows us that while there is a relatively good balance of gender in award rate, the overall participation by women is lower in all except the medical sciences, and in some cases (Physics, Chemical Sciences, Technical Sciences) the gap is extreme. This suggests a “pipeline” issue, where women are not being recruited into the sciences, they are not completing higher education in STEM fields, they are not pursuing academic or research careers after graduation, and/or they are not being given the opportunity to lead projects as Principal Investigator.

Notes on plot design:

Used facet_wrap() instead of facet_grid() to get the discipline titles to show.
Stacked Awards on top of Applications (position="identity") because that shows the Applications that became Awards.
Used stat="identity" because the underlying data is aggregated. Normally I would use stat="count" because I typically am dealing with actual award data instead of pre-aggregated data.

Data 110 Midterm

Katelyn Schreyer

loading libraries

reading in data

Research Funding Rate Analysis

tidying data

Funding Rates Plot

Awards vs Submissions Analysis

tidying data

plot