Setting Up

We will exploring a dataset called “admissions” which has the description: “Gender bias among graduate school admissions to UC Berkley.” It is not specific about its bias gender, so let us rejoice in the name of Data Analysis because now, we will try to find that out ourselves.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("dslabs")
data("admissions")

Understanding the Data

From what we are able to see using the str (structure) code, there are 4 columns, with very condensed information. We are given stratified values: number of applicants admitted and number of applicants that applied, as well as two character varaibles: major and gender.

str(admissions)
## 'data.frame':    12 obs. of  4 variables:
##  $ major     : chr  "A" "B" "C" "D" ...
##  $ gender    : chr  "men" "men" "men" "men" ...
##  $ admitted  : num  62 63 37 33 28 6 82 68 34 35 ...
##  $ applicants: num  825 560 325 417 191 373 108 25 593 375 ...

Preparing the Data

We ultimately want to answer the question: “What exactly is the gender disparity?” To do this, we will be adding a column that looks at acceptance rate based off gender and major, and creating a visualization based off our findings.

newadmin <- admissions %>% mutate(acceptance_rate = round((admitted / applicants) * 100, digits = 2)) 

Visualizing the Data

acc_plot <- ggplot(data = newadmin, aes(x = major, y = acceptance_rate, color = gender, size = acceptance_rate)) +
  geom_point() +
  labs(x = "Major", y = "Acceptance Rate", title = "UC Berkley Graduate School Acceptance Rates based on Major and Gender", color = "Gender") +
  guides(size = FALSE) +
  theme_linedraw()
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
acc_plot

Notice there is a confusing outlier at major “B”. We’re not sure how this came to be on the dataset. We could speculate the cause, or we can revisualize, putting more emphasis on majors A, C, D, E, and F.

adminnob <- newadmin %>% slice(-c(2, 8))
acc_plot2 <- ggplot(data = adminnob, aes(x = major, y = acceptance_rate, color = gender, size = acceptance_rate)) +
  geom_point() +
  labs(x = "Major", y = "Acceptance Rate", title = "UC Berkley Graduate School Acceptance Rates based on Major and Gender", color = "Gender") +
  guides(size = FALSE) +
  theme_light() +
  scale_y_continuous()
acc_plot2

We can see the obvious disparity in major A, so let us take a closer look at the rest

adminnoab <- newadmin %>% slice(-c(1, 2, 7, 8))
acc_plot3 <- ggplot(data = adminnoab, aes(x = major, y = acceptance_rate, color = gender)) +
  geom_point() +
  labs(x = "Major", y = "Acceptance Rate", title = "UC Berkley Graduate School Acceptance Rates based on Major and Gender", color = "Gender") +
  guides(size = FALSE) +
  theme_light() +
  scale_y_continuous()
acc_plot3

Interpreting our findings

The graph illustrates that women have a higher acceptance rate at the UC Berkley Graduate School for majors A, D, and F, while men have higher rates for majors C and E. We’re not sure how accurate the reading for major B is. It would be nice to know what exactly the grouping is for majors is (because there are far more than just 5). This would give us some insights on the disparities within the field. That data could then transcend into other datasets and shed light on the issue.