First we will need to load our DSLabs library since the data set we will use will be inside.
We will use the data set named “admissions”. While disguising the major’s name this will help us review Gender bias among graduate school admissions to UC Berkeley.
library(dslabs)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
major gender admitted applicants acceptance_rate
1 A men 62 825 0.07515152
2 B men 63 560 0.11250000
3 C men 37 325 0.11384615
4 D men 33 417 0.07913669
5 E men 28 191 0.14659686
6 F men 6 373 0.01608579
In the first two majors we see that women have much higher acceptance rates, but this might be skewed due to the amount of applicants per sex.
Total Applicants
Lets combine both applicant pools and we will see the acceptance rate of men and women based on total applicants
# A tibble: 6 × 2
major total_applicants
<chr> <dbl>
1 A 933
2 B 585
3 C 918
4 D 792
5 E 584
6 F 714
Bring new column back into the data set
admissions_with_total <- admissions1 |>left_join(total_applicants_by_major, by ="major")head(admissions_with_total)
major gender admitted applicants acceptance_rate total_applicants
1 A men 62 825 0.07515152 933
2 B men 63 560 0.11250000 585
3 C men 37 325 0.11384615 918
4 D men 33 417 0.07913669 792
5 E men 28 191 0.14659686 584
6 F men 6 373 0.01608579 714
major gender admitted applicants acceptance_rate total_applicants
1 A men 62 825 0.07515152 933
2 B men 63 560 0.11250000 585
3 C men 37 325 0.11384615 918
4 D men 33 417 0.07913669 792
5 E men 28 191 0.14659686 584
6 F men 6 373 0.01608579 714
acceptance_rate_with_total
1 0.066452304
2 0.107692308
3 0.040305011
4 0.041666667
5 0.047945205
6 0.008403361
Now to plot and see what we find -
ggplot(admissions3, aes(y = acceptance_rate_with_total, x = total_applicants, label = gender)) +geom_point(aes(color = major), size =6) +geom_text_repel(nudge_x = .001) +labs(y ="Acceptance Rate", x ="Total Applicants", title ="Comparing Gender Acceptance Rates by Major") +scale_color_brewer(palette ="Set2") +guides(color =guide_legend(title ="Major")) +theme_calc()
Results
Using the data set for admissions to UC Berkeley we can compare the acceptance rates of genders in each major to analyze the reason for this. Using a geom_point plot to display this data with the acceptance rate from total applicants in the Y axis and the Total Applicants in the X axis. In the plot, I show the acceptance rates of men and women and use a label to identify them, but color them by major to compare them to each other. This data shows differences depending on which major is taken on which gender has the higher acceptance rate having most being similar except major “A”.