ABOUT THE DATASET:
This dataset was created by Seung Min Song which information was
taken from the following website:
The dataset represents admissions data at the University of
California, Berkeley in 1973 according to the variables department (A,
B, C, D, E), gender (male, female), and outcome admitted or denied.
Were there gender bias during the application process?
# Upload the libraries
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
# Import the data from github.
# Link is provided to the csv file below:
# https://github.com/enidroman/data_607_data_aquisition_and_management_project/blob/main/Seung%20Min%20Song%20Untidy%20Dataset%20Admit.csv
urlfile <- "https://raw.githubusercontent.com/enidroman/data_607_data_aquisition_and_management_project/main/Seung%20Min%20Song%20Untidy%20Dataset%20Admit.csv"
admit_reject <- read.csv(urlfile)
admit_reject
## Gender Dept Admitted Rejected
## 1 Male A 512 313
## 2 Female A 89 19
## 3 Male B 353 207
## 4 Female B 17 8
## 5 Male C 120 205
## 6 Female C 202 391
## 7 Male D 138 279
## 8 Female D 131 244
## 9 Male E 53 138
## 10 Female E 94 299
## 11 Male F 22 351
## 12 Female F 24 317
ANALYSIS
No analysis was requested on the discussion but I created my own
analysis.
There were more applicants that were rejected then Admitted.
# Aggregate Function to compute summary statistic for subsets of the data, Average of Number of Applicants by Outcome(Applicants Admitted and Rejected).
group_mean <- aggregate(x = outcome$Number_of_Applicants,
by = list(outcome$Outcome),
FUN = mean)
colnames(group_mean) <- c("Outcome", "Mean")
group_mean
## Outcome Mean
## 1 Admitted 146.2500
## 2 Rejected 230.9167
There were more applicants that were male then female.
Please note: I don’t know why Female came out twice in this
dataframe. For some reason the female count comes out to 10. I did
checked everything it seem fine.
# Aggregate Function to compute summary statistic for subsets of the data, Average of Number of Applicants by Gender(Female and Male).
group_mean <- aggregate(x = outcome$Number_of_Applicants,
by = list(outcome$Gender),
FUN = mean)
colnames(group_mean) <- c("Gender", "Mean")
group_mean
## Gender Mean
## 1 Female 181.00
## 2 Female 12.50
## 3 Male 224.25
Please note: I don’t know why the # number of count is wrong
below.
sum(outcome$Gender=='Female')
## [1] 10
sum(outcome$Gender=='Male')
## [1] 0
There were more applicants in Dept C and less in Dept E.
# Aggregate Function to compute summary statistic for subsets of the data, Average of Number of Applicants by Dept (A, B, C, D, E, F).
group_mean <- aggregate(x = outcome$Number_of_Applicants,
by = list(outcome$Dept),
FUN = mean)
colnames(group_mean) <- c("Dept", "Mean")
group_mean
## Dept Mean
## 1 A 233.25
## 2 B 146.25
## 3 C 229.50
## 4 D 198.00
## 5 E 146.00
## 6 F 178.50
Please note again I have the extra set of Female Admit and Reject
and I don’t know why. `
There were 557 Female that were admitted and 1278 Female that were
rejected.
There were 1198 Male that were admitted and 1493 Male that were
Rejected.
# Aggregate function to aggregate the sum to summarize the data frame based on the two variables, Outcome and Gender.
list_aggregate <- aggregate(outcome$Number_of_Applicants, by = list(outcome$Outcome, outcome$Gender), FUN = sum)
colnames(list_aggregate) <- c("Outcome", "Gender", "Number_of_Applicants")
list_aggregate
## Outcome Gender Number_of_Applicants
## 1 Admitted Female 540
## 2 Rejected Female 1270
## 3 Admitted Female 17
## 4 Rejected Female 8
## 5 Admitted Male 1198
## 6 Rejected Male 1493
As per the graph below more male applicants were admitted vs female
applicants.
# Bar gaph showing Net Value per Boro Block Lot by Neighborhood.
graph <- ggplot(outcome, aes(x = Outcome, y = Number_of_Applicants, fill = Gender)) +
geom_col(position = "dodge")
graph

CONCLUSION
In my analysis I observed that there were gender bias during the
application process since more male were admitted then female. But as
the University of California, Berkley states there were more male
applicants then female that had applied. In regards to the women were
applying for admission in harder departments I have yet to see since
there is no data the Departments that the applicants applied to. Only
that they are listed as A, B, C, D, E, and F. Further investication has
to be conducted to see if this application process was actually a gender
bias.
As per the University of California Berkley An analysis of just the
variables gender and admissions shows a correlation that suggests gender
bias: the proportion of women admitted was significantly lower than the
proportion of men admitted. However, when the department variable is
taken into account, the gender bias disappears. Generally, the women
were applying for admission in the harder departments, those with low
admission rates.
A data set in which a correlation between two variables disapears,
or even reverses, when a third variable is taken into account is known
as Simpson’s paradox.