UCBAdmissionReport

admission_report <- read.csv("../UCBAdmissions.csv")
#admission_report
head(admission_report)

##     Status Gender Department
## 1 Admitted   Male          A
## 2 Rejected   Male          A
## 3 Admitted Female          C
## 4 Rejected Female          C
## 5 Admitted   Male          D
## 6 Rejected   Male          D

Introduction:

The problem at hand is that UC Berkley is being framed to have been accepting applicants based off their gender. More men have been accepted into programs than women. However, we need to further analyze this data in order to see if we can find any explanation as to why this is. Has UC Berkeley admitted students by gender, or is there an underlying issue that needs to be brought to light?

Below is a table of the overall acceptance/rejection by gender:

pt <- PivotTable$new()
pt$addData(admission_report)
pt$addColumnDataGroups("Status")
pt$addRowDataGroups("Gender")
pt$defineCalculation(calculationName = "sums", summariseExpression = "n()")
pt$renderPivot()

A plot that represents Acceptance/Rejections by gender:

plot_graph <- ggplot(data=admission_report, aes(x=Status, fill=Gender))
plot_graph <- plot_graph + geom_bar() + labs(title="Acceptance/Rejection by Gender")
plot_graph

At first glance, it looks as though very few women were accepted into the programs which may seem as though there has been gender discrimination. This is an overall view. To get a deeper look into why this is, we need to see what departments these students have applied to, and check if there is a pattern that aligns to the overall view.

Some Questions:

How many men and women applied to each department?
Did each department accept more men than women?
What is the lurking variable?
From these trends, did UC Berkeley use gender discrimination in their admission process?

How many men and women applied to each department?

plot_graph <- ggplot(data=admission_report, aes(x=Department, fill=Gender))
plot_graph <- plot_graph + geom_bar(position="fill") + theme_classic() + 
  scale_y_continuous(labels=percent) +
  labs(title="Normalized Graph representing Gender within each department", x="Department", y="Percentage")
plot_graph

From this graph, we can see the statistical findings of Peter Bickel. “…there was a statistically significant gender bias in favour of women for 4 out of the 6 departments” (Grigg). However, there does seem to be a higher gender bias present in Departments A and B.

Let’s take a look at how many acceptance and rejections were found for each gender by Department:

Table of acceptance and rejections of each gender within departments:

pt <- PivotTable$new()
pt$addData(admission_report)
pt$addColumnDataGroups("Department")
pt$addColumnDataGroups("Gender")
pt$addRowDataGroups("Status")
pt$defineCalculation(calculationName = "sums", summariseExpression = "n()")
pt$renderPivot()

When we break this data into further components (by department, status and gender), we see a new and clear picture.

Departments A and B did not have many applications from women but there were still more acceptance than rejections for both men and women. Though very few women applied, most got acceptance into the program. For example, Dept. A, of the 108 that applied, 89 were accepted. Though the Normalized Graph representing Gender within each department shows that very few women were accepted, the reason for that is because very few women applied to those departments. However, with Departments C-F, we can assume these are more demanding departments as the ratio of acceptance to rejection is quite noticeable. Fewer applicants from both genders were accepted, while more were rejected. Perhaps these departments had certain criteria that needed to be met, or they only accepted a certain number of students into the program, or had a very low acceptance rate.

In summary, the trend that we find when we organize the graph to be split by Department allows us to see that there is no gender discrimination but rather these departments may have criteria that need to be met, or have a lower acceptance rate as there are more rejections than acceptance for both genders for Departments C-F.

After analyzing this data, we can confirm the findings of Bickel’s team and see the Simpson’s paradox take place. “Bickel’s team discovered that women tended to apply to departments that admitted a smaller percentage of applicants overall, and that this hidden variable affected the marginal values for the percentage of accepted applicants in such a way as to reverse the trend that existed in the data as a whole. Essentially, the conclusion flipped when Bickel’s team changed their data-viewpoint to account for the school being divided into departments!” (Grigg). Since there were departments that accepted a very small percentage into their program, the data derived from those departments added a much higher value and may have accounted for what may have looked like gender discrimination.

Normalized graph of Status and Gender by each Department:

plot_data <- ggplot(data=admission_report, aes(x=Department, fill=Gender))
plot_data <- plot_data + geom_bar(position="fill") + theme_classic() +
  facet_grid(cols=vars(Status)) + 
  labs(title="Graph Representing Status and Gender by Each Dept.", x="Department")
plot_data

Non-Normalized graph of Status and Gender by each Department:

plot_data <- ggplot(data=admission_report, aes(x=Department, fill=Gender))
plot_data <- plot_data + geom_bar() + theme_classic() +
  facet_grid(cols=vars(Status)) + 
  labs(title="Graph Representing Status and Gender by Each Dept.", x="Department")
plot_data

Using a normalized plot allows us to compare proportions, while a non-normalized plot allows us to compare actual numbers by Department. Both are important ways to clarify information, but the normalized graph allows us to see these comparisons clearly.

What is the Simpson’s paradox?

The Simpson’s paradox happens when the data overall portrays one story, but if we add a hidden variable, we come across a new perspective. “Simpson’s paradox arises when there are hidden variables that split data into multiple separate distributions” (Grigg). That hidden variable would be the Departments variable. Simply organizing the data by status and gender may reveal a partial summary of what is happening. Organizing the data by department takes into account that each department may have certain criteria, lower acceptance rates, or may be limited to take in a certain number of students. By adding this hidden variable, we are able to obtain more information than from what we had originally. We can make better comparisons and give better explanations when this hidden variable is found.

A lurking variable is a hidden but important variable that can often change data to form different conclusions. This variable can create a new perspective which allows for a deeper understanding of what the data means which can offer more conclusions once this variable is found to understand our data better. When we use Department to create our graph, we obtain more information when looking at the status of both genders by department. Had we not organized the data to include Department, we get a graph that makes us infer that there has been gender discrimination. So, Department would be our lurking variable.

The UC Berkley Admission is an example of the Simpson’s Paradox because if we just look at the overall results of acceptance/rejection by gender, it is clear that more men have been accepted into Berkley than women. But, there is no clear answer as to why that is. Finding the reason as to why and finding the lurking variable and deriving information from those plots/graphs is what makes this scenario a Simpson’s paradox because we unveil another side of the story when we add the Department variable which shows us there is no gender discrimination.

Summary:

Looking at Acceptance/Rejection by Gender is not enough information to declare gender discrimination
We need to look at what departments the students have applied to, and see the criteria those departments have
It’s important to analyze the data by gender, status and by department to get a more detailed view of our data to make inferences

From this we can conclude:

The Simpson’s paradox is shown in the scenario
The lurking variable is the Departments variable
Departments A and B had more men accepted than women because more men had applied to those programs than women
Those who applied to Depts. A and B (regardless of gender) had more acceptance than rejection, and those who applied to Depts. C-F had more rejections than acceptance
Departments C-F we can assume have more demand as there are more applicants, and more students have been rejected than accepted
- For these departments, we can also conclude that because they have more demand, there may be certain criteria that needs to be met, there is a lower acceptance rate, or they can only take in a certain number of students

In conclusion, UC Berkeley did not discriminate by gender. In reality as mentioned by Griggs, 4 of the 6 departments favored women than men

Work Cited:

Grigg, Tom. “Simpson’s Paradox and Interpreting Data.” Towards Data Science, 9 December, 2018, https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765. Accessed 4 September, 2023

UCBAdmissionReport_SaiSalian

Sai Salian

2023-09-13