In 1973, University of California, Berkeley had 12,763 applicants for graduate programs in the Fall. From a smaller sample of the full data, by just glancing at the acceptance rates, it appears that there was a strong bias for admitting males over females. In the figure below we can see that males were about 14% more likely to be accepted into UC-Berkeley’s graduate programs.
While it did look like UC-Berkeley had a clear bias toward male applicants, that is not the full story. To better understand the bigger picture, lets consider how males and females differed in their applications. Using the figure below, we can see that the majority of males applied for departments A and B. While the majority of females applied for departments C, D, E, and F.
When we divide our data into groups to account for confounding variables, we can better understand what the data actually means in the real world. By dividing the admission rates of males and females by department as seen int the figure below, we can actually see that females were more likely to be accepted in 4 out of the 6 departments. The real story of the data shows that women were more likely to apply to the department with a lower acceptance rate. Since majority of males choose departments A and B, which had the greatest acceptance rate, the majority of all male applicants were accepted. This is not the case for the female applicants. The majority of females applicants were not accepted due the overall low acceptance rates in those departments.
The UC-Berkeley graduate program acceptance rate controversy is a great example of the Simpson’s paradox. The Simpson’s paradox is a trend in groups of data that reverses or disappears when the groups are combined. In short, the paradox happens when false trends in data appears when not all variables are considered in the data analysis.
In the UC-Berkeley controversy, only male and female acceptance rates were viewed and analyzed. However, as shown above, that is not the full story. When we considered acceptance rates by department and compared that to the distribution of total male and female applicants by department, the trend almost completely reversed. In this case, the total number of applicants per department and the acceptance rate of each department are known as lurking or confounding variables. Confounding variables are hidden variables that are not measured or controlled in a study or analysis. When these variables are considered, a clearer picture of the real world trends can be observed.
Science and data analysis is often messy and noisy. When we want to explain how the world works, we must consider all relevant variables and control for any confounding variables. If we do not, then our data analysis may be oversimplified and false. By understanding the UC-Berkeley controversy and the Simpson’s paradox, we can avoid making false claims about the real world in our data analysis.
Tom Grigg. “Simpson’s Paradox and Interpreting Data.” Medium, December 2018, (https://medium.com/data-science/simpsons-paradox-and-interpreting-data-6a0443516765).