Exam 5

library(mosaic)

The following table shows the total number of admitted and rejected applicants to the six largest departments in Berkeley in 1973.

	Admitted	Rejected
Male	1198	1493
Female	557	1278

Use a \(\chi^2\)-test to check whether the admission statistics for Berkeley show any sign of gender discrimination. To enter the table in R you can do:

admit <- matrix(c(1198,557,1493,1278),2,2)
rownames(admit) <- c("Male", "Female")
colnames(admit) <- c("Admitted", "Rejected")
admit <- as.table(admit)

Your analysis should as a minimum contain:

1. Statement of hypotheses:

H0- There is no discrimination between genders being admitted Ha - There is discrimination between genders being admitted

(3)Calculation of expected frequencies

Xsq = chisq.test(admit)
Xsq$expected

##         Admitted Rejected
## Male   1043.4611 1647.539
## Female  711.5389 1123.461

This is the numbers there should be admitted and rejected if males and females had the same rates of admitance.

1. Calculation of test statistic

expected <- matrix(c(1043.461,711.5389,1647.5389,1123.4611),2,2)
rownames(expected) <- c("Male", "Female")
colnames(expected) <- c("Admitted", "Rejected")
expected <- as.table(expected)

chi = (admit[1,1] - expected[1,1])^2/expected[1,1] + (admit[2,1] - expected[2,1])^2/expected[2,1] + (admit[1,2] - expected[1,2])^2/expected[1,2] + (admit[2,2] - expected[2,2])^2/expected[2,2]
'Discrimination'

## [1] "Discrimination"

chi

## [1] 92.20533

We want to measure the discrimination via the Chi-squared statistic.

1. Calculation and interpretation of p-value.

degree of freedom = (row-1)(col-1) = 1

pchisq(chi, 1, lower.tail = FALSE)

## [1] 7.813412e-22

This p value is very small. This means that we can reject the H0, and conclude that there is a gender discrimination.

A more detailed data set with the admissions for each department is available on the course web page. The variables are:

Admit (admitted/rejected)
Gender (male/female)
Dept (department A, B, C, D, E, F)
Freq (freqency of each combination)

Load the data into RStudio:

admission <- read.table("http://asta.math.aau.dk/dan/static/datasets?file=admission.dat", header=TRUE)

Using this data set you have to:

1. Make the saturated model in RStudio: A saturated model is one in which there are as many estimated parameters as data points. By definition, this will lead to a perfect fit, but will be of little use statistically, as you have no data left to estimate variance.

model1 = glm(Freq ~ Dept*Gender*Admit, family=poisson, data = admission)
model1

## 
## Call:  glm(formula = Freq ~ Dept * Gender * Admit, family = poisson, 
##     data = admission)
## 
## Coefficients:
##                    (Intercept)                           DeptB  
##                        4.48864                        -1.65542  
##                          DeptC                           DeptD  
##                        0.81963                         0.38656  
##                          DeptE                           DeptF  
##                        0.05466                        -1.31058  
##                     GenderMale                   AdmitRejected  
##                        1.74969                        -1.54420  
##               DeptB:GenderMale                DeptC:GenderMale  
##                        1.28357                        -2.27046  
##               DeptD:GenderMale                DeptE:GenderMale  
##                       -1.69763                        -2.32269  
##               DeptF:GenderMale             DeptB:AdmitRejected  
##                       -1.83670                         0.79043  
##            DeptC:AdmitRejected             DeptD:AdmitRejected  
##                        2.20464                         2.16617  
##            DeptE:AdmitRejected             DeptF:AdmitRejected  
##                        2.70135                         4.12505  
##       GenderMale:AdmitRejected  DeptB:GenderMale:AdmitRejected  
##                        1.05208                        -0.83205  
## DeptC:GenderMale:AdmitRejected  DeptD:GenderMale:AdmitRejected  
##                       -1.17700                        -0.97009  
## DeptE:GenderMale:AdmitRejected  DeptF:GenderMale:AdmitRejected  
##                       -1.25226                        -0.86318  
## 
## Degrees of Freedom: 23 Total (i.e. Null);  0 Residual
## Null Deviance:       2650 
## Residual Deviance: 1.874e-13     AIC: 207.1

1. Use drop1 to test whether the model can be simplified.

drop1(model1)

## Single term deletions
## 
## Model:
## Freq ~ Dept * Gender * Admit
##                   Df Deviance    AIC
## <none>                  0.000 207.06
## Dept:Gender:Admit  5   20.204 217.26

It has a deviance of zeroa because it does not deviate at all. The drop1 test shows that we can leave out dept:gender:admit, as it has an insignificant p-value

1. Even if the model cannot be simplified; make the simpler model Dept*Gender + Dept*Admit.

model2 = glm(Freq ~ Dept*Gender + Dept*Admit, family=poisson, data = admission)
model2

## 
## Call:  glm(formula = Freq ~ Dept * Gender + Dept * Admit, family = poisson, 
##     data = admission)
## 
## Coefficients:
##         (Intercept)                DeptB                DeptC  
##             4.24232             -1.48155              1.09523  
##               DeptD                DeptE                DeptF  
##             0.60476              0.35202             -1.15268  
##          GenderMale        AdmitRejected     DeptB:GenderMale  
##             2.03325             -0.59346              1.07581  
##    DeptC:GenderMale     DeptD:GenderMale     DeptE:GenderMale  
##            -2.63462             -1.92709             -2.75479  
##    DeptF:GenderMale  DeptB:AdmitRejected  DeptC:AdmitRejected  
##            -1.94356              0.05059              1.20915  
## DeptD:AdmitRejected  DeptE:AdmitRejected  DeptF:AdmitRejected  
##             1.25833              1.68296              3.26911  
## 
## Degrees of Freedom: 23 Total (i.e. Null);  6 Residual
## Null Deviance:       2650 
## Residual Deviance: 21.74     AIC: 216.8

1. Calculate the expected table and residuals for this simpler model. IF model is the fitted model object, the expected values and standardized residuals may be added by (assuming the data is in a data.frame called admission)
1. Find out which department deviates most from this model. Model A deviates the most from this model, because its standardized residuals values are the highest and observed values differ the most from expected ones.
1. Which kind of gender discrimination appears to be present? There is gender discrimination in department A, because there are much less females than males. There is a female discrimination, because expected value of females accepted is greater than observed and expected values of females rejected is less than observed. For males, those numbers are different and therefore there is no male discrimination.

Now we remove department A by removing the rows 1,2,3,4 (assuming the data is in a data.frame called admission and that department A is in the first four rows):

noA <- admission[-(1:4),]
satModel = glm(Freq ~ Dept*Gender*Admit, family=poisson, data = noA)

Use this reduced dataset to conduct a new analysis which as a minimum contains. - (13) Succesive removal of model terms using drop1 and update starting from the saturated model.

1. A graphical representation and interpretation of the final model for this dataset.

In the end, the Intercept is Females in Dept. B and that are admitted. Intercept is our Alpha. We know this because this is a comparison of factors based on a base (Females, Dept.B, admitted). That is why we are left with the rest of the departments, males and rejected.