Math 664 - Methods for Statistical Consulting

HW-04

Student Name: Yao Zhang (31250772)

Read in the data set and take a look at the head of the data.

  IQ Parent SES CollegeYes CollegeNo
1  L      L   L          4       349
2  L      L  LM          2       232
3  L      L  UM          8       166
4  L      L   H          4        48
5  L      H   L         13        64
6  L      H  LM         27        84

Question 1: Load the data into R. Use the tapply and barplot functions to make 3 barplots of student counts by SES

(a) Counts of students that intend to go to college, grouped by SES (b) Counts of all students, grouped by SES

(c) Counts of students, grouped by SES and College plans, i.e. pairs of side-by-side bars for Yes and No, one pair for each of Low, Lower Middle, Upper Middle and High SES

Question 2: From the barplots you made above, what can you say about how students’ college plans vary with socio-economic status (SES)?

From the plots above, we can observe that the number of students that have college plans increase when their socioeconomic status increases, and vice versa.

Question 3: Fit logistic models to the data, with “College plans” as the binary response variable, and IQ, parental encouragement and SES as explanatory variables.

(a) Fit a logistic model with only the main effects, i.e. with the 3 explanatory variables without interactions. Report the results of the model fit. Explain any differences (not including rounding errors) you find between your table and Table X3 of the text.


Call:
glm(formula = cbind(CollegeYes, CollegeNo) ~ IQ + Parent + SES, 
    family = binomial(link = "logit"), data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8509  -0.6815  -0.2569   0.6676   1.6122  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.0255     0.1501 -26.819  < 2e-16 ***
IQLM          0.5941     0.1235   4.810 1.51e-06 ***
IQUM          1.3332     0.1194  11.162  < 2e-16 ***
IQH           1.9663     0.1210  16.255  < 2e-16 ***
ParentH       2.4554     0.1014  24.215  < 2e-16 ***
SESLM         0.3560     0.1232   2.889  0.00386 ** 
SESUM         0.6624     0.1196   5.537 3.08e-08 ***
SESH          1.4140     0.1210  11.690  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2262.607  on 31  degrees of freedom
Residual deviance:   25.236  on 24  degrees of freedom
AIC: 186.93

Number of Fisher Scoring iterations: 4
From the regression results above, we can see that all the main effects are siginificant. I extracted a table shown below which is similar to Table X3 of the text. There is no difference between the summary results and Table X3 of the text.
Estimate Std. Error
(Intercept) -4.03 0.15
IQLM 0.59 0.12
IQUM 1.33 0.12
IQH 1.97 0.12
ParentH 2.46 0.10
SESLM 0.36 0.12
SESUM 0.66 0.12
SESH 1.41 0.12

This is because I set the levels for each variable in ascending order before I ran the regression model. By default, R takes “H” as the lowest level for each variable (in alphabetical order). If that’s the case, the results would be different.

# set levels for each factor in ascending order
data$SES <- factor(data$SES, levels = c("L", "LM", "UM", "H"))
data$IQ <- factor(data$IQ, levels = c("L", "LM", "UM", "H"))
data$Parent <- factor(data$Parent, levels = c("L", "H"))

(b) Fit the full logistic model with all the interactions. Report (only) the interactions that are “significant” (p value less than 0.05) and include their estimates and standard errors.

glm(formula = cbind(CollegeYes, CollegeNo) ~ IQ + Parent + SES + 
    IQ:Parent + IQ:SES + Parent:SES + IQ:Parent:SES, family = binomial(link = "logit"), 
    data = data)
Estimate Std. Error Pr(>|z|)
IQLM:ParentH -0.52 0.71 0.46
IQUM:ParentH -0.87 0.69 0.21
IQH:ParentH -0.84 0.71 0.24
IQLM:SESLM 0.06 1.01 0.95
IQUM:SESLM 0.38 0.97 0.70
IQH:SESLM 0.65 0.97 0.50
IQLM:SESUM -1.30 0.82 0.11
IQUM:SESUM -0.77 0.74 0.29
IQH:SESUM -1.48 0.83 0.07
IQLM:SESH -1.09 0.93 0.24
IQUM:SESH -1.15 0.87 0.18
IQH:SESH -0.84 0.91 0.36
ParentH:SESLM 0.74 0.95 0.43
ParentH:SESUM -0.50 0.71 0.48
ParentH:SESH -0.77 0.81 0.34
IQLM:ParentH:SESLM -0.14 1.11 0.90
IQUM:ParentH:SESLM -0.47 1.07 0.66
IQH:ParentH:SESLM -0.54 1.07 0.62
IQLM:ParentH:SESUM 0.75 0.93 0.42
IQUM:ParentH:SESUM 0.58 0.86 0.49
IQH:ParentH:SESUM 1.41 0.94 0.13
IQLM:ParentH:SESH 0.97 1.03 0.35
IQUM:ParentH:SESH 1.52 0.98 0.12
IQH:ParentH:SESH 1.53 1.01 0.13
As shown in the table above, all p-values of the interactions are greater than 0.05, so no interaction is significant. However, if I didn’t set the levels for each variable in ascending order before I ran the regression, there would be some significant interactions, as shown in the table below.
Estimate Std. Error Pr(>|z|)
(Intercept) 2.04 0.14 0.00
IQL -2.42 0.25 0.00
IQLM -1.72 0.20 0.00
IQUM -0.80 0.20 0.00
ParentL -2.79 0.45 0.00
SESL -1.91 0.25 0.00
SESLM -1.34 0.22 0.00
SESUM -1.04 0.20 0.00
IQLM:SESL 0.81 0.36 0.02
IQLM:SESLM 0.63 0.30 0.04
IQL:SESUM 0.76 0.34 0.03

(c) The text says that the difference in 2 log(maximized likelihood) (also called “Deviance”) between the full model and the main effects model is 25.24 with 24 d.f. Use the anova function with your models in (a) and (b) above to confirm this finding. What is the p value for testing if the full model is a better model?

Analysis of Deviance Table

Model 1: cbind(CollegeYes, CollegeNo) ~ IQ + Parent + SES
Model 2: cbind(CollegeYes, CollegeNo) ~ IQ + Parent + SES + IQ:Parent + 
    IQ:SES + Parent:SES + IQ:Parent:SES
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        24     25.236                     
2         0      0.000 24   25.236    0.393

The deviance and degrees of freedom of the two models are shown below:

dev.diff <- deviance(model) - deviance(fullModel)
round(dev.diff, 2)
[1] 25.24
df.diff <- df.residual(model) - df.residual(fullModel)
df.diff
[1] 24

The difference in deviance between the two models is indeed 25.24 with 24 degrees of freedom.

p <- pchisq(dev.diff, df.diff, lower.tail = FALSE)
round(p, 3)
[1] 0.393

The p-value of Chi-squre test is 0.393 > 0.05. So the improvement in the full model is not significant, which means there is probably no interactions.

(d) Using the main effects model, estimate the factor by which the `Yes/No’ ratio is increased for high IQ compared with a low IQ. Give also a 95% confidence interval.

The estimate coefficient for high IQ is:

     IQH 
1.966348 

Calculate the factor as follow:

f <- exp(model$coefficients[["IQH"]])
round(f, 2)
[1] 7.14

And the 95% confidence interval is:

CI <- confint(model, "IQH", 0.95)
Waiting for profiling to be done...
round(exp(CI), 2)
 2.5 % 97.5 % 
  5.65   9.08 

The odds of having a college plan for high IQ students is 7.14 times of the odds for students with a low IQ. The 95% confidence interval is (5.65, 9.08).

Question 4: Explain briefly how you might assess model fit for the main effects model.

The regression equation given by model with main effects is:
logit(p[CollegeYes]) = -4.025 + 0.594*IQLM + 1.333*IQUM + 1.966*IQH + 2.455*ParentH + 0.356*SESLM + 0.662*SESUM + 1.414*SESH

All of the parameters’ coefficients are possitive. This means the odds ratio of having a college plan increases along with the increase of each level of each parameter.

The first parameter is IQ. The coefficient of IQH is the largest, which means the students with high IQ are most likely to have a college plan.

The second parameter is parental encouragement. The result shows students are likely to have a college plan when their parental encouragement is high.

The third parameter is socioeconomic status. The coefficient of SESH is the largest, which means the students in high socioeconomic status are most likely to have a college plan.

This model explains most of the variation of the response variable. Compared to the model with interactions, it is simpler but not simplified.