Categorical Data Analysis P-Set 5

Question 1. For the Crabs data file http://users.stat.ufl.edu/~aa/cat/data/Crabs.dat, fit the logistic regression model for the probability of a satellite (y= 1) using weight as a quantitative predictor and color as a categorical predictor.

Fit the logistic regression model using both predictors, with no interaction; report the prediction equations relating weight to the probability of a satellite, for each color. Plot them, and interpret.

filename3 <- "http://users.stat.ufl.edu/~aa/cat/data/Crabs.dat"
Crabs <- read.table(file=filename3, header=TRUE)
head(Crabs)

##   crab sat y weight width color spine
## 1    1   8 1   3.05  28.3     2     3
## 2    2   0 0   1.55  22.5     3     3
## 3    3   9 1   2.30  26.0     1     1
## 4    4   0 0   2.10  24.8     3     3
## 5    5   4 1   2.60  26.0     3     3
## 6    6   0 0   2.10  23.8     2     3

fit2=glm(y ~ weight + factor(color), family=binomial(link=logit),data=Crabs)
summary(fit2)

## 
## Call:
## glm(formula = y ~ weight + factor(color), family = binomial(link = logit), 
##     data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1908  -1.0144   0.5101   0.8683   2.0751  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.2572     1.1985  -2.718  0.00657 ** 
## weight           1.6928     0.3888   4.354 1.34e-05 ***
## factor(color)2   0.1448     0.7365   0.197  0.84410    
## factor(color)3  -0.1861     0.7750  -0.240  0.81019    
## factor(color)4  -1.2694     0.8488  -1.495  0.13479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 188.54  on 168  degrees of freedom
## AIC: 198.54
## 
## Number of Fisher Scoring iterations: 4

## Plot of probabilities
cols=rainbow(3)
curve(plogis(fit2$coefficients[1]+fit2$coefficients[2]*x), from=1,to=5.5,lwd=2,ylab="probability",xlab="weight",main="Color as Categories, probability")
for (i in 1:3)
    curve(plogis(fit2$coefficients[1]+fit2$coefficients[2+i]+fit2$coefficients[2]*x), from=1,to=5.5,lwd=2,col=cols[i],add=TRUE)
legend(3.5,.6,col=c(cols,"black"),lwd=2,legend=c("Medium Light", "Medium", "Medium Dark","Dark"),bg = "light gray")

The estimated log-odds of a satellite, given a shell weight of x kg, is:

Medium Light: logit[P(Y=1)] = -3.2572+1.6928x

Medium: logit[P(Y=1)] = -3.1124+1.6928x

Medium dark: logit[P(Y=1)] = -3.4433+1.6928x

dark: logit[P(Y=1)] = -4.5266+1.6928x

Weight has the same effect, with coefficient 1.6928, for all colors. This implies that the shapes of the four curves relating weight to P(Y=1) for the four colors are identical. For each color a 1 kg in weight has a multiplicative effect of exp(1.6928)=5.434677 on the odds that Y=1. At any fixed value of weight, the estimated odds of the presence of a satellite for a medium light crab is exp(3.1124) = 22.47492 times medium crab; and is exp(3.4433) = 31.29005 times a medium dark crab and exp(4.5266) = 92.44372 times a dark crab.

Add three terms to the model in part (a) to permit interaction between color and weight. Report the prediction equations relating weight to the probability of a satellite, for each color. Plot them, and interpret.

m2 <- update(fit2, ~ weight*factor(color))
m2

## 
## Call:  glm(formula = y ~ weight + factor(color) + weight:factor(color), 
##     family = binomial(link = logit), data = Crabs)
## 
## Coefficients:
##           (Intercept)                 weight         factor(color)2  
##               -1.6203                 1.0483                -0.8320  
##        factor(color)3         factor(color)4  weight:factor(color)2  
##               -6.2964                 0.4335                 0.3613  
## weight:factor(color)3  weight:factor(color)4  
##                2.7065                -0.8536  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  165 Residual
## Null Deviance:       225.8 
## Residual Deviance: 181.7     AIC: 197.7

plot(y ~ weight, data=Crabs, type="n")

for(k in 1:4){curve(predict(m2, data.frame(weight=x, color=k),type="response"), col=k, lty=k, add=TRUE)}
legend("bottomright", inset=.05, lty=1:4, col=1:4,legend=paste("color ", 1:4))

Medium Light: logit[P(Y=1)] = -1.6203 + 1.0483x Medium: logit[P(Y=1)] = -2.4523+1.4096x Medium dark: logit[P(Y=1)] = -7.9167+3.7548x dark: logit[P(Y=1)] = -1.1868 + 0.1947x

Weight has a different effect on each color. This implies that the shapes of the four curves relating weight to P(Y=1) for the four colors are different. For each color a 1 kg in weight has a separate multiplicative effect on the odds that Y=1. For instance, the estimated odds of the presence of a satellite for a dark crab is exp(0.1947) = 1.214946 times a medium light crab at a certain weight.

Test whether the interaction model gives a better fit than the simpler model without interaction terms. Report a p-value, and interpret. Which model has the lower AIC?

anova(fit2, m2, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: y ~ weight + factor(color)
## Model 2: y ~ weight + factor(color) + weight:factor(color)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       168     188.54                       
## 2       165     181.66  3    6.886  0.07562 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-Value: 0.07562. Model 2 is a better fit, that is, the model with interaction is a better fit. There is dependence between color and weight. Moreover, the second model (of interaction) also has a smaller AIC of 197.66, while that of the model of the first model (showing independence) is 198.54.

Question 2. Problem 5 of Homework 4 introduced the four scales of the Myers-Briggs personality test. For part (b) of that problem you fit a model using the four scales as predictors of whether a subject drinks alcohol frequently.

Conduct a model goodness-of-fit test, and interpret.

df3 <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/MBTI.dat", header = TRUE)
no <- df3$n - df3$drink; yes <- df3$drink
fit4 <- glm(cbind (yes, no) ~ EI + SN + TF + JP, family = binomial(link = "logit"), data = df3);
fit5 <- glm(cbind (yes, no) ~ EI * SN * TF * JP, family = binomial(link = "logit"), data = df3);
anova(fit4,fit5,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: cbind(yes, no) ~ EI + SN + TF + JP
## Model 2: cbind(yes, no) ~ EI * SN * TF * JP
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        11     11.149                     
## 2         0      0.000 11   11.149   0.4309

The model with the interaction is a saturated model, so the LR test for the independence model with D0 − D1 = 11.149 − 0 on df=11 and p-value= 0.4309, so we fail to reject H0 and conclude independence between the four scales.

If you were to simplify the model by removing a predictor, which would it be? Why?

Being that JPp has the highest P-value I would consider removing it from the model as it is statistically insignificant.

c i. Consider the model of main effects and all two-way interactions.Choose between the main effects model and the two-way interactions model basedon a likelihood ratio test.

fit6 <- glm(cbind (yes, no) ~ (EI + SN + TF + JP)^2, family = binomial(link = "logit"), data = df3);
anova(fit4,fit6,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: cbind(yes, no) ~ EI + SN + TF + JP
## Model 2: cbind(yes, no) ~ (EI + SN + TF + JP)^2
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        11    11.1491                     
## 2         5     3.7409  6   7.4082   0.2847

Based on the LR test, the main effects model is chosen.

Choose between the main effects model and the two-way interactions model based on AIC.

summary(fit4)

## 
## Call:
## glm(formula = cbind(yes, no) ~ EI + SN + TF + JP, family = binomial(link = "logit"), 
##     data = df3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2712  -0.8062  -0.1063   0.1124   1.5807  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1140     0.2715  -7.788 6.82e-15 ***
## EIi          -0.5550     0.2170  -2.558  0.01053 *  
## SNs          -0.4292     0.2340  -1.834  0.06664 .  
## TFt           0.6873     0.2206   3.116  0.00184 ** 
## JPp           0.2022     0.2266   0.893  0.37209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30.488  on 15  degrees of freedom
## Residual deviance: 11.149  on 11  degrees of freedom
## AIC: 73.99
## 
## Number of Fisher Scoring iterations: 4

summary(fit6)

## 
## Call:
## glm(formula = cbind(yes, no) ~ (EI + SN + TF + JP)^2, family = binomial(link = "logit"), 
##     data = df3)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8  
## -0.09316   0.56452  -0.43696  -0.03168   0.13900  -0.84856   0.62519   0.00661  
##        9        10        11        12        13        14        15        16  
##  0.11129  -0.79249   0.37773   0.11962  -0.34909   0.77692  -0.75044  -0.07286  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.25933    0.47184  -4.788 1.68e-06 ***
## EIi         -0.45894    0.55219  -0.831    0.406    
## SNs         -0.55200    0.50880  -1.085    0.278    
## TFt          0.27522    0.56135   0.490    0.624    
## JPp          0.79110    0.49089   1.612    0.107    
## EIi:SNs      0.01767    0.50769   0.035    0.972    
## EIi:TFt      0.30405    0.46929   0.648    0.517    
## EIi:JPp     -0.54072    0.48426  -1.117    0.264    
## SNs:TFt      0.66547    0.50842   1.309    0.191    
## SNs:JPp     -0.29800    0.51578  -0.578    0.563    
## TFt:JPp     -0.29654    0.48354  -0.613    0.540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30.4880  on 15  degrees of freedom
## Residual deviance:  3.7409  on  5  degrees of freedom
## AIC: 78.582
## 
## Number of Fisher Scoring iterations: 4

AIC for main effects model: 73.99 AIC for two-way interaction model: 78.582 Based on AIC the main effects model is chosen.

Use model-building methods to select a model for predicting whether a person drinks alcohol frequently, based on their Myers-Briggs personality type.

fit7 <- glm(cbind (yes, no) ~ 1, family = binomial(link = "logit"), data = df3);
scope <- list(upper=formula(fit6), lower=formula(fit7))
library(MASS, lib.loc = "/opt/R/3.6.0/lib/R/library")
stepAIC(fit6, direction="backward", scope=scope)

## Start:  AIC=78.58
## cbind(yes, no) ~ (EI + SN + TF + JP)^2
## 
##         Df Deviance    AIC
## - EI:SN  1   3.7421 76.583
## - SN:JP  1   4.0796 76.920
## - TF:JP  1   4.1179 76.959
## - EI:TF  1   4.1617 77.003
## - EI:JP  1   4.9884 77.829
## - SN:TF  1   5.4760 78.317
## <none>       3.7409 78.582
## 
## Step:  AIC=76.58
## cbind(yes, no) ~ EI + SN + TF + JP + EI:TF + EI:JP + SN:TF + 
##     SN:JP + TF:JP
## 
##         Df Deviance    AIC
## - SN:JP  1   4.0805 74.921
## - TF:JP  1   4.1211 74.962
## - EI:TF  1   4.1925 75.033
## - EI:JP  1   5.1500 75.991
## - SN:TF  1   5.4928 76.334
## <none>       3.7421 76.583
## 
## Step:  AIC=74.92
## cbind(yes, no) ~ EI + SN + TF + JP + EI:TF + EI:JP + SN:TF + 
##     TF:JP
## 
##         Df Deviance    AIC
## - EI:TF  1   4.5804 73.421
## - TF:JP  1   4.6241 73.465
## - EI:JP  1   5.5449 74.386
## <none>       4.0805 74.921
## - SN:TF  1   6.1747 75.016
## 
## Step:  AIC=73.42
## cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF + TF:JP
## 
##         Df Deviance    AIC
## - TF:JP  1   5.2274 72.068
## <none>       4.5804 73.421
## - EI:JP  1   6.7014 73.542
## - SN:TF  1   6.7546 73.595
## 
## Step:  AIC=72.07
## cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF
## 
##         Df Deviance    AIC
## <none>       5.2274 72.068
## - EI:JP  1   7.4797 72.321
## - SN:TF  1   8.2375 73.078

## 
## Call:  glm(formula = cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF, 
##     family = binomial(link = "logit"), data = df3)
## 
## Coefficients:
## (Intercept)          EIi          SNs          TFt          JPp      EIi:JPp  
##     -2.0821      -0.2243      -0.8034       0.1659       0.4958      -0.6589  
##     SNs:TFt  
##      0.8185  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
## Null Deviance:       30.49 
## Residual Deviance: 5.227     AIC: 72.07

stepAIC(fit7, direction="forward", scope=scope)

## Start:  AIC=85.33
## cbind(yes, no) ~ 1
## 
##        Df Deviance    AIC
## + TF    1   23.683 80.523
## + EI    1   24.036 80.877
## + SN    1   26.832 83.673
## <none>      30.488 85.329
## + JP    1   29.508 86.348
## 
## Step:  AIC=80.52
## cbind(yes, no) ~ TF
## 
##        Df Deviance    AIC
## + EI    1   16.398 75.239
## + SN    1   18.469 77.310
## + JP    1   21.631 80.472
## <none>      23.683 80.523
## 
## Step:  AIC=75.24
## cbind(yes, no) ~ TF + EI
## 
##         Df Deviance    AIC
## + SN     1   11.945 72.786
## <none>       16.398 75.239
## + JP     1   14.436 75.277
## + EI:TF  1   14.984 75.825
## 
## Step:  AIC=72.79
## cbind(yes, no) ~ TF + EI + SN
## 
##         Df Deviance    AIC
## + SN:TF  1   8.2328 71.074
## <none>      11.9455 72.786
## + EI:TF  1  10.5461 73.387
## + JP     1  11.1491 73.990
## + EI:SN  1  11.3814 74.222
## 
## Step:  AIC=71.07
## cbind(yes, no) ~ TF + EI + SN + TF:SN
## 
##         Df Deviance    AIC
## <none>       8.2328 71.074
## + EI:TF  1   7.0895 71.930
## + JP     1   7.4797 72.321
## + EI:SN  1   7.8198 72.661

## 
## Call:  glm(formula = cbind(yes, no) ~ TF + EI + SN + TF:SN, family = binomial(link = "logit"), 
##     data = df3)
## 
## Coefficients:
## (Intercept)          TFt          EIi          SNs      TFt:SNs  
##    -1.76795      0.07959     -0.55499     -0.86844      0.89962  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  11 Residual
## Null Deviance:       30.49 
## Residual Deviance: 8.233     AIC: 71.07

Using Stepwise Selection Algorithms: backward selection chose the following model: glm(formula = cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF, family = binomial(link = “logit”), data = df3)

while forward selection chose the following model: glm(formula = cbind(yes, no) ~ TF + EI + SN + TF:SN, family = binomial(link = “logit”),

Question 3. The following table shows a 2×2×6 contingency table for Y= whether admitted to graduate school at the University of California, Berkeley, for fall 1973, by gender of applicant for the six largest graduate departments.

rm(list=ls())
Dept <- rep(1:6, rep(2,6))
Gender <- rep(c("Male","Female"), 6)
Yes <- c(512,89,353,17,120,202,138,131,53,94,22,24)
No <- c(313,19,207,8,205,391,279,244,138,299,351,317)
No <-
Data <- data.frame(Dept=Dept, Gender=Gender, Yes=Yes, No=No)
rm(Dept, Gender, Yes, No)
Data

##    Dept Gender Yes  No
## 1     1   Male 512 313
## 2     1 Female  89  19
## 3     2   Male 353 207
## 4     2 Female  17   8
## 5     3   Male 120 205
## 6     3 Female 202 391
## 7     4   Male 138 279
## 8     4 Female 131 244
## 9     5   Male  53 138
## 10    5 Female  94 299
## 11    6   Male  22 351
## 12    6 Female  24 317

Conduct a model goodness-of-fit test, and interpret.

m1 <- glm(cbind(Yes,No) ~ factor(Dept), data=Data, family=binomial)
summary(m1)

## 
## Call:
## glm(formula = cbind(Yes, No) ~ factor(Dept), family = binomial, 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4064  -0.4550   0.1456   0.5471   4.1323  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.59346    0.06838   8.679   <2e-16 ***
## factor(Dept)2 -0.05059    0.10968  -0.461    0.645    
## factor(Dept)3 -1.20915    0.09726 -12.432   <2e-16 ***
## factor(Dept)4 -1.25833    0.10152 -12.395   <2e-16 ***
## factor(Dept)5 -1.68296    0.11733 -14.343   <2e-16 ***
## factor(Dept)6 -3.26911    0.16707 -19.567   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.056  on 11  degrees of freedom
## Residual deviance:  21.736  on  6  degrees of freedom
## AIC: 102.68
## 
## Number of Fisher Scoring iterations: 4

sat <- update(m1, ~ factor(Dept)*Gender)
anova(m1, sat, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: cbind(Yes, No) ~ factor(Dept)
## Model 2: cbind(Yes, No) ~ factor(Dept) + Gender + factor(Dept):Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1         6     21.735                        
## 2         0      0.000  6   21.735 0.001352 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-Value: 0.001352. Reject the simple model, as according to the more complex model, acceptance in each department is dependent on gender.

Use the standardized residuals to describe the lack of fit.

Data$y.hat <- (Data$Yes + Data$No) * fitted(m1)
Data$resid <- rstandard(m1, type="pearson")
Data

##    Dept Gender Yes  No     y.hat      resid
## 1     1   Male 512 313 531.43087 -4.1530728
## 2     1 Female  89  19  69.56913  4.1530728
## 3     2   Male 353 207 354.18803 -0.5037077
## 4     2 Female  17   8  15.81197  0.5037077
## 5     3   Male 120 205 113.99782  0.8680662
## 6     3 Female 202 391 208.00218 -0.8680662
## 7     4   Male 138 279 141.63258 -0.5458732
## 8     4 Female 131 244 127.36742  0.5458732
## 9     5   Male  53 138  48.07705  1.0005342
## 10    5 Female  94 299  98.92295 -1.0005342
## 11    6   Male  22 351  24.03081 -0.6197526
## 12    6 Female  24 317  21.96919  0.6197526

The model tells us that the determination of admission in department 1 by gender is an outlier as the standardized residual is greater than 2.

Demonstrate that when we add a gender effect, the estimated conditional odds ratio between admissions and gender (1 = male, 0 = female) is 0.90.

rm(list=ls())
Dept <- rep(1:6, rep(2,6))
Gender <- rep(c("Male","Female"), 6)
Yes <- c(512,89,353,17,120,202,138,131,53,94,22,24)
No <- c(313,19,207,8,205,391,279,244,138,299,351,317)
No <-
Data <- data.frame(Dept=Dept, Gender=Gender, Yes=Yes, No=No)
rm(Dept, Gender, Yes, No)
m1 <- glm(cbind(Yes,No) ~ factor(Dept), data=Data, family=binomial)
m2 <- update(m1, ~ . + Gender)
exp(coef(m2))

##   (Intercept) factor(Dept)2 factor(Dept)3 factor(Dept)4 factor(Dept)5 
##    1.97767415    0.95753028    0.28291804    0.27400567    0.17564230 
## factor(Dept)6    GenderMale 
##    0.03664494    0.90495497

Demonstrate that the marginal table, collapsed over department, has odds ratio 1.84.

Count <- c(sum(Data$Yes[Data$Gender=="Male"]),sum(Data$Yes[Data$Gender=="Female"]),sum(Data$No[ Data$Gender=="Male"]),sum(Data$No[ Data$Gender=="Female"]) )
Table <- matrix(Count, 2, 2)
rownames(Table) <- c("Male", "Female")
colnames(Table) <- c("Yes", "No")
Table

##         Yes   No
## Male   1198 1493
## Female  557 1278

Table[1,1] * Table[2,2]/(Table[1,2] * Table[2,1])

## [1] 1.84108

m1a <- update(m1, ~ Gender)
exp(coef(m1a))

## (Intercept)  GenderMale 
##   0.4358372   1.8410800

Explain what causes the associations in parts (c) and (d) to differ so much.

In taking into consideration simpson’s paradox, in part c when you consider each department, the odds of admission for male vs female is less by department, but if we ignore the department effects males are doing better.

Question 4. The following table displays primary food choice for a sample of alligators, classified bylength (≤2.3 meters,>2.3 meters) and by the lake in Florida in which they were caught.

Fit a model to describe the effects of length and lake on primary food choice, using Fish as the baseline category. Report the prediction equations.

library(nnet)
gator <- read.table(file = "http://users.stat.ufl.edu/~aa/cat/data/Alligators2.dat", header = TRUE)
head(gator)

##   lake size y1 y2 y3 y4 y5
## 1    1    1 23  4  2  2  8
## 2    1    0  7  0  1  3  5
## 3    2    1  5 11  1  0  3
## 4    2    0 13  8  6  1  0
## 5    3    1  5 11  2  1  5
## 6    3    0  8  7  6  3  5

names(gator)

## [1] "lake" "size" "y1"   "y2"   "y3"   "y4"   "y5"

names(gator) = c("lake", "size", "F", "I", "R", "B", "O")
gator$Lake <- factor(gator$lake, levels = c(4,1,2,3))
Alligatorfit=multinom(cbind(F,I,R,B,O)~Lake+size, data=gator, family=multinomial)

## # weights:  30 (20 variable)
## initial  value 352.466903 
## iter  10 value 271.607785
## iter  20 value 270.046051
## final  value 270.040140 
## converged

summary(Alligatorfit)

## Call:
## multinom(formula = cbind(F, I, R, B, O) ~ Lake + size, data = gator, 
##     family = multinomial)
## 
## Coefficients:
##   (Intercept)      Lake1        Lake2    Lake3       size
## I   -1.549021 -1.6581178  0.937237973 1.122002  1.4581457
## R   -3.314512  1.2428408  2.458913302 2.935262 -0.3512702
## B   -2.093358  0.6954256 -0.652622721 1.088098 -0.6306329
## O   -1.904343  0.8263115  0.005792737 1.516461  0.3315514
## 
## Std. Errors:
##   (Intercept)     Lake1     Lake2     Lake3      size
## I   0.4249185 0.6128466 0.4719035 0.4905122 0.3959418
## R   1.0530577 1.1854031 1.1181000 1.1163844 0.5800207
## B   0.6622972 0.7813123 1.2020025 0.8417085 0.6424863
## O   0.5258313 0.5575446 0.7765655 0.6214371 0.4482504
## 
## Residual Deviance: 540.0803 
## AIC: 580.0803

Prediction Equations: I=Invertebrate=log(ˆπI/ˆπF) = -1.549021 -1.6581178 I(lake=1)+ 0.937237973 I(lake=2) +1.122002 I(lake=3) +1.4581457size R=Reptile=log(ˆπR/ˆπF) = -3.314512 +1.2428408 I(lake=1) +2.458913302 I(lake=2) +2.935262 I(lake=3)-0.3512702size B=Bird=log(ˆπB/ˆπF) = -2.093358 +0.6954256 I(lake=1) -0.652622721 I(lake=2) +1.088098 I(lake=3) -0.6306329size O=Other=log(ˆπO/ˆπF) = 1.904343 +0.8263115 I(lake=1) +0.005792737 I(lake=2) +1.516461 I(lake=3)+ 0.3315514size

Interpret the effect of length on the choice between invertebrates and fish.

For a given length, the estimated logs odds that primary food choice is invertebrate instead of fish increase multiplicatively by e(1.4581457)= 4.297982 for a given lake.

Estimate the probability that the primary food choice is fish, for each length in Lake Oklawaha.

predict(Alligatorfit, data.frame(size=0,Lake=factor(2)), type="probs")

##          F          I          R          B          O 
## 0.45842485 0.24864187 0.19484367 0.02942414 0.06866547

predict(Alligatorfit, data.frame(size=1,Lake=factor(2)), type="probs")

##           F           I           R           B           O 
## 0.258189860 0.601880037 0.077232953 0.008820525 0.053876625

Predictions for fish in Lake Oklawaha Size = 0: 0.45842485 Size = 1: 0.258189860

Question 5. A model fit predicting preference for President (Democrat, Republican, Independent) using x= annual income (in $10,000 dollars) is

log[ˆπR(x)/ˆπI(x)] = 1.0 + 0.3x log[ˆπD(x)/ˆπI(x)] = 3.3−0.2x

State the prediction equation for log[ˆπR(x)/ˆπD(x)]. Interpret its slope. Note on formula: log(ˆπR(x)/ˆπD(x))= (log[ˆπD(x)/ˆπI(x)])/log[ˆπR(x)/ˆπI(x)]

1.0-3.3

## [1] -2.3

0.3+0.2

## [1] 0.5

exp(.5)

## [1] 1.648721

Answer: log(ˆπR(x)/ˆπD(x)) = -2.3+0.5x The odds of chosing a replucan over a democrat increases by a multiplicative 1.648721 for each additional increase in income(x) of $10, 000.

Find the range of x for which ˆπR(x)>ˆπD(x).

2.3/0.5

## [1] 4.6

For incomes>(4.6*10000)=46000

State the prediction equation for ˆπI(x).

Since I (Independent) is the baseline category then alpha=Beta=0.

ˆπI(x)=0+0x

Step1. πˆ I = (e α I + β I x)/(eαR+βRx +eαD+βDx +eαI+βIx)

Step2. (e0+0x)/(e1.0+0.3x + e3.3−0.2x + e0+0x)

Step 3. (1)/(e1.0+0.3x + e3.3−0.2x + 1)

Question 6. Is marital happiness associated with family income? For a General Social Survey, counts in the happiness categories (not, pretty, very) were (6,43,75) for below average income,(6,113,178) for an average income, and (6,57,117) for above average income. Fit a baseline-category logit model with not happy as the baseline category and scores (1,2,3) for the income categories (below average, average, above average).

Report the prediction equations.

library(VGAM)

## Loading required package: stats4

## Loading required package: splines

Not <- c(6,6,6); 
Pretty <- c(43,113,57); 
Very <- c(75,178,117);
Income <- c("Below", "Average", "Above")
data.frame(Income, Not, Pretty, Very)

##    Income Not Pretty Very
## 1   Below   6     43   75
## 2 Average   6    113  178
## 3   Above   6     57  117

score = c(1,2,3)
jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, family=multinomial)
jobsat.ind

## 
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ score, family = multinomial)
## 
## 
## Coefficients:
## (Intercept):1 (Intercept):2       score:1       score:2 
##     2.2038939     2.5551795     0.1313533     0.2275057 
## 
## Degrees of Freedom: 6 Total; 2 Residual
## Residual deviance: 3.190889 
## Log-likelihood: -15.38641 
## 
## This is a multinomial logit model with 3 levels

log(ˆπ1/ˆπ3) = 2.2038939 +0.1313533 Income log(ˆπ2/ˆπ3) = 2.5551795+0.2275057 Income

Carefully interpret the income effect in the first equation.

A unit change in income score will increase the odds of being pretty happy to not happy by a multiplicative effect of exp(0.1313533)=1.140371.

Test the null hypothesis that marital happiness is independent of family income.Interpret the p-value, and clearly state your conclusion.

jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, multinomial)
summary(jobsat.ind)

## 
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ score, family = multinomial)
## 
## Pearson residuals:
##   log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 1            -0.9011            -0.1378
## 2             1.2327             0.2369
## 3            -0.8408            -0.1936
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   2.2039     0.7376   2.988 0.002807 ** 
## (Intercept):2   2.5552     0.7256   3.521 0.000429 ***
## score:1         0.1314     0.3468   0.379 0.704906    
## score:2         0.2275     0.3412   0.667 0.504907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 3.1909 on 2 degrees of freedom
## 
## Log-likelihood: -15.3864 on 2 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Reference group is level  3  of the response

jobsat.ind1 <- vglm(cbind(Pretty,Very,Not) ~ 1, multinomial)
summary(jobsat.ind1)

## 
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ 1, family = multinomial)
## 
## Pearson residuals:
##   log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 1            -0.8316            -0.8716
## 2             1.2691             0.1709
## 3            -0.9400             0.5038
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   2.4709     0.2455   10.07   <2e-16 ***
## (Intercept):2   3.0231     0.2414   12.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 4.1348 on 4 degrees of freedom
## 
## Log-likelihood: -15.8583 on 4 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Reference group is level  3  of the response

lrtest(jobsat.ind,jobsat.ind1)

## Likelihood ratio test
## 
## Model 1: cbind(Pretty, Very, Not) ~ score
## Model 2: cbind(Pretty, Very, Not) ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   2 -15.386                     
## 2   4 -15.858  2 0.9439     0.6238

H0= no association HA= association P-value=0.6238, fail to reject the null.

Does the model fit adequately? Justify your answer.

Yes, the model fits adequately. Compared to the saturated model, the chi-square=3.1909 on 2 df, the corresponding critical value is 5.991465 (or P-value=0.2028172), so we fail to reject the null that the model is a good fit.

Estimate the probability that a person with average family income reports a very happy marriage.

jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, family=multinomial)
predict(jobsat.ind,data.frame(score=2), type="response")

##      Pretty      Very       Not
## 1 0.3562457 0.6135187 0.0302356

Probability: 0.6135187

Question 7. For a recent General Social Survey, a prediction equation relating Y= job satisfaction(four ordered categories; 1 = least satisfied) to the subject’s report ofx1= earningscompared to others with similar positions (four ordered categories; 1 = much less, 4 =much more),x2= freedom to make decisions about how to do job (four ordered categories;1 = very true, 4 = not true at all), andx3= work environment allows productivity (four ordered categories, 1 = strongly agree, 4 = strongly disagree), was

logit[ˆP(Y≤j)] = ˆαj−0.54x1+ 0.60x2+ 1.19x3.

Summarize each partial effect by indicating whether subjects tend to be more or lesssatisfied, as (i)x1, (ii)x2, (iii)x3, increases.

For (i)x1 since β <0 then Pr(Y≤j) is decreasing in x1. Hence larger values of y tend to occur more frequently with larger values of x1.

For (i)x2 since β >0 then Pr(Y≤j) is increasing in x2. Hence larger values of y tend to occur more frequently with smaller values of x2.

For (i)x3 since β >0 then Pr(Y≤j) is increasing in x3. Hence larger values of y tend to occur more frequently with smaller values of x3.

Report the settings for x1,x2,x3 at which a subject is most likely to have the highest job satisfaction.

x1= 4 x2= 1 x3= 1

What can you say about the prediction equation for logit[ˆP(Y≥j)]?

For x1 Beta would be greater than 0 while for x2 and x3, Beta would be less than 0. logit[ˆP(Y≥j)] is a reciprocal of logit[ˆP(Y≤j)].

Question 8. Refer to the modeling happiness data from Problem 6. Fit a proportional odds logit modelfor predicting marital happiness from family income, using scores (1,2,3) for the incomecategories.

Report the prediction equations.

library(VGAM)
Not <- c(6,6,6); 
Pretty <- c(43,113,57); 
Very <- c(75,178,117);
Income <- c("Below", "Average", "Above")
data.frame(Income, Not, Pretty, Very)

##    Income Not Pretty Very
## 1   Below   6     43   75
## 2 Average   6    113  178
## 3   Above   6     57  117

scores<-c(1,2,3)
logitmodel <- vglm(cbind(Not,Pretty,Very) ~ scores, family=propodds)
logitmodel

## 
## Call:
## vglm(formula = cbind(Not, Pretty, Very) ~ scores, family = propodds)
## 
## 
## Coefficients:
## (Intercept):1 (Intercept):2        scores 
##     3.2466448     0.2377548     0.1117423 
## 
## Degrees of Freedom: 6 Total; 3 Residual
## Residual deviance: 3.247165 
## Log-likelihood: -15.41455

Prediction Equations: log(ˆπ1/ˆπ3) = 3.2466448+ 0.1117423Income log(ˆπ2/ˆπ3) = 0.2377548+ 0.1117423Income

Carefully interpret the estimated income effect.

A unit change in income score will increase the odds of being happier by a multiplicative effect of exp(0.1117423)=1.118225.

Test the null hypothesis that marital happiness is independent of family income. Interpret thep-value, and clearly state your conclusion.

logitmodel <- vglm(cbind(Not,Pretty,Very) ~ scores, family=propodds)
logitmodel1 <- vglm(cbind(Not,Pretty,Very) ~ 1, family=propodds)
summary(logitmodel)

## 
## Call:
## vglm(formula = cbind(Not, Pretty, Very) ~ scores, family = propodds)
## 
## Pearson residuals:
##   logitlink(P[Y>=2]) logitlink(P[Y>=3])
## 1            -0.9484             0.5775
## 2             1.0457            -0.6768
## 3            -0.5406             0.3904
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   3.2466     0.3404   9.537   <2e-16 ***
## (Intercept):2   0.2378     0.2592   0.917    0.359    
## scores          0.1117     0.1179   0.948    0.343    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])
## 
## Residual deviance: 3.2472 on 3 degrees of freedom
## 
## Log-likelihood: -15.4146 on 3 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
##   scores 
## 1.118225

lrtest(logitmodel,logitmodel1)

## Likelihood ratio test
## 
## Model 1: cbind(Not, Pretty, Very) ~ scores
## Model 2: cbind(Not, Pretty, Very) ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   3 -15.415                     
## 2   4 -15.858  1 0.8876     0.3461

P-value=0.3461, fail to reject the null that “there is no association”.

Does the model fit adequately? Justify your answer.

Yes, the model fits adequately. Compared to the saturated model, the chi-square=3.2472on 3 df, the corresponding critical value is 7.814728 (or P-value=0.3550593), so we fail to reject the null that “the model is a good fit”.

Estimate the probability that a person with average family income reports a very happy marriage.

predict(logitmodel,data.frame(score=2), type="response")

##          Not    Pretty      Very
## 1 0.03362159 0.3798828 0.5864956
## 2 0.03017419 0.3565176 0.6133082
## 3 0.02707037 0.3334787 0.6394509

Probability= 0.6394509

Categorical Data Analysis P-Set 5

Barovier Allybose

04/12/2020