Question 1. For the Crabs data file http://users.stat.ufl.edu/~aa/cat/data/Crabs.dat, fit the logistic regression model for the probability of a satellite (y= 1) using weight as a quantitative predictor and color as a categorical predictor.
filename3 <- "http://users.stat.ufl.edu/~aa/cat/data/Crabs.dat"
Crabs <- read.table(file=filename3, header=TRUE)
head(Crabs)
## crab sat y weight width color spine
## 1 1 8 1 3.05 28.3 2 3
## 2 2 0 0 1.55 22.5 3 3
## 3 3 9 1 2.30 26.0 1 1
## 4 4 0 0 2.10 24.8 3 3
## 5 5 4 1 2.60 26.0 3 3
## 6 6 0 0 2.10 23.8 2 3
fit2=glm(y ~ weight + factor(color), family=binomial(link=logit),data=Crabs)
summary(fit2)
##
## Call:
## glm(formula = y ~ weight + factor(color), family = binomial(link = logit),
## data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1908 -1.0144 0.5101 0.8683 2.0751
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2572 1.1985 -2.718 0.00657 **
## weight 1.6928 0.3888 4.354 1.34e-05 ***
## factor(color)2 0.1448 0.7365 0.197 0.84410
## factor(color)3 -0.1861 0.7750 -0.240 0.81019
## factor(color)4 -1.2694 0.8488 -1.495 0.13479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 188.54 on 168 degrees of freedom
## AIC: 198.54
##
## Number of Fisher Scoring iterations: 4
## Plot of probabilities
cols=rainbow(3)
curve(plogis(fit2$coefficients[1]+fit2$coefficients[2]*x), from=1,to=5.5,lwd=2,ylab="probability",xlab="weight",main="Color as Categories, probability")
for (i in 1:3)
curve(plogis(fit2$coefficients[1]+fit2$coefficients[2+i]+fit2$coefficients[2]*x), from=1,to=5.5,lwd=2,col=cols[i],add=TRUE)
legend(3.5,.6,col=c(cols,"black"),lwd=2,legend=c("Medium Light", "Medium", "Medium Dark","Dark"),bg = "light gray")
The estimated log-odds of a satellite, given a shell weight of x kg, is:
Medium Light: logit[P(Y=1)] = -3.2572+1.6928x
Medium: logit[P(Y=1)] = -3.1124+1.6928x
Medium dark: logit[P(Y=1)] = -3.4433+1.6928x
dark: logit[P(Y=1)] = -4.5266+1.6928x
Weight has the same effect, with coefficient 1.6928, for all colors. This implies that the shapes of the four curves relating weight to P(Y=1) for the four colors are identical. For each color a 1 kg in weight has a multiplicative effect of exp(1.6928)=5.434677 on the odds that Y=1. At any fixed value of weight, the estimated odds of the presence of a satellite for a medium light crab is exp(3.1124) = 22.47492 times medium crab; and is exp(3.4433) = 31.29005 times a medium dark crab and exp(4.5266) = 92.44372 times a dark crab.
m2 <- update(fit2, ~ weight*factor(color))
m2
##
## Call: glm(formula = y ~ weight + factor(color) + weight:factor(color),
## family = binomial(link = logit), data = Crabs)
##
## Coefficients:
## (Intercept) weight factor(color)2
## -1.6203 1.0483 -0.8320
## factor(color)3 factor(color)4 weight:factor(color)2
## -6.2964 0.4335 0.3613
## weight:factor(color)3 weight:factor(color)4
## 2.7065 -0.8536
##
## Degrees of Freedom: 172 Total (i.e. Null); 165 Residual
## Null Deviance: 225.8
## Residual Deviance: 181.7 AIC: 197.7
plot(y ~ weight, data=Crabs, type="n")
for(k in 1:4){curve(predict(m2, data.frame(weight=x, color=k),type="response"), col=k, lty=k, add=TRUE)}
legend("bottomright", inset=.05, lty=1:4, col=1:4,legend=paste("color ", 1:4))
Medium Light: logit[P(Y=1)] = -1.6203 + 1.0483x Medium: logit[P(Y=1)] = -2.4523+1.4096x Medium dark: logit[P(Y=1)] = -7.9167+3.7548x dark: logit[P(Y=1)] = -1.1868 + 0.1947x
Weight has a different effect on each color. This implies that the shapes of the four curves relating weight to P(Y=1) for the four colors are different. For each color a 1 kg in weight has a separate multiplicative effect on the odds that Y=1. For instance, the estimated odds of the presence of a satellite for a dark crab is exp(0.1947) = 1.214946 times a medium light crab at a certain weight.
anova(fit2, m2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: y ~ weight + factor(color)
## Model 2: y ~ weight + factor(color) + weight:factor(color)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 168 188.54
## 2 165 181.66 3 6.886 0.07562 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-Value: 0.07562. Model 2 is a better fit, that is, the model with interaction is a better fit. There is dependence between color and weight. Moreover, the second model (of interaction) also has a smaller AIC of 197.66, while that of the model of the first model (showing independence) is 198.54.
Question 2. Problem 5 of Homework 4 introduced the four scales of the Myers-Briggs personality test. For part (b) of that problem you fit a model using the four scales as predictors of whether a subject drinks alcohol frequently.
df3 <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/MBTI.dat", header = TRUE)
no <- df3$n - df3$drink; yes <- df3$drink
fit4 <- glm(cbind (yes, no) ~ EI + SN + TF + JP, family = binomial(link = "logit"), data = df3);
fit5 <- glm(cbind (yes, no) ~ EI * SN * TF * JP, family = binomial(link = "logit"), data = df3);
anova(fit4,fit5,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(yes, no) ~ EI + SN + TF + JP
## Model 2: cbind(yes, no) ~ EI * SN * TF * JP
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 11 11.149
## 2 0 0.000 11 11.149 0.4309
The model with the interaction is a saturated model, so the LR test for the independence model with D0 − D1 = 11.149 − 0 on df=11 and p-value= 0.4309, so we fail to reject H0 and conclude independence between the four scales.
Being that JPp has the highest P-value I would consider removing it from the model as it is statistically insignificant.
c i. Consider the model of main effects and all two-way interactions.Choose between the main effects model and the two-way interactions model basedon a likelihood ratio test.
fit6 <- glm(cbind (yes, no) ~ (EI + SN + TF + JP)^2, family = binomial(link = "logit"), data = df3);
anova(fit4,fit6,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(yes, no) ~ EI + SN + TF + JP
## Model 2: cbind(yes, no) ~ (EI + SN + TF + JP)^2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 11 11.1491
## 2 5 3.7409 6 7.4082 0.2847
Based on the LR test, the main effects model is chosen.
summary(fit4)
##
## Call:
## glm(formula = cbind(yes, no) ~ EI + SN + TF + JP, family = binomial(link = "logit"),
## data = df3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2712 -0.8062 -0.1063 0.1124 1.5807
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1140 0.2715 -7.788 6.82e-15 ***
## EIi -0.5550 0.2170 -2.558 0.01053 *
## SNs -0.4292 0.2340 -1.834 0.06664 .
## TFt 0.6873 0.2206 3.116 0.00184 **
## JPp 0.2022 0.2266 0.893 0.37209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30.488 on 15 degrees of freedom
## Residual deviance: 11.149 on 11 degrees of freedom
## AIC: 73.99
##
## Number of Fisher Scoring iterations: 4
summary(fit6)
##
## Call:
## glm(formula = cbind(yes, no) ~ (EI + SN + TF + JP)^2, family = binomial(link = "logit"),
## data = df3)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## -0.09316 0.56452 -0.43696 -0.03168 0.13900 -0.84856 0.62519 0.00661
## 9 10 11 12 13 14 15 16
## 0.11129 -0.79249 0.37773 0.11962 -0.34909 0.77692 -0.75044 -0.07286
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.25933 0.47184 -4.788 1.68e-06 ***
## EIi -0.45894 0.55219 -0.831 0.406
## SNs -0.55200 0.50880 -1.085 0.278
## TFt 0.27522 0.56135 0.490 0.624
## JPp 0.79110 0.49089 1.612 0.107
## EIi:SNs 0.01767 0.50769 0.035 0.972
## EIi:TFt 0.30405 0.46929 0.648 0.517
## EIi:JPp -0.54072 0.48426 -1.117 0.264
## SNs:TFt 0.66547 0.50842 1.309 0.191
## SNs:JPp -0.29800 0.51578 -0.578 0.563
## TFt:JPp -0.29654 0.48354 -0.613 0.540
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30.4880 on 15 degrees of freedom
## Residual deviance: 3.7409 on 5 degrees of freedom
## AIC: 78.582
##
## Number of Fisher Scoring iterations: 4
AIC for main effects model: 73.99 AIC for two-way interaction model: 78.582 Based on AIC the main effects model is chosen.
fit7 <- glm(cbind (yes, no) ~ 1, family = binomial(link = "logit"), data = df3);
scope <- list(upper=formula(fit6), lower=formula(fit7))
library(MASS, lib.loc = "/opt/R/3.6.0/lib/R/library")
stepAIC(fit6, direction="backward", scope=scope)
## Start: AIC=78.58
## cbind(yes, no) ~ (EI + SN + TF + JP)^2
##
## Df Deviance AIC
## - EI:SN 1 3.7421 76.583
## - SN:JP 1 4.0796 76.920
## - TF:JP 1 4.1179 76.959
## - EI:TF 1 4.1617 77.003
## - EI:JP 1 4.9884 77.829
## - SN:TF 1 5.4760 78.317
## <none> 3.7409 78.582
##
## Step: AIC=76.58
## cbind(yes, no) ~ EI + SN + TF + JP + EI:TF + EI:JP + SN:TF +
## SN:JP + TF:JP
##
## Df Deviance AIC
## - SN:JP 1 4.0805 74.921
## - TF:JP 1 4.1211 74.962
## - EI:TF 1 4.1925 75.033
## - EI:JP 1 5.1500 75.991
## - SN:TF 1 5.4928 76.334
## <none> 3.7421 76.583
##
## Step: AIC=74.92
## cbind(yes, no) ~ EI + SN + TF + JP + EI:TF + EI:JP + SN:TF +
## TF:JP
##
## Df Deviance AIC
## - EI:TF 1 4.5804 73.421
## - TF:JP 1 4.6241 73.465
## - EI:JP 1 5.5449 74.386
## <none> 4.0805 74.921
## - SN:TF 1 6.1747 75.016
##
## Step: AIC=73.42
## cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF + TF:JP
##
## Df Deviance AIC
## - TF:JP 1 5.2274 72.068
## <none> 4.5804 73.421
## - EI:JP 1 6.7014 73.542
## - SN:TF 1 6.7546 73.595
##
## Step: AIC=72.07
## cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF
##
## Df Deviance AIC
## <none> 5.2274 72.068
## - EI:JP 1 7.4797 72.321
## - SN:TF 1 8.2375 73.078
##
## Call: glm(formula = cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF,
## family = binomial(link = "logit"), data = df3)
##
## Coefficients:
## (Intercept) EIi SNs TFt JPp EIi:JPp
## -2.0821 -0.2243 -0.8034 0.1659 0.4958 -0.6589
## SNs:TFt
## 0.8185
##
## Degrees of Freedom: 15 Total (i.e. Null); 9 Residual
## Null Deviance: 30.49
## Residual Deviance: 5.227 AIC: 72.07
stepAIC(fit7, direction="forward", scope=scope)
## Start: AIC=85.33
## cbind(yes, no) ~ 1
##
## Df Deviance AIC
## + TF 1 23.683 80.523
## + EI 1 24.036 80.877
## + SN 1 26.832 83.673
## <none> 30.488 85.329
## + JP 1 29.508 86.348
##
## Step: AIC=80.52
## cbind(yes, no) ~ TF
##
## Df Deviance AIC
## + EI 1 16.398 75.239
## + SN 1 18.469 77.310
## + JP 1 21.631 80.472
## <none> 23.683 80.523
##
## Step: AIC=75.24
## cbind(yes, no) ~ TF + EI
##
## Df Deviance AIC
## + SN 1 11.945 72.786
## <none> 16.398 75.239
## + JP 1 14.436 75.277
## + EI:TF 1 14.984 75.825
##
## Step: AIC=72.79
## cbind(yes, no) ~ TF + EI + SN
##
## Df Deviance AIC
## + SN:TF 1 8.2328 71.074
## <none> 11.9455 72.786
## + EI:TF 1 10.5461 73.387
## + JP 1 11.1491 73.990
## + EI:SN 1 11.3814 74.222
##
## Step: AIC=71.07
## cbind(yes, no) ~ TF + EI + SN + TF:SN
##
## Df Deviance AIC
## <none> 8.2328 71.074
## + EI:TF 1 7.0895 71.930
## + JP 1 7.4797 72.321
## + EI:SN 1 7.8198 72.661
##
## Call: glm(formula = cbind(yes, no) ~ TF + EI + SN + TF:SN, family = binomial(link = "logit"),
## data = df3)
##
## Coefficients:
## (Intercept) TFt EIi SNs TFt:SNs
## -1.76795 0.07959 -0.55499 -0.86844 0.89962
##
## Degrees of Freedom: 15 Total (i.e. Null); 11 Residual
## Null Deviance: 30.49
## Residual Deviance: 8.233 AIC: 71.07
Using Stepwise Selection Algorithms: backward selection chose the following model: glm(formula = cbind(yes, no) ~ EI + SN + TF + JP + EI:JP + SN:TF, family = binomial(link = “logit”), data = df3)
while forward selection chose the following model: glm(formula = cbind(yes, no) ~ TF + EI + SN + TF:SN, family = binomial(link = “logit”),
Question 3. The following table shows a 2×2×6 contingency table for Y= whether admitted to graduate school at the University of California, Berkeley, for fall 1973, by gender of applicant for the six largest graduate departments.
rm(list=ls())
Dept <- rep(1:6, rep(2,6))
Gender <- rep(c("Male","Female"), 6)
Yes <- c(512,89,353,17,120,202,138,131,53,94,22,24)
No <- c(313,19,207,8,205,391,279,244,138,299,351,317)
No <-
Data <- data.frame(Dept=Dept, Gender=Gender, Yes=Yes, No=No)
rm(Dept, Gender, Yes, No)
Data
## Dept Gender Yes No
## 1 1 Male 512 313
## 2 1 Female 89 19
## 3 2 Male 353 207
## 4 2 Female 17 8
## 5 3 Male 120 205
## 6 3 Female 202 391
## 7 4 Male 138 279
## 8 4 Female 131 244
## 9 5 Male 53 138
## 10 5 Female 94 299
## 11 6 Male 22 351
## 12 6 Female 24 317
m1 <- glm(cbind(Yes,No) ~ factor(Dept), data=Data, family=binomial)
summary(m1)
##
## Call:
## glm(formula = cbind(Yes, No) ~ factor(Dept), family = binomial,
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4064 -0.4550 0.1456 0.5471 4.1323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.59346 0.06838 8.679 <2e-16 ***
## factor(Dept)2 -0.05059 0.10968 -0.461 0.645
## factor(Dept)3 -1.20915 0.09726 -12.432 <2e-16 ***
## factor(Dept)4 -1.25833 0.10152 -12.395 <2e-16 ***
## factor(Dept)5 -1.68296 0.11733 -14.343 <2e-16 ***
## factor(Dept)6 -3.26911 0.16707 -19.567 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.056 on 11 degrees of freedom
## Residual deviance: 21.736 on 6 degrees of freedom
## AIC: 102.68
##
## Number of Fisher Scoring iterations: 4
sat <- update(m1, ~ factor(Dept)*Gender)
anova(m1, sat, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: cbind(Yes, No) ~ factor(Dept)
## Model 2: cbind(Yes, No) ~ factor(Dept) + Gender + factor(Dept):Gender
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6 21.735
## 2 0 0.000 6 21.735 0.001352 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-Value: 0.001352. Reject the simple model, as according to the more complex model, acceptance in each department is dependent on gender.
Data$y.hat <- (Data$Yes + Data$No) * fitted(m1)
Data$resid <- rstandard(m1, type="pearson")
Data
## Dept Gender Yes No y.hat resid
## 1 1 Male 512 313 531.43087 -4.1530728
## 2 1 Female 89 19 69.56913 4.1530728
## 3 2 Male 353 207 354.18803 -0.5037077
## 4 2 Female 17 8 15.81197 0.5037077
## 5 3 Male 120 205 113.99782 0.8680662
## 6 3 Female 202 391 208.00218 -0.8680662
## 7 4 Male 138 279 141.63258 -0.5458732
## 8 4 Female 131 244 127.36742 0.5458732
## 9 5 Male 53 138 48.07705 1.0005342
## 10 5 Female 94 299 98.92295 -1.0005342
## 11 6 Male 22 351 24.03081 -0.6197526
## 12 6 Female 24 317 21.96919 0.6197526
The model tells us that the determination of admission in department 1 by gender is an outlier as the standardized residual is greater than 2.
rm(list=ls())
Dept <- rep(1:6, rep(2,6))
Gender <- rep(c("Male","Female"), 6)
Yes <- c(512,89,353,17,120,202,138,131,53,94,22,24)
No <- c(313,19,207,8,205,391,279,244,138,299,351,317)
No <-
Data <- data.frame(Dept=Dept, Gender=Gender, Yes=Yes, No=No)
rm(Dept, Gender, Yes, No)
m1 <- glm(cbind(Yes,No) ~ factor(Dept), data=Data, family=binomial)
m2 <- update(m1, ~ . + Gender)
exp(coef(m2))
## (Intercept) factor(Dept)2 factor(Dept)3 factor(Dept)4 factor(Dept)5
## 1.97767415 0.95753028 0.28291804 0.27400567 0.17564230
## factor(Dept)6 GenderMale
## 0.03664494 0.90495497
Count <- c(sum(Data$Yes[Data$Gender=="Male"]),sum(Data$Yes[Data$Gender=="Female"]),sum(Data$No[ Data$Gender=="Male"]),sum(Data$No[ Data$Gender=="Female"]) )
Table <- matrix(Count, 2, 2)
rownames(Table) <- c("Male", "Female")
colnames(Table) <- c("Yes", "No")
Table
## Yes No
## Male 1198 1493
## Female 557 1278
Table[1,1] * Table[2,2]/(Table[1,2] * Table[2,1])
## [1] 1.84108
m1a <- update(m1, ~ Gender)
exp(coef(m1a))
## (Intercept) GenderMale
## 0.4358372 1.8410800
In taking into consideration simpson’s paradox, in part c when you consider each department, the odds of admission for male vs female is less by department, but if we ignore the department effects males are doing better.
Question 4. The following table displays primary food choice for a sample of alligators, classified bylength (≤2.3 meters,>2.3 meters) and by the lake in Florida in which they were caught.
library(nnet)
gator <- read.table(file = "http://users.stat.ufl.edu/~aa/cat/data/Alligators2.dat", header = TRUE)
head(gator)
## lake size y1 y2 y3 y4 y5
## 1 1 1 23 4 2 2 8
## 2 1 0 7 0 1 3 5
## 3 2 1 5 11 1 0 3
## 4 2 0 13 8 6 1 0
## 5 3 1 5 11 2 1 5
## 6 3 0 8 7 6 3 5
names(gator)
## [1] "lake" "size" "y1" "y2" "y3" "y4" "y5"
names(gator) = c("lake", "size", "F", "I", "R", "B", "O")
gator$Lake <- factor(gator$lake, levels = c(4,1,2,3))
Alligatorfit=multinom(cbind(F,I,R,B,O)~Lake+size, data=gator, family=multinomial)
## # weights: 30 (20 variable)
## initial value 352.466903
## iter 10 value 271.607785
## iter 20 value 270.046051
## final value 270.040140
## converged
summary(Alligatorfit)
## Call:
## multinom(formula = cbind(F, I, R, B, O) ~ Lake + size, data = gator,
## family = multinomial)
##
## Coefficients:
## (Intercept) Lake1 Lake2 Lake3 size
## I -1.549021 -1.6581178 0.937237973 1.122002 1.4581457
## R -3.314512 1.2428408 2.458913302 2.935262 -0.3512702
## B -2.093358 0.6954256 -0.652622721 1.088098 -0.6306329
## O -1.904343 0.8263115 0.005792737 1.516461 0.3315514
##
## Std. Errors:
## (Intercept) Lake1 Lake2 Lake3 size
## I 0.4249185 0.6128466 0.4719035 0.4905122 0.3959418
## R 1.0530577 1.1854031 1.1181000 1.1163844 0.5800207
## B 0.6622972 0.7813123 1.2020025 0.8417085 0.6424863
## O 0.5258313 0.5575446 0.7765655 0.6214371 0.4482504
##
## Residual Deviance: 540.0803
## AIC: 580.0803
Prediction Equations: I=Invertebrate=log(ˆπI/ˆπF) = -1.549021 -1.6581178 I(lake=1)+ 0.937237973 I(lake=2) +1.122002 I(lake=3) +1.4581457size R=Reptile=log(ˆπR/ˆπF) = -3.314512 +1.2428408 I(lake=1) +2.458913302 I(lake=2) +2.935262 I(lake=3)-0.3512702size B=Bird=log(ˆπB/ˆπF) = -2.093358 +0.6954256 I(lake=1) -0.652622721 I(lake=2) +1.088098 I(lake=3) -0.6306329size O=Other=log(ˆπO/ˆπF) = 1.904343 +0.8263115 I(lake=1) +0.005792737 I(lake=2) +1.516461 I(lake=3)+ 0.3315514size
For a given length, the estimated logs odds that primary food choice is invertebrate instead of fish increase multiplicatively by e(1.4581457)= 4.297982 for a given lake.
predict(Alligatorfit, data.frame(size=0,Lake=factor(2)), type="probs")
## F I R B O
## 0.45842485 0.24864187 0.19484367 0.02942414 0.06866547
predict(Alligatorfit, data.frame(size=1,Lake=factor(2)), type="probs")
## F I R B O
## 0.258189860 0.601880037 0.077232953 0.008820525 0.053876625
Predictions for fish in Lake Oklawaha Size = 0: 0.45842485 Size = 1: 0.258189860
Question 5. A model fit predicting preference for President (Democrat, Republican, Independent) using x= annual income (in $10,000 dollars) is
log[ˆπR(x)/ˆπI(x)] = 1.0 + 0.3x log[ˆπD(x)/ˆπI(x)] = 3.3−0.2x
1.0-3.3
## [1] -2.3
0.3+0.2
## [1] 0.5
exp(.5)
## [1] 1.648721
Answer: log(ˆπR(x)/ˆπD(x)) = -2.3+0.5x The odds of chosing a replucan over a democrat increases by a multiplicative 1.648721 for each additional increase in income(x) of $10, 000.
2.3/0.5
## [1] 4.6
For incomes>(4.6*10000)=46000
Since I (Independent) is the baseline category then alpha=Beta=0.
ˆπI(x)=0+0x
Step1. πˆ I = (e α I + β I x)/(eαR+βRx +eαD+βDx +eαI+βIx)
Step2. (e0+0x)/(e1.0+0.3x + e3.3−0.2x + e0+0x)
Step 3. (1)/(e1.0+0.3x + e3.3−0.2x + 1)
Question 6. Is marital happiness associated with family income? For a General Social Survey, counts in the happiness categories (not, pretty, very) were (6,43,75) for below average income,(6,113,178) for an average income, and (6,57,117) for above average income. Fit a baseline-category logit model with not happy as the baseline category and scores (1,2,3) for the income categories (below average, average, above average).
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
Not <- c(6,6,6);
Pretty <- c(43,113,57);
Very <- c(75,178,117);
Income <- c("Below", "Average", "Above")
data.frame(Income, Not, Pretty, Very)
## Income Not Pretty Very
## 1 Below 6 43 75
## 2 Average 6 113 178
## 3 Above 6 57 117
score = c(1,2,3)
jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, family=multinomial)
jobsat.ind
##
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ score, family = multinomial)
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 score:1 score:2
## 2.2038939 2.5551795 0.1313533 0.2275057
##
## Degrees of Freedom: 6 Total; 2 Residual
## Residual deviance: 3.190889
## Log-likelihood: -15.38641
##
## This is a multinomial logit model with 3 levels
log(ˆπ1/ˆπ3) = 2.2038939 +0.1313533 Income log(ˆπ2/ˆπ3) = 2.5551795+0.2275057 Income
A unit change in income score will increase the odds of being pretty happy to not happy by a multiplicative effect of exp(0.1313533)=1.140371.
jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, multinomial)
summary(jobsat.ind)
##
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ score, family = multinomial)
##
## Pearson residuals:
## log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 1 -0.9011 -0.1378
## 2 1.2327 0.2369
## 3 -0.8408 -0.1936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 2.2039 0.7376 2.988 0.002807 **
## (Intercept):2 2.5552 0.7256 3.521 0.000429 ***
## score:1 0.1314 0.3468 0.379 0.704906
## score:2 0.2275 0.3412 0.667 0.504907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 3.1909 on 2 degrees of freedom
##
## Log-likelihood: -15.3864 on 2 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level 3 of the response
jobsat.ind1 <- vglm(cbind(Pretty,Very,Not) ~ 1, multinomial)
summary(jobsat.ind1)
##
## Call:
## vglm(formula = cbind(Pretty, Very, Not) ~ 1, family = multinomial)
##
## Pearson residuals:
## log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 1 -0.8316 -0.8716
## 2 1.2691 0.1709
## 3 -0.9400 0.5038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 2.4709 0.2455 10.07 <2e-16 ***
## (Intercept):2 3.0231 0.2414 12.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 4.1348 on 4 degrees of freedom
##
## Log-likelihood: -15.8583 on 4 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Reference group is level 3 of the response
lrtest(jobsat.ind,jobsat.ind1)
## Likelihood ratio test
##
## Model 1: cbind(Pretty, Very, Not) ~ score
## Model 2: cbind(Pretty, Very, Not) ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -15.386
## 2 4 -15.858 2 0.9439 0.6238
H0= no association HA= association P-value=0.6238, fail to reject the null.
Yes, the model fits adequately. Compared to the saturated model, the chi-square=3.1909 on 2 df, the corresponding critical value is 5.991465 (or P-value=0.2028172), so we fail to reject the null that the model is a good fit.
jobsat.ind <- vglm(cbind(Pretty,Very,Not) ~ score, family=multinomial)
predict(jobsat.ind,data.frame(score=2), type="response")
## Pretty Very Not
## 1 0.3562457 0.6135187 0.0302356
Probability: 0.6135187
Question 7. For a recent General Social Survey, a prediction equation relating Y= job satisfaction(four ordered categories; 1 = least satisfied) to the subject’s report ofx1= earningscompared to others with similar positions (four ordered categories; 1 = much less, 4 =much more),x2= freedom to make decisions about how to do job (four ordered categories;1 = very true, 4 = not true at all), andx3= work environment allows productivity (four ordered categories, 1 = strongly agree, 4 = strongly disagree), was
logit[ˆP(Y≤j)] = ˆαj−0.54x1+ 0.60x2+ 1.19x3.
For (i)x1 since β <0 then Pr(Y≤j) is decreasing in x1. Hence larger values of y tend to occur more frequently with larger values of x1.
For (i)x2 since β >0 then Pr(Y≤j) is increasing in x2. Hence larger values of y tend to occur more frequently with smaller values of x2.
For (i)x3 since β >0 then Pr(Y≤j) is increasing in x3. Hence larger values of y tend to occur more frequently with smaller values of x3.
x1= 4 x2= 1 x3= 1
For x1 Beta would be greater than 0 while for x2 and x3, Beta would be less than 0. logit[ˆP(Y≥j)] is a reciprocal of logit[ˆP(Y≤j)].
Question 8. Refer to the modeling happiness data from Problem 6. Fit a proportional odds logit modelfor predicting marital happiness from family income, using scores (1,2,3) for the incomecategories.
library(VGAM)
Not <- c(6,6,6);
Pretty <- c(43,113,57);
Very <- c(75,178,117);
Income <- c("Below", "Average", "Above")
data.frame(Income, Not, Pretty, Very)
## Income Not Pretty Very
## 1 Below 6 43 75
## 2 Average 6 113 178
## 3 Above 6 57 117
scores<-c(1,2,3)
logitmodel <- vglm(cbind(Not,Pretty,Very) ~ scores, family=propodds)
logitmodel
##
## Call:
## vglm(formula = cbind(Not, Pretty, Very) ~ scores, family = propodds)
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 scores
## 3.2466448 0.2377548 0.1117423
##
## Degrees of Freedom: 6 Total; 3 Residual
## Residual deviance: 3.247165
## Log-likelihood: -15.41455
Prediction Equations: log(ˆπ1/ˆπ3) = 3.2466448+ 0.1117423Income log(ˆπ2/ˆπ3) = 0.2377548+ 0.1117423Income
A unit change in income score will increase the odds of being happier by a multiplicative effect of exp(0.1117423)=1.118225.
logitmodel <- vglm(cbind(Not,Pretty,Very) ~ scores, family=propodds)
logitmodel1 <- vglm(cbind(Not,Pretty,Very) ~ 1, family=propodds)
summary(logitmodel)
##
## Call:
## vglm(formula = cbind(Not, Pretty, Very) ~ scores, family = propodds)
##
## Pearson residuals:
## logitlink(P[Y>=2]) logitlink(P[Y>=3])
## 1 -0.9484 0.5775
## 2 1.0457 -0.6768
## 3 -0.5406 0.3904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 3.2466 0.3404 9.537 <2e-16 ***
## (Intercept):2 0.2378 0.2592 0.917 0.359
## scores 0.1117 0.1179 0.948 0.343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])
##
## Residual deviance: 3.2472 on 3 degrees of freedom
##
## Log-likelihood: -15.4146 on 3 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Exponentiated coefficients:
## scores
## 1.118225
lrtest(logitmodel,logitmodel1)
## Likelihood ratio test
##
## Model 1: cbind(Not, Pretty, Very) ~ scores
## Model 2: cbind(Not, Pretty, Very) ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -15.415
## 2 4 -15.858 1 0.8876 0.3461
P-value=0.3461, fail to reject the null that “there is no association”.
Yes, the model fits adequately. Compared to the saturated model, the chi-square=3.2472on 3 df, the corresponding critical value is 7.814728 (or P-value=0.3550593), so we fail to reject the null that “the model is a good fit”.
predict(logitmodel,data.frame(score=2), type="response")
## Not Pretty Very
## 1 0.03362159 0.3798828 0.5864956
## 2 0.03017419 0.3565176 0.6133082
## 3 0.02707037 0.3334787 0.6394509
Probability= 0.6394509