Figure 10.25 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?
\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)
Figure 10.27 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?
\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1+\beta_3X\cdot Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)
Consider the medical school data presented in this chapter. For each of the questions listed below, a nested likelihood ratio test could be used. In each case, state the reduced and full models for the likelihood ratio test.
data("MedGPA")
logm1 = glm(Acceptance ~ GPA+MCAT+Sex, data=MedGPA)
summary(logm1)
##
## Call:
## glm(formula = Acceptance ~ GPA + MCAT + Sex, data = MedGPA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.76317 -0.42660 0.08646 0.31157 0.81582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.73079 0.72380 -3.773 0.000422 ***
## GPA 0.75408 0.23896 3.156 0.002686 **
## MCAT 0.01863 0.01416 1.316 0.194108
## SexM -0.16070 0.11407 -1.409 0.164946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1770009)
##
## Null deviance: 13.636 on 54 degrees of freedom
## Residual deviance: 9.027 on 51 degrees of freedom
## AIC: 66.692
##
## Number of Fisher Scoring iterations: 2
\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]
and the full model for this is
\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3GPA \]
we see that the p-value for GPA is more significant than MCAT and Sex, therefore, GPA does contribute to predicting Acceptance beyond the effects of MCAT and Sex.
logm2 = glm(Acceptance ~ MCAT + Sex + MCAT*Sex, data=MedGPA)
summary(logm2)
##
## Call:
## glm(formula = Acceptance ~ MCAT + Sex + MCAT * Sex, data = MedGPA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7791 -0.4324 0.1843 0.4043 0.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.687694 0.782229 -0.879 0.3834
## MCAT 0.036669 0.021424 1.712 0.0931 .
## SexM -0.554181 0.984609 -0.563 0.5760
## MCAT:SexM 0.009839 0.026926 0.365 0.7163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2110099)
##
## Null deviance: 13.636 on 54 degrees of freedom
## Residual deviance: 10.762 on 51 degrees of freedom
## AIC: 76.359
##
## Number of Fisher Scoring iterations: 2
\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]
and the full model for this is
\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3Sex\cdot MCAT \]
we see that the interaction term generated is not significant, therefore, the relationship between logit(Acceptance) and MCAT is the same for men as for women.
The Corporate Average Fuel Economy (CAFE) bill was proposed by Senators John McCain and John Kerry to improve the fuel economy of cars and light trucks sold in the United States. However, the bill was, in effectm indefinitely postponed when an amendment was passed, by a vote of 62-38, that charged the National Highway Traffic Safety Administration to develop a new standard. The file CAFE contains data on this vote, including the vote of each senator (Vote, which is 1 for Yes and 0 for No), whether or not each senator is a Democrat (Dem, which is 1 for Democrats and 0 for Republicans),* monetary contributions that each of the 100 senators received over his or her lifetime from car manufacturers (Contribution), and lofContr, which is the natural log of (1+Contribution). As we might expect, there is a strong positive association between contributions from car manufacturers and voting Yes on the CAFE amendment. But is the effect the same for Democrats as for Republicans? That is, we wonder whether the slope between logit(Vote) and logContr is the same for the two parties.
data(CAFE)
logm3 = glm(Vote ~ LogContr + Dem + LogContr*Dem, data=CAFE, family=binomial)
summary(logm3)
##
## Call:
## glm(formula = Vote ~ LogContr + Dem + LogContr * Dem, family = binomial,
## data = CAFE)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5711 -0.6419 0.3023 0.5631 2.2532
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.164 5.401 -1.882 0.0599 .
## LogContr 3.002 1.357 2.212 0.0270 *
## Dem 2.544 5.974 0.426 0.6703
## LogContr:Dem -1.088 1.515 -0.719 0.4724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.813 on 99 degrees of freedom
## Residual deviance: 86.781 on 96 degrees of freedom
## AIC: 94.781
##
## Number of Fisher Scoring iterations: 5
lrtest(logm3)
## Likelihood ratio test
##
## Model 1: Vote ~ LogContr + Dem + LogContr * Dem
## Model 2: Vote ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -43.391
## 2 1 -66.406 -3 46.031 5.585e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In Exercise 9.27-9.32, we considered the data on the passengerswho survived and those who died when the oceanliner Titanic sank on it maiden voyage in 1912. The dataset in Titanic includes the following variables:
| Name | Description |
|---|---|
| Age | which gives the passenger’s age in years |
| Sex | which gives the passenger’s sex (male or female) |
| Survived | A binary variable, where 1 indicates the passenger survived and 0 indicates death |
| SexCode | which numerically codes male as 0 and female as 1 |
data(Titanic)
logm5 <- glm(Survived ~ Age+Sex, data=Titanic, family=binomial)
summary(logm5)
##
## Call:
## glm(formula = Survived ~ Age + Sex, family = binomial, data = Titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7541 -0.6905 -0.6504 0.7576 1.8628
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.306157 0.231052 5.653 1.58e-08 ***
## Age -0.006352 0.006187 -1.027 0.305
## Sexmale -2.465996 0.178455 -13.819 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1025.57 on 755 degrees of freedom
## Residual deviance: 795.59 on 753 degrees of freedom
## (557 observations deleted due to missingness)
## AIC: 801.59
##
## Number of Fisher Scoring iterations: 4
\[ logit(\hat{\pi})=1.306157-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{1.306157-0.00635Age+2.466Sex}}{1+e^{1.306157-0.00635Age+2.466Sex}}\\ logit(\hat{\pi})=-1.160-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{-1.160-0.00635Age+2.466Sex}}{1+e^{-1.160-0.00635Age+2.466Sex}} \]
exp(-1.160-0.00635*(18))
## [1] 0.2796266
0.2796266/1.279627
## [1] 0.218522
\[ \widehat{odds}=0.2796266\\ \hat{\pi}=0.218522 \]
exp(1.306157-0.00635*(18))
## [1] 3.293191
3.293191/4.293191
## [1] 0.767073
\[ \widehat{odds}=3.293191\\ \hat{\pi}=0.767073 \]
3.293191/0.2796266
## [1] 11.7771
exp(-1.160-0.00635*(50))
## [1] 0.2282075
exp(-1.160-0.00635*(50))/(1+exp(-1.160-0.00635*(50)))
## [1] 0.1858053
\[ \widehat{odds_m}=0.2282075\\ \hat{\pi_m}=0.1858053 \]
exp(1.306157-0.00635*(50))
## [1] 2.687623
exp(1.306157-0.00635*(50))/(1+exp(1.306157-0.00635*(50)))
## [1] 0.7288226
\[ \widehat{odds_f}=2.687623\\ \hat{\pi_f}=0.7288226 \]
2.687623/0.2282075
## [1] 11.7771
The file Election16 contains data from 50 states related to the 2016 U.S. presidential election. One variable is TrumpWin, which is 1 for the states won by Donald Trump, the Republican candidate, and 0 for states that Trump lost. Among the potential predictors are the per capita Income, percentages of adults with high school (HS), or college (BS) degrees, and a measure (Dem.Rep) of the %Democrat-%Republican leaning in the state.
data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)
##
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial,
## data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44654 -0.05377 0.06772 0.25970 1.35391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.8785550 19.3235173 0.615 0.5387
## Dem.Rep -0.2520207 0.1125109 -2.240 0.0251 *
## HS 0.0663105 0.2509851 0.264 0.7916
## BA -0.2753913 0.2733136 -1.008 0.3136
## Income -0.0001785 0.0001436 -1.243 0.2139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 19.555 on 45 degrees of freedom
## AIC: 29.555
##
## Number of Fisher Scoring iterations: 7
coeftest(logm6)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.87855497 19.32351728 0.6147 0.53874
## Dem.Rep -0.25202068 0.11251085 -2.2400 0.02509 *
## HS 0.06631048 0.25098508 0.2642 0.79163
## BA -0.27539127 0.27331357 -1.0076 0.31365
## Income -0.00017849 0.00014362 -1.2428 0.21394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)
##
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial,
## data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44654 -0.05377 0.06772 0.25970 1.35391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.8785550 19.3235173 0.615 0.5387
## Dem.Rep -0.2520207 0.1125109 -2.240 0.0251 *
## HS 0.0663105 0.2509851 0.264 0.7916
## BA -0.2753913 0.2733136 -1.008 0.3136
## Income -0.0001785 0.0001436 -1.243 0.2139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 19.555 on 45 degrees of freedom
## AIC: 29.555
##
## Number of Fisher Scoring iterations: 7
Election16_logm6 = augment(logm6, data = Election16)
Election16_logm6 = Election16_logm6 %>%
mutate(odds = exp(.resid), probability=odds/(1+odds))
library(leaps)
back = regsubsets(TrumpWin~Dem.Rep+HS+BA+Income, data=Election16, nbest=1,nvmax=4,method="backward")
summary(back)
## Subset selection object
## Call: regsubsets.formula(TrumpWin ~ Dem.Rep + HS + BA + Income, data = Election16,
## nbest = 1, nvmax = 4, method = "backward")
## 4 Variables (and intercept)
## Forced in Forced out
## Dem.Rep FALSE FALSE
## HS FALSE FALSE
## BA FALSE FALSE
## Income FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Dem.Rep HS BA Income
## 1 ( 1 ) " " " " "*" " "
## 2 ( 1 ) "*" " " "*" " "
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
with(summary(back),data.frame(cp,outmat))
## cp Dem.Rep HS BA Income
## 1 ( 1 ) 21.019066 *
## 2 ( 1 ) 2.892274 * *
## 3 ( 1 ) 3.137647 * * *
## 4 ( 1 ) 5.000000 * * * *
log7 <- glm(TrumpWin ~ Dem.Rep+BA, data=Election16, family=binomial)
summary(log7)
##
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + BA, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93446 -0.10841 0.07757 0.30645 1.22311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.34796 6.13130 2.503 0.0123 *
## Dem.Rep -0.21406 0.08778 -2.439 0.0147 *
## BA -0.51792 0.21181 -2.445 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 21.080 on 47 degrees of freedom
## AIC: 27.08
##
## Number of Fisher Scoring iterations: 7
\[ logit(\hat{\pi})=15.34796-0.21406\cdot Dem.Rep-0.51792\cdot BA \]
Figure 12.46 shows autocorrelation plots for five different series. Based on just the plots, which series do you think would need at least one regular difference to help achieve stationary? Note: The next exercist asks a similar question about seasonal differences.
As in Exercise 12.7, we are interested in guessing whether differencing might be needed to help with stationary for the time series that are plotted in Figure 12.46. In this exercise, comment on whether you think a seasonal difference might be required.
The five time series (Series A through Series E) shown in the time series plot of Figure 12.45 for exercises 12.5 and 12.6 are the same as the five series (Series 1 through Series 5) shown in the ACF plots of Figure 12.46 for Exercises 12.7 and 12.8, but the order has been scrambled. Match each series with its ACF and explain your reasoning.
In Example 12.16 on page 562 we consider an MA(1) model for the percentage differences of the of the monthly CPI data (CPIPctDiff in the dataset Inflation). In this exercise we consider moving average models
data("Inflation")
CPI <- ts(Inflation$CPI, start = c(2009,1), frequency = 12)
m1 <- Arima(CPI, order=c(0,1,1), include.constant=TRUE)
m1
## Series: CPI
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.4278 0.3223
## s.e. 0.0778 0.0915
##
## sigma^2 estimated as 0.4016: log likelihood=-90.49
## AIC=186.99 AICc=187.25 BIC=194.65
coeftest(m1)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.427811 0.077798 5.4990 3.82e-08 ***
## drift 0.322258 0.091508 3.5216 0.0004289 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ \hat{y}=0.322+0.428\cdot \hat{\epsilon}_{t-1} \]
Acf(residuals(m1))
m2 <- Arima(CPI, order = c(0,1,2), include.constant = TRUE)
m2
## Series: CPI
## ARIMA(0,1,2) with drift
##
## Coefficients:
## ma1 ma2 drift
## 0.5794 0.3444 0.3254
## s.e. 0.0969 0.1137 0.1179
##
## sigma^2 estimated as 0.3739: log likelihood=-86.72
## AIC=181.43 AICc=181.88 BIC=191.65
coeftest(m2)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.579443 0.096943 5.9772 2.271e-09 ***
## ma2 0.344422 0.113709 3.0290 0.002454 **
## drift 0.325427 0.117908 2.7600 0.005780 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data("CO2Hawaii")
CO2Hawaii <- CO2Hawaii %>%
mutate(Xcos = cos(2*pi*t/12), Xsin = sin(2*pi*t/12))
m3 <- lm(CO2~t+I(t^2)+Xcos+Xsin, data=CO2Hawaii)
summary(m3)
##
## Call:
## lm(formula = CO2 ~ t + I(t^2) + Xcos + Xsin, data = CO2Hawaii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0962 -0.5816 -0.0353 0.6347 1.7666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.505e+02 1.308e-01 2679.51 <2e-16 ***
## t 1.052e-01 1.673e-03 62.89 <2e-16 ***
## I(t^2) 1.473e-04 4.489e-06 32.82 <2e-16 ***
## Xcos -1.617e+00 6.132e-02 -26.37 <2e-16 ***
## Xsin 2.466e+00 6.134e-02 40.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8227 on 355 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
## F-statistic: 3.666e+04 on 4 and 355 DF, p-value: < 2.2e-16
m4 <- lm(CO2~t+I(t^2)+factor(Month), data=CO2Hawaii)
summary(m4)
##
## Call:
## lm(formula = CO2 ~ t + I(t^2) + factor(Month), data = CO2Hawaii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46022 -0.38223 0.00451 0.44292 1.35235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.507e+02 1.349e-01 2600.647 < 2e-16 ***
## t 1.052e-01 1.166e-03 90.200 < 2e-16 ***
## I(t^2) 1.473e-04 3.128e-06 47.090 < 2e-16 ***
## factor(Month)2 6.408e-01 1.480e-01 4.328 1.97e-05 ***
## factor(Month)3 1.377e+00 1.480e-01 9.303 < 2e-16 ***
## factor(Month)4 2.561e+00 1.480e-01 17.297 < 2e-16 ***
## factor(Month)5 2.941e+00 1.480e-01 19.863 < 2e-16 ***
## factor(Month)6 2.129e+00 1.481e-01 14.379 < 2e-16 ***
## factor(Month)7 3.702e-01 1.481e-01 2.500 0.0129 *
## factor(Month)8 -1.879e+00 1.481e-01 -12.692 < 2e-16 ***
## factor(Month)9 -3.558e+00 1.481e-01 -24.034 < 2e-16 ***
## factor(Month)10 -3.515e+00 1.481e-01 -23.740 < 2e-16 ***
## factor(Month)11 -2.254e+00 1.481e-01 -15.224 < 2e-16 ***
## factor(Month)12 -9.761e-01 1.481e-01 -6.592 1.62e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5734 on 346 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9988
## F-statistic: 2.325e+04 on 13 and 346 DF, p-value: < 2.2e-16
predict(m3, newdata=data.frame(t=370,
Xcos = cos(2*pi*370/12),
Xsin = sin(2*pi*370/12)))
## 1
## 406.6855
predict(m4, newdata=data.frame(t=370, Month = 7))
## 1
## 410.1708
m5 <- lm(CO2 ~ factor(Month), data=CO2Hawaii)
summary(m5)
##
## Call:
## lm(formula = CO2 ~ factor(Month), data = CO2Hawaii)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.831 -15.099 -1.349 14.038 30.909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 375.2207 3.0678 122.308 <2e-16 ***
## factor(Month)2 0.7977 4.3386 0.184 0.854
## factor(Month)3 1.6913 4.3386 0.390 0.697
## factor(Month)4 3.0323 4.3386 0.699 0.485
## factor(Month)5 3.5700 4.3386 0.823 0.411
## factor(Month)6 2.9163 4.3386 0.672 0.502
## factor(Month)7 1.3160 4.3386 0.303 0.762
## factor(Month)8 -0.7747 4.3386 -0.179 0.858
## factor(Month)9 -2.2950 4.3386 -0.529 0.597
## factor(Month)10 -2.0923 4.3386 -0.482 0.630
## factor(Month)11 -0.6720 4.3386 -0.155 0.877
## factor(Month)12 0.7660 4.3386 0.177 0.860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 348 degrees of freedom
## Multiple R-squared: 0.01245, Adjusted R-squared: -0.01877
## F-statistic: 0.3987 on 11 and 348 DF, p-value: 0.9562
ts_m5 <- ts(fitted.values(m5), start = c(1998,1), frequency = 12)
plot(ts_m5, col = "blue")
ts_CO2 <- ts(CO2Hawaii$CO2, start = c(1998, 1), frequency = 12)
plot(ts_CO2, col = "red")
plot(ts_CO2, col = "red")
lines(ts_m5, col = "blue")