10.3 Empirical logits

Figure 10.25 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?

\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)

10.5 Empirical logits again

Figure 10.27 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?

\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1+\beta_3X\cdot Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)

10.11 Medical school acceptance: Nested likelihood ratio tests

Consider the medical school data presented in this chapter. For each of the questions listed below, a nested likelihood ratio test could be used. In each case, state the reduced and full models for the likelihood ratio test.

  1. Does GPA contribute to predicting Acceptance beyond the effects of MCAT and Sex?
data("MedGPA")
logm1 = glm(Acceptance ~ GPA+MCAT+Sex, data=MedGPA)
summary(logm1)
## 
## Call:
## glm(formula = Acceptance ~ GPA + MCAT + Sex, data = MedGPA)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.76317  -0.42660   0.08646   0.31157   0.81582  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.73079    0.72380  -3.773 0.000422 ***
## GPA          0.75408    0.23896   3.156 0.002686 ** 
## MCAT         0.01863    0.01416   1.316 0.194108    
## SexM        -0.16070    0.11407  -1.409 0.164946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1770009)
## 
##     Null deviance: 13.636  on 54  degrees of freedom
## Residual deviance:  9.027  on 51  degrees of freedom
## AIC: 66.692
## 
## Number of Fisher Scoring iterations: 2

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]

and the full model for this is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3GPA \]

we see that the p-value for GPA is more significant than MCAT and Sex, therefore, GPA does contribute to predicting Acceptance beyond the effects of MCAT and Sex.

  1. Is the relationship between logit(Acceptance) and MCAT the same for men as for women?
logm2 = glm(Acceptance ~ MCAT + Sex + MCAT*Sex, data=MedGPA)
summary(logm2)
## 
## Call:
## glm(formula = Acceptance ~ MCAT + Sex + MCAT * Sex, data = MedGPA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7791  -0.4324   0.1843   0.4043   0.7536  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.687694   0.782229  -0.879   0.3834  
## MCAT         0.036669   0.021424   1.712   0.0931 .
## SexM        -0.554181   0.984609  -0.563   0.5760  
## MCAT:SexM    0.009839   0.026926   0.365   0.7163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2110099)
## 
##     Null deviance: 13.636  on 54  degrees of freedom
## Residual deviance: 10.762  on 51  degrees of freedom
## AIC: 76.359
## 
## Number of Fisher Scoring iterations: 2

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]

and the full model for this is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3Sex\cdot MCAT \]

we see that the interaction term generated is not significant, therefore, the relationship between logit(Acceptance) and MCAT is the same for men as for women.

10.17 CAFE

The Corporate Average Fuel Economy (CAFE) bill was proposed by Senators John McCain and John Kerry to improve the fuel economy of cars and light trucks sold in the United States. However, the bill was, in effectm indefinitely postponed when an amendment was passed, by a vote of 62-38, that charged the National Highway Traffic Safety Administration to develop a new standard. The file CAFE contains data on this vote, including the vote of each senator (Vote, which is 1 for Yes and 0 for No), whether or not each senator is a Democrat (Dem, which is 1 for Democrats and 0 for Republicans),* monetary contributions that each of the 100 senators received over his or her lifetime from car manufacturers (Contribution), and lofContr, which is the natural log of (1+Contribution). As we might expect, there is a strong positive association between contributions from car manufacturers and voting Yes on the CAFE amendment. But is the effect the same for Democrats as for Republicans? That is, we wonder whether the slope between logit(Vote) and logContr is the same for the two parties.

  1. Fit the model in which logit(Vote) depends on logContr and on party, allowing for different coefficients of logContr for Democrats (Dem) and for Republicans. Use a Wald z-test to check for a difference. What is the P-value?
data(CAFE)
logm3 = glm(Vote ~ LogContr + Dem + LogContr*Dem, data=CAFE, family=binomial)
summary(logm3)
## 
## Call:
## glm(formula = Vote ~ LogContr + Dem + LogContr * Dem, family = binomial, 
##     data = CAFE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5711  -0.6419   0.3023   0.5631   2.2532  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -10.164      5.401  -1.882   0.0599 .
## LogContr        3.002      1.357   2.212   0.0270 *
## Dem             2.544      5.974   0.426   0.6703  
## LogContr:Dem   -1.088      1.515  -0.719   0.4724  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.813  on 99  degrees of freedom
## Residual deviance:  86.781  on 96  degrees of freedom
## AIC: 94.781
## 
## Number of Fisher Scoring iterations: 5
  1. Repeat part (a) but this time use nested models and the drop-in-deviance test (the nested likelihood ratio test). What is the P-value?
lrtest(logm3)
## Likelihood ratio test
## 
## Model 1: Vote ~ LogContr + Dem + LogContr * Dem
## Model 2: Vote ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -43.391                         
## 2   1 -66.406 -3 46.031  5.585e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. How do the P-values from part (a) and (b) compare? If they are the same, why is that? If they are different, why are they different?

10.23 Sinking of the Titanic

In Exercise 9.27-9.32, we considered the data on the passengerswho survived and those who died when the oceanliner Titanic sank on it maiden voyage in 1912. The dataset in Titanic includes the following variables:

Name Description
Age which gives the passenger’s age in years
Sex which gives the passenger’s sex (male or female)
Survived A binary variable, where 1 indicates the passenger survived and 0 indicates death
SexCode which numerically codes male as 0 and female as 1
  1. In Exercises 9.27-9.31, you fit separate logistic regression models for the binary response Survived using Age and then SexCode. Now fit a multiple logistic model using these two predictors. Write down both the logit and probability forms for the fitted model.
data(Titanic)
logm5 <- glm(Survived ~ Age+Sex, data=Titanic, family=binomial)
summary(logm5)
## 
## Call:
## glm(formula = Survived ~ Age + Sex, family = binomial, data = Titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7541  -0.6905  -0.6504   0.7576   1.8628  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.306157   0.231052   5.653 1.58e-08 ***
## Age         -0.006352   0.006187  -1.027    0.305    
## Sexmale     -2.465996   0.178455 -13.819  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1025.57  on 755  degrees of freedom
## Residual deviance:  795.59  on 753  degrees of freedom
##   (557 observations deleted due to missingness)
## AIC: 801.59
## 
## Number of Fisher Scoring iterations: 4

\[ logit(\hat{\pi})=1.306157-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{1.306157-0.00635Age+2.466Sex}}{1+e^{1.306157-0.00635Age+2.466Sex}}\\ logit(\hat{\pi})=-1.160-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{-1.160-0.00635Age+2.466Sex}}{1+e^{-1.160-0.00635Age+2.466Sex}} \]

  1. Comment on the effectiveness of each of the predictors in the two-predicor model.
  1. According to the fitted model, estimate the probability and odds that an 18-year-old man would survive the Titanic sinking.
exp(-1.160-0.00635*(18))
## [1] 0.2796266
0.2796266/1.279627
## [1] 0.218522

\[ \widehat{odds}=0.2796266\\ \hat{\pi}=0.218522 \]

  1. Repeat the calculations for an 18-year-old woman and find the odds ratio compared to a man of the same age.
exp(1.306157-0.00635*(18))
## [1] 3.293191
3.293191/4.293191
## [1] 0.767073

\[ \widehat{odds}=3.293191\\ \hat{\pi}=0.767073 \]

3.293191/0.2796266
## [1] 11.7771
  1. Redo both (b) and (c) for a man and woman of age 50.
exp(-1.160-0.00635*(50))
## [1] 0.2282075
exp(-1.160-0.00635*(50))/(1+exp(-1.160-0.00635*(50)))
## [1] 0.1858053

\[ \widehat{odds_m}=0.2282075\\ \hat{\pi_m}=0.1858053 \]

exp(1.306157-0.00635*(50))
## [1] 2.687623
exp(1.306157-0.00635*(50))/(1+exp(1.306157-0.00635*(50)))
## [1] 0.7288226

\[ \widehat{odds_f}=2.687623\\ \hat{\pi_f}=0.7288226 \]

2.687623/0.2282075
## [1] 11.7771
  1. What happens to the odds ratio (female to male of the same age) when the age increases in the Titanic data? Will this always be the case?

10.26 Red states or blue states in 2016

The file Election16 contains data from 50 states related to the 2016 U.S. presidential election. One variable is TrumpWin, which is 1 for the states won by Donald Trump, the Republican candidate, and 0 for states that Trump lost. Among the potential predictors are the per capita Income, percentages of adults with high school (HS), or college (BS) degrees, and a measure (Dem.Rep) of the %Democrat-%Republican leaning in the state.

  1. Fit a logistic model with TrumpWin as the response and Dem.Rep, HS, BA, and Income as the predictors. Which predictor has the strongest relationship with the response in this model?
data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)
## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7
coeftest(logm6)
## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value Pr(>|z|)  
## (Intercept) 11.87855497 19.32351728  0.6147  0.53874  
## Dem.Rep     -0.25202068  0.11251085 -2.2400  0.02509 *
## HS           0.06631048  0.25098508  0.2642  0.79163  
## BA          -0.27539127  0.27331357 -1.0076  0.31365  
## Income      -0.00017849  0.00014362 -1.2428  0.21394  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Consider the model from part (a). Which predictors (if any) are not significantly related to TrumpWin in that model?
  1. Identify the state that has the largest positive deviance (residual) and the state that has the largest negative deviance (residual). (Note: Deviance residuals are available in computer output. They can be treated similarly to residuals in the regression setting; we discuss them further in the next chapter.)
data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)
## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7
Election16_logm6 = augment(logm6, data = Election16)
Election16_logm6 = Election16_logm6 %>%
  mutate(odds = exp(.resid), probability=odds/(1+odds))
  1. Consider applying a backward elimination process, starting with the four-predictor model in part (a). At each “step,” find the least significant predictor; eliminate it, and refit with the smaller model–unless the worst predictor is significant, say, at a 10% level, in which case you stop an call that your final model. Describe what happesn at each step in this situation.
library(leaps)
back = regsubsets(TrumpWin~Dem.Rep+HS+BA+Income, data=Election16, nbest=1,nvmax=4,method="backward")
summary(back)
## Subset selection object
## Call: regsubsets.formula(TrumpWin ~ Dem.Rep + HS + BA + Income, data = Election16, 
##     nbest = 1, nvmax = 4, method = "backward")
## 4 Variables  (and intercept)
##         Forced in Forced out
## Dem.Rep     FALSE      FALSE
## HS          FALSE      FALSE
## BA          FALSE      FALSE
## Income      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          Dem.Rep HS  BA  Income
## 1  ( 1 ) " "     " " "*" " "   
## 2  ( 1 ) "*"     " " "*" " "   
## 3  ( 1 ) "*"     " " "*" "*"   
## 4  ( 1 ) "*"     "*" "*" "*"
with(summary(back),data.frame(cp,outmat))
##                 cp Dem.Rep HS BA Income
## 1  ( 1 ) 21.019066             *       
## 2  ( 1 )  2.892274       *     *       
## 3  ( 1 )  3.137647       *     *      *
## 4  ( 1 )  5.000000       *  *  *      *
log7 <- glm(TrumpWin ~ Dem.Rep+BA, data=Election16, family=binomial)
summary(log7)
## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + BA, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93446  -0.10841   0.07757   0.30645   1.22311  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 15.34796    6.13130   2.503   0.0123 *
## Dem.Rep     -0.21406    0.08778  -2.439   0.0147 *
## BA          -0.51792    0.21181  -2.445   0.0145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 21.080  on 47  degrees of freedom
## AIC: 27.08
## 
## Number of Fisher Scoring iterations: 7

\[ logit(\hat{\pi})=15.34796-0.21406\cdot Dem.Rep-0.51792\cdot BA \]

12.7 Guessing regular seasonal differencing from an ACF

Figure 12.46 shows autocorrelation plots for five different series. Based on just the plots, which series do you think would need at least one regular difference to help achieve stationary? Note: The next exercist asks a similar question about seasonal differences.

12.8 Guessing seasonal differencing from an ACF

As in Exercise 12.7, we are interested in guessing whether differencing might be needed to help with stationary for the time series that are plotted in Figure 12.46. In this exercise, comment on whether you think a seasonal difference might be required.

12.9 Matching time series plot to its ACF

The five time series (Series A through Series E) shown in the time series plot of Figure 12.45 for exercises 12.5 and 12.6 are the same as the five series (Series 1 through Series 5) shown in the ACF plots of Figure 12.46 for Exercises 12.7 and 12.8, but the order has been scrambled. Match each series with its ACF and explain your reasoning.

12.24 Consumer Price Index: moving average models

In Example 12.16 on page 562 we consider an MA(1) model for the percentage differences of the of the monthly CPI data (CPIPctDiff in the dataset Inflation). In this exercise we consider moving average models

  1. Fit a first-order moving average model to the differences of the CPI series (i.e., and ARIMA (1,1,0) model). Check if the constant term is important to keep in the model. If not, drop it and refit. Write down the fitted ARIMA(1,1,0) model.
data("Inflation")

CPI <- ts(Inflation$CPI, start = c(2009,1), frequency = 12)

m1 <- Arima(CPI, order=c(0,1,1), include.constant=TRUE)
m1
## Series: CPI 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.4278  0.3223
## s.e.  0.0778  0.0915
## 
## sigma^2 estimated as 0.4016:  log likelihood=-90.49
## AIC=186.99   AICc=187.25   BIC=194.65
coeftest(m1)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1   0.427811   0.077798  5.4990  3.82e-08 ***
## drift 0.322258   0.091508  3.5216 0.0004289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ \hat{y}=0.322+0.428\cdot \hat{\epsilon}_{t-1} \]

  1. Is the first-order moving average term needed in this model? Justify your answer.
  1. Comment on anything interesting you see in the ACF of the residuals for this model.
Acf(residuals(m1))

  1. Now add a second moving term to run an ARIMA(0,1,2) for the CPI series. Is the second-order moving average term helpful in the model?
m2 <- Arima(CPI, order = c(0,1,2), include.constant = TRUE)
m2
## Series: CPI 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##          ma1     ma2   drift
##       0.5794  0.3444  0.3254
## s.e.  0.0969  0.1137  0.1179
## 
## sigma^2 estimated as 0.3739:  log likelihood=-86.72
## AIC=181.43   AICc=181.88   BIC=191.65
coeftest(m2)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1   0.579443   0.096943  5.9772 2.271e-09 ***
## ma2   0.344422   0.113709  3.0290  0.002454 ** 
## drift 0.325427   0.117908  2.7600  0.005780 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.29 Hawaii CO2.

  1. Fit a cosine trend model, along with terms for a quatratic trend in t, for the CO2 data in COHawaii. Record the adjusted \(R^\) value.
data("CO2Hawaii")

CO2Hawaii <- CO2Hawaii %>%
  mutate(Xcos = cos(2*pi*t/12), Xsin = sin(2*pi*t/12))

m3 <- lm(CO2~t+I(t^2)+Xcos+Xsin, data=CO2Hawaii)
summary(m3)
## 
## Call:
## lm(formula = CO2 ~ t + I(t^2) + Xcos + Xsin, data = CO2Hawaii)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0962 -0.5816 -0.0353  0.6347  1.7666 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.505e+02  1.308e-01 2679.51   <2e-16 ***
## t            1.052e-01  1.673e-03   62.89   <2e-16 ***
## I(t^2)       1.473e-04  4.489e-06   32.82   <2e-16 ***
## Xcos        -1.617e+00  6.132e-02  -26.37   <2e-16 ***
## Xsin         2.466e+00  6.134e-02   40.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8227 on 355 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
## F-statistic: 3.666e+04 on 4 and 355 DF,  p-value: < 2.2e-16
  1. Fit a seasonal means model, along with terms for a quadratic trend in t, for the CO2 data in CO2Hawaii. Record the adjusted \(R^2\) value.
m4 <- lm(CO2~t+I(t^2)+factor(Month), data=CO2Hawaii)
summary(m4)
## 
## Call:
## lm(formula = CO2 ~ t + I(t^2) + factor(Month), data = CO2Hawaii)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46022 -0.38223  0.00451  0.44292  1.35235 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      3.507e+02  1.349e-01 2600.647  < 2e-16 ***
## t                1.052e-01  1.166e-03   90.200  < 2e-16 ***
## I(t^2)           1.473e-04  3.128e-06   47.090  < 2e-16 ***
## factor(Month)2   6.408e-01  1.480e-01    4.328 1.97e-05 ***
## factor(Month)3   1.377e+00  1.480e-01    9.303  < 2e-16 ***
## factor(Month)4   2.561e+00  1.480e-01   17.297  < 2e-16 ***
## factor(Month)5   2.941e+00  1.480e-01   19.863  < 2e-16 ***
## factor(Month)6   2.129e+00  1.481e-01   14.379  < 2e-16 ***
## factor(Month)7   3.702e-01  1.481e-01    2.500   0.0129 *  
## factor(Month)8  -1.879e+00  1.481e-01  -12.692  < 2e-16 ***
## factor(Month)9  -3.558e+00  1.481e-01  -24.034  < 2e-16 ***
## factor(Month)10 -3.515e+00  1.481e-01  -23.740  < 2e-16 ***
## factor(Month)11 -2.254e+00  1.481e-01  -15.224  < 2e-16 ***
## factor(Month)12 -9.761e-01  1.481e-01   -6.592 1.62e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5734 on 346 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9988 
## F-statistic: 2.325e+04 on 13 and 346 DF,  p-value: < 2.2e-16
  1. What CO2 level does each model predict for July 2018 (t=370)?
predict(m3, newdata=data.frame(t=370,
                               Xcos = cos(2*pi*370/12), 
                               Xsin = sin(2*pi*370/12)))
##        1 
## 406.6855
predict(m4, newdata=data.frame(t=370, Month = 7))
##        1 
## 410.1708
  1. Do you prefer the model in (a) or (b)? Give some justification for your answer.
  1. Do we really need the terms for a quadratic trend in this model? Take whichever model you choose in (d) and delete
m5 <- lm(CO2 ~ factor(Month), data=CO2Hawaii)
summary(m5)
## 
## Call:
## lm(formula = CO2 ~ factor(Month), data = CO2Hawaii)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.831 -15.099  -1.349  14.038  30.909 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     375.2207     3.0678 122.308   <2e-16 ***
## factor(Month)2    0.7977     4.3386   0.184    0.854    
## factor(Month)3    1.6913     4.3386   0.390    0.697    
## factor(Month)4    3.0323     4.3386   0.699    0.485    
## factor(Month)5    3.5700     4.3386   0.823    0.411    
## factor(Month)6    2.9163     4.3386   0.672    0.502    
## factor(Month)7    1.3160     4.3386   0.303    0.762    
## factor(Month)8   -0.7747     4.3386  -0.179    0.858    
## factor(Month)9   -2.2950     4.3386  -0.529    0.597    
## factor(Month)10  -2.0923     4.3386  -0.482    0.630    
## factor(Month)11  -0.6720     4.3386  -0.155    0.877    
## factor(Month)12   0.7660     4.3386   0.177    0.860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 348 degrees of freedom
## Multiple R-squared:  0.01245,    Adjusted R-squared:  -0.01877 
## F-statistic: 0.3987 on 11 and 348 DF,  p-value: 0.9562
ts_m5 <- ts(fitted.values(m5), start = c(1998,1), frequency = 12)
plot(ts_m5, col = "blue")

ts_CO2 <- ts(CO2Hawaii$CO2, start = c(1998, 1), frequency = 12)
plot(ts_CO2, col = "red")

plot(ts_CO2, col = "red")
lines(ts_m5, col = "blue")