HW #10

Reading: Ch. 10, 12
Exercises to hand in: 10.3, 10.5, 10.11, 10.17, 10.23, 10.26, 12.7, 12.8, 12.9, 12.24, 12.29

10.3 Empirical logits

Figure 10.25 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?

\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)

10.5 Empirical logits again

Figure 10.27 shows a plot of empirical logits for a dataset with two predictors: X, a continuous variable, and Group, a categorical variable with two levels, 1 and 2. The circles are for Group=1 and the triangles are for Group=2. What model is suggested by this plot?

\[ log(\frac{\pi}{1-\pi})=\beta_0 +\beta_1X +\beta_2Group_1+\beta_3X\cdot Group_1 \] where \(Group_1\) if an indicator variable for Group (either 1 or 2)

10.11 Medical school acceptance: Nested likelihood ratio tests

Consider the medical school data presented in this chapter. For each of the questions listed below, a nested likelihood ratio test could be used. In each case, state the reduced and full models for the likelihood ratio test.

Does GPA contribute to predicting Acceptance beyond the effects of MCAT and Sex?

data("MedGPA")
logm1 = glm(Acceptance ~ GPA+MCAT+Sex, data=MedGPA)
summary(logm1)

## 
## Call:
## glm(formula = Acceptance ~ GPA + MCAT + Sex, data = MedGPA)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.76317  -0.42660   0.08646   0.31157   0.81582  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.73079    0.72380  -3.773 0.000422 ***
## GPA          0.75408    0.23896   3.156 0.002686 ** 
## MCAT         0.01863    0.01416   1.316 0.194108    
## SexM        -0.16070    0.11407  -1.409 0.164946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1770009)
## 
##     Null deviance: 13.636  on 54  degrees of freedom
## Residual deviance:  9.027  on 51  degrees of freedom
## AIC: 66.692
## 
## Number of Fisher Scoring iterations: 2

The reduced model for predicting whether GPA contribute to predicting Acceptance beyond the effects of MCAT and Sex is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]

and the full model for this is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3GPA \]

we see that the p-value for GPA is more significant than MCAT and Sex, therefore, GPA does contribute to predicting Acceptance beyond the effects of MCAT and Sex.

Is the relationship between logit(Acceptance) and MCAT the same for men as for women?

logm2 = glm(Acceptance ~ MCAT + Sex + MCAT*Sex, data=MedGPA)
summary(logm2)

## 
## Call:
## glm(formula = Acceptance ~ MCAT + Sex + MCAT * Sex, data = MedGPA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7791  -0.4324   0.1843   0.4043   0.7536  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.687694   0.782229  -0.879   0.3834  
## MCAT         0.036669   0.021424   1.712   0.0931 .
## SexM        -0.554181   0.984609  -0.563   0.5760  
## MCAT:SexM    0.009839   0.026926   0.365   0.7163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2110099)
## 
##     Null deviance: 13.636  on 54  degrees of freedom
## Residual deviance: 10.762  on 51  degrees of freedom
## AIC: 76.359
## 
## Number of Fisher Scoring iterations: 2

The reduced model is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex \]

and the full model for this is

\[ logit(\pi)=\beta_0+\beta_1MCAT+\beta_2Sex+\beta_3Sex\cdot MCAT \]

we see that the interaction term generated is not significant, therefore, the relationship between logit(Acceptance) and MCAT is the same for men as for women.

10.17 CAFE

The Corporate Average Fuel Economy (CAFE) bill was proposed by Senators John McCain and John Kerry to improve the fuel economy of cars and light trucks sold in the United States. However, the bill was, in effectm indefinitely postponed when an amendment was passed, by a vote of 62-38, that charged the National Highway Traffic Safety Administration to develop a new standard. The file CAFE contains data on this vote, including the vote of each senator (Vote, which is 1 for Yes and 0 for No), whether or not each senator is a Democrat (Dem, which is 1 for Democrats and 0 for Republicans),* monetary contributions that each of the 100 senators received over his or her lifetime from car manufacturers (Contribution), and lofContr, which is the natural log of (1+Contribution). As we might expect, there is a strong positive association between contributions from car manufacturers and voting Yes on the CAFE amendment. But is the effect the same for Democrats as for Republicans? That is, we wonder whether the slope between logit(Vote) and logContr is the same for the two parties.

Fit the model in which logit(Vote) depends on logContr and on party, allowing for different coefficients of logContr for Democrats (Dem) and for Republicans. Use a Wald z-test to check for a difference. What is the P-value?

data(CAFE)
logm3 = glm(Vote ~ LogContr + Dem + LogContr*Dem, data=CAFE, family=binomial)
summary(logm3)

## 
## Call:
## glm(formula = Vote ~ LogContr + Dem + LogContr * Dem, family = binomial, 
##     data = CAFE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5711  -0.6419   0.3023   0.5631   2.2532  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -10.164      5.401  -1.882   0.0599 .
## LogContr        3.002      1.357   2.212   0.0270 *
## Dem             2.544      5.974   0.426   0.6703  
## LogContr:Dem   -1.088      1.515  -0.719   0.4724  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.813  on 99  degrees of freedom
## Residual deviance:  86.781  on 96  degrees of freedom
## AIC: 94.781
## 
## Number of Fisher Scoring iterations: 5

The p-value is 0.472

Repeat part (a) but this time use nested models and the drop-in-deviance test (the nested likelihood ratio test). What is the P-value?

lrtest(logm3)

## Likelihood ratio test
## 
## Model 1: Vote ~ LogContr + Dem + LogContr * Dem
## Model 2: Vote ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -43.391                         
## 2   1 -66.406 -3 46.031  5.585e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is 0.6045

How do the P-values from part (a) and (b) compare? If they are the same, why is that? If they are different, why are they different?

The p-values from (a) and (b) are fairly similar.

10.23 Sinking of the Titanic

In Exercise 9.27-9.32, we considered the data on the passengerswho survived and those who died when the oceanliner Titanic sank on it maiden voyage in 1912. The dataset in Titanic includes the following variables:

Name	Description
Age	which gives the passenger’s age in years
Sex	which gives the passenger’s sex (male or female)
Survived	A binary variable, where 1 indicates the passenger survived and 0 indicates death
SexCode	which numerically codes male as 0 and female as 1

In Exercises 9.27-9.31, you fit separate logistic regression models for the binary response Survived using Age and then SexCode. Now fit a multiple logistic model using these two predictors. Write down both the logit and probability forms for the fitted model.

data(Titanic)
logm5 <- glm(Survived ~ Age+Sex, data=Titanic, family=binomial)
summary(logm5)

## 
## Call:
## glm(formula = Survived ~ Age + Sex, family = binomial, data = Titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7541  -0.6905  -0.6504   0.7576   1.8628  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.306157   0.231052   5.653 1.58e-08 ***
## Age         -0.006352   0.006187  -1.027    0.305    
## Sexmale     -2.465996   0.178455 -13.819  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1025.57  on 755  degrees of freedom
## Residual deviance:  795.59  on 753  degrees of freedom
##   (557 observations deleted due to missingness)
## AIC: 801.59
## 
## Number of Fisher Scoring iterations: 4

\[ logit(\hat{\pi})=1.306157-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{1.306157-0.00635Age+2.466Sex}}{1+e^{1.306157-0.00635Age+2.466Sex}}\\ logit(\hat{\pi})=-1.160-0.00635Age+2.466Sex\\ \hat{\pi}=\frac{e^{-1.160-0.00635Age+2.466Sex}}{1+e^{-1.160-0.00635Age+2.466Sex}} \]

Comment on the effectiveness of each of the predictors in the two-predicor model.

Sex is the most effective since its p-value is very significant. Age is the least effective because its p-value is 0.305.

According to the fitted model, estimate the probability and odds that an 18-year-old man would survive the Titanic sinking.

exp(-1.160-0.00635*(18))

## [1] 0.2796266

0.2796266/1.279627

## [1] 0.218522

\[ \widehat{odds}=0.2796266\\ \hat{\pi}=0.218522 \]

Repeat the calculations for an 18-year-old woman and find the odds ratio compared to a man of the same age.

exp(1.306157-0.00635*(18))

## [1] 3.293191

3.293191/4.293191

## [1] 0.767073

\[ \widehat{odds}=3.293191\\ \hat{\pi}=0.767073 \]

3.293191/0.2796266

## [1] 11.7771

The odds ratio compared to a man of the same age is 11.7771

Redo both (b) and (c) for a man and woman of age 50.

exp(-1.160-0.00635*(50))

## [1] 0.2282075

exp(-1.160-0.00635*(50))/(1+exp(-1.160-0.00635*(50)))

## [1] 0.1858053

\[ \widehat{odds_m}=0.2282075\\ \hat{\pi_m}=0.1858053 \]

exp(1.306157-0.00635*(50))

## [1] 2.687623

exp(1.306157-0.00635*(50))/(1+exp(1.306157-0.00635*(50)))

## [1] 0.7288226

\[ \widehat{odds_f}=2.687623\\ \hat{\pi_f}=0.7288226 \]

2.687623/0.2282075

## [1] 11.7771

The odds ratio compared to a man of the same age is 11.7771

What happens to the odds ratio (female to male of the same age) when the age increases in the Titanic data? Will this always be the case?

The gender odd ratio is the same at all ages. This will always be the case.

10.26 Red states or blue states in 2016

The file Election16 contains data from 50 states related to the 2016 U.S. presidential election. One variable is TrumpWin, which is 1 for the states won by Donald Trump, the Republican candidate, and 0 for states that Trump lost. Among the potential predictors are the per capita Income, percentages of adults with high school (HS), or college (BS) degrees, and a measure (Dem.Rep) of the %Democrat-%Republican leaning in the state.

Fit a logistic model with TrumpWin as the response and Dem.Rep, HS, BA, and Income as the predictors. Which predictor has the strongest relationship with the response in this model?

data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)

## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7

coeftest(logm6)

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value Pr(>|z|)  
## (Intercept) 11.87855497 19.32351728  0.6147  0.53874  
## Dem.Rep     -0.25202068  0.11251085 -2.2400  0.02509 *
## HS           0.06631048  0.25098508  0.2642  0.79163  
## BA          -0.27539127  0.27331357 -1.0076  0.31365  
## Income      -0.00017849  0.00014362 -1.2428  0.21394  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The predictor that has the strongest relationship with the response in this model is Dem.Rep because the p-value is 0.0251.

Consider the model from part (a). Which predictors (if any) are not significantly related to TrumpWin in that model?

The predictors that are not significnatly related to TrumpWin in the model is HS because the p-value is 0.7916, BA at a p-value of 0.3136, and Income at a p-value of 0.2139.

Identify the state that has the largest positive deviance (residual) and the state that has the largest negative deviance (residual). (Note: Deviance residuals are available in computer output. They can be treated similarly to residuals in the regression setting; we discuss them further in the next chapter.)

data("Election16")
logm6 <- glm(TrumpWin ~ Dem.Rep + HS + BA + Income, data=Election16, family=binomial)
summary(logm6)

## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + HS + BA + Income, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7

Election16_logm6 = augment(logm6, data = Election16)
Election16_logm6 = Election16_logm6 %>%
  mutate(odds = exp(.resid), probability=odds/(1+odds))

The state that has the largest positive deviance (residual) is Pennsylvania and the state that has the largest negative deviance (residual) is Nevada

Consider applying a backward elimination process, starting with the four-predictor model in part (a). At each “step,” find the least significant predictor; eliminate it, and refit with the smaller model–unless the worst predictor is significant, say, at a 10% level, in which case you stop an call that your final model. Describe what happesn at each step in this situation.

library(leaps)
back = regsubsets(TrumpWin~Dem.Rep+HS+BA+Income, data=Election16, nbest=1,nvmax=4,method="backward")
summary(back)

## Subset selection object
## Call: regsubsets.formula(TrumpWin ~ Dem.Rep + HS + BA + Income, data = Election16, 
##     nbest = 1, nvmax = 4, method = "backward")
## 4 Variables  (and intercept)
##         Forced in Forced out
## Dem.Rep     FALSE      FALSE
## HS          FALSE      FALSE
## BA          FALSE      FALSE
## Income      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          Dem.Rep HS  BA  Income
## 1  ( 1 ) " "     " " "*" " "   
## 2  ( 1 ) "*"     " " "*" " "   
## 3  ( 1 ) "*"     " " "*" "*"   
## 4  ( 1 ) "*"     "*" "*" "*"

with(summary(back),data.frame(cp,outmat))

##                 cp Dem.Rep HS BA Income
## 1  ( 1 ) 21.019066             *       
## 2  ( 1 )  2.892274       *     *       
## 3  ( 1 )  3.137647       *     *      *
## 4  ( 1 )  5.000000       *  *  *      *

log7 <- glm(TrumpWin ~ Dem.Rep+BA, data=Election16, family=binomial)
summary(log7)

## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep + BA, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93446  -0.10841   0.07757   0.30645   1.22311  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 15.34796    6.13130   2.503   0.0123 *
## Dem.Rep     -0.21406    0.08778  -2.439   0.0147 *
## BA          -0.51792    0.21181  -2.445   0.0145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 21.080  on 47  degrees of freedom
## AIC: 27.08
## 
## Number of Fisher Scoring iterations: 7

Dem.Rep and BA will work best to fit the smaller model. The new model will be

\[ logit(\hat{\pi})=15.34796-0.21406\cdot Dem.Rep-0.51792\cdot BA \]

12.7 Guessing regular seasonal differencing from an ACF

Figure 12.46 shows autocorrelation plots for five different series. Based on just the plots, which series do you think would need at least one regular difference to help achieve stationary? Note: The next exercist asks a similar question about seasonal differences.

The graphs that would need at least one regular difference to help achieve stationary are Series 1, 2, and 4.

12.8 Guessing seasonal differencing from an ACF

As in Exercise 12.7, we are interested in guessing whether differencing might be needed to help with stationary for the time series that are plotted in Figure 12.46. In this exercise, comment on whether you think a seasonal difference might be required.

A seasonal difference might be required for Series 3 and 5.

12.9 Matching time series plot to its ACF

The five time series (Series A through Series E) shown in the time series plot of Figure 12.45 for exercises 12.5 and 12.6 are the same as the five series (Series 1 through Series 5) shown in the ACF plots of Figure 12.46 for Exercises 12.7 and 12.8, but the order has been scrambled. Match each series with its ACF and explain your reasoning.

Series A is the same as Series 2 because we see that the seasonal means graph is declining linearly as so Series 2.
Series B is the same as Series 4 because we see that the data is fluctuating in a smilar period in Series B and Series 4.
Series C is the same as Series 5 because there exists some seasonality in Series B.
Series D is the same as Series 3 because the seasonlity is more frequent as compared to Series C.
Series E is the same as Series 1 because the periods are not as frequent and we see that there is not much seasonality in the pattern.

12.24 Consumer Price Index: moving average models

In Example 12.16 on page 562 we consider an MA(1) model for the percentage differences of the of the monthly CPI data (CPIPctDiff in the dataset Inflation). In this exercise we consider moving average models

Fit a first-order moving average model to the differences of the CPI series (i.e., and ARIMA (1,1,0) model). Check if the constant term is important to keep in the model. If not, drop it and refit. Write down the fitted ARIMA(1,1,0) model.

data("Inflation")

CPI <- ts(Inflation$CPI, start = c(2009,1), frequency = 12)

m1 <- Arima(CPI, order=c(0,1,1), include.constant=TRUE)
m1

## Series: CPI 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.4278  0.3223
## s.e.  0.0778  0.0915
## 
## sigma^2 estimated as 0.4016:  log likelihood=-90.49
## AIC=186.99   AICc=187.25   BIC=194.65

coeftest(m1)

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1   0.427811   0.077798  5.4990  3.82e-08 ***
## drift 0.322258   0.091508  3.5216 0.0004289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ \hat{y}=0.322+0.428\cdot \hat{\epsilon}_{t-1} \]

Is the first-order moving average term needed in this model? Justify your answer.

Yes, it has a small p-value.

Comment on anything interesting you see in the ACF of the residuals for this model.

Acf(residuals(m1))

Looks like there’s some seasonality going on.

Now add a second moving term to run an ARIMA(0,1,2) for the CPI series. Is the second-order moving average term helpful in the model?

m2 <- Arima(CPI, order = c(0,1,2), include.constant = TRUE)
m2

## Series: CPI 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##          ma1     ma2   drift
##       0.5794  0.3444  0.3254
## s.e.  0.0969  0.1137  0.1179
## 
## sigma^2 estimated as 0.3739:  log likelihood=-86.72
## AIC=181.43   AICc=181.88   BIC=191.65

coeftest(m2)

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1   0.579443   0.096943  5.9772 2.271e-09 ***
## ma2   0.344422   0.113709  3.0290  0.002454 ** 
## drift 0.325427   0.117908  2.7600  0.005780 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, since the p-value is still significant.

12.29 Hawaii CO2.

Fit a cosine trend model, along with terms for a quatratic trend in t, for the CO2 data in COHawaii. Record the adjusted \(R^\) value.

data("CO2Hawaii")

CO2Hawaii <- CO2Hawaii %>%
  mutate(Xcos = cos(2*pi*t/12), Xsin = sin(2*pi*t/12))

m3 <- lm(CO2~t+I(t^2)+Xcos+Xsin, data=CO2Hawaii)
summary(m3)

## 
## Call:
## lm(formula = CO2 ~ t + I(t^2) + Xcos + Xsin, data = CO2Hawaii)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0962 -0.5816 -0.0353  0.6347  1.7666 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.505e+02  1.308e-01 2679.51   <2e-16 ***
## t            1.052e-01  1.673e-03   62.89   <2e-16 ***
## I(t^2)       1.473e-04  4.489e-06   32.82   <2e-16 ***
## Xcos        -1.617e+00  6.132e-02  -26.37   <2e-16 ***
## Xsin         2.466e+00  6.134e-02   40.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8227 on 355 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
## F-statistic: 3.666e+04 on 4 and 355 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) value is 0.9976.

Fit a seasonal means model, along with terms for a quadratic trend in t, for the CO2 data in CO2Hawaii. Record the adjusted \(R^2\) value.

m4 <- lm(CO2~t+I(t^2)+factor(Month), data=CO2Hawaii)
summary(m4)

## 
## Call:
## lm(formula = CO2 ~ t + I(t^2) + factor(Month), data = CO2Hawaii)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46022 -0.38223  0.00451  0.44292  1.35235 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      3.507e+02  1.349e-01 2600.647  < 2e-16 ***
## t                1.052e-01  1.166e-03   90.200  < 2e-16 ***
## I(t^2)           1.473e-04  3.128e-06   47.090  < 2e-16 ***
## factor(Month)2   6.408e-01  1.480e-01    4.328 1.97e-05 ***
## factor(Month)3   1.377e+00  1.480e-01    9.303  < 2e-16 ***
## factor(Month)4   2.561e+00  1.480e-01   17.297  < 2e-16 ***
## factor(Month)5   2.941e+00  1.480e-01   19.863  < 2e-16 ***
## factor(Month)6   2.129e+00  1.481e-01   14.379  < 2e-16 ***
## factor(Month)7   3.702e-01  1.481e-01    2.500   0.0129 *  
## factor(Month)8  -1.879e+00  1.481e-01  -12.692  < 2e-16 ***
## factor(Month)9  -3.558e+00  1.481e-01  -24.034  < 2e-16 ***
## factor(Month)10 -3.515e+00  1.481e-01  -23.740  < 2e-16 ***
## factor(Month)11 -2.254e+00  1.481e-01  -15.224  < 2e-16 ***
## factor(Month)12 -9.761e-01  1.481e-01   -6.592 1.62e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5734 on 346 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9988 
## F-statistic: 2.325e+04 on 13 and 346 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) value is 0.9988.

What CO2 level does each model predict for July 2018 (t=370)?

predict(m3, newdata=data.frame(t=370,
                               Xcos = cos(2*pi*370/12), 
                               Xsin = sin(2*pi*370/12)))

##        1 
## 406.6855

predict(m4, newdata=data.frame(t=370, Month = 7))

##        1 
## 410.1708

The cosine model predicts 406.6855 and the seasonal model predicts 410.1708.

Do you prefer the model in (a) or (b)? Give some justification for your answer.

Seasonal model has better \(R^2\) and we will prefer the model in (b).

Do we really need the terms for a quadratic trend in this model? Take whichever model you choose in (d) and delete

m5 <- lm(CO2 ~ factor(Month), data=CO2Hawaii)
summary(m5)

## 
## Call:
## lm(formula = CO2 ~ factor(Month), data = CO2Hawaii)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.831 -15.099  -1.349  14.038  30.909 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     375.2207     3.0678 122.308   <2e-16 ***
## factor(Month)2    0.7977     4.3386   0.184    0.854    
## factor(Month)3    1.6913     4.3386   0.390    0.697    
## factor(Month)4    3.0323     4.3386   0.699    0.485    
## factor(Month)5    3.5700     4.3386   0.823    0.411    
## factor(Month)6    2.9163     4.3386   0.672    0.502    
## factor(Month)7    1.3160     4.3386   0.303    0.762    
## factor(Month)8   -0.7747     4.3386  -0.179    0.858    
## factor(Month)9   -2.2950     4.3386  -0.529    0.597    
## factor(Month)10  -2.0923     4.3386  -0.482    0.630    
## factor(Month)11  -0.6720     4.3386  -0.155    0.877    
## factor(Month)12   0.7660     4.3386   0.177    0.860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 348 degrees of freedom
## Multiple R-squared:  0.01245,    Adjusted R-squared:  -0.01877 
## F-statistic: 0.3987 on 11 and 348 DF,  p-value: 0.9562

ts_m5 <- ts(fitted.values(m5), start = c(1998,1), frequency = 12)
plot(ts_m5, col = "blue")

ts_CO2 <- ts(CO2Hawaii$CO2, start = c(1998, 1), frequency = 12)
plot(ts_CO2, col = "red")

plot(ts_CO2, col = "red")
lines(ts_m5, col = "blue")

We do need the quadratic trend because the \(R^2\) value is negative. We’re getting a good seasonlity but the increase is not good so we would need the quadratic term.