climate <- read.csv("climate_change.csv")
str(climate)
## 'data.frame': 308 obs. of 11 variables:
## $ Year : int 1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
## $ Month : int 5 6 7 8 9 10 11 12 1 2 ...
## $ MEI : num 2.556 2.167 1.741 1.13 0.428 ...
## $ CO2 : num 346 346 344 342 340 ...
## $ CH4 : num 1639 1634 1633 1631 1648 ...
## $ N2O : num 304 304 304 304 304 ...
## $ CFC.11 : num 191 192 193 194 194 ...
## $ CFC.12 : num 350 352 354 356 357 ...
## $ TSI : num 1366 1366 1366 1366 1366 ...
## $ Aerosols: num 0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
## $ Temp : num 0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
Creating testing and training sets
training <- subset(climate, climate$Year <= 2006)
testing <- subset(climate, climate$Year > 2006)
Create model which uses all parameters except Year and Month
fit <- lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols,
data=training)
summary(fit)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 +
## TSI + Aerosols, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25888 -0.05913 -0.00082 0.05649 0.32433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.246e+02 1.989e+01 -6.265 1.43e-09 ***
## MEI 6.421e-02 6.470e-03 9.923 < 2e-16 ***
## CO2 6.457e-03 2.285e-03 2.826 0.00505 **
## CH4 1.240e-04 5.158e-04 0.240 0.81015
## N2O -1.653e-02 8.565e-03 -1.930 0.05467 .
## CFC.11 -6.631e-03 1.626e-03 -4.078 5.96e-05 ***
## CFC.12 3.808e-03 1.014e-03 3.757 0.00021 ***
## TSI 9.314e-02 1.475e-02 6.313 1.10e-09 ***
## Aerosols -1.538e+00 2.133e-01 -7.210 5.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7436
## F-statistic: 103.6 on 8 and 275 DF, p-value: < 2.2e-16
Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.
cor(training) > 0.7
## Year Month MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols
## Year TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## Month FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## MEI FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## CO2 TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## CH4 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## N2O TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## CFC.11 FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## CFC.12 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## TSI FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## Aerosols FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## Temp TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## Temp
## Year TRUE
## Month FALSE
## MEI FALSE
## CO2 TRUE
## CH4 TRUE
## N2O TRUE
## CFC.11 FALSE
## CFC.12 FALSE
## TSI FALSE
## Aerosols FALSE
## Temp TRUE
Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.
fit.n2o <- lm(Temp ~ MEI + TSI + Aerosols + N2O, data=training)
summary(fit.n2o)
##
## Call:
## lm(formula = Temp ~ MEI + TSI + Aerosols + N2O, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27916 -0.05975 -0.00595 0.05672 0.34195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.162e+02 2.022e+01 -5.747 2.37e-08 ***
## MEI 6.419e-02 6.652e-03 9.649 < 2e-16 ***
## TSI 7.949e-02 1.487e-02 5.344 1.89e-07 ***
## Aerosols -1.702e+00 2.180e-01 -7.806 1.19e-13 ***
## N2O 2.532e-02 1.311e-03 19.307 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7222
## F-statistic: 184.9 on 4 and 279 DF, p-value: < 2.2e-16
We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.
The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called “climateLM”, you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)
fit.new <- step(fit)
## Start: AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## - CH4 1 0.00049 2.3135 -1350.1
## <none> 2.3130 -1348.2
## - N2O 1 0.03132 2.3443 -1346.3
## - CO2 1 0.06719 2.3802 -1342.0
## - CFC.12 1 0.11874 2.4318 -1335.9
## - CFC.11 1 0.13986 2.4529 -1333.5
## - TSI 1 0.33516 2.6482 -1311.7
## - Aerosols 1 0.43727 2.7503 -1301.0
## - MEI 1 0.82823 3.1412 -1263.2
##
## Step: AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## <none> 2.3135 -1350.1
## - N2O 1 0.03133 2.3448 -1348.3
## - CO2 1 0.06672 2.3802 -1344.0
## - CFC.12 1 0.13023 2.4437 -1336.5
## - CFC.11 1 0.13938 2.4529 -1335.5
## - TSI 1 0.33500 2.6485 -1313.7
## - Aerosols 1 0.43987 2.7534 -1302.7
## - MEI 1 0.83118 3.1447 -1264.9
summary(fit.new)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI +
## Aerosols, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25770 -0.05994 -0.00104 0.05588 0.32203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.245e+02 1.985e+01 -6.273 1.37e-09 ***
## MEI 6.407e-02 6.434e-03 9.958 < 2e-16 ***
## CO2 6.402e-03 2.269e-03 2.821 0.005129 **
## N2O -1.602e-02 8.287e-03 -1.933 0.054234 .
## CFC.11 -6.609e-03 1.621e-03 -4.078 5.95e-05 ***
## CFC.12 3.868e-03 9.812e-04 3.942 0.000103 ***
## TSI 9.312e-02 1.473e-02 6.322 1.04e-09 ***
## Aerosols -1.540e+00 2.126e-01 -7.244 4.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7445
## F-statistic: 118.8 on 7 and 276 DF, p-value: < 2.2e-16
We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?
Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.
Enter the testing set R2:
temp.hat <- predict(fit.new, newdata=testing)
sse <- sum((testing$Temp - temp.hat)^2)
sst <- sum((testing$Temp - mean(training$Temp))^2)
1 - sse / sst
## [1] 0.6286051
pisaTrain <- read.csv("pisa2009train.csv")
pisaTest <- read.csv("pisa2009test.csv")
summary(pisaTrain)
## grade male raceeth
## Min. : 8.00 Min. :0.0000 White :2015
## 1st Qu.:10.00 1st Qu.:0.0000 Hispanic : 834
## Median :10.00 Median :1.0000 Black : 444
## Mean :10.09 Mean :0.5111 Asian : 143
## 3rd Qu.:10.00 3rd Qu.:1.0000 More than one race: 124
## Max. :12.00 Max. :1.0000 (Other) : 68
## NA's : 35
## preschool expectBachelors motherHS motherBachelors
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.00 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.00 Median :0.0000
## Mean :0.7228 Mean :0.7859 Mean :0.88 Mean :0.3481
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.0000
## NA's :56 NA's :62 NA's :97 NA's :397
## motherWork fatherHS fatherBachelors fatherWork
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :1.0000
## Mean :0.7345 Mean :0.8593 Mean :0.3319 Mean :0.8531
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :93 NA's :245 NA's :569 NA's :233
## selfBornUS motherBornUS fatherBornUS englishAtHome
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.9313 Mean :0.7725 Mean :0.7668 Mean :0.8717
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :69 NA's :71 NA's :113 NA's :71
## computerForSchoolwork read30MinsADay minutesPerWeekEnglish
## Min. :0.0000 Min. :0.0000 Min. : 0.0
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 225.0
## Median :1.0000 Median :0.0000 Median : 250.0
## Mean :0.8994 Mean :0.2899 Mean : 266.2
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 300.0
## Max. :1.0000 Max. :1.0000 Max. :2400.0
## NA's :65 NA's :34 NA's :186
## studentsInEnglish schoolHasLibrary publicSchool urban
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.0 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :25.0 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :24.5 Mean :0.9676 Mean :0.9339 Mean :0.3849
## 3rd Qu.:30.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :75.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :249 NA's :143
## schoolSize readingScore
## Min. : 100 Min. :168.6
## 1st Qu.: 712 1st Qu.:431.7
## Median :1212 Median :499.7
## Mean :1369 Mean :497.9
## 3rd Qu.:1900 3rd Qu.:566.2
## Max. :6694 Max. :746.0
## NA's :162
Average test scores of males and females
tapply(pisaTrain$readingScore, pisaTrain$male==1, mean)
## FALSE TRUE
## 512.9406 483.5325
Find all columns with missing data
sapply(pisaTrain, function(c) any(is.na(c)))
## grade male raceeth
## FALSE FALSE TRUE
## preschool expectBachelors motherHS
## TRUE TRUE TRUE
## motherBachelors motherWork fatherHS
## TRUE TRUE TRUE
## fatherBachelors fatherWork selfBornUS
## TRUE TRUE TRUE
## motherBornUS fatherBornUS englishAtHome
## TRUE TRUE TRUE
## computerForSchoolwork read30MinsADay minutesPerWeekEnglish
## TRUE TRUE TRUE
## studentsInEnglish schoolHasLibrary publicSchool
## TRUE TRUE FALSE
## urban schoolSize readingScore
## FALSE TRUE FALSE
Remove all missing data from training and testing datasets
pisaTrain <- na.omit(pisaTrain)
pisaTest <- na.omit(pisaTest)
(nrow(pisaTrain))
## [1] 2414
nrow(pisaTest)
## [1] 990
Which unordered and ordered factors have at least 3 levels
str(pisaTrain)
## 'data.frame': 2414 obs. of 24 variables:
## $ grade : int 11 10 10 10 10 10 10 10 11 9 ...
## $ male : int 1 0 1 0 1 0 0 0 1 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
## $ preschool : int 0 1 1 1 1 1 1 1 1 1 ...
## $ expectBachelors : int 0 1 0 1 1 1 1 0 1 1 ...
## $ motherHS : int 1 0 1 1 1 1 1 0 1 1 ...
## $ motherBachelors : int 1 0 0 0 1 0 0 0 0 1 ...
## $ motherWork : int 1 1 1 0 1 1 1 0 0 1 ...
## $ fatherHS : int 1 1 1 1 0 1 1 0 1 1 ...
## $ fatherBachelors : int 0 0 0 0 0 0 1 0 1 1 ...
## $ fatherWork : int 1 1 0 1 1 0 1 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ fatherBornUS : int 1 1 0 1 1 0 1 0 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 0 1 0 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 0 1 1 1 1 ...
## $ read30MinsADay : int 1 1 1 1 0 1 1 1 0 0 ...
## $ minutesPerWeekEnglish: int 450 200 250 300 294 232 225 270 275 225 ...
## $ studentsInEnglish : int 25 23 35 30 24 14 20 25 30 15 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 0 ...
## $ urban : int 0 1 1 0 0 0 0 1 1 1 ...
## $ schoolSize : int 1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
## $ readingScore : num 575 458 614 439 466 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
## .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...
Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”). Set the reference level of the factor by typing the following two lines in your R console:
pisaTrain$raceeth <- relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth <- relevel(pisaTest$raceeth, "White")
What is the Multiple R-squared value of lmScore on the training set?
lm.score <- lm(readingScore ~ ., data=pisaTrain)
What is the training-set root-mean squared error (RMSE) of lmScore?
sse <- sum(lm.score$residual^2)
sqrt(sse / nrow(pisaTrain))
## [1] 73.36555
Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
m.grade <- lm.score$coeff["grade"]
(11 - 9) * m.grade
## grade
## 59.08541
What is the range between the maximum and minimum predicted reading score on the test set?
score.hat <- predict(lm.score, pisaTest)
summary(score.hat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 353.2 482.0 524.0 516.7 555.7 637.7
What is the sum of squared errors (SSE) and RMSE of lmScore on the testing set?
(sse <- sum((pisaTest$readingScore - score.hat)^2))
## [1] 5762082
sqrt(sse / nrow(pisaTest))
## [1] 76.29079
What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
(score.baseline <- mean(pisaTrain$readingScore))
## [1] 517.9629
What is the sum of squared errors of the baseline model on the testing set?
(sst <- sum((pisaTest$readingScore - score.baseline)^2))
## [1] 7802354
What is the test-set R-squared value of lmScore?
1 - sse / sst
## [1] 0.2614944
fluTrain <- read.csv("FluTrain.csv")
str(fluTrain)
## 'data.frame': 417 obs. of 3 variables:
## $ Week : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ ILI : num 2.42 1.81 1.71 1.54 1.44 ...
## $ Queries: num 0.238 0.22 0.226 0.238 0.224 ...
Week with highest percentage of ILI-related physician visits:
fluTrain[which.max(fluTrain$ILI),]
## Week ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892 1
Week with highest percentage of ILI-related queries
fluTrain[which.max(fluTrain$Queries),]
## Week ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892 1
Data is skew right
hist(fluTrain$ILI)
When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.
Plot natural log of ILI vs Queries, what does it suggest?
plot(x=fluTrain$Queries, y=log(fluTrain$ILI))
What is the training set R-squared value for FluTrend1 model (the “Multiple R-squared”)?
fluTrend <- lm(log(ILI) ~ Queries, data=fluTrain)
summary(fluTrend)
##
## Call:
## lm(formula = log(ILI) ~ Queries, data = fluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76003 -0.19696 -0.01657 0.18685 1.06450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.49934 0.03041 -16.42 <2e-16 ***
## Queries 2.96129 0.09312 31.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared: 0.709, Adjusted R-squared: 0.7083
## F-statistic: 1011 on 1 and 415 DF, p-value: < 2.2e-16
For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem?
cor(fluTrain$Queries, log(fluTrain$ILI))^2
## [1] 0.7090201
What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012?
fluTest <- read.csv("FluTest.csv")
ili.hat <- exp(predict(fluTrend, fluTest))
idx.0311 <- which(fluTest$Week == "2012-03-11 - 2012-03-17")
ili.hat[idx.0311]
## 11
## 2.187378
What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012?
q <- fluTest$ILI[idx.0311]
(q - ili.hat[idx.0311]) / q
## 11
## 0.04623827
What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?
sse <- sum((fluTest$ILI - ili.hat)^2)
sqrt(sse / nrow(fluTest))
## [1] 0.7490645
The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.
First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.
library(zoo)
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ILI.lag2 <- lag(zoo(fluTrain$ILI), -2, na.pad=TRUE)
fluTrain$ILI.lag2 <- coredata(ILI.lag2)
sum(is.na(fluTest$ILI.lag2))
## Warning in is.na(fluTest$ILI.lag2): is.na() applied to non-(list or
## vector) of type 'NULL'
## [1] 0
Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?
plot(x=log(fluTrain$ILI), y=log(fluTrain$ILI.lag2))
Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.
fluTrend2 <- lm(log(ILI) ~ Queries + log(ILI.lag2), data=fluTrain)
Add ILI.lag2 to fluTest
ILI.lag2 <- lag(zoo(fluTest$ILI), -2, na.pad=TRUE)
fluTest$ILI.lag2 <- coredata(ILI.lag2)
sum(is.na(fluTest$ILI.lag2))
## [1] 2
Copy missing data for ILI from fluTrain to fluTest
fluTest$ILI.lag2[1] <- fluTrain$ILI[416]
fluTest$ILI.lag2[2] <- fluTrain$ILI[417]
Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).
What is the test-set RMSE of the FluTrend2 model?
ILI.hat <- exp(predict(fluTrend2, fluTest))
sse <- sum((fluTest$ILI - ILI.hat)^2)
sqrt(sse / nrow(fluTest))
## [1] 0.2942029