Climate Change

climate <- read.csv("climate_change.csv")
str(climate)
## 'data.frame':    308 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...

Creating testing and training sets

training <- subset(climate, climate$Year <= 2006)
testing <- subset(climate, climate$Year > 2006)

Create model which uses all parameters except Year and Month

fit <- lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, 
          data=training)
summary(fit)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.

cor(training) > 0.7
##           Year Month   MEI   CO2   CH4   N2O CFC.11 CFC.12   TSI Aerosols
## Year      TRUE FALSE FALSE  TRUE  TRUE  TRUE  FALSE   TRUE FALSE    FALSE
## Month    FALSE  TRUE FALSE FALSE FALSE FALSE  FALSE  FALSE FALSE    FALSE
## MEI      FALSE FALSE  TRUE FALSE FALSE FALSE  FALSE  FALSE FALSE    FALSE
## CO2       TRUE FALSE FALSE  TRUE  TRUE  TRUE  FALSE   TRUE FALSE    FALSE
## CH4       TRUE FALSE FALSE  TRUE  TRUE  TRUE   TRUE   TRUE FALSE    FALSE
## N2O       TRUE FALSE FALSE  TRUE  TRUE  TRUE  FALSE   TRUE FALSE    FALSE
## CFC.11   FALSE FALSE FALSE FALSE  TRUE FALSE   TRUE   TRUE FALSE    FALSE
## CFC.12    TRUE FALSE FALSE  TRUE  TRUE  TRUE   TRUE   TRUE FALSE    FALSE
## TSI      FALSE FALSE FALSE FALSE FALSE FALSE  FALSE  FALSE  TRUE    FALSE
## Aerosols FALSE FALSE FALSE FALSE FALSE FALSE  FALSE  FALSE FALSE     TRUE
## Temp      TRUE FALSE FALSE  TRUE  TRUE  TRUE  FALSE  FALSE FALSE    FALSE
##           Temp
## Year      TRUE
## Month    FALSE
## MEI      FALSE
## CO2       TRUE
## CH4       TRUE
## N2O       TRUE
## CFC.11   FALSE
## CFC.12   FALSE
## TSI      FALSE
## Aerosols FALSE
## Temp      TRUE

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

fit.n2o <- lm(Temp ~ MEI + TSI + Aerosols + N2O, data=training)
summary(fit.n2o)
## 
## Call:
## lm(formula = Temp ~ MEI + TSI + Aerosols + N2O, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
## MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
## TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
## Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
## N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16

We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called “climateLM”, you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

fit.new <- step(fit)
## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9
summary(fit.new)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
##     Aerosols, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
## MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
## CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
## N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
## CFC.11      -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
## CFC.12       3.868e-03  9.812e-04   3.942 0.000103 ***
## TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
## Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16

We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

Enter the testing set R2:

temp.hat <- predict(fit.new, newdata=testing)
sse <- sum((testing$Temp - temp.hat)^2)
sst <- sum((testing$Temp - mean(training$Temp))^2)
1 - sse / sst
## [1] 0.6286051

Reading Test Scores

pisaTrain <- read.csv("pisa2009train.csv")
pisaTest <- read.csv("pisa2009test.csv")
summary(pisaTrain)
##      grade            male                      raceeth    
##  Min.   : 8.00   Min.   :0.0000   White             :2015  
##  1st Qu.:10.00   1st Qu.:0.0000   Hispanic          : 834  
##  Median :10.00   Median :1.0000   Black             : 444  
##  Mean   :10.09   Mean   :0.5111   Asian             : 143  
##  3rd Qu.:10.00   3rd Qu.:1.0000   More than one race: 124  
##  Max.   :12.00   Max.   :1.0000   (Other)           :  68  
##                                   NA's              :  35  
##    preschool      expectBachelors     motherHS    motherBachelors 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.00   Median :0.0000  
##  Mean   :0.7228   Mean   :0.7859   Mean   :0.88   Mean   :0.3481  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##  NA's   :56       NA's   :62       NA's   :97     NA's   :397     
##    motherWork        fatherHS      fatherBachelors    fatherWork    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.7345   Mean   :0.8593   Mean   :0.3319   Mean   :0.8531  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :93       NA's   :245      NA's   :569      NA's   :233     
##    selfBornUS      motherBornUS     fatherBornUS    englishAtHome   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.9313   Mean   :0.7725   Mean   :0.7668   Mean   :0.8717  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :69       NA's   :71       NA's   :113      NA's   :71      
##  computerForSchoolwork read30MinsADay   minutesPerWeekEnglish
##  Min.   :0.0000        Min.   :0.0000   Min.   :   0.0       
##  1st Qu.:1.0000        1st Qu.:0.0000   1st Qu.: 225.0       
##  Median :1.0000        Median :0.0000   Median : 250.0       
##  Mean   :0.8994        Mean   :0.2899   Mean   : 266.2       
##  3rd Qu.:1.0000        3rd Qu.:1.0000   3rd Qu.: 300.0       
##  Max.   :1.0000        Max.   :1.0000   Max.   :2400.0       
##  NA's   :65            NA's   :34       NA's   :186          
##  studentsInEnglish schoolHasLibrary  publicSchool        urban       
##  Min.   : 1.0      Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:20.0      1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :25.0      Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :24.5      Mean   :0.9676   Mean   :0.9339   Mean   :0.3849  
##  3rd Qu.:30.0      3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :75.0      Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :249       NA's   :143                                       
##    schoolSize    readingScore  
##  Min.   : 100   Min.   :168.6  
##  1st Qu.: 712   1st Qu.:431.7  
##  Median :1212   Median :499.7  
##  Mean   :1369   Mean   :497.9  
##  3rd Qu.:1900   3rd Qu.:566.2  
##  Max.   :6694   Max.   :746.0  
##  NA's   :162

Average test scores of males and females

tapply(pisaTrain$readingScore, pisaTrain$male==1, mean)
##    FALSE     TRUE 
## 512.9406 483.5325

Find all columns with missing data

sapply(pisaTrain, function(c) any(is.na(c)))
##                 grade                  male               raceeth 
##                 FALSE                 FALSE                  TRUE 
##             preschool       expectBachelors              motherHS 
##                  TRUE                  TRUE                  TRUE 
##       motherBachelors            motherWork              fatherHS 
##                  TRUE                  TRUE                  TRUE 
##       fatherBachelors            fatherWork            selfBornUS 
##                  TRUE                  TRUE                  TRUE 
##          motherBornUS          fatherBornUS         englishAtHome 
##                  TRUE                  TRUE                  TRUE 
## computerForSchoolwork        read30MinsADay minutesPerWeekEnglish 
##                  TRUE                  TRUE                  TRUE 
##     studentsInEnglish      schoolHasLibrary          publicSchool 
##                  TRUE                  TRUE                 FALSE 
##                 urban            schoolSize          readingScore 
##                 FALSE                  TRUE                 FALSE

Remove all missing data from training and testing datasets

pisaTrain <- na.omit(pisaTrain)
pisaTest <- na.omit(pisaTest)

(nrow(pisaTrain))
## [1] 2414
nrow(pisaTest)
## [1] 990

Which unordered and ordered factors have at least 3 levels

str(pisaTrain)
## 'data.frame':    2414 obs. of  24 variables:
##  $ grade                : int  11 10 10 10 10 10 10 10 11 9 ...
##  $ male                 : int  1 0 1 0 1 0 0 0 1 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
##  $ preschool            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ motherHS             : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ motherWork           : int  1 1 1 0 1 1 1 0 0 1 ...
##  $ fatherHS             : int  1 1 1 1 0 1 1 0 1 1 ...
##  $ fatherBachelors      : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ fatherWork           : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ fatherBornUS         : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
##  $ read30MinsADay       : int  1 1 1 1 0 1 1 1 0 0 ...
##  $ minutesPerWeekEnglish: int  450 200 250 300 294 232 225 270 275 225 ...
##  $ studentsInEnglish    : int  25 23 35 30 24 14 20 25 30 15 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ urban                : int  0 1 1 0 0 0 0 1 1 1 ...
##  $ schoolSize           : int  1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
##  $ readingScore         : num  575 458 614 439 466 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
##   .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”). Set the reference level of the factor by typing the following two lines in your R console:

pisaTrain$raceeth <- relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth <- relevel(pisaTest$raceeth, "White")

What is the Multiple R-squared value of lmScore on the training set?

lm.score <- lm(readingScore ~ ., data=pisaTrain)

What is the training-set root-mean squared error (RMSE) of lmScore?

sse <- sum(lm.score$residual^2)
sqrt(sse / nrow(pisaTrain))
## [1] 73.36555

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?

m.grade <- lm.score$coeff["grade"]
(11 - 9) * m.grade
##    grade 
## 59.08541

What is the range between the maximum and minimum predicted reading score on the test set?

score.hat <- predict(lm.score, pisaTest)
summary(score.hat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   353.2   482.0   524.0   516.7   555.7   637.7

What is the sum of squared errors (SSE) and RMSE of lmScore on the testing set?

(sse <- sum((pisaTest$readingScore - score.hat)^2))
## [1] 5762082
sqrt(sse / nrow(pisaTest))
## [1] 76.29079

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

(score.baseline <- mean(pisaTrain$readingScore))
## [1] 517.9629

What is the sum of squared errors of the baseline model on the testing set?

(sst <- sum((pisaTest$readingScore - score.baseline)^2))
## [1] 7802354

What is the test-set R-squared value of lmScore?

1 - sse / sst
## [1] 0.2614944

Detecting flu epidemics via search engine query data

fluTrain <- read.csv("FluTrain.csv")
str(fluTrain)
## 'data.frame':    417 obs. of  3 variables:
##  $ Week   : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  2.42 1.81 1.71 1.54 1.44 ...
##  $ Queries: num  0.238 0.22 0.226 0.238 0.224 ...

Week with highest percentage of ILI-related physician visits:

fluTrain[which.max(fluTrain$ILI),]
##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

Week with highest percentage of ILI-related queries

fluTrain[which.max(fluTrain$Queries),]
##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

Data is skew right

hist(fluTrain$ILI)

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.

Plot natural log of ILI vs Queries, what does it suggest?

plot(x=fluTrain$Queries, y=log(fluTrain$ILI))

What is the training set R-squared value for FluTrend1 model (the “Multiple R-squared”)?

fluTrend <- lm(log(ILI) ~ Queries, data=fluTrain)
summary(fluTrend)
## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = fluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16

For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem?

cor(fluTrain$Queries, log(fluTrain$ILI))^2
## [1] 0.7090201

What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012?

fluTest <- read.csv("FluTest.csv")
ili.hat <- exp(predict(fluTrend, fluTest))
idx.0311 <- which(fluTest$Week == "2012-03-11 - 2012-03-17")
ili.hat[idx.0311]
##       11 
## 2.187378

What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012?

q <- fluTest$ILI[idx.0311]
(q - ili.hat[idx.0311]) / q
##         11 
## 0.04623827

What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

sse <- sum((fluTest$ILI - ili.hat)^2)
sqrt(sse / nrow(fluTest))
## [1] 0.7490645

The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

library(zoo)
## Warning: package 'zoo' was built under R version 3.1.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
ILI.lag2 <- lag(zoo(fluTrain$ILI), -2, na.pad=TRUE)
fluTrain$ILI.lag2 <- coredata(ILI.lag2)
sum(is.na(fluTest$ILI.lag2))
## Warning in is.na(fluTest$ILI.lag2): is.na() applied to non-(list or
## vector) of type 'NULL'
## [1] 0

Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?

plot(x=log(fluTrain$ILI), y=log(fluTrain$ILI.lag2))

Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.

fluTrend2 <- lm(log(ILI) ~ Queries + log(ILI.lag2), data=fluTrain)

Add ILI.lag2 to fluTest

ILI.lag2 <- lag(zoo(fluTest$ILI), -2, na.pad=TRUE)
fluTest$ILI.lag2 <- coredata(ILI.lag2)
sum(is.na(fluTest$ILI.lag2))
## [1] 2

Copy missing data for ILI from fluTrain to fluTest

fluTest$ILI.lag2[1] <- fluTrain$ILI[416]
fluTest$ILI.lag2[2] <- fluTrain$ILI[417]

Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).

What is the test-set RMSE of the FluTrend2 model?

ILI.hat <- exp(predict(fluTrend2, fluTest))
sse <- sum((fluTest$ILI - ILI.hat)^2)
sqrt(sse / nrow(fluTest))
## [1] 0.2942029