Assignment2: Linear Regression

Climate Change
Reading Test Scores
Detecting Flu Epidemics via Search Engine Query Data

Climate Change

Problem 1.1 - Creating Our First Model

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

cc <- read.csv("climate_change.csv")

ccTraining <- subset(cc, Year <= 2006)
ccTesting <- subset(cc, Year > 2006)

TempReg <- lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols,
              data = ccTraining)
summary(TempReg)

## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols, data = ccTraining)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16

Enter the model R2 (the “Multiple R-squared” value):

Problem 1.2 - Creating Our First Model

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)

Problem 2.1 - Understanding the Model

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the simplest correct explanation for this contradiction?

All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.

Problem 2.2 - Understanding the Model

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.

cor(ccTraining$N2O, ccTraining)

##           Year      Month         MEI       CO2       CH4 N2O    CFC.11
## [1,] 0.9938452 0.01363153 -0.05081978 0.9767198 0.8998386   1 0.5224773
##         CFC.12       TSI   Aerosols      Temp
## [1,] 0.8679308 0.1997567 -0.3370546 0.7786389

Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.

cor(ccTraining$CFC.11, ccTraining)

##           Year       Month        MEI       CO2      CH4       N2O CFC.11
## [1,] 0.5691064 -0.01311122 0.06900044 0.5140597 0.779904 0.5224773      1
##         CFC.12      TSI   Aerosols      Temp
## [1,] 0.8689852 0.272046 -0.0439212 0.4077103

Problem 3 - Simplifying the Model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Enter the coefficient of N2O in this reduced model:

TempReg2 <- lm(Temp ~ MEI + N2O + TSI + Aerosols, data=ccTraining )
summary(TempReg2)

## 
## Call:
## lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = ccTraining)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
## MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
## N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
## TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
## Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16

Problem 4 - Automatically Building the Model

We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called “climateLM”, you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

Enter the R2 value of the model produced by the step function:

TempRegSimp <- step(TempReg)

## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9

summary(TempRegSimp)

## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
##     Aerosols, data = ccTraining)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
## MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
## CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
## N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
## CFC.11      -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
## CFC.12       3.868e-03  9.812e-04   3.942 0.000103 ***
## TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
## Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16

It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).

Problem 5 - Testing on Unseen Data

We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

Enter the testing set R2:

#stepModel <- lm(Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = ccTraining)

predictTemp <- predict(TempRegSimp, ccTesting)
summary(predictTemp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3142  0.3418  0.3771  0.3832  0.4245  0.4678

SSE = sum((ccTesting$Temp - predictTemp)^2)
SST = sum((ccTesting$Temp - mean(ccTraining$Temp))^2)
1 - SSE/SST

## [1] 0.6286051

Reading Test Scores

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

Problem 1.1 - Dataset size

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

How many students are there in the training set?

pisaTrain <- read.csv("pisa2009train.csv")
pisaTest <- read.csv("pisa2009test.csv")

nrow(pisaTrain)

## [1] 3663

Problem 1.2 - Summarizing the dataset

Using tapply() on pisaTrain, what is the average reading test score of males? Of females?

## tapply(Summary Variable, Group Variable, Function)
tapply(pisaTrain$readingScore, pisaTrain$male, mean)

##        0        1 
## 512.9406 483.5325

Problem 1.3 - Locating missing values

Which variables are missing data in at least one observation in the training set? Select all that apply.

sapply(pisaTrain, function(x) sum(is.na(x)))

##                 grade                  male               raceeth 
##                     0                     0                    35 
##             preschool       expectBachelors              motherHS 
##                    56                    62                    97 
##       motherBachelors            motherWork              fatherHS 
##                   397                    93                   245 
##       fatherBachelors            fatherWork            selfBornUS 
##                   569                   233                    69 
##          motherBornUS          fatherBornUS         englishAtHome 
##                    71                   113                    71 
## computerForSchoolwork        read30MinsADay minutesPerWeekEnglish 
##                    65                    34                   186 
##     studentsInEnglish      schoolHasLibrary          publicSchool 
##                   249                   143                     0 
##                 urban            schoolSize          readingScore 
##                     0                   162                     0

Problem 1.4 - Removing missing values

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

Type the following commands into your R console to remove observations with any missing value from pisaTrain and pisaTest:

pisaTrain = na.omit(pisaTrain)

pisaTest = na.omit(pisaTest)

How many observations are now in the training set? How many observations are now in the testing set?

pisaTrain <- na.omit(pisaTrain)
nrow(pisaTrain)

## [1] 2414

pisaTest <- na.omit(pisaTest)
nrow(pisaTest)

## [1] 990

Problem 2.1 - Factor variables

Factor variables are variables that take on a discrete set of values, like the “Region” variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn’t any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications “large,” “medium,” and “small”).

Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.) - raceeth

Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.) - grade

str(pisaTrain$grade)

##  int [1:2414] 11 10 10 10 10 10 10 10 11 9 ...

str(pisaTrain$male)

##  int [1:2414] 1 0 1 0 1 0 0 0 1 1 ...

str(pisaTrain$raceeth)

##  Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...

Problem 2.2 - Unordered factors in regression models

To include unordered factors in a linear regression model, we define one level as the “reference level” and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.

As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level.

Which binary variables will be included in the regression model? (Select all that apply.)

levels(pisaTrain$raceeth)

## [1] "American Indian/Alaska Native"         
## [2] "Asian"                                 
## [3] "Black"                                 
## [4] "Hispanic"                              
## [5] "More than one race"                    
## [6] "Native Hawaiian/Other Pacific Islander"
## [7] "White"

raceethAmerican Indian/Alaska Native

raceethAsian

raceethBlack

raceethHispanic

raceethMore than one race

raceethNative Hawaiian/Other Pacific Islander

Problem 2.3 - Example unordered factors

Consider again adding our unordered factor race to the regression model with reference level “White”.

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)

Problem 3.1 - Building a model

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”). Set the reference level of the factor by typing the following two lines in your R console:

pisaTrain$raceeth = relevel(pisaTrain$raceeth, “White”)

pisaTest$raceeth = relevel(pisaTest$raceeth, “White”)

Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

It would be time-consuming to type all the variables, but R provides the shorthand notation “readingScore ~ .” to mean “predict readingScore using all the other variables in the data frame.” The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called “Y”, your independent variables are called “X1”, “X2”, and “X3”, and your training data set is called “Train”, instead of the regular notation:

LinReg = lm(Y ~ X1 + X2 + X3, data = Train)

You would use the following command to build your model:

LinReg = lm(Y ~ ., data = Train)

What is the Multiple R-squared value of lmScore on the training set?

pisaTrain$raceeth <- relevel(pisaTrain$raceeth, "White")

pisaTest$raceeth <- relevel(pisaTest$raceeth, "White")

levels(pisaTrain$raceeth)

## [1] "White"                                 
## [2] "American Indian/Alaska Native"         
## [3] "Asian"                                 
## [4] "Black"                                 
## [5] "Hispanic"                              
## [6] "More than one race"                    
## [7] "Native Hawaiian/Other Pacific Islander"

lmScore <- lm(readingScore ~ ., data=pisaTrain)
summary(lmScore)

## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

summary(lmScore)$r.squared

## [1] 0.3251434

Note that this R-squared is lower than the ones for the models we saw in the lectures and recitation. This does not necessarily imply that the model is of poor quality. More often than not, it simply means that the prediction problem at hand (predicting a student’s test score based on demographic and school-related variables) is more difficult than other prediction problems (like predicting a team’s number of wins from their runs scored and allowed, or predicting the quality of wine from weather conditions).

Problem 3.2 - Computing the root-mean squared error of the model

What is the training-set root-mean squared error (RMSE) of lmScore?

lmScore$residuals^2 %>% mean() %>% sqrt()

## [1] 73.36555

Problem 3.3 - Comparing predictions for similar students

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?

summary(lmScore)

## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

gradeCoef <- lmScore$coefficients[2]

2 * gradeCoef

##    grade 
## 59.08541

Problem 3.4 - Interpreting model coefficients

What is the meaning of the coefficient associated with variable raceethAsian?

# Predicted difference in the reading score between an Asian student and a white student who is otherwise identical

Problem 3.5 - Identifying variables lacking statistical significance

Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)

# preschool, motherHS, motherWork, fatherHS, fatherWork, selfBornUS, motherBornUS
# fatherBornUS, englishAtHome, minutesPerWeekEnglish, studentsInEnglish
# schoolHasLibrary, urban

Problem 4.1 - Predicting on unseen data

Using the “predict” function and supplying the “newdata” argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions “predTest”. Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

What is the range between the maximum and minimum predicted reading score on the test set?

predTest <- predict(lmScore, newdata=pisaTest)
summary(predTest)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   353.2   482.0   524.0   516.7   555.7   637.7

max(predTest) - min(predTest)

## [1] 284.4683

Problem 4.2 - Test set SSE and RMSE

What is the sum of squared errors (SSE) of lmScore on the testing set?

What is the root-mean squared error (RMSE) of lmScore on the testing set?

# SSE <- sum((predTest - pisaTest$readingScore)^2)
SSE <- (predTest - pisaTest$readingScore)^2 %>% sum()
SSE

## [1] 5762082

RMSE <- (predTest - pisaTest$readingScore)^2 %>% mean() %>% sqrt()
RMSE

## [1] 76.29079

Problem 4.3 - Baseline prediction and test-set SSE

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).

baselineScore <- pisaTrain$readingScore %>% mean()
baselineScore

## [1] 517.9629

SST <- (pisaTest$readingScore - baselineScore)^2 %>% sum()
SST

## [1] 7802354

Problem 4.4 - Test-set R-squared

What is the test-set R-squared value of lmScore?

1 - SSE/SST

## [1] 0.2614944

Detecting Flu Epidemics via Search Engine Query Data

Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.

The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.

The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

Problem 1.1 - Understanding the Data

We would like to estimate influenza-like illness (ILI) activity using Google web search logs. Fortunately, one can easily access this data online:

ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.

Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.

The csv file FluTrain.csv aggregates this data from January 1, 2004 until December 31, 2011 as follows:

“Week” - The range of dates represented by this observation, in year/month/day format.

“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.

“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

Before applying analytics tools on the training set, we first need to understand the data at hand. Load “FluTrain.csv” into a data frame called FluTrain. Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? Select the day of the month corresponding to the start of this week.

fluTrain <- read.csv("FluTrain.csv")
fluTest <- read.csv("FluTest.csv")

attach(fluTrain)

# Week corresponds to the highest percentage of ILI-related physician visits
levels(factor(Week[which.max(ILI)]))

## [1] "2009-10-18 - 2009-10-24"

# Week that corresponds to the highest percentage of ILI-related query fraction
levels(factor(Week[which.max(Queries)]))

## [1] "2009-10-18 - 2009-10-24"

Problem 1.2 - Understanding the Data

Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?

Answer:

Most of the ILI values are small, with a relatively small number of much larger values (in statistics, this sort of data is called “skew right”).

library(ggplot2)

str(fluTrain)

## 'data.frame':    417 obs. of  3 variables:
##  $ Week   : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  2.42 1.81 1.71 1.54 1.44 ...
##  $ Queries: num  0.238 0.22 0.226 0.238 0.224 ...

ggplot(data = fluTrain, aes(x = ILI)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Problem 1.3 - Understanding the Data

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.

Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.

library(ggplot2)
ggplot(data = fluTrain, aes(x = log(ILI), y = Queries)) +
    geom_point()

Problem 2.1 - Linear Regression Model

Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?

Answer:

log(ILI) = intercept + coefficient x Queries, where the coefficient is positive

Problem 2.2 - Linear Regression Model

Let’s call the regression model from the previous problem (Problem 2.1) FluTrend1 and run it in R. Hint: to take the logarithm of a variable Var in a regression equation, you simply use log(Var) when specifying the formula to the lm() function.

What is the training set R-squared value for FluTrend1 model (the “Multiple R-squared”)?

Answer: 0.709

fluTrend1 <- lm(log(ILI) ~ Queries, data = fluTrain)

summary(fluTrend1)

## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = fluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16

Problem 2.3 - Linear Regression Model

For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem? (Don’t forget that you can use the cor function to compute the correlation between two variables.)

Answer: R-squared = Correlation^2

correlation <- cor(log(fluTrain$ILI), fluTrain$Queries)

correlation^2

## [1] 0.7090201

Problem 3.1 - Performance on the Test Set

The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits. Load this data into a data frame called FluTest.

Normally, we would obtain test-set predictions from the model FluTrend1 using the code

PredTest1 = predict(FluTrend1, newdata=FluTest)

However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))

What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the “which” function in R. To learn more about the which function, type ?which in your R console.)

fluTest <- read.csv("FluTest.csv")

PredTest1 = exp(predict(fluTrend1, newdata=fluTest))

est_mar11 <- PredTest1[which(fluTest$Week == "2012-03-11 - 2012-03-17")]

Problem 3.2 - Performance on the Test Set

What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as

(Observed ILI - Estimated ILI)/Observed ILI

obs_mar11 <- subset(fluTest, Week == "2012-03-11 - 2012-03-17")$ILI
obs_mar11

## [1] 2.293422

(obs_mar11 - est_mar11) / obs_mar11

##         11 
## 0.04623827

Problem 3.3 - Performance on the Test Set

What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

(fluTest$ILI - PredTest1)^2 %>% mean() %>% sqrt()

## [1] 0.7490645

Problem 4.1 - Training a Time Series Model

The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

To do so, we will use the “zoo” package, which provides a number of helpful methods for time series models. While many functions are built into R, you need to add new packages to use some functions. New packages can be installed and loaded easily in R, and we will do this many times in this class. Run the following two commands to install and load the zoo package. In the first command, you will be prompted to select a CRAN mirror to use for your download. Select a mirror near you geographically.

install.packages(“zoo”)

library(zoo)

After installing and loading the zoo package, run the following commands to create the ILILag2 variable in the training set:

ILILag2 = lag(zoo(FluTrain$ILI), -2, na.pad=TRUE)

FluTrain$ILILag2 = coredata(ILILag2)

In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can’t compute the data from 2 weeks earlier.

How many values are missing in the new ILILag2 variable?

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

ILILag2 <- stats::lag(zoo(fluTrain$ILI), -2, na.pad = TRUE)

ILILag2

##         1         2         3         4         5         6         7 
##        NA        NA 2.4183312 1.8090560 1.7120239 1.5424951 1.4378683 
##         8         9        10        11        12        13        14 
## 1.3242740 1.3072567 1.0369770 1.0103204 1.0524925 1.0200901 0.9244187 
##        15        16        17        18        19        20        21 
## 0.7906450 0.8026098 0.8361300 0.7924358 0.6835877 0.7574523 0.7885854 
##        22        23        24        25        26        27        28 
## 0.8121710 0.8044629 0.8777009 0.7414530 0.6610222 0.7151092 0.5622412 
##        29        30        31        32        33        34        35 
## 0.7868082 0.8606578 0.6899440 0.7796912 0.6281439 0.9024586 0.8064432 
##        36        37        38        39        40        41        42 
## 0.8748878 0.9932130 0.8761408 0.9480916 0.9269426 0.9716430 0.8971591 
##        43        44        45        46        47        48        49 
## 1.0224828 1.0629632 1.1469570 1.2049501 1.3051655 1.2869916 1.5946756 
##        50        51        52        53        54        55        56 
## 1.3971432 1.4499567 1.6174545 2.1911192 2.5664893 2.1764491 2.2017121 
##        57        58        59        60        61        62        63 
## 2.5301211 3.0652381 3.9806083 4.5956803 4.7519706 4.1796206 3.4535851 
##        64        65        66        67        68        69        70 
## 3.1585224 2.6732010 2.3516104 1.8924285 1.5249048 1.4113441 1.2506826 
##        71        72        73        74        75        76        77 
## 1.2070250 1.0789550 1.1452080 1.0612426 1.0567977 1.2519310 1.0141893 
##        78        79        80        81        82        83        84 
## 1.0419693 0.9540274 0.8482299 0.8418715 0.7308936 0.7134316 0.6706772 
##        85        86        87        88        89        90        91 
## 0.6892776 0.7049290 0.6159033 0.6094256 0.6802587 0.7754884 0.6834214 
##        92        93        94        95        96        97        98 
## 0.7810748 0.8069435 1.0763468 1.0586890 1.1152326 1.1238125 1.2548892 
##        99       100       101       102       103       104       105 
## 1.3366090 1.3786364 1.6082900 1.4831056 1.6537399 2.0067892 2.5685716 
##       106       107       108       109       110       111       112 
## 3.0527762 2.4250373 2.0019506 2.0586902 2.2127697 2.3222001 2.4927920 
##       113       114       115       116       117       118       119 
## 2.7948942 2.9691114 2.8395905 2.7779902 2.4728693 2.1806146 2.0167951 
##       120       121       122       123       124       125       126 
## 1.6410133 1.3582865 1.1427983 1.0403125 0.9643469 0.9379817 0.9474493 
##       127       128       129       130       131       132       133 
## 0.8919182 0.8646427 0.9703199 0.8443901 0.7748704 0.8213725 0.8727445 
##       134       135       136       137       138       139       140 
## 0.9226345 0.8994868 0.8430824 0.8818244 0.8171452 0.8715001 0.7386205 
##       141       142       143       144       145       146       147 
## 0.7979660 1.0139373 0.8809358 0.9433663 0.8915462 1.2032228 1.0578822 
##       148       149       150       151       152       153       154 
## 1.1305354 1.1255230 1.2080820 1.3495244 1.4689004 1.8276716 1.6656012 
##       155       156       157       158       159       160       161 
## 1.8596834 2.3889130 2.7897759 3.1154858 2.2694245 1.8635464 1.9998635 
##       162       163       164       165       166       167       168 
## 2.4406044 2.8301821 3.1234256 3.2701949 3.1775688 2.7236366 2.5020140 
##       169       170       171       172       173       174       175 
## 2.4271992 1.9604132 1.5913980 1.3697835 1.3631668 1.1736951 1.0635756 
##       176       177       178       179       180       181       182 
## 0.9697111 0.9653617 0.8567489 0.8633465 0.9353695 0.7455694 0.7404281 
##       183       184       185       186       187       188       189 
## 0.6728965 0.6662820 0.6627473 0.5456190 0.5862306 0.6606867 0.5340928 
##       190       191       192       193       194       195       196 
## 0.5855491 0.6180750 0.6874647 0.7156961 0.8293131 0.8009115 0.9184839 
##       197       198       199       200       201       202       203 
## 0.8142590 1.0719708 1.2178574 1.2457554 1.3598449 1.4467085 1.5328638 
##       204       205       206       207       208       209       210 
## 1.6665324 1.9748773 1.6730547 1.6340509 1.7459475 1.9364319 2.4890534 
##       211       212       213       214       215       216       217 
## 2.2540484 2.0914715 2.3593428 3.3233143 4.4338100 5.3454714 5.4225751 
##       218       219       220       221       222       223       224 
## 5.3030330 4.2445550 3.6280001 3.0346275 2.5359536 2.0573015 1.7415035 
##       225       226       227       228       229       230       231 
## 1.4065217 1.2686070 1.0771887 0.9934452 0.9112119 0.9721091 0.9932575 
##       232       233       234       235       236       237       238 
## 1.0913202 0.8884460 0.8876915 0.8831874 0.8267564 0.7832014 0.7806103 
##       239       240       241       242       243       244       245 
## 0.7690726 0.7212979 0.7525273 0.7527210 0.7927660 0.7438962 0.8141663 
##       246       247       248       249       250       251       252 
## 0.8384009 0.8511236 1.1097575 1.0311436 1.0228436 1.0301739 1.0124478 
##       253       254       255       256       257       258       259 
## 1.0835911 1.1657765 1.1912964 1.2807470 1.2705251 1.5957825 1.4584994 
##       260       261       262       263       264       265       266 
## 1.4992072 1.6298157 2.1556121 2.0205270 1.5456623 1.6422367 1.9652378 
##       267       268       269       270       271       272       273 
## 2.3436784 2.8605744 3.3421049 3.2056588 3.1004908 2.9581850 2.4638058 
##       274       275       276       277       278       279       280 
## 2.1927224 1.8739459 1.6481690 1.4987776 1.2923267 1.2716411 2.9815890 
##       281       282       283       284       285       286       287 
## 2.4370224 2.2813011 3.8157199 4.2131523 3.1783224 2.5097162 2.0663177 
##       288       289       290       291       292       293       294 
## 1.7180460 1.5596467 1.3085629 1.1869460 1.1379623 1.1500523 1.1126189 
##       295       296       297       298       299       300       301 
## 1.1614188 1.6410714 2.4716598 3.7196936 3.9497480 4.0875636 4.0189724 
##       302       303       304       305       306       307       308 
## 4.6036164 5.6608671 6.8152222 7.6188921 7.3883586 6.3392723 4.9434950 
##       309       310       311       312       313       314       315 
## 3.8099612 3.4410588 2.6677306 2.4718250 2.3449995 2.7143498 2.6766718 
##       316       317       318       319       320       321       322 
## 1.9828382 1.8274862 1.9260563 1.9249472 2.0887684 2.0343408 1.9764946 
##       323       324       325       326       327       328       329 
## 1.9936177 1.8538260 1.8673036 1.6998677 1.4974082 1.4511188 1.2071478 
##       330       331       332       333       334       335       336 
## 1.1741508 1.1620668 1.1721343 1.1216765 1.1498116 1.1332758 1.0817133 
##       337       338       339       340       341       342       343 
## 1.1995860 0.9528083 0.9160321 0.9265822 0.8696197 0.9031331 0.7737757 
##       344       345       346       347       348       349       350 
## 0.7427744 0.7309345 0.7868818 0.7630507 0.8410432 0.7915728 0.9127318 
##       351       352       353       354       355       356       357 
## 1.0339765 0.9340091 1.0818888 1.0656260 1.1350529 1.2525629 1.2456956 
##       358       359       360       361       362       363       364 
## 1.2677380 1.4372295 1.5334125 1.6944544 1.9915024 1.8130453 2.0142579 
##       365       366       367       368       369       370       371 
## 2.5565913 3.3818486 3.4317231 2.6915111 2.9106289 3.4923189 4.0036963 
##       372       373       374       375       376       377       378 
## 4.4353368 4.2421482 4.3971861 3.9025565 3.1507275 2.7242234 2.3333563 
##       379       380       381       382       383       384       385 
## 1.9250003 1.7524260 1.5770365 1.3576558 1.3122310 1.1493747 1.1145057 
##       386       387       388       389       390       391       392 
## 1.1098449 1.0524026 1.0353647 1.1177658 0.9829495 0.9251944 0.8355311 
##       393       394       395       396       397       398       399 
## 0.8323927 0.8555910 0.7069494 0.6943868 0.6879762 0.6447430 0.6753299 
##       400       401       402       403       404       405       406 
## 0.7282297 0.8065263 0.8604084 0.9360754 0.9666827 0.9960071 1.1084635 
##       407       408       409       410       411       412       413 
## 1.2030858 1.2369566 1.2525865 1.3054612 1.4528432 1.4408922 1.4622115 
##       414       415       416       417 
## 1.6554147 1.4657230 1.5181061 1.6639544

fluTrain$ILILag2 <- coredata(ILILag2)
sum(is.na(ILILag2))

## [1] 2

Problem 4.2 - Training a Time Series Model

Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?

Answer:

There is a strong positive relationship between log(ILILag2) and log(ILI).

plot(log(fluTrain$ILILag2), log(fluTrain$ILI))

Problem 4.3 - Training a Time Series Model

Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.

Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)

FluTrend2 = lm(log(ILI) ~ Queries + log(ILILag2), data=fluTrain)

summary(FluTrend2)

## 
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = fluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52209 -0.11082 -0.01819  0.08143  0.76785 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.24064    0.01953  -12.32   <2e-16 ***
## Queries       1.25578    0.07910   15.88   <2e-16 ***
## log(ILILag2)  0.65569    0.02251   29.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 412 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.9059 
## F-statistic:  1993 on 2 and 412 DF,  p-value: < 2.2e-16

Problem 4.4 - Training a Time Series Model

On the basis of R-squared value and significance of coefficients, which statement is the most accurate?

Answer:

FluTrend2 is a stronger model than FluTrend1 on the training set.

Problem 5.1 - Evaluating the Time Series Model in the Test Set

So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort).

Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?

ILILag2 <- stats::lag(zoo(fluTest$ILI), -2, na.pad=TRUE)
fluTest$ILILag2 = coredata(ILILag2)

summary(fluTest)

##                       Week         ILI            Queries      
##  2012-01-01 - 2012-01-07: 1   Min.   :0.9018   Min.   :0.2390  
##  2012-01-08 - 2012-01-14: 1   1st Qu.:1.1535   1st Qu.:0.2772  
##  2012-01-15 - 2012-01-21: 1   Median :1.3592   Median :0.3924  
##  2012-01-22 - 2012-01-28: 1   Mean   :1.6638   Mean   :0.4094  
##  2012-01-29 - 2012-02-04: 1   3rd Qu.:1.8637   3rd Qu.:0.4874  
##  2012-02-05 - 2012-02-11: 1   Max.   :6.0336   Max.   :0.8054  
##  (Other)                :46                                    
##     ILILag2      
##  Min.   :0.9018  
##  1st Qu.:1.1359  
##  Median :1.3409  
##  Mean   :1.5188  
##  3rd Qu.:1.7606  
##  Max.   :3.6002  
##  NA's   :2

Problem 5.2 - Evaluating the Time Series Model in the Test Set

In this problem, the training and testing sets are split sequentially – the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.

Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?

Answer:

The ILI value of the second-to-last observation in the FluTrain data frame.

Problem 5.3 - Evaluating the Time Series Model in the Test Set

Fill in the missing values for ILILag2 in FluTest. In terms of syntax, you could set the value of ILILag2 in row “x” of the FluTest data frame to the value of ILI in row “y” of the FluTrain data frame with “FluTest$ILILag2[x] = FluTrain$ILI[y]”. Use the answer to the previous questions to determine the appropriate values of “x” and “y”. It may be helpful to check the total number of rows in FluTrain using str(FluTrain) or nrow(FluTrain).

What is the new value of the ILILag2 variable in the first row of FluTest?

fluTest$ILILag2[1] = fluTrain$ILI[416]

fluTest$ILILag2[2] = fluTrain$ILI[417]

Problem 5.4 - Evaluating the Time Series Model in the Test Set

Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).

What is the test-set RMSE of the FluTrend2 model?

library(dplyr)
PredTest2 <- exp(predict(FluTrend2, newdata=fluTest))

RMSE <- (PredTest2 - fluTest$ILI)^2 %>% mean() %>% sqrt()
RMSE

## [1] 0.2942029

Problem 5.5 - Evaluating the Time Series Model in the Test Set

Which model obtained the best test-set RMSE?

Ans:FluTrend2

EXPLANATION:The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model. In this problem, we used a simple time series model with a single lag term. ARIMA models are a more general form of the model we built, which can include multiple lag terms as well as more complicated combinations of previous values of the dependent variable.

Assignment2: Linear Regression

Sheena Yu

6/26/2017

Climate Change

Problem 1.1 - Creating Our First Model

Problem 1.2 - Creating Our First Model

Problem 2.1 - Understanding the Model

Problem 2.2 - Understanding the Model

Problem 3 - Simplifying the Model

Problem 4 - Automatically Building the Model

Problem 5 - Testing on Unseen Data

Reading Test Scores

Problem 1.1 - Dataset size

Problem 1.2 - Summarizing the dataset

Problem 1.3 - Locating missing values

Problem 1.4 - Removing missing values

Problem 2.1 - Factor variables

Problem 2.2 - Unordered factors in regression models

Problem 2.3 - Example unordered factors

Problem 3.1 - Building a model

Problem 3.2 - Computing the root-mean squared error of the model

Problem 3.3 - Comparing predictions for similar students

Problem 3.4 - Interpreting model coefficients

Problem 3.5 - Identifying variables lacking statistical significance

Problem 4.1 - Predicting on unseen data

Problem 4.2 - Test set SSE and RMSE

Problem 4.3 - Baseline prediction and test-set SSE

Problem 4.4 - Test-set R-squared

Detecting Flu Epidemics via Search Engine Query Data

Problem 1.1 - Understanding the Data

Problem 1.2 - Understanding the Data

Problem 1.3 - Understanding the Data

Problem 2.1 - Linear Regression Model

Problem 2.2 - Linear Regression Model

Problem 2.3 - Linear Regression Model

Problem 3.1 - Performance on the Test Set

Problem 3.2 - Performance on the Test Set

Problem 3.3 - Performance on the Test Set

Problem 4.1 - Training a Time Series Model

Problem 4.2 - Training a Time Series Model

Problem 4.3 - Training a Time Series Model

Problem 4.4 - Training a Time Series Model

Problem 5.1 - Evaluating the Time Series Model in the Test Set

Problem 5.2 - Evaluating the Time Series Model in the Test Set

Problem 5.3 - Evaluating the Time Series Model in the Test Set

Problem 5.4 - Evaluating the Time Series Model in the Test Set

Problem 5.5 - Evaluating the Time Series Model in the Test Set