title: “MIT_Analytics_Edge_Unit2_Homework” output: html_document

PART 1

DATA Sources:

https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/climate_change.csv

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:

Year: the observation year.
Month: the observation month.
Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.
CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.
CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume).
Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.
TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.
MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

Climate = read.csv("~/Dersler/MIT_15.071/Week2/hw1/climate_change.csv")
str(Climate)

## 'data.frame':    308 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...

summary(Climate)

##       Year          Month             MEI               CO2       
##  Min.   :1983   Min.   : 1.000   Min.   :-1.6350   Min.   :340.2  
##  1st Qu.:1989   1st Qu.: 4.000   1st Qu.:-0.3987   1st Qu.:353.0  
##  Median :1996   Median : 7.000   Median : 0.2375   Median :361.7  
##  Mean   :1996   Mean   : 6.552   Mean   : 0.2756   Mean   :363.2  
##  3rd Qu.:2002   3rd Qu.:10.000   3rd Qu.: 0.8305   3rd Qu.:373.5  
##  Max.   :2008   Max.   :12.000   Max.   : 3.0010   Max.   :388.5  
##       CH4            N2O            CFC.11          CFC.12     
##  Min.   :1630   Min.   :303.7   Min.   :191.3   Min.   :350.1  
##  1st Qu.:1722   1st Qu.:308.1   1st Qu.:246.3   1st Qu.:472.4  
##  Median :1764   Median :311.5   Median :258.3   Median :528.4  
##  Mean   :1750   Mean   :312.4   Mean   :252.0   Mean   :497.5  
##  3rd Qu.:1787   3rd Qu.:317.0   3rd Qu.:267.0   3rd Qu.:540.5  
##  Max.   :1814   Max.   :322.2   Max.   :271.5   Max.   :543.8  
##       TSI          Aerosols            Temp        
##  Min.   :1365   Min.   :0.00160   Min.   :-0.2820  
##  1st Qu.:1366   1st Qu.:0.00280   1st Qu.: 0.1217  
##  Median :1366   Median :0.00575   Median : 0.2480  
##  Mean   :1366   Mean   :0.01666   Mean   : 0.2568  
##  3rd Qu.:1366   3rd Qu.:0.01260   3rd Qu.: 0.4073  
##  Max.   :1367   Max.   :0.14940   Max.   : 0.7390

Climate_Training = subset(Climate, Year <2007)
Climate_Test = subset(Climate, Year>2006)
str(Climate_Training)

## 'data.frame':    284 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...

str(Climate_Test)

## 'data.frame':    24 obs. of  11 variables:
##  $ Year    : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Month   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MEI     : num  0.974 0.51 0.074 -0.049 0.183 ...
##  $ CO2     : num  383 384 385 386 387 ...
##  $ CH4     : num  1800 1803 1803 1802 1796 ...
##  $ N2O     : num  321 321 321 321 320 ...
##  $ CFC.11  : num  248 248 248 248 247 ...
##  $ CFC.12  : num  539 539 539 539 538 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0054 0.0051 0.0045 0.0045 0.0041 0.004 0.004 0.0041 0.0042 0.0041 ...
##  $ Temp    : num  0.601 0.498 0.435 0.466 0.372 0.382 0.394 0.358 0.402 0.362 ...

# Let's create a model to predict Temp using all otehr independent variables
ClimateReg <- lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = Climate_Training)
summary(ClimateReg)

## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols, data = Climate_Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16

###Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

# This is because all of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.

cor(Climate_Training)

##                 Year         Month           MEI         CO2         CH4
## Year      1.00000000 -0.0279419602 -0.0369876842  0.98274939  0.91565945
## Month    -0.02794196  1.0000000000  0.0008846905 -0.10673246  0.01856866
## MEI      -0.03698768  0.0008846905  1.0000000000 -0.04114717 -0.03341930
## CO2       0.98274939 -0.1067324607 -0.0411471651  1.00000000  0.87727963
## CH4       0.91565945  0.0185686624 -0.0334193014  0.87727963  1.00000000
## N2O       0.99384523  0.0136315303 -0.0508197755  0.97671982  0.89983864
## CFC.11    0.56910643 -0.0131112236  0.0690004387  0.51405975  0.77990402
## CFC.12    0.89701166  0.0006751102  0.0082855443  0.85268963  0.96361625
## TSI       0.17030201 -0.0346061935 -0.1544919227  0.17742893  0.24552844
## Aerosols -0.34524670  0.0148895406  0.3402377871 -0.35615480 -0.26780919
## Temp      0.78679714 -0.0998567411  0.1724707512  0.78852921  0.70325502
##                  N2O      CFC.11        CFC.12         TSI    Aerosols
## Year      0.99384523  0.56910643  0.8970116635  0.17030201 -0.34524670
## Month     0.01363153 -0.01311122  0.0006751102 -0.03460619  0.01488954
## MEI      -0.05081978  0.06900044  0.0082855443 -0.15449192  0.34023779
## CO2       0.97671982  0.51405975  0.8526896272  0.17742893 -0.35615480
## CH4       0.89983864  0.77990402  0.9636162478  0.24552844 -0.26780919
## N2O       1.00000000  0.52247732  0.8679307757  0.19975668 -0.33705457
## CFC.11    0.52247732  1.00000000  0.8689851828  0.27204596 -0.04392120
## CFC.12    0.86793078  0.86898518  1.0000000000  0.25530281 -0.22513124
## TSI       0.19975668  0.27204596  0.2553028138  1.00000000  0.05211651
## Aerosols -0.33705457 -0.04392120 -0.2251312440  0.05211651  1.00000000
## Temp      0.77863893  0.40771029  0.6875575483  0.24338269 -0.38491375
##                 Temp
## Year      0.78679714
## Month    -0.09985674
## MEI       0.17247075
## CO2       0.78852921
## CH4       0.70325502
## N2O       0.77863893
## CFC.11    0.40771029
## CFC.12    0.68755755
## TSI       0.24338269
## Aerosols -0.38491375
## Temp      1.00000000

cor(Climate_Training)*(cor(Climate_Training)>0.9)

##               Year Month MEI       CO2       CH4       N2O CFC.11
## Year     1.0000000     0   0 0.9827494 0.9156595 0.9938452      0
## Month    0.0000000     1   0 0.0000000 0.0000000 0.0000000      0
## MEI      0.0000000     0   1 0.0000000 0.0000000 0.0000000      0
## CO2      0.9827494     0   0 1.0000000 0.0000000 0.9767198      0
## CH4      0.9156595     0   0 0.0000000 1.0000000 0.0000000      0
## N2O      0.9938452     0   0 0.9767198 0.0000000 1.0000000      0
## CFC.11   0.0000000     0   0 0.0000000 0.0000000 0.0000000      1
## CFC.12   0.0000000     0   0 0.0000000 0.9636162 0.0000000      0
## TSI      0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
## Aerosols 0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
## Temp     0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
##             CFC.12 TSI Aerosols Temp
## Year     0.0000000   0        0    0
## Month    0.0000000   0        0    0
## MEI      0.0000000   0        0    0
## CO2      0.0000000   0        0    0
## CH4      0.9636162   0        0    0
## N2O      0.0000000   0        0    0
## CFC.11   0.0000000   0        0    0
## CFC.12   1.0000000   0        0    0
## TSI      0.0000000   1        0    0
## Aerosols 0.0000000   0        1    0
## Temp     0.0000000   0        0    1

cor(Climate_Training)*(cor(Climate_Training)>0.9)

##               Year Month MEI       CO2       CH4       N2O CFC.11
## Year     1.0000000     0   0 0.9827494 0.9156595 0.9938452      0
## Month    0.0000000     1   0 0.0000000 0.0000000 0.0000000      0
## MEI      0.0000000     0   1 0.0000000 0.0000000 0.0000000      0
## CO2      0.9827494     0   0 1.0000000 0.0000000 0.9767198      0
## CH4      0.9156595     0   0 0.0000000 1.0000000 0.0000000      0
## N2O      0.9938452     0   0 0.9767198 0.0000000 1.0000000      0
## CFC.11   0.0000000     0   0 0.0000000 0.0000000 0.0000000      1
## CFC.12   0.0000000     0   0 0.0000000 0.9636162 0.0000000      0
## TSI      0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
## Aerosols 0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
## Temp     0.0000000     0   0 0.0000000 0.0000000 0.0000000      0
##             CFC.12 TSI Aerosols Temp
## Year     0.0000000   0        0    0
## Month    0.0000000   0        0    0
## MEI      0.0000000   0        0    0
## CO2      0.0000000   0        0    0
## CH4      0.9636162   0        0    0
## N2O      0.0000000   0        0    0
## CFC.11   0.0000000   0        0    0
## CFC.12   1.0000000   0        0    0
## TSI      0.0000000   1        0    0
## Aerosols 0.0000000   0        1    0
## Temp     0.0000000   0        0    1

# Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

ClimateReg2 <- lm(Temp ~ MEI + N2O + TSI + Aerosols, data = Climate_Training)
summary(ClimateReg2)

## 
## Call:
## lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = Climate_Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
## MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
## N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
## TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
## Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16

#We have many variables in this problem, and as we have seen above, dropping some from the model does not decrease model quality. R provides a function, step, that will automate the procedure of trying different combinations of variables to find a good compromise of model simplicity and R2. This trade-off is formalized by the Akaike information criterion (AIC) - it can be informally thought of as the quality of the model with a penalty for the number of variables in the model.

#The step function has one argument - the name of the initial model. It returns a simplified model. Use the step function in R to derive a new model, with the full model as the initial model (HINT: If your initial full model was called "climateLM", you could create a new model with the step function by typing step(climateLM). Be sure to save your new model to a variable name so that you can look at the summary. For more information about the step function, type ?step in your R console.)

ClimateReg3 <-step(ClimateReg)

## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9

summary(ClimateReg3)

## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
##     Aerosols, data = Climate_Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
## MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
## CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
## N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
## CFC.11      -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
## CFC.12       3.868e-03  9.812e-04   3.942 0.000103 ***
## TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
## Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16

#It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).



# We have developed an understanding of how well we can fit a linear regression to the training data, but does the model quality hold when applied to unseen data?

# Using the model produced from the step function, calculate temperature predictions for the testing data set, using the predict function.

ClimatePrediction <- predict(ClimateReg3, newdata = Climate_Test)
SSE = sum((ClimatePrediction - Climate_Test$Temp)^2)
SST = sum((mean(Climate_Training$Temp) - Climate_Test$Temp)^2)
R2 = 1 - SSE/SST
R2

## [1] 0.6286051

Part 2

Data Sources

https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/pisa2009train.csv https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/pisa2009test.csv

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)
male: Whether the student is male (1/0)
raceeth: The race/ethnicity composite of the student
preschool: Whether the student attended preschool (1/0)
expectBachelors: Whether the student expects to obtain a bachelor’s degree (1/0)
motherHS: Whether the student’s mother completed high school (1/0)
motherBachelors: Whether the student’s mother obtained a bachelor’s degree (1/0)
motherWork: Whether the student’s mother has part-time or full-time work (1/0)
fatherHS: Whether the student’s father completed high school (1/0)
fatherBachelors: Whether the student’s father obtained a bachelor’s degree (1/0)
fatherWork: Whether the student’s father has part-time or full-time work (1/0)
selfBornUS: Whether the student was born in the United States of America (1/0)
motherBornUS: Whether the student’s mother was born in the United States of America (1/0)
fatherBornUS: Whether the student’s father was born in the United States of America (1/0)
englishAtHome: Whether the student speaks English at home (1/0)
computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)
read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)
minutesPerWeekEnglish: The number of minutes per week the student spend in English class
studentsInEnglish: The number of students in this student’s English class at school
schoolHasLibrary: Whether this student’s school has a library (1/0)
publicSchool: Whether this student attends a public school (1/0)
urban: Whether this student’s school is in an urban area (1/0)
schoolSize: The number of students in this student’s school
readingScore: The student’s reading score, on a 1000-point scale

pisaTrain <- read.csv("~/Dersler/MIT_15.071/Week2/hw2/pisa2009train.csv")
pisaTest <- read.csv("~/Dersler/MIT_15.071/Week2/hw2/pisa2009test.csv")
str(pisaTrain)

## 'data.frame':    3663 obs. of  24 variables:
##  $ grade                : int  11 11 9 10 10 10 10 10 9 10 ...
##  $ male                 : int  1 1 1 0 1 1 0 0 0 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
##  $ preschool            : int  NA 0 1 1 1 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 0 1 1 0 1 1 1 0 1 ...
##  $ motherHS             : int  NA 1 1 0 1 NA 1 1 1 1 ...
##  $ motherBachelors      : int  NA 1 1 0 0 NA 0 0 NA 1 ...
##  $ motherWork           : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ fatherHS             : int  NA 1 1 1 1 1 NA 1 0 0 ...
##  $ fatherBachelors      : int  NA 0 NA 0 0 0 NA 0 NA 0 ...
##  $ fatherWork           : int  1 1 1 1 0 1 NA 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ motherBornUS         : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  0 1 1 1 0 1 NA 1 1 1 ...
##  $ englishAtHome        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 1 0 1 1 0 0 1 0 0 ...
##  $ minutesPerWeekEnglish: int  225 450 250 200 250 300 250 300 378 294 ...
##  $ studentsInEnglish    : int  NA 25 28 23 35 20 28 30 20 24 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  1 0 0 1 1 0 1 0 1 0 ...
##  $ schoolSize           : int  673 1173 1233 2640 1095 227 2080 1913 502 899 ...
##  $ readingScore         : num  476 575 555 458 614 ...

str(pisaTest)

## 'data.frame':    1570 obs. of  24 variables:
##  $ grade                : int  10 10 10 10 10 10 10 10 11 10 ...
##  $ male                 : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 7 7 7 7 1 7 7 4 ...
##  $ preschool            : int  1 0 1 1 1 0 1 1 0 1 ...
##  $ expectBachelors      : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ motherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBachelors      : int  1 0 0 1 0 0 0 0 1 1 ...
##  $ motherWork           : int  1 1 1 1 0 1 0 1 1 1 ...
##  $ fatherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBachelors      : int  0 0 0 0 1 0 0 0 1 0 ...
##  $ fatherWork           : int  0 1 1 0 1 1 0 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
##  $ read30MinsADay       : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ minutesPerWeekEnglish: int  240 255 NA 160 240 200 240 270 270 350 ...
##  $ studentsInEnglish    : int  30 NA 30 30 30 NA 30 35 30 25 ...
##  $ schoolHasLibrary     : int  1 1 1 NA 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolSize           : int  808 808 808 808 808 808 808 808 808 899 ...
##  $ readingScore         : num  355 386 523 406 454 ...

#Using tapply() on pisaTrain, what is the average reading test score of males?
tapply(pisaTrain$readingScore, pisaTrain$male,mean)

##        0        1 
## 512.9406 483.5325

summary(pisaTrain)

##      grade            male                      raceeth    
##  Min.   : 8.00   Min.   :0.0000   White             :2015  
##  1st Qu.:10.00   1st Qu.:0.0000   Hispanic          : 834  
##  Median :10.00   Median :1.0000   Black             : 444  
##  Mean   :10.09   Mean   :0.5111   Asian             : 143  
##  3rd Qu.:10.00   3rd Qu.:1.0000   More than one race: 124  
##  Max.   :12.00   Max.   :1.0000   (Other)           :  68  
##                                   NA's              :  35  
##    preschool      expectBachelors     motherHS    motherBachelors 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.00   Median :0.0000  
##  Mean   :0.7228   Mean   :0.7859   Mean   :0.88   Mean   :0.3481  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##  NA's   :56       NA's   :62       NA's   :97     NA's   :397     
##    motherWork        fatherHS      fatherBachelors    fatherWork    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.7345   Mean   :0.8593   Mean   :0.3319   Mean   :0.8531  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :93       NA's   :245      NA's   :569      NA's   :233     
##    selfBornUS      motherBornUS     fatherBornUS    englishAtHome   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.9313   Mean   :0.7725   Mean   :0.7668   Mean   :0.8717  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :69       NA's   :71       NA's   :113      NA's   :71      
##  computerForSchoolwork read30MinsADay   minutesPerWeekEnglish
##  Min.   :0.0000        Min.   :0.0000   Min.   :   0.0       
##  1st Qu.:1.0000        1st Qu.:0.0000   1st Qu.: 225.0       
##  Median :1.0000        Median :0.0000   Median : 250.0       
##  Mean   :0.8994        Mean   :0.2899   Mean   : 266.2       
##  3rd Qu.:1.0000        3rd Qu.:1.0000   3rd Qu.: 300.0       
##  Max.   :1.0000        Max.   :1.0000   Max.   :2400.0       
##  NA's   :65            NA's   :34       NA's   :186          
##  studentsInEnglish schoolHasLibrary  publicSchool        urban       
##  Min.   : 1.0      Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:20.0      1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :25.0      Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :24.5      Mean   :0.9676   Mean   :0.9339   Mean   :0.3849  
##  3rd Qu.:30.0      3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :75.0      Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :249       NA's   :143                                       
##    schoolSize    readingScore  
##  Min.   : 100   Min.   :168.6  
##  1st Qu.: 712   1st Qu.:431.7  
##  Median :1212   Median :499.7  
##  Mean   :1369   Mean   :497.9  
##  3rd Qu.:1900   3rd Qu.:566.2  
##  Max.   :6694   Max.   :746.0  
##  NA's   :162

#Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)

str(pisaTest)

## 'data.frame':    990 obs. of  24 variables:
##  $ grade                : int  10 10 10 10 11 10 10 10 10 10 ...
##  $ male                 : int  0 0 0 0 0 1 0 1 1 0 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 1 7 7 4 7 4 7 4 ...
##  $ preschool            : int  1 1 1 1 0 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 0 0 1 1 0 1 1 ...
##  $ motherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 1 0 0 1 0 ...
##  $ motherWork           : int  1 0 0 1 1 1 0 1 1 1 ...
##  $ fatherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBachelors      : int  0 1 0 0 1 0 0 0 1 1 ...
##  $ fatherWork           : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ minutesPerWeekEnglish: int  240 240 240 270 270 350 350 360 350 360 ...
##  $ studentsInEnglish    : int  30 30 30 35 30 25 27 28 25 27 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolSize           : int  808 808 808 808 808 899 899 899 899 899 ...
##  $ readingScore         : num  355 454 405 665 605 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:580] 2 3 4 6 12 16 17 19 22 23 ...
##   .. ..- attr(*, "names")= chr [1:580] "2" "3" "4" "6" ...

str(pisaTrain)

## 'data.frame':    2414 obs. of  24 variables:
##  $ grade                : int  11 10 10 10 10 10 10 10 11 9 ...
##  $ male                 : int  1 0 1 0 1 0 0 0 1 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
##  $ preschool            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ motherHS             : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ motherWork           : int  1 1 1 0 1 1 1 0 0 1 ...
##  $ fatherHS             : int  1 1 1 1 0 1 1 0 1 1 ...
##  $ fatherBachelors      : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ fatherWork           : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ fatherBornUS         : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
##  $ read30MinsADay       : int  1 1 1 1 0 1 1 1 0 0 ...
##  $ minutesPerWeekEnglish: int  450 200 250 300 294 232 225 270 275 225 ...
##  $ studentsInEnglish    : int  25 23 35 30 24 14 20 25 30 15 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ urban                : int  0 1 1 0 0 0 0 1 1 1 ...
##  $ schoolSize           : int  1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
##  $ readingScore         : num  575 458 614 439 466 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
##   .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...

#Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() -- you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically ("American Indian/Alaska Native") as the reference level of our factor instead of the most common level ("White"). Set the reference level of the factor by typing the following two lines in your R console:
  
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")

pisaTest$raceeth = relevel(pisaTest$raceeth, "White")

#Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables. 

# It would be time-consuming to type all the variables, but R provides the shorthand notation "readingScore ~ ." to mean "predict readingScore using all the other variables in the data frame." The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called "Y", your independent variables are called "X1", "X2", and "X3", and your training data set is called "Train", instead of the regular notation:

lmScore <- lm(readingScore ~.,data = pisaTrain)
summary(lmScore)

## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

#The training-set RMSE can be computed by first computing the SSE:

SSE = sum(lmScore$residuals^2)

#and then dividing by the number of observations and taking the square root:

RMSE = sqrt(SSE / nrow(pisaTrain))

#A alternative way of getting this answer would be with the following command:
sqrt(mean(lmScore$residuals^2))

## [1] 73.36555

#Using the "predict" function and supplying the "newdata" argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions "predTest". Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

#What is the range between the maximum and minimum predicted reading score on the test set?
predTest <- predict(lmScore, newdata = pisaTest)

summary(predTest)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   353.2   482.0   524.0   516.7   555.7   637.7

#What is the sum of squared errors (SSE) of lmScore on the testing set?
SSE <- sum((predTest - pisaTest$readingScore)^2)

#What is the root-mean squared error (RMSE) of lmScore on the testing set?
sqrt(SSE/nrow(pisaTest))

## [1] 76.29079

# What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
baseline = mean(pisaTrain$readingScore)
baseline

## [1] 517.9629

# What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).
SST <- sum((mean(pisaTrain$readingScore) - pisaTest$readingScore)^2)

#What is the test-set R-squared value of lmScore?
R2 <- 1 - SSE/SST
R2

## [1] 0.2614944

Part 3 - Detecting Flu epidemics via web search

Data Sources

https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/FluTrain.csv https://courses.edx.org/asset-v1:MITx+15.071x_2a+2T2015+type@asset+block/FluTest.csv

Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.

The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.

The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

PROBLEM 1.1 - UNDERSTANDING THE DATA (6 points possible)

We would like to estimate influenza-like illness (ILI) activity using Google web search logs. Fortunately, one can easily access this data online:

ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.

Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.

The csv file FluTrain.csv aggregates this data from January 1, 2004 until December 31, 2011 as follows:

“Week” - The range of dates represented by this observation, in year/month/day format.
“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.
“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

FluTrain <- read.csv("~/Dersler/MIT_15.071/Week2/hw3/FluTrain.csv")
FluTest <- read.csv("~/Dersler/MIT_15.071/Week2/hw3/FluTest.csv")

#Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? 
FluTrain[which.max(FluTrain$ILI),]

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

#Which week corresponds to the highest percentage of ILI-related query fraction?
FluTrain[which.max(FluTrain$Queries),]

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

#Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?
hist(FluTrain$ILI)

#When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.

#Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.

plot(log(FluTrain$ILI),FluTrain$Queries)
cor(log(FluTrain$ILI),FluTrain$Queries)

## [1] 0.8420333

#Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?

#log(ILI) = intercept + coefficient x Queries, where the coefficient is positive



FluTrend1 <-lm(log(ILI) ~ Queries, data = FluTrain)
summary(FluTrend1)

## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16

# For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem?

# ANS: R-squared = Correlation^2 

# PROBLEM 3.1 - PERFORMANCE ON THE TEST SET  

#The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits. Load this data into a data frame called FluTest.

#Normally, we would obtain test-set predictions from the model FluTrend1 using the code

#PredTest1 = predict(FluTrend1, newdata=FluTest)

#However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))

#What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the "which" function in R. To learn more about the which function, type ?which in your R console.)

PredTest1[which(FluTest$Week == "2012-03-11 - 2012-03-17")]

##       11 
## 2.187378

#What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as (Observed ILI - Estimated ILI)/Observed ILI

(FluTest$ILI[which(FluTest$Week == "2012-03-11 - 2012-03-17")] - PredTest1[which(FluTest$Week == "2012-03-11 - 2012-03-17")])/(FluTest$ILI[which(FluTest$Week == "2012-03-11 - 2012-03-17")])

##         11 
## 0.04623827

# What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

SSE = sum((PredTest1-FluTest$ILI)^2)
RMSE = sqrt(SSE / nrow(FluTest))
RMSE

## [1] 0.7490645

###PROBLEM 4.1 - TRAINING A TIME SERIES MODEL  (1 point possible)
#The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a "time series." Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

#First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week's ILI value to predict the current week's value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

#To do so, we will use the "zoo" package, which provides a number of helpful methods for time series models. While many functions are built into R, you need to add new packages to use some functions. New packages can be installed and loaded easily in R, and we will do this many times in this class. Run the following two commands to install and load the zoo package. In the first command, you will be prompted to select a CRAN mirror to use for your download. Select a mirror near you geographically.

#install.packages("zoo")

library(zoo)

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

#After installing and loading the zoo package, run the following commands to create the ILILag2 variable in the training set:

ILILag2 = lag(zoo(FluTrain$ILI), -2, na.pad=TRUE)

FluTrain$ILILag2 = coredata(ILILag2)

#In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can't compute the data from 2 weeks earlier.

#How many values are missing in the new ILILag2 variable?
summary(ILILag2)

##      Index        ILILag2      
##  Min.   :  1   Min.   :0.5341  
##  1st Qu.:105   1st Qu.:0.9010  
##  Median :209   Median :1.2519  
##  Mean   :209   Mean   :1.6754  
##  3rd Qu.:313   3rd Qu.:2.0580  
##  Max.   :417   Max.   :7.6189  
##                NA's   :2

#Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?
plot(log(FluTrain$ILILag2), log(FluTrain$ILI))

#Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.

# Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)
FluTrend2 <- lm(log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
summary(FluTrend2)

## 
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52209 -0.11082 -0.01819  0.08143  0.76785 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.24064    0.01953  -12.32   <2e-16 ***
## Queries       1.25578    0.07910   15.88   <2e-16 ***
## log(ILILag2)  0.65569    0.02251   29.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 412 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.9059 
## F-statistic:  1993 on 2 and 412 DF,  p-value: < 2.2e-16

#So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort).

#Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?

ILILag2_test = lag(zoo(FluTest$ILI), -2, na.pad=TRUE)
FluTest$ILILag2 <- coredata(ILILag2_test)

head(FluTest)

##                      Week      ILI   Queries  ILILag2
## 1 2012-01-01 - 2012-01-07 1.766707 0.5936255       NA
## 2 2012-01-08 - 2012-01-14 1.543401 0.4993360       NA
## 3 2012-01-15 - 2012-01-21 1.647615 0.5006640 1.766707
## 4 2012-01-22 - 2012-01-28 1.684297 0.4794157 1.543401
## 5 2012-01-29 - 2012-02-04 1.863542 0.4714475 1.647615
## 6 2012-02-05 - 2012-02-11 1.864079 0.5033201 1.684297

str(FluTest)

## 'data.frame':    52 obs. of  4 variables:
##  $ Week   : Factor w/ 52 levels "2012-01-01 - 2012-01-07",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  1.77 1.54 1.65 1.68 1.86 ...
##  $ Queries: num  0.594 0.499 0.501 0.479 0.471 ...
##  $ ILILag2: num  NA NA 1.77 1.54 1.65 ...

summary(FluTest)

##                       Week         ILI            Queries      
##  2012-01-01 - 2012-01-07: 1   Min.   :0.9018   Min.   :0.2390  
##  2012-01-08 - 2012-01-14: 1   1st Qu.:1.1535   1st Qu.:0.2772  
##  2012-01-15 - 2012-01-21: 1   Median :1.3592   Median :0.3924  
##  2012-01-22 - 2012-01-28: 1   Mean   :1.6638   Mean   :0.4094  
##  2012-01-29 - 2012-02-04: 1   3rd Qu.:1.8637   3rd Qu.:0.4874  
##  2012-02-05 - 2012-02-11: 1   Max.   :6.0336   Max.   :0.8054  
##  (Other)                :46                                    
##     ILILag2      
##  Min.   :0.9018  
##  1st Qu.:1.1359  
##  Median :1.3409  
##  Mean   :1.5188  
##  3rd Qu.:1.7606  
##  Max.   :3.6002  
##  NA's   :2

#In this problem, the training and testing sets are split sequentially -- the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.

#Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?

  #ANS: The ILI value of the second-to-last observation in the FluTrain data frame.

# Which value should be used to fill in the ILILag2 variable for the second observation in FluTest?

  # The ILI value of the last observation in the FluTrain data frame



#Fill in the missing values for ILILag2 in FluTest. In terms of syntax, you could set the value of ILILag2 in row "x" of the FluTest data frame to the value of ILI in row "y" of the FluTrain data frame with "FluTest$ILILag2[x] = FluTrain$ILI[y]". Use the answer to the previous questions to determine the appropriate values of "x" and "y". It may be helpful to check the total number of rows in FluTrain using str(FluTrain) or nrow(FluTrain).

head(FluTest)

##                      Week      ILI   Queries  ILILag2
## 1 2012-01-01 - 2012-01-07 1.766707 0.5936255       NA
## 2 2012-01-08 - 2012-01-14 1.543401 0.4993360       NA
## 3 2012-01-15 - 2012-01-21 1.647615 0.5006640 1.766707
## 4 2012-01-22 - 2012-01-28 1.684297 0.4794157 1.543401
## 5 2012-01-29 - 2012-02-04 1.863542 0.4714475 1.647615
## 6 2012-02-05 - 2012-02-11 1.864079 0.5033201 1.684297

FluTest$ILILag2[1] = FluTrain$ILI[416]
FluTest$ILILag2[2] = FluTrain$ILI[417]
head(FluTest)

##                      Week      ILI   Queries  ILILag2
## 1 2012-01-01 - 2012-01-07 1.766707 0.5936255 1.852736
## 2 2012-01-08 - 2012-01-14 1.543401 0.4993360 2.124130
## 3 2012-01-15 - 2012-01-21 1.647615 0.5006640 1.766707
## 4 2012-01-22 - 2012-01-28 1.684297 0.4794157 1.543401
## 5 2012-01-29 - 2012-02-04 1.863542 0.4714475 1.647615
## 6 2012-02-05 - 2012-02-11 1.864079 0.5033201 1.684297

#Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).

# What is the test-set RMSE of the FluTrend2 model?

PredTest2 = exp(predict(FluTrend2, newdata=FluTest))

# What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

SSE = sum((PredTest2-FluTest$ILI)^2)
RMSE = sqrt(SSE / nrow(FluTest))
RMSE

## [1] 0.2942029

PART 1

DATA Sources:

Part 2

Data Sources

Part 3 - Detecting Flu epidemics via web search

Data Sources

PROBLEM 1.1 - UNDERSTANDING THE DATA (6 points possible)

Part 4

Data Sources

Part 5

Data Sources