Before applying analytics tools on the training set, we first need to understand the data at hand. Load “FluTrain.csv” into a data frame called FluTrain. Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? Select the day of the month corresponding to the start of this week.
FluTrain= read.csv("Unit2/FluTrain.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
MAX= subset(FluTrain, ILI== max(ILI))
MAX
## Week ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892 1
#2009-10-18
Which week corresponds to the highest percentage of ILI-related query fraction?
#2009-10-18
Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?
hist(FluTrain$ILI)
Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.
plot(FluTrain$Queries, log(FluTrain$ILI))
#There is a positive, linear relationship between log(ILI) and Queries.
Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?
#log(ILI) = intercept + coefficient x Queries, where the coefficient is positive
What is the training set R-squared value for FluTrend1 model (the “Multiple R-squared”)?
reg1= lm(log(ILI)~ Queries, FluTrain)
summary(reg1)
##
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76003 -0.19696 -0.01657 0.18685 1.06450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.49934 0.03041 -16.42 <2e-16 ***
## Queries 2.96129 0.09312 31.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared: 0.709, Adjusted R-squared: 0.7083
## F-statistic: 1011 on 1 and 415 DF, p-value: < 2.2e-16
#0.709
(cor(log(FluTrain$ILI), FluTrain$Queries))^2
## [1] 0.7090201
#R-squared = Correlation^2
What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the “which” function in R. To learn more about the which function, type ?which in your R console.)
FluTest = read.csv("Unit2/FluTest.csv")
PredTest1 = exp(predict(reg1, newdata=FluTest))
which(FluTest$Week == "2012-03-11 - 2012-03-17")
## [1] 11
PredTest1[11]
## 11
## 2.187378
What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as
(Observed ILI - Estimated ILI)/Observed ILI
subset(FluTest, Week=="2012-03-11 - 2012-03-17")
## Week ILI Queries
## 11 2012-03-11 - 2012-03-17 2.293422 0.4329349
(2.293422- 2.187378)/ 2.293422
## [1] 0.04623833
What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?
(FluTest$ILI - PredTest1)^2 %>% mean %>% sqrt
## [1] 0.7490645
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ILILag2 = lag(zoo(FluTrain$ILI), 2, na.pad=TRUE) #不懂??
FluTrain$ILILag2 = coredata(ILILag2)
summary(FluTrain$ILILag2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.5341 0.9010 1.2520 1.6750 2.0580 7.6190 2
#2
Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?
plot(log(FluTrain$ILILag2), log(FluTrain$ILI))
#There is a strong positive relationship between log(ILILag2) and log(ILI).
Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.
Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)
FluTrend2 = lm(log(ILI)~ Queries+ log(ILILag2), FluTrain)
summary(FluTrend2)
##
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52209 -0.11082 -0.01819 0.08143 0.76785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.24064 0.01953 -12.32 <2e-16 ***
## Queries 1.25578 0.07910 15.88 <2e-16 ***
## log(ILILag2) 0.65569 0.02251 29.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1703 on 412 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9063, Adjusted R-squared: 0.9059
## F-statistic: 1993 on 2 and 412 DF, p-value: < 2.2e-16
#all
What is the R^2 value of the FluTrend2 model?
#.9063
On the basis of R-squared value and significance of coefficients, which statement is the most accurate?
#3
Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?
ILILag2= lag(zoo(FluTest$ILI), 2 , na.pad = T)
FluTest$ILILag2= coredata(ILILag2)
is.na(FluTest$ILILag2)
## [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#2
Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?
#The ILI value of the second-to-last observation in the FluTrain data frame.
Which value should be used to fill in the ILILag2 variable for the second observation in FluTest?
#The ILI value of the first observation in the FluTest data frame.
What is the new value of the ILILag2 variable in the first row of FluTest?
FluTest$ILILag2[1]=FluTrain$ILI[416]
head(FluTest$ILILag2)
## [1] 1.852736 NA 1.766707 1.543401 1.647615 1.684297
#1.852736
What is the new value of the ILILag2 variable in the second row of FluTest?
FluTest$ILILag2[2]=FluTrain$ILI[417]
head(FluTest$ILILag2)
## [1] 1.852736 2.124130 1.766707 1.543401 1.647615 1.684297
#2.12413
What is the test-set RMSE of the FluTrend2 model?
L=predict(FluTrend2, newdata = FluTest) %>% exp
(L- FluTest$ILI)^2 %>% mean %>% sqrt
## [1] 0.2942029
Which model obtained the best test-set RMSE?
#FluTrend2
#RMSE 越小越好