setwd('/home/daria/Courses/R/Edx/Linear Regression')
elantra <- read.csv('elantra.csv')
elantra$Month_factor <- as.factor(elantra$Month)
train <- elantra[elantra$Year <2013,]
test <- elantra[elantra$Year > 2012,]
# create a model to predict Elantra Sales
model1 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all, data = train)
summary(model1)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy +
## CPI_all, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6785.2 -2101.8 -562.5 2901.7 7021.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95385.36 170663.81 0.559 0.580
## Unemployment -3179.90 3610.26 -0.881 0.385
## Queries 19.03 11.26 1.690 0.101
## CPI_energy 38.51 109.60 0.351 0.728
## CPI_all -297.65 704.84 -0.422 0.676
##
## Residual standard error: 3295 on 31 degrees of freedom
## Multiple R-squared: 0.4282, Adjusted R-squared: 0.3544
## F-statistic: 5.803 on 4 and 31 DF, p-value: 0.00132
For an increase of 1 in Unemployment, the prediction of Elantra sales decreases by approximately 3000.
# add month as independent variable to the model
model2 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + Month, data = train)
summary(model2)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy +
## CPI_all + Month, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6416.6 -2068.7 -597.1 2616.3 7183.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148330.49 195373.51 0.759 0.4536
## Unemployment -4137.28 4008.56 -1.032 0.3103
## Queries 21.19 11.98 1.769 0.0871 .
## CPI_energy 54.18 114.08 0.475 0.6382
## CPI_all -517.99 808.26 -0.641 0.5265
## Month 110.69 191.66 0.578 0.5679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3331 on 30 degrees of freedom
## Multiple R-squared: 0.4344, Adjusted R-squared: 0.3402
## F-statistic: 4.609 on 5 and 30 DF, p-value: 0.003078
The model is not better because the adjusted R-squared has gone down and none of the variables (including the new one) are very significant.
The adjusted R-Squared is the R-Squared but adjusted to take into account the number of variables. If the adjusted R-Squared is lower, then this indicates that our model is not better and in fact may be worse. Furthermore, if none of the variables have become significant, then this also indicates that the model is not better.
EXPLANATION
The coefficient for Month is 110.69 (look at the summary output of the model). For the first question, January is coded numerically as 1, while March is coded numerically as 3; the difference in the prediction is therefore
110.69 * (3 - 1) = 110.69 * 2 = 221.38
For the second question, May is numerically coded as 5, while January is 1, so the difference in predicted sales is
110.69 * (5 - 1) = 110.69 * 4 = 442.76
You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model.
By modeling Month as a factor variable, the effect of each calendar month is not restricted to be linear in the numerical coding of the month.
#Re-run the regression with the Month variable modeled as a factor variable. (Create a new variable that models the Month as a factor (using the as.factor function) instead of overwriting the current Month variable. We'll still use the numeric version of Month later in the problem.)
# new month factor variable is added at the top to 'elantra' variable
# re-run model with month as factor
model3 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + Month_factor, data = train)
summary(model3)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy +
## CPI_all + Month_factor, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3865.1 -1211.7 -77.1 1207.5 3562.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 312509.280 144061.867 2.169 0.042288 *
## Unemployment -7739.381 2968.747 -2.607 0.016871 *
## Queries -4.764 12.938 -0.368 0.716598
## CPI_energy 288.631 97.974 2.946 0.007988 **
## CPI_all -1343.307 592.919 -2.266 0.034732 *
## Month_factor2 2254.998 1943.249 1.160 0.259540
## Month_factor3 6696.557 1991.635 3.362 0.003099 **
## Month_factor4 7556.607 2038.022 3.708 0.001392 **
## Month_factor5 7420.249 1950.139 3.805 0.001110 **
## Month_factor6 9215.833 1995.230 4.619 0.000166 ***
## Month_factor7 9929.464 2238.800 4.435 0.000254 ***
## Month_factor8 7939.447 2064.629 3.845 0.001010 **
## Month_factor9 5013.287 2010.745 2.493 0.021542 *
## Month_factor10 2500.184 2084.057 1.200 0.244286
## Month_factor11 3238.932 2397.231 1.351 0.191747
## Month_factor12 5293.911 2228.310 2.376 0.027621 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared: 0.8193, Adjusted R-squared: 0.6837
## F-statistic: 6.044 on 15 and 20 DF, p-value: 0.0001469
Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient – as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we’d expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).
As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.
Which of the following variables is CPI_energy highly correlated with? Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)
cor(elantra[c("Unemployment","Month","Queries","CPI_energy","CPI_all")])
## Unemployment Month Queries CPI_energy CPI_all
## Unemployment 1.00000000 -0.06647839 -0.34940849 -0.70149386 -0.9635829
## Month -0.06647839 1.00000000 0.03697702 0.09709211 0.1148929
## Queries -0.34940849 0.03697702 1.00000000 0.76001600 0.5080560
## CPI_energy -0.70149386 0.09709211 0.76001600 1.00000000 0.8409909
## CPI_all -0.96358293 0.11489290 0.50805596 0.84099093 1.0000000
Let us now simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.
model4 <- lm(ElantraSales ~ Unemployment + CPI_energy + CPI_all + Month_factor, data = train)
summary(model4)
##
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all +
## Month_factor, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3866.0 -1283.3 -107.2 1098.3 3650.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 325709.15 136627.85 2.384 0.026644 *
## Unemployment -7971.34 2840.79 -2.806 0.010586 *
## CPI_energy 268.03 78.75 3.403 0.002676 **
## CPI_all -1377.58 573.39 -2.403 0.025610 *
## Month_factor2 2410.91 1857.10 1.298 0.208292
## Month_factor3 6880.09 1888.15 3.644 0.001517 **
## Month_factor4 7697.36 1960.21 3.927 0.000774 ***
## Month_factor5 7444.64 1908.48 3.901 0.000823 ***
## Month_factor6 9223.13 1953.64 4.721 0.000116 ***
## Month_factor7 9602.72 2012.66 4.771 0.000103 ***
## Month_factor8 7919.50 2020.99 3.919 0.000789 ***
## Month_factor9 5074.29 1962.23 2.586 0.017237 *
## Month_factor10 2724.24 1951.78 1.396 0.177366
## Month_factor11 3665.08 2055.66 1.783 0.089062 .
## Month_factor12 5643.19 1974.36 2.858 0.009413 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.6967
## F-statistic: 6.744 on 14 and 21 DF, p-value: 5.73e-05
The variable with the highest p-value is “Queries”. After removing it and looking at the model summary again, we can see that there are no variables that are insignificant, at the 0.10 p-level. Note that Month has a few values that are insignificant, but we don’t want to remove it because many values are very significant.
Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?
predictions <- predict(model4, newdata = test)
# Compute out-of-sample R^2
SSE = sum((predictions - test$ElantraSales)^2)
SSE
## [1] 190757747
COMPARING TO A BASELINE
What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.
mean(train$ElantraSales)
## [1] 14462.25
SST = sum((mean(train$ElantraSales) - test$ElantraSales)^2)
SST
## [1] 701375142
R2 = 1 - SSE/SST
R2
## [1] 0.7280232
# Compute the RMSE
RMSE = sqrt(SSE/nrow(test))
RMSE
## [1] 3691.281
What is the largest absolute error that we make in our test set predictions?
# You can get this answer by using the max function and the abs function:
max(abs(predictions - test$ElantraSales))
## [1] 7491.488
In which period (Month,Year pair) do we make the largest absolute error in our prediction?
which.max(abs(predictions - test$ElantraSales))
## 14
## 5
test[5,]
## Month Year ElantraSales Unemployment Queries CPI_energy CPI_all
## 14 3 2013 26153 7.5 313 244.598 232.075
## Month_factor
## 14 3