setwd('/home/daria/Courses/R/Edx/Linear Regression')

elantra <- read.csv('elantra.csv')
elantra$Month_factor <- as.factor(elantra$Month)

train <- elantra[elantra$Year <2013,]
test <- elantra[elantra$Year > 2012,]


# create a model to predict Elantra Sales
model1 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all, data = train)
summary(model1)

## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy + 
##     CPI_all, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6785.2 -2101.8  -562.5  2901.7  7021.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   95385.36  170663.81   0.559    0.580
## Unemployment  -3179.90    3610.26  -0.881    0.385
## Queries          19.03      11.26   1.690    0.101
## CPI_energy       38.51     109.60   0.351    0.728
## CPI_all        -297.65     704.84  -0.422    0.676
## 
## Residual standard error: 3295 on 31 degrees of freedom
## Multiple R-squared:  0.4282, Adjusted R-squared:  0.3544 
## F-statistic: 5.803 on 4 and 31 DF,  p-value: 0.00132

For an increase of 1 in Unemployment, the prediction of Elantra sales decreases by approximately 3000.

# add month as independent variable to the model
model2 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + Month, data = train)
summary(model2)

## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy + 
##     CPI_all + Month, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6416.6 -2068.7  -597.1  2616.3  7183.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  148330.49  195373.51   0.759   0.4536  
## Unemployment  -4137.28    4008.56  -1.032   0.3103  
## Queries          21.19      11.98   1.769   0.0871 .
## CPI_energy       54.18     114.08   0.475   0.6382  
## CPI_all        -517.99     808.26  -0.641   0.5265  
## Month           110.69     191.66   0.578   0.5679  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3331 on 30 degrees of freedom
## Multiple R-squared:  0.4344, Adjusted R-squared:  0.3402 
## F-statistic: 4.609 on 5 and 30 DF,  p-value: 0.003078

The model is not better because the adjusted R-squared has gone down and none of the variables (including the new one) are very significant.

The adjusted R-Squared is the R-Squared but adjusted to take into account the number of variables. If the adjusted R-Squared is lower, then this indicates that our model is not better and in fact may be worse. Furthermore, if none of the variables have become significant, then this also indicates that the model is not better.

EXPLANATION

The coefficient for Month is 110.69 (look at the summary output of the model). For the first question, January is coded numerically as 1, while March is coded numerically as 3; the difference in the prediction is therefore

110.69 * (3 - 1) = 110.69 * 2 = 221.38

For the second question, May is numerically coded as 5, while January is 1, so the difference in predicted sales is

110.69 * (5 - 1) = 110.69 * 4 = 442.76

You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model.

By modeling Month as a factor variable, the effect of each calendar month is not restricted to be linear in the numerical coding of the month.

#Re-run the regression with the Month variable modeled as a factor variable. (Create a new variable that models the Month as a factor (using the as.factor function) instead of overwriting the current Month variable. We'll still use the numeric version of Month later in the problem.)

# new month factor variable is added at the top to 'elantra' variable

# re-run model with month as factor
model3 <- lm(ElantraSales ~ Unemployment + Queries + CPI_energy + CPI_all + Month_factor, data = train)
summary(model3)

## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + Queries + CPI_energy + 
##     CPI_all + Month_factor, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3865.1 -1211.7   -77.1  1207.5  3562.2 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    312509.280 144061.867   2.169 0.042288 *  
## Unemployment    -7739.381   2968.747  -2.607 0.016871 *  
## Queries            -4.764     12.938  -0.368 0.716598    
## CPI_energy        288.631     97.974   2.946 0.007988 ** 
## CPI_all         -1343.307    592.919  -2.266 0.034732 *  
## Month_factor2    2254.998   1943.249   1.160 0.259540    
## Month_factor3    6696.557   1991.635   3.362 0.003099 ** 
## Month_factor4    7556.607   2038.022   3.708 0.001392 ** 
## Month_factor5    7420.249   1950.139   3.805 0.001110 ** 
## Month_factor6    9215.833   1995.230   4.619 0.000166 ***
## Month_factor7    9929.464   2238.800   4.435 0.000254 ***
## Month_factor8    7939.447   2064.629   3.845 0.001010 ** 
## Month_factor9    5013.287   2010.745   2.493 0.021542 *  
## Month_factor10   2500.184   2084.057   1.200 0.244286    
## Month_factor11   3238.932   2397.231   1.351 0.191747    
## Month_factor12   5293.911   2228.310   2.376 0.027621 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2306 on 20 degrees of freedom
## Multiple R-squared:  0.8193, Adjusted R-squared:  0.6837 
## F-statistic: 6.044 on 15 and 20 DF,  p-value: 0.0001469

MULTICOLINEARITY

Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient – as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we’d expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).

As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.

Which of the following variables is CPI_energy highly correlated with? Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)

cor(elantra[c("Unemployment","Month","Queries","CPI_energy","CPI_all")])

##              Unemployment       Month     Queries  CPI_energy    CPI_all
## Unemployment   1.00000000 -0.06647839 -0.34940849 -0.70149386 -0.9635829
## Month         -0.06647839  1.00000000  0.03697702  0.09709211  0.1148929
## Queries       -0.34940849  0.03697702  1.00000000  0.76001600  0.5080560
## CPI_energy    -0.70149386  0.09709211  0.76001600  1.00000000  0.8409909
## CPI_all       -0.96358293  0.11489290  0.50805596  0.84099093  1.0000000

A REDUCED MODEL

Let us now simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.

model4 <- lm(ElantraSales ~ Unemployment + CPI_energy + CPI_all + Month_factor, data = train)
summary(model4)

## 
## Call:
## lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all + 
##     Month_factor, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3866.0 -1283.3  -107.2  1098.3  3650.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    325709.15  136627.85   2.384 0.026644 *  
## Unemployment    -7971.34    2840.79  -2.806 0.010586 *  
## CPI_energy        268.03      78.75   3.403 0.002676 ** 
## CPI_all         -1377.58     573.39  -2.403 0.025610 *  
## Month_factor2    2410.91    1857.10   1.298 0.208292    
## Month_factor3    6880.09    1888.15   3.644 0.001517 ** 
## Month_factor4    7697.36    1960.21   3.927 0.000774 ***
## Month_factor5    7444.64    1908.48   3.901 0.000823 ***
## Month_factor6    9223.13    1953.64   4.721 0.000116 ***
## Month_factor7    9602.72    2012.66   4.771 0.000103 ***
## Month_factor8    7919.50    2020.99   3.919 0.000789 ***
## Month_factor9    5074.29    1962.23   2.586 0.017237 *  
## Month_factor10   2724.24    1951.78   1.396 0.177366    
## Month_factor11   3665.08    2055.66   1.783 0.089062 .  
## Month_factor12   5643.19    1974.36   2.858 0.009413 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2258 on 21 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.6967 
## F-statistic: 6.744 on 14 and 21 DF,  p-value: 5.73e-05

The variable with the highest p-value is “Queries”. After removing it and looking at the model summary again, we can see that there are no variables that are insignificant, at the 0.10 p-level. Note that Month has a few values that are insignificant, but we don’t want to remove it because many values are very significant.

TEST SET PREDICTIONS

Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?

predictions <- predict(model4, newdata = test)

# Compute out-of-sample R^2
SSE = sum((predictions - test$ElantraSales)^2)
SSE

## [1] 190757747

COMPARING TO A BASELINE
What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.

mean(train$ElantraSales)

## [1] 14462.25

SST = sum((mean(train$ElantraSales) - test$ElantraSales)^2)
SST

## [1] 701375142

R2 = 1 - SSE/SST
R2

## [1] 0.7280232

# Compute the RMSE
RMSE = sqrt(SSE/nrow(test))
RMSE

## [1] 3691.281

ABSOLUTE ERRORS

What is the largest absolute error that we make in our test set predictions?

# You can get this answer by using the max function and the abs function:

max(abs(predictions - test$ElantraSales))

## [1] 7491.488

MONTH OF LARGEST ERROR

In which period (Month,Year pair) do we make the largest absolute error in our prediction?

which.max(abs(predictions - test$ElantraSales))

## 14 
##  5

test[5,]

##    Month Year ElantraSales Unemployment Queries CPI_energy CPI_all
## 14     3 2013        26153          7.5     313    244.598 232.075
##    Month_factor
## 14            3