The classifier is used to fit categorical dependent variable while the regression is used to fit continuous (numeric) dependent variable.
The classifier will calculate the conditional probability of each of the K neighbors for each observation and will assign the level of highest probability to the estimate. The regression model will use the mean of the K neighbors instead
library(ISLR)
pairs(Auto)
cor(Auto[ ,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
auto_data<- Auto
auto_data$origin <- factor(auto_data$origin, labels = c("American", "European", "Japanese"))
lm_mpg <- lm(mpg~. -name, data = auto_data)
summary(lm_mpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
The p value is very low (< 2.2e-16), which means that we should reject the null and adopt the alternative hypothesis that the model has at least one statistically significant predictor for the variable mpg.
All predictors are significant to explain the variation of the mpg, except cylinders, horsepower and accelaration. The variable name was excluded from the model.
lm_mpg$coefficients
## (Intercept) cylinders displacement horsepower weight
## -17.954602067 -0.489709424 0.023978644 -0.018183464 -0.006710384
## acceleration year originEuropean originJapanese
## 0.079103036 0.777026939 2.630002360 2.853228228
Since the coefficient is positive, it indicates that as the year of car increases so does the miles per gallon - at rate of 0.77 mpg per year increas, given that all the other variables are constant. In other words, the newer the car the better the mileage efficiency.
par(mfrow=c(2,2))
plot(lm_mpg)
The qqplot show a long right tail, which tells that the distribution is skewed right: there are some outliners with very high mpg. More precisely, we can tell that observations 323 and 394 are those outliers, influential points.
interaction_lm<- lm(formula = mpg ~ . * ., data = auto_data[, -9])
summary(interaction_lm)
##
## Call:
## lm(formula = mpg ~ . * ., data = auto_data[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## originEuropean -3.517e+01 1.260e+01 -2.790 0.005547 **
## originJapanese -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:originEuropean -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:originJapanese 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:originEuropean -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:originJapanese 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:originEuropean -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:originJapanese 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:originEuropean 2.277e-03 2.685e-03 0.848 0.397037
## weight:originJapanese -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:originEuropean 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:originJapanese 7.159e-01 3.258e-01 2.198 0.028614 *
## year:originEuropean 2.932e-01 1.444e-01 2.031 0.043005 *
## year:originJapanese 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
Using the confidence interval of 95%, thus considering significant those predictors with p value below 5%, we can count significant 9 predictor for explaining the variation of mpg.
exp2_lm <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = auto_data)
summary(exp2_lm)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) +
## I(horsepower^2) + I(year^2), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4816 -1.5384 0.0735 1.3671 12.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.185e+02 6.966e+01 6.008 4.40e-09 ***
## cylinders 5.073e-01 3.191e-01 1.590 0.112692
## displacement -3.328e-02 2.045e-02 -1.627 0.104480
## horsepower -1.781e-01 3.953e-02 -4.506 8.81e-06 ***
## weight -1.114e-02 2.587e-03 -4.306 2.12e-05 ***
## acceleration -1.700e-01 9.652e-02 -1.762 0.078960 .
## year -1.019e+01 1.837e+00 -5.546 5.49e-08 ***
## originEuropean 1.323e+00 5.304e-01 2.494 0.013068 *
## originJapanese 1.258e+00 5.129e-01 2.452 0.014637 *
## I(weight^2) 1.182e-06 3.438e-07 3.439 0.000649 ***
## I(displacement^2) 5.839e-05 3.435e-05 1.700 0.089967 .
## I(horsepower^2) 4.388e-04 1.336e-04 3.284 0.001118 **
## I(year^2) 7.210e-02 1.207e-02 5.974 5.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8735
## F-statistic: 225.9 on 12 and 379 DF, p-value: < 2.2e-16
exp3_lm <- lm(mpg ~ . - name + I(weight^3) + I(displacement^3) + I(horsepower^3) + I(year^3), data = auto_data)
summary(exp3_lm)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^3) + I(displacement^3) +
## I(horsepower^3) + I(year^3), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2784 -1.6036 0.0593 1.3744 12.1152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.686e+02 4.675e+01 5.746 1.88e-08 ***
## cylinders 6.578e-01 3.334e-01 1.973 0.04923 *
## displacement -1.957e-02 1.386e-02 -1.411 0.15897
## horsepower -1.182e-01 2.461e-02 -4.802 2.26e-06 ***
## weight -8.205e-03 1.490e-03 -5.507 6.75e-08 ***
## acceleration -1.441e-01 9.639e-02 -1.495 0.13573
## year -4.618e+00 9.253e-01 -4.991 9.19e-07 ***
## originEuropean 1.442e+00 5.313e-01 2.715 0.00694 **
## originJapanese 1.334e+00 5.120e-01 2.605 0.00955 **
## I(weight^3) 1.354e-10 3.256e-11 4.159 3.96e-05 ***
## I(displacement^3) 7.671e-08 4.574e-08 1.677 0.09440 .
## I(horsepower^3) 9.763e-07 3.316e-07 2.944 0.00344 **
## I(year^3) 3.106e-04 5.312e-05 5.847 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.791 on 379 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8721
## F-statistic: 223.2 on 12 and 379 DF, p-value: < 2.2e-16
sqr2_lm <- lm(mpg ~ . - name + I(weight^0.5) + I(displacement^0.5) + I(horsepower^0.5) + I(year^0.5), data = auto_data)
summary(sqr2_lm)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^0.5) + I(displacement^0.5) +
## I(horsepower^0.5) + I(year^0.5), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7371 -1.5567 0.0806 1.2635 12.0194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.772e+03 2.786e+02 6.359 5.83e-10 ***
## cylinders 1.477e-01 2.937e-01 0.503 0.615288
## displacement 3.307e-02 2.721e-02 1.215 0.225048
## horsepower 1.807e-01 6.200e-02 2.914 0.003776 **
## weight 9.920e-03 4.365e-03 2.272 0.023618 *
## acceleration -2.000e-01 9.640e-02 -2.074 0.038731 *
## year 2.333e+01 3.669e+00 6.359 5.85e-10 ***
## originEuropean 1.269e+00 5.237e-01 2.423 0.015854 *
## originJapanese 1.260e+00 5.136e-01 2.454 0.014582 *
## I(weight^0.5) -1.479e+00 5.120e-01 -2.889 0.004084 **
## I(displacement^0.5) -1.082e+00 8.293e-01 -1.305 0.192697
## I(horsepower^0.5) -5.281e+00 1.395e+00 -3.785 0.000179 ***
## I(year^0.5) -3.932e+02 6.397e+01 -6.147 2.01e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.774 on 379 degrees of freedom
## Multiple R-squared: 0.8775, Adjusted R-squared: 0.8737
## F-statistic: 226.3 on 12 and 379 DF, p-value: < 2.2e-16
The transformation did not change the overall significance of the model. All have the same p-value. Looking at the individual predictors, there was no single transformation that substantially improved the significance: some improve, some get worse. Finally, looking at the adjusted R2, we could not see also a big improvement: all values turned around 87%. At end, the model with no transformation and the interecations had highest adjusted R2 value.
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
lm_carseat <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_carseat)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price: as the price increases, the sales decrease
Urban stores have smaller sales than rural ones. But this predictor is not signficant, which means that the variation is probably due other external factors (error).
US stores have larger sales than stores outside ones.
Predict_Sales = 13.043469−0.054459xPrice−0.021916xUrban+1.200573xUS
Assuming that Urban is 1, no urban 0; store in the US is 1 and outside US is 0.
For Price and USYes, as their p values are below 5%.
lm_carseat2<- lm(Sales ~ Price + US, data = Carseats)
summary(lm_carseat2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
They have very close values for R2 and adjusted R2. The first model has R2 of 0.23928 compared to 0.23926 of the second model. For use one extra predictor, however, model one has a slighlty lower adjusted R2, OF 0.23351 compared with 0.23543 of the second model.
confint(lm_carseat2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(lm_carseat2)
There a few outliers, but overall all the observation are pretty close to the normal distribution curve.
Since this model has no intercept (B0=0), the coefficients would be same when the sum of squared values of the X is the same of the sum of squared values of Y.
set.seed(51)
x <- rnorm(100)
y <- 3.5*x + rnorm(100, sd = 3)
data_12b <- data.frame(x, y)
lm_y_12b <- lm(y ~ x+0, data = data_12b)
lm_x_12b <- lm(x ~ y+0, data = data_12b)
lm_y_12b$coefficients
## x
## 3.511354
lm_x_12b$coefficients
## y
## 0.1866791
set.seed(59)
x <- rnorm(100)
y <- x
data_12c <- data.frame(x, y)
lm_y_12c <- lm(y ~ x+ 0, data = data_12c)
lm_x_12c <- lm(x ~ y+0, data = data_12c)
lm_y_12c$coefficients
## x
## 1
lm_x_12c$coefficients
## y
## 1