Chapter 03 (page 120): 2, 9, 10, 12
To predict Y for a given value of X, consider k closest points to X in training data and take the average of the responses. The KNN classifier is used for classification and The KNN regression method is used for regression problems.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
data(Auto)
pairs(Auto )
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Lm_model = lm(mpg ~. -name, data = Auto)
summary(Lm_model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
This model is significant with small p values and four of the predictors are highly significant.
displacement, weight, year, origin.
With one unit increase in year of the car, mpg are increased by 0.75 by gallon
Residuals vs Fitted graph: evidence of a non-linear relationship between the response and the predictors. QQ plot: slighly right skewed. Residuals vs leverage: no major leverage points except #14 on the graph
par(mfrow = c(2, 2))
plot(Lm_model)
Displacement and weight are statistically significant.
Lm_model2 = lm(mpg ~ displacement*weight+weight*cylinders, data = Auto)
summary(Lm_model2)
##
## Call:
## lm(formula = mpg ~ displacement * weight + weight * cylinders,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3698 -2.5514 -0.3861 1.7206 18.0838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.798e+01 6.440e+00 7.451 6.15e-13 ***
## displacement -1.065e-01 3.066e-02 -3.473 0.000573 ***
## weight -7.232e-03 2.165e-03 -3.341 0.000916 ***
## cylinders 1.993e+00 2.055e+00 0.970 0.332710
## displacement:weight 2.457e-05 8.205e-06 2.995 0.002924 **
## weight:cylinders -5.380e-04 6.016e-04 -0.894 0.371771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7273, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
I did a log transformation. All the predictors are significant. The Residuals vs Fitted graph doesn;t have that curve visible, QQ plot doesn’t support normal distribution.
lmlog = lm(mpg ~ log(horsepower) + log(weight) + log(displacement), data = Auto)
summary(lmlog)
##
## Call:
## lm(formula = mpg ~ log(horsepower) + log(weight) + log(displacement),
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6539 -2.4232 -0.3147 2.0760 15.2014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.495 11.898 13.574 < 2e-16 ***
## log(horsepower) -6.929 1.263 -5.487 7.38e-08 ***
## log(weight) -11.835 2.264 -5.228 2.81e-07 ***
## log(displacement) -2.353 1.187 -1.982 0.0482 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.978 on 388 degrees of freedom
## Multiple R-squared: 0.7422, Adjusted R-squared: 0.7402
## F-statistic: 372.3 on 3 and 388 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmlog)
data(Carseats)
Lm_model3 = lm(Sales ~ Price+Urban+US, data = Carseats)
summary(Lm_model3)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price and US are significant predictors. One $1 increase in price and the sales dcrease by $54. Stores in US sell $1200 more in car seats than non US stores.
\(Sales=13.043469−0.054459Price−0.021916Urban_{Yes}+1.200573XUS_{Yes}\)
‘Price’ and ‘USYes’ because the p value is significant
Lm_model4 = lm(Sales ~ Price+US, data = Carseats)
summary(Lm_model4)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
only 23.54% of the variance in Sales can be explained by the model. Not that great.
confint(Lm_model4)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Residuals vs Fitted graph and QQplot support normal distribution Rresiduals vs. leverage plot indicates there are outliers and some high leverage points.
par(mfrow=c(2,2))
plot(Lm_model4)
sum of squares of observed y values = sum of squares of observed x values
x <- rnorm(100)
y <- x^2
coefficients(lm(x ~ y))
## (Intercept) y
## 0.1708051 -0.2072981
coefficients(lm(y ~ x))
## (Intercept) x
## 1.1243761 -0.4287084
x <- rnorm(100)
y <- x
coefficients(lm(x ~ y))
## (Intercept) y
## -1.110223e-17 1.000000e+00
coefficients(lm(y ~ x))
## (Intercept) x
## -1.110223e-17 1.000000e+00