KNN Classifier is used for classification problems with a qualitative response. KNN regression is used for solving regression problems with a quantitative response. KNN Classifier classifies a given observation to the class with the largest estimated probability. KNN regression identifies the neighborhood of the observations and then estimates the outcome as the average of all the training responses in the neighborhood.
library(readr)
library(MASS)
library(ISLR)
Auto = read.csv("Auto.csv", na.strings = "?")
Auto = na.omit(Auto)
# scatterplot matrix
plot(Auto)
# matrix of correlations
corr = cor(Auto[1:8])
print(corr)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
attach(Auto)
# multiple linear regression
lm.fit = lm(mpg~.-name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response based on the linear regression model. The model has a R-square of 0.8215.
The predictors displacement, weight, year, and origin appear to have a statistically significant relationship with the response variable mpg. These predictors have a p-value less than 0.05 which makes them statistically significant.
The year variable suggests that each year mpg which is the response variable increases by 0.75.
# diagnostic plots
par(mfrow=c(2,2))
plot(lm.fit)
The residual plot has some unusually large outliers which also have high leverage in the leverage plot. So those outliers are influential in the model.
# interaction effects
inter_eff = lm(mpg ~ . - name + weight * year + weight * origin ,
data = Auto)
summary(inter_eff)
##
## Call:
## lm(formula = mpg ~ . - name + weight * year + weight * origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6874 -1.8186 -0.1915 1.4850 11.5235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.142e+02 1.343e+01 -8.509 4.05e-16 ***
## cylinders -2.002e-01 3.033e-01 -0.660 0.50950
## displacement 7.467e-03 7.354e-03 1.015 0.31058
## horsepower -2.435e-02 1.292e-02 -1.884 0.06033 .
## weight 2.936e-02 4.647e-03 6.318 7.37e-10 ***
## acceleration 1.159e-01 9.224e-02 1.257 0.20967
## year 1.979e+00 1.779e-01 11.129 < 2e-16 ***
## origin 4.039e+00 1.246e+00 3.243 0.00129 **
## weight:year -4.470e-04 6.305e-05 -7.090 6.54e-12 ***
## weight:origin -1.261e-03 5.363e-04 -2.352 0.01918 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.066 on 382 degrees of freedom
## Multiple R-squared: 0.8492, Adjusted R-squared: 0.8457
## F-statistic: 239.1 on 9 and 382 DF, p-value: < 2.2e-16
\({weight}\times{year}\) and \(weight\times origin\) becomes significant with a p-value less than 0.05.R-square also increases with 0.8495.
# transformation of variables
trans_var = lm(mpg ~ . - name + I(log(weight)) + I(log(origin)) + I(log(weight * year)), data = Auto)
summary(trans_var)
##
## Call:
## lm(formula = mpg ~ . - name + I(log(weight)) + I(log(origin)) +
## I(log(weight * year)), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5239 -1.5448 0.0142 1.4848 12.8274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.330e+03 4.740e+02 7.025 9.91e-12 ***
## cylinders -2.983e-01 2.780e-01 -1.073 0.28384
## displacement 1.160e-02 6.676e-03 1.737 0.08317 .
## horsepower -3.786e-02 1.202e-02 -3.150 0.00176 **
## weight 8.398e-03 1.556e-03 5.396 1.20e-07 ***
## acceleration 6.922e-03 8.476e-02 0.082 0.93496
## year 1.290e+01 1.868e+00 6.907 2.08e-11 ***
## origin -3.815e+00 1.643e+00 -2.323 0.02073 *
## I(log(weight)) 8.769e+02 1.419e+02 6.179 1.66e-09 ***
## I(log(origin)) 8.318e+00 2.962e+00 2.808 0.00523 **
## I(log(weight * year)) -9.183e+02 1.419e+02 -6.470 3.01e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.844 on 381 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.8672
## F-statistic: 256.4 on 10 and 381 DF, p-value: < 2.2e-16
trans_var2 = lm(mpg ~ . - name + I(sqrt(weight)) +
I(sqrt(origin)) + I(sqrt(weight * year)), data = Auto)
summary(trans_var2)
##
## Call:
## lm(formula = mpg ~ . - name + I(sqrt(weight)) + I(sqrt(origin)) +
## I(sqrt(weight * year)), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7099 -1.6043 -0.0566 1.4825 12.0998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.232e+02 3.127e+01 -3.938 9.75e-05 ***
## cylinders -1.965e-01 2.813e-01 -0.699 0.48525
## displacement 1.396e-02 6.686e-03 2.089 0.03741 *
## horsepower -3.127e-02 1.195e-02 -2.616 0.00924 **
## weight 1.483e-02 3.099e-03 4.787 2.43e-06 ***
## acceleration 1.032e-01 8.559e-02 1.206 0.22845
## year 2.767e+00 3.352e-01 8.255 2.51e-15 ***
## origin -9.378e+00 3.278e+00 -2.861 0.00446 **
## I(sqrt(weight)) 3.366e+00 1.136e+00 2.963 0.00323 **
## I(sqrt(origin)) 2.756e+01 8.935e+00 3.084 0.00219 **
## I(sqrt(weight * year)) -6.559e-01 1.120e-01 -5.859 1.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.863 on 381 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8654
## F-statistic: 252.5 on 10 and 381 DF, p-value: < 2.2e-16
trans_var3 = lm(mpg ~ . - name + I(weight ^ 2) + I(origin ^ 2) + I((weight *year) ^ 2), data = Auto)
summary(trans_var3)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(origin^2) + I((weight *
## year)^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1957 -1.6635 -0.0742 1.5486 12.1525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.204e+01 8.979e+00 -4.682 3.96e-06 ***
## cylinders -4.797e-02 2.846e-01 -0.169 0.866225
## displacement 1.583e-02 6.712e-03 2.359 0.018841 *
## horsepower -3.310e-02 1.207e-02 -2.743 0.006376 **
## weight -1.716e-02 1.668e-03 -10.290 < 2e-16 ***
## acceleration 1.033e-01 8.597e-02 1.202 0.230216
## year 1.241e+00 9.390e-02 13.212 < 2e-16 ***
## origin 6.160e+00 1.755e+00 3.511 0.000501 ***
## I(weight^2) 3.650e-06 3.603e-07 10.128 < 2e-16 ***
## I(origin^2) -1.351e+00 4.337e-01 -3.115 0.001981 **
## I((weight * year)^2) -3.509e-10 6.917e-11 -5.073 6.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.882 on 381 degrees of freedom
## Multiple R-squared: 0.8671, Adjusted R-squared: 0.8636
## F-statistic: 248.6 on 10 and 381 DF, p-value: < 2.2e-16
Transformation using log has the best outputs with highest R-square = 0.8706, highest F-statistic=256.4, and lowest residual error=2.844. So the models using log transformation is better than the original regression model and the two models that use square root and square transformations.
detach(Auto)
data(Carseats)
attach(Carseats)
contrasts(Urban)
## Yes
## No 0
## Yes 1
contrasts(US)
## Yes
## No 0
## Yes 1
The contrasts show that if stores are in the Urban are it’s a 1 and 0 otherwise and stores in the US is a 1 otherwise 0.
# Model 1:
lm.fit2 = lm(Sales ~ Price + Urban + US)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price of child carseats decreases sales, stores in Urban area (UrbanYes) decreases, and stores in US(USYes) increases the sales. This means that stores in an Urban area of US has low sales where price is a significant factor.
\(Sales = -0.054Price -0.022UrbanYes + 1.201USYes\)
The F-statistic = 41.52 is very low and not all the variables are significant because P-value is not less than 0.05. So we cannot reject the null hypothesis.
Only Price and USYes variables are significant with a p-value of 0.05. UrbanYes is not significant because of it’s high p-value. So we can fit a smaller model for the outcome by removing the UrbanYes variable.
# fit a smaller model
# Model 2:
lm.fit3 = lm(Sales ~ Price + US)
summary(lm.fit3)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Although Model 2 only has significant predictors, both Model 1 and Model 2 have a low F-Statistics, high residual standard errors, and low R-square value. These indicates that neither of the models are a good fit for the data.
confid_int=confint(lm.fit3, level = 0.95)
par(mfrow=c(2,2))
plot(lm.fit3)
There are very few outliers and high leverage observations in the model so they are not very influential.
detach(Carseats)
If the predictors X and Y are equal then coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X.
X= rnorm(n=100)
Y= 0.25*X + X
coef(lm(Y~X))
## (Intercept) X
## -3.885781e-17 1.250000e+00
coef(lm(X~Y))
## (Intercept) Y
## 2.844947e-17 8.000000e-01
X= rnorm(n=100)
Y= X
coef(lm(Y~X))
## (Intercept) X
## -1.665335e-17 1.000000e+00
coef(lm(X~Y))
## (Intercept) Y
## -1.665335e-17 1.000000e+00