Carefully explain the differences between the KNN classifier and KNN regression methods. KNN regression is used to predict an outcome with a series of different variables that are included within a model. KNN regression the prediction is the average of the target values of the neighbors. Where as KNN classifier is used to predict whether or not a series of variables will yield a “yes” or “no” or “0” or “1”, to classify something. When dealing with KNN classifier the prediction is based on the frequent class among the neighbors.
library(ISLR2)
Warning: package ‘ISLR2’ was built under R version 4.4.2
auto = read.csv("Auto.csv")
auto$horsepower = as.numeric(auto$horsepower)
auto$weight = as.numeric(auto$weight)
str(auto)
pairs(auto[c("mpg","displacement","weight","acceleration")])
cor(auto.num, use = "complete.obs")
mpg cylinders displacement horsepower
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268
cylinders -0.7776175 1.0000000 0.9508233 0.8429834
displacement -0.8051269 0.9508233 1.0000000 0.8972570
horsepower -0.7784268 0.8429834 0.8972570 1.0000000
weight -0.8322442 0.8975273 0.9329944 0.8645377
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955
year 0.5805410 -0.3456474 -0.3698552 -0.4163615
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715
weight acceleration year origin
mpg -0.8322442 0.4233285 0.5805410 0.5652088
cylinders 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration -0.4168392 1.0000000 0.2903161 0.2127458
year -0.3091199 0.2903161 1.0000000 0.1815277
origin -0.5850054 0.2127458 0.1815277 1.0000000
lm.fit = lm(mpg ~ . - name, data = auto)
summary(lm.fit)
Call:
lm(formula = mpg ~ . - name, data = auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the response variable and the other variables within the linear model.
The predictors that appear to have a statistical significance with the response variables. Are, displacement, weight, year, origin.
The coefficient for year suggests that when year increases by one year then mpg goes up by .75 mpg.
The Residual vs Fitted plot looks to be standard. We can see a random scatter around the zero.
The data looks to follow normal distribution. Although the right tail
appears to veer off. Implying that there is a possibility that this data
does not follow normal distribution. One way to help us determine
whether or not the data is normally distributed is the Shapiro-Wilk
Test.
## Scale-Location This Scale-Loaction diagnostic plot is looking to
determine whether or not this homoscedasticity. In which case this plot
fulfills the assumption.
In the Residuals vs Leverage plot an observation that appears to have high leverage, but continues to remain within cooks distance is #14.
summary(lm.fit2)
Call:
lm(formula = mpg ~ displacement * weight + year * origin, data = auto)
Residuals:
Min 1Q Median 3Q Max
-9.5843 -1.6827 -0.0667 1.3420 13.2836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.760e+01 7.942e+00 2.216 0.027242 *
displacement -7.542e-02 9.025e-03 -8.357 1.15e-15 ***
weight -1.044e-02 6.399e-04 -16.312 < 2e-16 ***
year 4.927e-01 1.007e-01 4.894 1.45e-06 ***
origin -1.502e+01 4.200e+00 -3.576 0.000392 ***
displacement:weight 2.117e-05 2.163e-06 9.787 < 2e-16 ***
year:origin 1.981e-01 5.399e-02 3.669 0.000277 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.964 on 390 degrees of freedom
Multiple R-squared: 0.8587, Adjusted R-squared: 0.8566
F-statistic: 395.1 on 6 and 390 DF, p-value: < 2.2e-16
The interaction between displacement:weight is statistically significant. As well as, the interaction between year:origin.
summary(lm.fit3)
Call:
lm(formula = mpg ~ log(displacement) + log(weight) + log(horsepower) +
log(acceleration), data = auto)
Residuals:
Min 1Q Median 3Q Max
-12.4014 -2.4485 -0.2746 1.9905 15.4792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.040 11.772 14.020 < 2e-16 ***
log(displacement) -3.634 1.226 -2.964 0.003221 **
log(weight) -6.054 2.774 -2.182 0.029692 *
log(horsepower) -12.038 1.916 -6.283 8.95e-10 ***
log(acceleration) -7.167 2.043 -3.508 0.000505 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.922 on 387 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.7501, Adjusted R-squared: 0.7475
F-statistic: 290.5 on 4 and 387 DF, p-value: < 2.2e-16
When using a log() transformation the data relationship between mpg and the selected variables appears to become a negative relationship.
summary(lm.fit4)
Call:
lm(formula = mpg ~ poly(displacement, 3) + poly(weight, 3) +
poly(acceleration, 3), data = auto)
Residuals:
Min 1Q Median 3Q Max
-12.5820 -2.3528 -0.3239 1.9444 18.0187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.5159 0.2043 115.077 < 2e-16 ***
poly(displacement, 3)1 -37.8218 14.9110 -2.536 0.01159 *
poly(displacement, 3)2 8.7690 7.7816 1.127 0.26049
poly(displacement, 3)3 7.1830 5.5091 1.304 0.19307
poly(weight, 3)1 -90.3471 13.3874 -6.749 5.47e-11 ***
poly(weight, 3)2 21.0103 7.0899 2.963 0.00323 **
poly(weight, 3)3 -7.8782 5.2966 -1.487 0.13772
poly(acceleration, 3)1 11.2690 5.8032 1.942 0.05288 .
poly(acceleration, 3)2 6.9245 5.2968 1.307 0.19189
poly(acceleration, 3)3 8.4332 4.4690 1.887 0.05990 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.072 on 387 degrees of freedom
Multiple R-squared: 0.7355, Adjusted R-squared: 0.7293
F-statistic: 119.5 on 9 and 387 DF, p-value: < 2.2e-16
Using a polynomial transformation only a couple of the variables were statistically significant. # 10
summary(car.lm)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
When the company increases the ‘price’ for a car seat at each location the ‘sales’ decrease by .054. A store located within an urban area, results in a decrease of ‘sales’ by .02. When the store is located within the territory of the United States the ‘sales’ increases by 1.2 units.
\(Sales = B_0 + Price(-0.054) + UrbanYes(-0.021) + USYes(1.20)\)
The predictors that would fail to reject the null hypothesis, are \(Price\) and \(USYes\).
car.lm2 = lm(Sales ~ Price + US, data = Carseats)
summary(car.lm2)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The model in part (a) fits about the same as the model in part (e). The adjusted R2 is a bit better than in model in part (e) than the one in part (a). Part (e) might have a better model because it yield similar numbers and is a more simple model.
confint(car.lm2, level = .95)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
It does not look like there is any high leverage or outliers within the data set. # Cooks Distance
cook = cooks.distance(car.lm2)
plot(cook)
abline(h = 4 / length(cook), col = "pink", lty = 2)
flu = which(cook > (4/length(cook)))
Carseats[flu,]
\(X\) on to \(Y\) will be the same when the sum of squares of \(X\) is equal to sum of squares \(Y\).
# coefficient estimates are different
set.seed(123)
n = 100
x = rnorm(n, mean = 10, sd = 5)
y = 2*x + rnorm(n,mean = 0, sd =3)
#regression of Y onto X
beta.yx = sum(x *y)/ sum(x^2)
beta.yx
[1] 1.969034
#regression X onto Y
beta.xy = sum(x*y)/ sum(y^2)
beta.xy
[1] 0.4996166
set.seed(456)
z = rnorm(n, mean = 0, sd = 1)
x.eq = z
y.eq = x *3
# Regression of Y onto X
beta_y_on_x_equal <- sum(x.eq * y.eq) / sum(x.eq^2)
beta_y_on_x_equal
[1] 3.770559
# Regression of X onto Y
beta_x_on_y_equal <- sum(x.eq * y.eq) / sum(y.eq^2)
beta_x_on_y_equal
[1] 0.003250617