The difference between the KNN classifier and KNN regression methods is that KNN classification is used when the response variable is categorical(qualitative), while KNN regression is used when the response variable is numerical (quantitative). For KNN classification, we are trying to predict the class of a new observation based on the majority class among its K nearest neighbors. While for KNN regression, we are predicting the response by taking the average of the response values of the K nearest neighbors. KNN classification predicts class labels such as 0 or 1, Yes or No, but KNN regression predicts continuous values.
For example: - Classification: Predicting Yes/No, 0/1 - Regression: Predicting house price, temperature
Auto <- read.csv("Auto.csv")
Auto$horsepower <- as.numeric(as.character(Auto$horsepower))
## Warning: NAs introduced by coercion
Clean_Auto <- na.omit(Auto)
pairs(Auto[, -9])
This scatterplot matrix allows us to visualize pairwise relationships between variables.
cor(Clean_Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
We observe that:
mpg is negatively correlated with:
mpg is positively correlated with:
This suggests that heavier cars with larger engines tend to have lower fuel efficiency.
We fit a regression model using mpg as the response
variable.
model_fit <- lm(mpg ~ . - name, data = Clean_Auto)
summary(model_fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Clean_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes.
The F-statistic is highly significant with:
\[ p < 0.001 \]
This indicates strong evidence that at least one predictor is related
to mpg.
Based on p-values less than 0.05, the statistically significant predictors are:
The coefficient for year is positive and statistically
significant.
This suggests that newer cars tend to have higher fuel efficiency when holding all other variables constant.
par(mfrow = c(2,2))
plot(model_fit)
Residuals vs Fitted
Normal Q-Q
Scale-Location
Residuals vs Leverage
Overall, the model performs reasonably well, although some non-linear effects may exist.
fit1 <- lm(mpg ~ horsepower * weight, data = Clean_Auto)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Clean_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7725 -2.2074 -0.2708 1.9973 14.7314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.356e+01 2.343e+00 27.127 < 2e-16 ***
## horsepower -2.508e-01 2.728e-02 -9.195 < 2e-16 ***
## weight -1.077e-02 7.738e-04 -13.921 < 2e-16 ***
## horsepower:weight 5.355e-05 6.649e-06 8.054 9.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.7465
## F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
Model performance:
This suggests a strong interaction effect.
fit2 <- lm(mpg ~ weight + I(weight^2), Clean_Auto)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2), data = Clean_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6246 -2.7134 -0.3485 1.8267 16.0866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 ***
## weight -1.850e-02 1.972e-03 -9.379 < 2e-16 ***
## I(weight^2) 1.697e-06 3.059e-07 5.545 5.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137
## F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16
Results:
This suggests a nonlinear relationship between weight and mpg.
Carseats <- read.csv("Carseats.csv")
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : chr "Bad" "Good" "Medium" "Medium" ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "Yes" "Yes" "Yes" ...
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
sales_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(sales_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Results:
Regression equation:
\[ Sales = 13.043 - 0.05446(Price) - 0.02192(UrbanYes) + 1.20057(USYes) \]
Interpretation:
Using significance level:
\[ \alpha = 0.05 \]
Reject null hypothesis for:
lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(sales_model, lm.fit2)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
The p-value is:
\[ 0.9357 \]
This suggests the reduced model performs similarly to the full model.
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Interpretation:
Price coefficient confidence interval:
\[ (-0.0648,,-0.0442) \]
We are 95% confident that the true coefficient lies within this interval.
plot(predict(lm.fit2), rstudent(lm.fit2))
Interpretation:
The plot does not indicate a substantial number of outliers or highly influential observations.
The regression slopes without intercept are equal when:
\[ \sum x_i^2 = \sum y_i^2 \]
This occurs when both variables have equal Euclidean norm.
set.seed(1)
x <- rnorm(100)
y <- rnorm(100)
coef(lm(y ~ x + 0))
## x
## -0.006123917
coef(lm(x ~ y + 0))
## y
## -0.005455947
Output:
The coefficients are different because:
\[ \sum x_i^2 \neq \sum y_i^2 \]
set.seed(2)
x <- 1:100
y <- 100:1
eg3 <- lm(y ~ x + 0)
eg4 <- lm(x ~ y + 0)
coef(eg3)
## x
## 0.5074627
coef(eg4)
## y
## 0.5074627
Output:
Both coefficients are equal.
This occurs because:
\[ \sum x_i^2 = \sum y_i^2 \]