he KNN classifier is used in situations where the response variable is categorical, whereas the KNN regressor would only be appropriate when it is numeric.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.2.1
pairs(Auto)
cor(Auto[ ,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
summary(lm(mpg ~ . -name, data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Ans: Yes, multiple predictors have a relationship with the response. However, some predictors do not have a statistically significant effect on the response (p-value below 0.05).
Ans: displacement, weight, year, and origin have statistically significant relationships.
Ans: When every other predictor held constant, the mpg value increases with each year that passes. The coefficient of year is 0.750773(3/4), which suggests that every 3 years, the mpg goes up by 4.
plot(lm(mpg ~ . -name, data=Auto))
The residual plot shows that there is a U-shape pattern in the residuals which indicates that the data is non-linear. The second graph shows that the residuals are normally distributed and right skewed. The spread of residuals starts off small but then increases, which indicates the heteroscedasticity of the data (non-constant variance.
The plot shows that there are no outliers within this range.
Based on the Residuals vs. Leverage graph, there are no observations that provide a high leverage(The Cook’s distance is shown with the dashed red line).
summary(lm(mpg~.-name + horsepower*displacement, data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + horsepower * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7010 -1.6009 -0.0967 1.4119 12.6734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.894e+00 4.302e+00 -0.440 0.66007
## cylinders 6.466e-01 3.017e-01 2.143 0.03275 *
## displacement -7.487e-02 1.092e-02 -6.859 2.80e-11 ***
## horsepower -1.975e-01 2.052e-02 -9.624 < 2e-16 ***
## weight -3.147e-03 6.475e-04 -4.861 1.71e-06 ***
## acceleration -2.131e-01 9.062e-02 -2.351 0.01921 *
## year 7.379e-01 4.463e-02 16.534 < 2e-16 ***
## origin 6.891e-01 2.527e-01 2.727 0.00668 **
## displacement:horsepower 5.236e-04 4.813e-05 10.878 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.912 on 383 degrees of freedom
## Multiple R-squared: 0.8636, Adjusted R-squared: 0.8608
## F-statistic: 303.1 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg~.-name + horsepower*origin, data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.277 -1.875 -0.225 1.570 12.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.196e+01 4.396e+00 -4.996 8.94e-07 ***
## cylinders -5.275e-01 3.028e-01 -1.742 0.0823 .
## displacement -1.486e-03 7.607e-03 -0.195 0.8452
## horsepower 8.173e-02 1.856e-02 4.404 1.38e-05 ***
## weight -4.710e-03 6.555e-04 -7.186 3.52e-12 ***
## acceleration -1.124e-01 9.617e-02 -1.168 0.2434
## year 7.327e-01 4.780e-02 15.328 < 2e-16 ***
## origin 7.695e+00 8.858e-01 8.687 < 2e-16 ***
## horsepower:origin -7.955e-02 1.074e-02 -7.405 8.44e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.116 on 383 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8406
## F-statistic: 258.7 on 8 and 383 DF, p-value: < 2.2e-16
I fitted linear regression models with interaction effects (horsepower ans displacement,horsepower and origin). Both of the interactions were statistically significant.
summary(lm(mpg ~ . -name + log(horsepower), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5777 -1.6623 -0.1213 1.4913 12.0230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.674e+01 1.106e+01 7.839 4.54e-14 ***
## cylinders -5.530e-02 2.907e-01 -0.190 0.849230
## displacement -4.607e-03 7.108e-03 -0.648 0.517291
## horsepower 1.764e-01 2.269e-02 7.775 7.05e-14 ***
## weight -3.366e-03 6.561e-04 -5.130 4.62e-07 ***
## acceleration -3.277e-01 9.670e-02 -3.388 0.000776 ***
## year 7.421e-01 4.534e-02 16.368 < 2e-16 ***
## origin 8.976e-01 2.528e-01 3.551 0.000432 ***
## log(horsepower) -2.685e+01 2.652e+00 -10.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.959 on 383 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8562
## F-statistic: 292.1 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . -name + I(horsepower^2), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5497 -1.7311 -0.2236 1.5877 11.9955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3236564 4.6247696 0.286 0.774872
## cylinders 0.3489063 0.3048310 1.145 0.253094
## displacement -0.0075649 0.0073733 -1.026 0.305550
## horsepower -0.3194633 0.0343447 -9.302 < 2e-16 ***
## weight -0.0032712 0.0006787 -4.820 2.07e-06 ***
## acceleration -0.3305981 0.0991849 -3.333 0.000942 ***
## year 0.7353414 0.0459918 15.989 < 2e-16 ***
## origin 1.0144130 0.2545545 3.985 8.08e-05 ***
## I(horsepower^2) 0.0010060 0.0001065 9.449 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.001 on 383 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8522
## F-statistic: 282.8 on 8 and 383 DF, p-value: < 2.2e-16
I fitted log transformation and square transformation of horsepower. Both of them are statistically significant.
library(ISLR)
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
data(Carseats)
lm(Sales ~ Price+Urban+US, data= Carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Coefficients:
## (Intercept) Price UrbanYes USYes
## 13.04347 -0.05446 -0.02192 1.20057
summary(lm(Sales ~ Price+Urban+US, data= Carseats))
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price: The coefficient is negative, so as Price increases, Sales decreases.
UrbanYes: The linear regression suggests that there is not enough evidence for relationship between the location of the store and the number of sales based.
USYes: There is a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units.
Sales = 13.04347 - 0.05446price - 0.02192urbanYes + 1.20057*USYes
based on the p-values, I can reject the null hypothesis for Price and USYes.
mod <- lm(Sales ~ Price + US, data = Carseats)
summary(mod)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Based on residual Standard Error, and R-squared, they both fit the data similarly,but the linear regression from (e) fitted the data slightly better.
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(predict(mod), rstudent(mod))
No potential outliers are showing in the range of [-3,3].
par(mfrow=c(2,2))
plot(mod)
The leverage-statistic plot suggests that the corresponding points have
high leverage.
The coefficients are the same iff ∑xj² = ∑yj²
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08