Both KNN regression and KNN classifier methods are non parametric, the main difference lies in how a new data point is classified. KNN regression chooses to answer a numerical based answer based on an average of K nearest data points. Since the numerical/real number depends on an average it is highly susceptible to extreme outliers given a small number of neighbors. Meanwhile a classifier categorizes a new data point into a particular group based on similar characteristics using the mode instead of the average. Classification also seeks to create a boundary of where future points may lie and give insights into the shape of the given groups.
library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.5.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Question 9
attach(Auto)
The following object is masked from package:lubridate:
origin
The following object is masked from package:ggplot2:
mpg
#apairs(Auto)
subset_Auto <-select(Auto,-name)#b : remove the name variable or last entryround(cor(subset_Auto) , 2)
Call:
lm(formula = mpg ~ ., data = subset_Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
When fitting the full model and regressing mpg against all predictors, except name, the resulting model can explain \(81.8\%\) of variance from the R sqrd adj. From the large F 252.4 statistic we have selected at least one significant predictor whose coefficient is not 0. The following are the significant predictors:
displacement: While holding all other variabels constant one unit increase in displacement results in .019 increase in mpg
weight: While holding all other variables constant one unit increase in weight results in .006 decrease in mpg
year: While holding all other variables constant one unit increase in year results in .75 increase in mpg
origin: While holding all other variables constant one unit increase in origin results in 1.42 increase in mpg
#dpar(mfrow=c(2,2))plot(auto_lm)
From the residual plots we see that there were many violations of the linearity assumptions. The slight curvature in the line of the residuals vs fitted imply that there exists a nonlinear relationship using the full model. The Q-Q residuals plot suggest that the residuals do not follow a normal distribution at the positive extremes. The sqrt of the standardized residuals also point to non constant variance and heteroscedasticity. Finally the residuals vs leverage plot point to observation 14 being an influential point as having high leverage and on the boundary of cook’s distance.
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Call:
lm(formula = mpg ~ horsepower * weight + weight + horsepower +
year, data = subset_Auto)
Residuals:
Min 1Q Median 3Q Max
-7.9146 -1.8987 -0.0386 1.5536 12.6333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.577e+00 3.911e+00 0.915 0.361
horsepower -2.236e-01 2.063e-02 -10.837 <2e-16 ***
weight -1.185e-02 5.868e-04 -20.198 <2e-16 ***
year 7.749e-01 4.508e-02 17.190 <2e-16 ***
horsepower:weight 5.790e-05 5.020e-06 11.534 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.963 on 387 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8559
F-statistic: 581.5 on 4 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))plot(inter)
From the smaller model mpg~weight*horsepower+year we are able to achieve a high F statistics and Adjusted R sqrd \(.85\) while addressing the issue of multicolinearity. The interaction term is significant with a p. value of \(2^{-26}<<.05\). However, the non-linear pattern still remains with the residuals still not being fully normalized on the positive extremes.
Call:
lm(formula = mpg ~ horsepower * I(weight^2) + I(weight^2) + horsepower +
year, data = subset_Auto)
Residuals:
Min 1Q Median 3Q Max
-7.6169 -1.9979 -0.0211 1.7409 12.7008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.002e+01 3.779e+00 -2.65 0.00837 **
horsepower -1.650e-01 1.259e-02 -13.10 < 2e-16 ***
I(weight^2) -2.092e-06 1.026e-07 -20.39 < 2e-16 ***
year 7.596e-01 4.593e-02 16.54 < 2e-16 ***
horsepower:I(weight^2) 1.101e-08 7.462e-10 14.75 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.024 on 387 degrees of freedom
Multiple R-squared: 0.8514, Adjusted R-squared: 0.8499
F-statistic: 554.3 on 4 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))plot(sqr_trans)
The final logarithmic transformation of the model \(log(X)\) resulted in the best model fit: log(mpg)~horsepower*weight+year, which slightly outperformed the \(\sqrt X\) and \(X^2\) models. This final model addresses the nonlinear pattern as well as normalizing the residuals. Homoscedasticity was achieved with constant variance and the influential point 14 was addressed. The model did increase in complexity but has the highest Adjusted R sqrd value thus far at \(88.74\%\).
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The following provides interpretation of the coefficients of the model above:
price is statistically significant: While holding all other variables constant one unit increase in price results in a decrease of sales of 54.4 units.
With a pvalue of \(2^{-16}<<.05\) reject the null hypothesis so, \(H_0:\beta_{price}\neq0\)
urbanYes is not statistically significant: While holding all other variables constant there is no statistical evidence to imply a difference in carseat sales between locations urban or rural.
With a pvalue of \(.936>.05\) we fail to reject the null hypothesis so, \(H_0:\beta_{Urban}= 0\)
USYes is statistically significant: While holding all other variables constant, when compared to the base USNo, a store located in the united states is associated with an increase of 1200 units compared to a store outside of the united states.
With a pvalue of \(4.86^{-6}<.05\) reject the null hypothesis so, \(H_0:\beta_{US}\neq 0\)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(fit, smaller_fit)
Analysis of Variance Table
Model 1: Sales ~ Price + Urban + US
Model 2: Sales ~ Price + US
Res.Df RSS Df Sum of Sq F Pr(>F)
1 396 2420.8
2 397 2420.9 -1 -0.03979 0.0065 0.9357
From the partial f test of the two models, we see that the F statistics yielded a p value of \(.9357>.05\) meaning that we fail to reject the null hypothesis that the coefficient of Urban is zero. Therefore dropping Urban does not significantly affect the RSS. Following the law of parsimony we select the simpler reduced model.
The table above has identified 5 points in the hat column that are classified as high leverage, meaning that the feature values for those observation are significantly different from the mean of the whole dataset. Meanwhile the dependent value portion of the observations are well behaved as non cross Cook’s distance.
Question 12
The regressor coefficient of X onto Y are the same as the regressor coefficients of Y onto X when the sets X and Y are the same OR hold the same sum of squares.
Define the regressors of \(Y\) onto \(X\) as: \[\hat{\beta} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}\]
Next define the regressors of \(X\) onto \(Y\) as: \[\hat{\alpha} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n y_i^2}\]
For the cases where \(\hat{\beta} = \hat{\alpha}\): \[\frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n y_i^2} \implies \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 \quad \text{Q.E.D.}\]
#bset.seed(42)x<-rnorm(100,mean=0, sd=1)y<-rnorm(100,mean=0, sd=13)fit_Y_on_X<-lm(y~x)fit_X_on_Y<-lm(x~y)coef_Y_on_X<-coef(fit_Y_on_X)coef_X_on_Y<-coef(fit_X_on_Y)cat("Coeff of Y onto X:", round(coef_Y_on_X,2))
Coeff of Y onto X: -1.15 0.35
cat("\n")
cat("Coeffof X onto Y:", round(coef_X_on_Y,2))
Coeffof X onto Y: 0.04 0
#cy=xfit_Y_on_X<-lm(y~x)fit_X_on_Y<-lm(x~y)coef_Y_on_X<-coef(fit_Y_on_X)coef_X_on_Y<-coef(fit_X_on_Y)cat("Coeff of Y onto X:", round(coef_Y_on_X,2))