It really comes down to the outcome you are trying to predict (a numerical vs categorical response). With KNN, these are non-parametric approaches that both use some form of local approximation to predict an outcome. The different outcome reflects the different type of approximation taking place. KNN regression tries to predict the value of the output variable by using a local average. KNN classification attempts to predict the class to which the output variable belong by computing the local probability.
_______________________________________________________________________________________________________________________
library(ISLR)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm_mpg <- lm(mpg ~ . - name, data = Auto)
summary(lm_mpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
fit_sales<-lm(Sales~Price+Urban+US, data = Carseats)
summary(fit_sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! Price and US are significant predictors of Sales, while Urban has no significant effect on Sales. For every $1 increase in price, sales are expected to decrease by $54 (all other predictors remaining fixed). On average sales in a US store are $1200 more than in a non US store (all other predictors remaining fixed).
Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales= 13.043469 − 0.054459(Price) − 0.021916(UrbanYes) + 1.200573(USYes)
Given their low p-values, we can reject the null hypothesis for the variables Price & US.
fit_sales2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit_sales2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Not so good, both seemed to only explain 23% of the variation in the response variable Sales
confint(fit_sales2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(fit_sales2)
Upon a visual inspection of the outliers in the Residual vs leverage plot, It looks as if there are some points going beyond the Cooks distance cutoff point and just general outliers. If I summarize the potentially influential obsevations by forming a list that shows them row by row. I can easily reference that list to exclude them from a new model to see if doing so bears any significant outcome on the models performance. By running this new model and comparing the results of both models, I can really see if what I visually suspected can be confirmed or denied.
summary(influence.measures(fit_sales2))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
outyliers <- c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats_Altered <- Carseats[-outyliers,]
fit_sales2NEW <- lm(Sales ~ Price + US, data = Carseats_Altered)
summary(fit_sales2NEW)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats_Altered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.263 -1.605 -0.039 1.590 5.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.925232 0.665259 19.429 < 2e-16 ***
## Price -0.053973 0.005511 -9.794 < 2e-16 ***
## USYes 1.255018 0.248856 5.043 7.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared: 0.2387, Adjusted R-squared: 0.2347
## F-statistic: 58.64 on 2 and 374 DF, p-value: < 2.2e-16
Comparing how well the model fits now to before removing the influential points, there isn’t even a 1% change in R-Squared values. It’s probably safe to include all of the data points in our model.
When the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.
x = rnorm(100,5,2)
y = rnorm(100,10,1)
lm(y~x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 9.82121 0.02837
lm(x~y)
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 3.710 0.113
x = rnorm(100,10,1)
y = rnorm(100,10,1)
lm(y~x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 12.0731 -0.2093
lm(x~y)
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 11.9538 -0.2087
Some helpful references to maybe understand problem 12
https://rstudio-pubs-static.s3.amazonaws.com/289958_811f0797a3454900b45ce7ecad79c101.html
Comments for 9(c)
According to the p-value shown with the F-statistic, we can say the evidence suggests there is a relationship between the predictors and the response variable mpg.
Predictors with p-values that demonstrate statistically significant relationships to the response variable include: Displacement, Weight, Year, and Origin.
Holding all other variables consants, the year variable suggests that for every 1 unit increase in year, mpg will increase by a value of 0.750773
The residual v Fitted plot indicates some mild non-linearity in the data. The residual plot does indicate a few outliers and the leverage plot indicates an observation, #14, with a particularly high leverage point. It may be worth seeing how much of an influence these points have on our models fit by trialing some models with and without the outliers/leverage points.
After running through a few examples on my own, I was able to find some significant interaction terms. Here is summary of those specific findings
The interraction terms cylinders:acceleration & acceleration:year appear to be statistically significant
In order to look into the effects of predictor variable transformations on models, I decided to run a few MLR models with each model holding a variable + the same variable transformed in some way. I noticed in the first model, doing a log transformation of the acceleration variable made it significant, while the untransformed acceleration variable isn’t. I also noticed that sometimes the significance of the variable would not change, but the coefficient estimates would. This was the case when I did a model with a sqrt() transformation on displacement and another model with a x^2 transformation on horsepower. So while more combinations of transformations in models can be trialed here, I think whats important to note is that there can be real changes in sigificance and estimate values when doing so.