One difference between KNN classifier and KNN regression is that in KNN classifier, is that KNN regression would be used in situations where our target variable is numeric / continuous where y is any number and we would use KNN classifier if our target is categorical / discrete because our y output is 0 or 1. KNN regression takes K and the prediction point to find the nearest training points near the prediction point, then tries to predict using the average of the nearest neighbors. KNN classifier takes the points near the prediction point in the training data and estimates the conditional probability of the class. KNN classifier will then classify a test point to whichever class it had the highest probability for.
auto <- read.csv("~/R-Studio/Predictive Modeling/ALL CSV FILES - 2nd Edition/Auto.csv")
plot(auto)
auto_no_name = auto %>%
dplyr::select(-name)
auto_no_name$horsepower = as.numeric(auto_no_name$horsepower)
## Warning: NAs introduced by coercion
cor(auto_no_name)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7762599 -0.8044430 NA -0.8317389
## cylinders -0.7762599 1.0000000 0.9509199 NA 0.8970169
## displacement -0.8044430 0.9509199 1.0000000 NA 0.9331044
## horsepower NA NA NA 1 NA
## weight -0.8317389 0.8970169 0.9331044 NA 1.0000000
## acceleration 0.4222974 -0.5040606 -0.5441618 NA -0.4195023
## year 0.5814695 -0.3467172 -0.3698041 NA -0.3079004
## origin 0.5636979 -0.5649716 -0.6106643 NA -0.5812652
## acceleration year origin
## mpg 0.4222974 0.5814695 0.5636979
## cylinders -0.5040606 -0.3467172 -0.5649716
## displacement -0.5441618 -0.3698041 -0.6106643
## horsepower NA NA NA
## weight -0.4195023 -0.3079004 -0.5812652
## acceleration 1.0000000 0.2829009 0.2100836
## year 0.2829009 1.0000000 0.1843141
## origin 0.2100836 0.1843141 1.0000000
is.na(auto_no_name$horsepower)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [337] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE
#counted as ? in original data
auto_lm = lm(mpg ~ ., data = auto_no_name)
summary(auto_lm)
##
## Call:
## lm(formula = mpg ~ ., data = auto_no_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Given our p-value of 2.2e-16, we can reject the null and state that at least one predictor is significant on mpg (assuming alpha level of 0.05). Displacement, weight, year, and origin are our significant predictors. The coefficient for year is 0.750773, so assuming all other variables are held constant, this means that a one unit increase in year leads to a 0.75 increase in mpg.
par(mfrow=c(2,2))
plot(auto_lm)
Our linear model looks to possibly be not normal because there a lot of
points on the top end that do not follow the normal curve, along with it
not following homoskedasticity / not equal variance because in our
residuals vs fitted plot the points begin spreading out like < when
they should be equidistant and not have a pattern. Looking at our Cook’s
distance there is no point’s that go past the 0.5 threshold so there
seems to be no influential points.
auto_lm_int = lm(mpg~.*. , data = auto_no_name)
summary(auto_lm_int)
##
## Call:
## lm(formula = mpg ~ . * ., data = auto_no_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
Using an alpha level of 0.05, the following interactions are significant: displacement:year,acceleration:year, and acceleration:origin.
auto_lm_trans = lm(mpg~log(horsepower)+horsepower, data=auto_no_name)
summary(auto_lm_trans)
##
## Call:
## lm(formula = mpg ~ log(horsepower) + horsepower, data = auto_no_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5118 -2.5018 -0.2533 2.4446 15.3102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.04057 12.08267 12.914 < 2e-16 ***
## log(horsepower) -31.59815 3.28363 -9.623 < 2e-16 ***
## horsepower 0.11846 0.02929 4.044 6.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.415 on 389 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6801
## F-statistic: 416.6 on 2 and 389 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(auto_lm_trans)
Previously, horsepower was not significant when including all variables
of the model but now it became significant with the transformation. It
also seems to follow normality and equal variance more than the entire
model.
carsets <- read.csv("~/R-Studio/Predictive Modeling/ALL CSV FILES - 2nd Edition/Carseats.csv")
carseats_lm = lm(Sales ~ Price + Urban + US, data = carsets)
summary(carseats_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carsets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price(p<0.05): With all other variables held constant, a one unit
increase in price is a loss of 0.054459 sales. Urban(p>0.05): Urban
is not a significant predictor of Sales since p=0.936, so there is no
evidence that a store being Urban would affect sales. With all other
variables held constant, if a store is in an urban area, then there is a
loss of 0.021916 sales. US(p<0.05): With all other variables
held constant, if a store is in the US, then there is a 1.200573
increase in sales.
Sales = 13.043469 - 0.054459 * (Price) - 0.021916 * (Urban) + 1.200573 * (US) Urban and US can be 0 or 1 to nullify their part of the equation.
Price and US, assuming 0.05
carseat_e = lm(Sales ~ Price + US, data = carsets)
summary(carseat_e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = carsets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Using Adjusted R-squared, Model A is 0.2335 and Model E is 0.2354. This means that both models can explain ~23% of variance but Model E has a better F statistic at 62.43 which is ~20 higher than Model A.
confint(carseat_e)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
There is a 95% chance that the values of price fall between (-0.06475984, -0.04419543) and ( 0.69151957, 1.70776632) for US and a 5% chance it does not.
par(mfrow=c(2,2))
plot(carseat_e)
Looks like there are no high leverage observations in the Residuals vs
Leverage plot.
When the coefficient is 1 they should be the same?
x = rnorm(100)
y = 0.42*x + rnorm(100)
q12_a = lm(y~x)
q12_b = lm(x~y)
summary(q12_a)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2072 -0.5763 0.0610 0.5546 2.3546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06450 0.09810 -0.658 0.51238
## x 0.28748 0.09964 2.885 0.00481 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9714 on 98 degrees of freedom
## Multiple R-squared: 0.07829, Adjusted R-squared: 0.06889
## F-statistic: 8.324 on 1 and 98 DF, p-value: 0.004811
summary(q12_b)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6802 -0.6189 0.1225 0.6761 1.5873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14379 0.09458 1.520 0.13167
## y 0.27234 0.09439 2.885 0.00481 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9455 on 98 degrees of freedom
## Multiple R-squared: 0.07829, Adjusted R-squared: 0.06889
## F-statistic: 8.324 on 1 and 98 DF, p-value: 0.004811
coefficients(q12_a)
## (Intercept) x
## -0.06450122 0.28747757
coefficients(q12_b)
## (Intercept) y
## 0.1437865 0.2723397
They are different
x = rnorm(100)
y = x
q12_c = lm(y~x)
q12_d = lm(x~y)
summary(q12_c)
## Warning in summary.lm(q12_c): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.653e-15 1.800e-18 5.670e-17 8.640e-17 5.662e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.661e-17 5.874e-17 -1.134e+00 0.26
## x 1.000e+00 5.400e-17 1.852e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.825e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.429e+32 on 1 and 98 DF, p-value: < 2.2e-16
summary(q12_d)
## Warning in summary.lm(q12_d): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.653e-15 1.800e-18 5.670e-17 8.640e-17 5.662e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.661e-17 5.874e-17 -1.134e+00 0.26
## y 1.000e+00 5.400e-17 1.852e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.825e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.429e+32 on 1 and 98 DF, p-value: < 2.2e-16
coefficients(q12_c)
## (Intercept) x
## -6.661338e-17 1.000000e+00
coefficients(q12_d)
## (Intercept) y
## -6.661338e-17 1.000000e+00
Now they are the same / perfect fit.