The KNN classifier is used when the target variables is discrete. The KNN regression model is used when y is continuous. The classifier is used to predict an observation to a category, while the regression is trying to predict a value.
auto = read.csv("Auto.csv", na.strings = "?")
auto = na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
auto$horsepower = as.numeric(auto$horsepower)
pairs(auto[1:8])
cor(auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
auto.lm = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = auto)
summary(auto.lm)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, the linear model appears to have 4 significant variables in predicting mpg. These are displacement, weight, year and origin. All the ones mentioned, expect for weight have a positive relationship to mpg. This makes sense, as weight tends to slow a car down.
The most significant indicators are weight and year. Both have the same P-value at < 2e-16 and are way below the .05 level of significance.
The coefficient for the variable year is .7508. This indicates that for every one movement in x, y will move .7508. In other words, the newer the car the higher the mpg.
par(mfrow=c(2,2))
plot(auto.lm, which=c(1:4))
The residual vs fitted plot seems to have a slight pattern, which could indicate heteroscedasticity. The red line that plotted in the scale-location chart is roughly horizontal and therefore seems fine. The points on the QQ plot seem to mostly follow the QQ line, which indicates normality. Lastly, the cook’s distance plot indicates that there might be three outliers. The one with the highest cooks distance has a value of .08 and could be removed.
auto.lm2 = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin + cylinders*displacement + cylinders*horsepower + cylinders*weight + cylinders*acceleration + cylinders*year + cylinders*origin +
horsepower*weight + horsepower*acceleration + horsepower*year + horsepower*origin + weight*acceleration + weight*year + weight*origin + acceleration*year + acceleration*origin + year*origin , data = auto)
summary(auto.lm2)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin + cylinders * displacement +
## cylinders * horsepower + cylinders * weight + cylinders *
## acceleration + cylinders * year + cylinders * origin + horsepower *
## weight + horsepower * acceleration + horsepower * year +
## horsepower * origin + weight * acceleration + weight * year +
## weight * origin + acceleration * year + acceleration * origin +
## year * origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5120 -1.4640 -0.0339 1.3639 11.3647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.575e+01 5.013e+01 1.711 0.08801 .
## cylinders -6.267e+00 5.546e+00 -1.130 0.25919
## displacement 1.199e-02 3.019e-02 0.397 0.69140
## horsepower 3.396e-01 3.299e-01 1.029 0.30396
## weight -2.183e-02 1.514e-02 -1.441 0.15038
## acceleration -5.227e+00 2.124e+00 -2.461 0.01433 *
## year 1.174e-01 5.786e-01 0.203 0.83932
## origin -1.734e+01 6.923e+00 -2.504 0.01271 *
## cylinders:displacement -1.999e-03 4.488e-03 -0.445 0.65628
## cylinders:horsepower -8.357e-03 1.486e-02 -0.562 0.57429
## cylinders:weight 1.537e-03 5.757e-04 2.670 0.00793 **
## cylinders:acceleration 7.697e-02 9.863e-02 0.780 0.43566
## cylinders:year 8.582e-03 6.453e-02 0.133 0.89427
## cylinders:origin 7.531e-01 3.931e-01 1.916 0.05618 .
## horsepower:weight 2.088e-05 1.796e-05 1.163 0.24557
## horsepower:acceleration -6.810e-03 3.221e-03 -2.114 0.03516 *
## horsepower:year -4.214e-03 3.812e-03 -1.105 0.26974
## horsepower:origin 3.354e-03 2.888e-02 0.116 0.90763
## weight:acceleration 2.388e-04 1.992e-04 1.199 0.23147
## weight:year 2.183e-05 1.755e-04 0.124 0.90108
## weight:origin 4.048e-04 1.133e-03 0.357 0.72102
## acceleration:year 5.187e-02 2.496e-02 2.079 0.03835 *
## acceleration:origin 4.785e-01 1.456e-01 3.286 0.00111 **
## year:origin 7.480e-02 7.128e-02 1.049 0.29468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.719 on 368 degrees of freedom
## Multiple R-squared: 0.8858, Adjusted R-squared: 0.8786
## F-statistic: 124.1 on 23 and 368 DF, p-value: < 2.2e-16
There seems to be 4 significant interaction effects. These are cylinders:weight, cylinders:origin, horsepower:acceleration, acceleration:year and acceleration:origin. The most significant is acceleration:origin.
auto.lm3 = lm(mpg ~ log(cylinders) + log(displacement) + log(horsepower) + log(weight) + log(acceleration) + log(year) + log(origin), data = auto)
summary(auto.lm3)
##
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(horsepower) +
## log(weight) + log(acceleration) + log(year) + log(origin),
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5987 -1.8172 -0.0181 1.5906 12.8132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.5643 17.5053 -3.803 0.000167 ***
## log(cylinders) 1.4818 1.6589 0.893 0.372273
## log(displacement) -1.0551 1.5385 -0.686 0.493230
## log(horsepower) -6.9657 1.5569 -4.474 1.01e-05 ***
## log(weight) -12.5728 2.2251 -5.650 3.12e-08 ***
## log(acceleration) -4.9831 1.6078 -3.099 0.002082 **
## log(year) 54.9857 3.5555 15.465 < 2e-16 ***
## log(origin) 1.5822 0.5083 3.113 0.001991 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.069 on 384 degrees of freedom
## Multiple R-squared: 0.8482, Adjusted R-squared: 0.8454
## F-statistic: 306.5 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm4 = lm(mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) + sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), data = auto)
summary(auto.lm4)
##
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) +
## sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin),
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5250 -1.9822 -0.1111 1.7347 13.0681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.79814 9.17832 -5.426 1.02e-07 ***
## sqrt(cylinders) -0.23699 1.53753 -0.154 0.8776
## sqrt(displacement) 0.22580 0.22940 0.984 0.3256
## sqrt(horsepower) -0.77976 0.30788 -2.533 0.0117 *
## sqrt(weight) -0.62172 0.07898 -7.872 3.59e-14 ***
## sqrt(acceleration) -0.82529 0.83443 -0.989 0.3233
## sqrt(year) 12.79030 0.85891 14.891 < 2e-16 ***
## sqrt(origin) 3.26036 0.76767 4.247 2.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8308
## F-statistic: 275.3 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm5 = lm(mpg ~ (cylinders)**2 + (displacement)**2 + (horsepower)**2 + (weight)**2 + (acceleration)**2 + (year)**2 + (origin)**2, data = auto)
summary(auto.lm5)
##
## Call:
## lm(formula = mpg ~ (cylinders)^2 + (displacement)^2 + (horsepower)^2 +
## (weight)^2 + (acceleration)^2 + (year)^2 + (origin)^2, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The best predicting model that was created during this part was the log model, with a r-square value of .8482. The best model overall was the one that was created in the previous part. That model had a r-square of .8856.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
car = Carseats
car = na.omit(car)
car.lm = lm(Sales ~ Price + Urban + US, data = car)
summary(car.lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
According to the model that was created, only price and urban are significant to predicting child cars seat sales. As the price of the car seats increase, its sales decreases. For every one unit increase in x, y decreases .0545 units. If the store is in an urban location sales decrease and they increase if the store is in the US.
Yhat = 13.0435 - .0545(price) - .0219(UrbanYes) + 1.2005(USYes)
The null hypothesis of the variables Price and USYes are rejected because these two have a p-value below .05 and means that they are significant.
car.lm2 = lm(Sales ~ Price + US, data = car)
summary(car.lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models have low r-square values. The model in part E) has a value of .2393, which is equivalent to the one seen on part A). Model E) has a slightly higher adjusted r-square with a value of .2354 versus .2335 in model A). The coefficients of the variables barely change.
confint(car.lm2, level = .95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
cook.model = cooks.distance(car.lm2)
plot(cook.model, col = "red", pch = 20, cex = 1)
According to the cooks distance plot, no outliers or high leverage observations populate the model. All the points seem to be close to each other.
When x and y are symmetrical or equivalent
set.seed(70)
x = runif(100)
set.seed(58)
y = rnorm(100)
model = lm(y~x)
model2 = lm(x~y)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91100 -0.76273 0.02784 0.61826 2.30196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3556 0.1963 1.811 0.0731 .
## x -0.7723 0.3365 -2.295 0.0238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.935 on 98 degrees of freedom
## Multiple R-squared: 0.05102, Adjusted R-squared: 0.04133
## F-statistic: 5.268 on 1 and 98 DF, p-value: 0.02385
summary(model2)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52975 -0.19776 0.01285 0.22857 0.51232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.51039 0.02737 18.649 <2e-16 ***
## y -0.06606 0.02878 -2.295 0.0238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2734 on 98 degrees of freedom
## Multiple R-squared: 0.05102, Adjusted R-squared: 0.04133
## F-statistic: 5.268 on 1 and 98 DF, p-value: 0.02385
plot(x,y)
abline(model)
plot(y,x)
abline(model2)
set.seed(70)
x2 = sample.int(100,100)
y2 = sample.int(100,100)
model3 = lm(y2~x2)
model4 = lm(x2~y2)
summary(model3)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.136 -23.523 0.382 23.836 49.518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.9915 5.8408 7.703 1.09e-11 ***
## x2 0.1091 0.1004 1.086 0.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.99 on 98 degrees of freedom
## Multiple R-squared: 0.0119, Adjusted R-squared: 0.001816
## F-statistic: 1.18 on 1 and 98 DF, p-value: 0.28
summary(model4)
##
## Call:
## lm(formula = x2 ~ y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.754 -24.723 0.173 24.509 51.300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.9915 5.8408 7.703 1.09e-11 ***
## y2 0.1091 0.1004 1.086 0.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.99 on 98 degrees of freedom
## Multiple R-squared: 0.0119, Adjusted R-squared: 0.001816
## F-statistic: 1.18 on 1 and 98 DF, p-value: 0.28
plot(x2,y2)
abline(model3)
plot(y2,x2)
abline(model4)