This document will show linear regression and KNN tests of the mtcars dataset. Both tools help predict target values, but linear regression is often easier to interpret and it’s clearer which variables are the best predictors. KNN can be reliable if the dataset is large and if it is assumed that nearby-points are similar.
A preview of the dataset:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The structure of the dataset is as follows:
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Q1. The variables most strongly correlated with mpg are cyl, disp, wt.
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
## corrplot 0.92 loaded
There are no missing values:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
Q3. The assumptions of the linear regression are that the variables have a linear relationship, that there is normality, and that there is equal variance.
In this dataset, the distribution is fairly normal, and the variance is fairly evenly distributed. .
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Q4. The interaction between cyl and wt is
significan’t to the model with a p-value of 0.02. Our R-square increases
from 0.87 to 0.90 and our adjusted R-square goes from 0.81 to 0.85. This
means that there is synergistic interplay between these variables that
changes the model when they are both used.
##
## Call:
## lm(formula = mpg ~ . + cyl * wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8411 -1.5118 -0.1535 1.0062 4.4899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.194963 17.531505 1.437 0.16614
## cyl -2.455271 1.325962 -1.852 0.07889 .
## disp 0.004617 0.016360 0.282 0.78068
## hp -0.024417 0.019518 -1.251 0.22536
## drat 0.380865 1.472647 0.259 0.79857
## wt -10.803352 3.309691 -3.264 0.00388 **
## qsec 0.994343 0.657765 1.512 0.14625
## vs 0.365762 1.883568 0.194 0.84799
## am 0.674658 1.983863 0.340 0.73735
## gear 1.644552 1.394046 1.180 0.25196
## carb -0.038582 0.744507 -0.052 0.95918
## cyl:wt 1.029314 0.412776 2.494 0.02152 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.372 on 20 degrees of freedom
## Multiple R-squared: 0.9001, Adjusted R-squared: 0.8451
## F-statistic: 16.38 on 11 and 20 DF, p-value: 1.14e-07
## [1] "Mean Squared Error for Linear Model: 3.52"
## [1] "Mean Squared Error for stdKNN: 2.16"
Q2. As the value of k increases, the MSE increases,
meaning that the model becomes less and less of a fit for the data.
## k MSE
## 1 1 0.0000000
## 2 3 0.8303497
## 3 5 2.1607451
## 4 7 3.4229878
## 5 9 4.3303489
## 6 11 4.9779654
Q5. When we truncate hp, our R-squared and Adjusted
R-squared stays about the same at 0.87 and 0.81.
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3995 -1.6189 -0.1607 1.1006 4.6034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.67402 18.65510 0.733 0.472
## cyl -0.07669 1.03686 -0.074 0.942
## disp 0.01385 0.01726 0.802 0.432
## hp -0.02747 0.02369 -1.160 0.259
## drat 0.97639 1.61615 0.604 0.552
## wt -3.53681 1.83844 -1.924 0.068 .
## qsec 0.71170 0.73841 0.964 0.346
## vs 0.28096 2.05922 0.136 0.893
## am 2.31767 2.03927 1.137 0.269
## gear 0.69316 1.48211 0.468 0.645
## carb -0.27777 0.76186 -0.365 0.719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 21 degrees of freedom
## Multiple R-squared: 0.8712, Adjusted R-squared: 0.8099
## F-statistic: 14.2 on 10 and 21 DF, p-value: 3.211e-07
Q6. When we make the dataset larger, KNN because a better tool as long as long as we don’t add dimensionality. When the model becomes highly dimensional, KNN becomes less accurate.