Introduction

This document will show linear regression and KNN tests of the mtcars dataset. Both tools help predict target values, but linear regression is often easier to interpret and it’s clearer which variables are the best predictors. KNN can be reliable if the dataset is large and if it is assumed that nearby-points are similar.

A preview of the dataset:

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The structure of the dataset is as follows:

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Dataset Visuals:

Q1. The variables most strongly correlated with mpg are cyl, disp, wt.

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
## corrplot 0.92 loaded

There are no missing values:

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

Q3. The assumptions of the linear regression are that the variables have a linear relationship, that there is normality, and that there is equal variance.

In this dataset, the distribution is fairly normal, and the variance is fairly evenly distributed. .

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Q4. The interaction between cyl and wt is significan’t to the model with a p-value of 0.02. Our R-square increases from 0.87 to 0.90 and our adjusted R-square goes from 0.81 to 0.85. This means that there is synergistic interplay between these variables that changes the model when they are both used.

## 
## Call:
## lm(formula = mpg ~ . + cyl * wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8411 -1.5118 -0.1535  1.0062  4.4899 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  25.194963  17.531505   1.437  0.16614   
## cyl          -2.455271   1.325962  -1.852  0.07889 . 
## disp          0.004617   0.016360   0.282  0.78068   
## hp           -0.024417   0.019518  -1.251  0.22536   
## drat          0.380865   1.472647   0.259  0.79857   
## wt          -10.803352   3.309691  -3.264  0.00388 **
## qsec          0.994343   0.657765   1.512  0.14625   
## vs            0.365762   1.883568   0.194  0.84799   
## am            0.674658   1.983863   0.340  0.73735   
## gear          1.644552   1.394046   1.180  0.25196   
## carb         -0.038582   0.744507  -0.052  0.95918   
## cyl:wt        1.029314   0.412776   2.494  0.02152 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.372 on 20 degrees of freedom
## Multiple R-squared:  0.9001, Adjusted R-squared:  0.8451 
## F-statistic: 16.38 on 11 and 20 DF,  p-value: 1.14e-07
## [1] "Mean Squared Error for Linear Model: 3.52"

## [1] "Mean Squared Error for stdKNN: 2.16"

Q2. As the value of k increases, the MSE increases, meaning that the model becomes less and less of a fit for the data.

##    k       MSE
## 1  1 0.0000000
## 2  3 0.8303497
## 3  5 2.1607451
## 4  7 3.4229878
## 5  9 4.3303489
## 6 11 4.9779654

Q5. When we truncate hp, our R-squared and Adjusted R-squared stays about the same at 0.87 and 0.81.

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3995 -1.6189 -0.1607  1.1006  4.6034 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.67402   18.65510   0.733    0.472  
## cyl         -0.07669    1.03686  -0.074    0.942  
## disp         0.01385    0.01726   0.802    0.432  
## hp          -0.02747    0.02369  -1.160    0.259  
## drat         0.97639    1.61615   0.604    0.552  
## wt          -3.53681    1.83844  -1.924    0.068 .
## qsec         0.71170    0.73841   0.964    0.346  
## vs           0.28096    2.05922   0.136    0.893  
## am           2.31767    2.03927   1.137    0.269  
## gear         0.69316    1.48211   0.468    0.645  
## carb        -0.27777    0.76186  -0.365    0.719  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 21 degrees of freedom
## Multiple R-squared:  0.8712, Adjusted R-squared:  0.8099 
## F-statistic:  14.2 on 10 and 21 DF,  p-value: 3.211e-07

Q6. When we make the dataset larger, KNN because a better tool as long as long as we don’t add dimensionality. When the model becomes highly dimensional, KNN becomes less accurate.