Linear Regression

Carefully explain the differences between the KNN classifier and KNN regression methods.

To predict Y for a given value of X, the KNN regression considers k closest points to X in training data and takes the average of the responses. The KNN classifier utilizes k closest points to training data to classify qualitative observations.

This question involves the use of multiple linear regression on the Auto data set.

Auto <- read.csv("C:/Users/dpmar/OneDrive/Documents/R/DATA/Auto.csv", stringsAsFactors = T)

## A): Produce a scatterplot matrix which includes all of the variables in the dataset.

pairs(Auto)

## B): Compute the matrix of correlations between the variables using the function cor().

Auto <- Auto[,-9]
AutoTF <- Auto[,-4]

cor(AutoTF)

##                     mpg  cylinders displacement     weight acceleration
## mpg           1.0000000 -0.7762599   -0.8044430 -0.8317389    0.4222974
## cylinders    -0.7762599  1.0000000    0.9509199  0.8970169   -0.5040606
## displacement -0.8044430  0.9509199    1.0000000  0.9331044   -0.5441618
## weight       -0.8317389  0.8970169    0.9331044  1.0000000   -0.4195023
## acceleration  0.4222974 -0.5040606   -0.5441618 -0.4195023    1.0000000
## year          0.5814695 -0.3467172   -0.3698041 -0.3079004    0.2829009
## origin        0.5636979 -0.5649716   -0.6106643 -0.5812652    0.2100836
##                    year     origin
## mpg           0.5814695  0.5636979
## cylinders    -0.3467172 -0.5649716
## displacement -0.3698041 -0.6106643
## weight       -0.3079004 -0.5812652
## acceleration  0.2829009  0.2100836
## year          1.0000000  0.1843141
## origin        0.1843141  1.0000000

## C): Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

## Based on the multiple linear regression, there appears to be a relationship between the predictors and response. Weight, acceleration, year, and origin demonstrate the highest influence & significance. The coefficient for the year variable suggests a positive correlation between mpg. 

lm.Auto=lm(mpg~cylinders + displacement + weight + acceleration + year + origin, data=Auto)

summary(lm.Auto)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + weight + acceleration + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5573 -2.1745 -0.0456  1.8454 12.9946 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.014e+01  4.145e+00  -4.858 1.72e-06 ***
## cylinders    -4.198e-01  3.203e-01  -1.311   0.1908    
## displacement  1.742e-02  7.189e-03   2.423   0.0158 *  
## weight       -6.928e-03  5.781e-04 -11.983  < 2e-16 ***
## acceleration  1.591e-01  7.741e-02   2.055   0.0405 *  
## year          7.703e-01  4.934e-02  15.613  < 2e-16 ***
## origin        1.356e+00  2.691e-01   5.040 7.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.333 on 390 degrees of freedom
## Multiple R-squared:  0.8214, Adjusted R-squared:  0.8186 
## F-statistic: 298.9 on 6 and 390 DF,  p-value: < 2.2e-16

## D) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.

## The residual plot shows greater deviations as the fitted values increase. The leverage plot also identifies 3 observations with a higher than average leverage.

plot(lm.Auto)

## E) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

## All of the tested interactions are statistically significant.

lm.Auto.Inter=lm(mpg~cylinders:displacement + displacement:weight + weight:acceleration + acceleration:year + year:origin, data=Auto)

summary(lm.Auto.Inter)

## 
## Call:
## lm(formula = mpg ~ cylinders:displacement + displacement:weight + 
##     weight:acceleration + acceleration:year + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9192 -2.1888 -0.0468  1.7326 12.4489 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.256e+01  1.500e+00   8.372 1.02e-15 ***
## cylinders:displacement -3.464e-03  1.033e-03  -3.352 0.000881 ***
## displacement:weight     1.055e-05  2.228e-06   4.738 3.03e-06 ***
## weight:acceleration    -5.898e-04  3.409e-05 -17.301  < 2e-16 ***
## acceleration:year       2.817e-02  1.624e-03  17.344  < 2e-16 ***
## year:origin             1.297e-02  3.454e-03   3.754 0.000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.386 on 391 degrees of freedom
## Multiple R-squared:  0.8152, Adjusted R-squared:  0.8128 
## F-statistic: 344.9 on 5 and 391 DF,  p-value: < 2.2e-16

## F) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

## It appears the log & sqrt transformations impact influence/correlations of origin, acceleration, & displacement on mpg.

lm.Auto.log=lm(mpg~log(cylinders) + log(displacement) + log(weight) + log(acceleration) + log(year) + log(origin), data=Auto)

lm.Auto.sqrt=lm(mpg~sqrt(cylinders) + sqrt(displacement) + sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), data=Auto)

lm.Auto.sqrd=lm(mpg~(cylinders)^2 + (displacement)^2 + (weight)^2 + (acceleration)^2 + (year)^2 + (origin)^2, data=Auto)

summary(lm.Auto.log)

## 
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(weight) + 
##     log(acceleration) + log(year) + log(origin), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7552 -2.0393 -0.0793  1.6484 12.9417 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -81.1712    17.4335  -4.656 4.43e-06 ***
## log(cylinders)      1.7036     1.6906   1.008  0.31422    
## log(displacement)  -1.0799     1.5707  -0.688  0.49217    
## log(weight)       -18.8500     1.7867 -10.550  < 2e-16 ***
## log(acceleration)   0.3410     1.1020   0.309  0.75713    
## log(year)          59.1308     3.5025  16.883  < 2e-16 ***
## log(origin)         1.3358     0.5123   2.608  0.00947 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.139 on 390 degrees of freedom
## Multiple R-squared:  0.8415, Adjusted R-squared:  0.8391 
## F-statistic: 345.2 on 6 and 390 DF,  p-value: < 2.2e-16

summary(lm.Auto.sqrt)

## 
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(weight) + 
##     sqrt(acceleration) + sqrt(year) + sqrt(origin), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5371 -2.0696 -0.0884  1.7341 13.0395 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -61.95588    8.01021  -7.735 8.94e-14 ***
## sqrt(cylinders)      0.07329    1.54182   0.048 0.962110    
## sqrt(displacement)   0.16085    0.22924   0.702 0.483293    
## sqrt(weight)        -0.74142    0.06558 -11.306  < 2e-16 ***
## sqrt(acceleration)   0.67239    0.59463   1.131 0.258845    
## sqrt(year)          13.41805    0.83071  16.152  < 2e-16 ***
## sqrt(origin)         2.93024    0.75371   3.888 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 390 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.8294 
## F-statistic:   322 on 6 and 390 DF,  p-value: < 2.2e-16

summary(lm.Auto.sqrd)

## 
## Call:
## lm(formula = mpg ~ (cylinders)^2 + (displacement)^2 + (weight)^2 + 
##     (acceleration)^2 + (year)^2 + (origin)^2, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5573 -2.1745 -0.0456  1.8454 12.9946 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.014e+01  4.145e+00  -4.858 1.72e-06 ***
## cylinders    -4.198e-01  3.203e-01  -1.311   0.1908    
## displacement  1.742e-02  7.189e-03   2.423   0.0158 *  
## weight       -6.928e-03  5.781e-04 -11.983  < 2e-16 ***
## acceleration  1.591e-01  7.741e-02   2.055   0.0405 *  
## year          7.703e-01  4.934e-02  15.613  < 2e-16 ***
## origin        1.356e+00  2.691e-01   5.040 7.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.333 on 390 degrees of freedom
## Multiple R-squared:  0.8214, Adjusted R-squared:  0.8186 
## F-statistic: 298.9 on 6 and 390 DF,  p-value: < 2.2e-16

plot(lm.Auto.log)

plot(lm.Auto.sqrt)

plot(lm.Auto.sqrd)

This question should be answered using the Carseats data set.

library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

## A) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.Carseats=lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.Carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

## B) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

## "Price" is the company charges for car seats at each site. "Urban" indicates whether the store is in an urban or rural location. "US" indicates whether the store is in the US or not.

## C) Write out the model in equation form, being careful to handle the qualitative variables properly.

## Y = 13.043469 - 0.054459x1 - 0.021916x2 + 1.200573x3 + Error

## D) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

## Predictors Price and US can reject the null hypothesis Bj = 0, given their respective probabilities are less than the designated alpha, 0.05.

## E) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome. 

lm.Carseats.assn=lm(Sales ~ Price + US, data=Carseats)
summary(lm.Carseats.assn)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

## F) How well do the models in (a) and (e) fit the data?

## Both models appear to fit the data extremely well, as the residual plots demonstrate a random spread, and the  Q-Q plots present almost identical fits.

plot(lm.Carseats)

plot(lm.Carseats.assn)

## Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm.Carseats.assn)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

This problem involves simple linear regression without an intercept.

## A) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

## The coefficeient estimate for regression of X onto Y is the same as Y onto XD when the observation is set to 1.

## B) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x <- rnorm (100)

y <- 2 * x 

xce <- 2 * y
yce <- y

xce

##   [1]  -1.50307484  -4.22910319   0.06350131  -0.54890003   5.85909429
##   [6]  -5.14578863   2.09898368  -4.68057257  -0.70744727  -0.09645434
##  [11]   0.86808516  -4.03948203   2.59265150  -0.22308245  -4.73432461
##  [16]  -1.83899245  -4.51019057  -0.96246875  -3.84508468  -4.34804540
##  [21]   2.22319322   5.50961202   4.89665203  11.18010811   1.93451732
##  [26]   3.49608733   0.22258607   1.65060242  -0.44507075   4.00837371
##  [31]  -0.94250301  -5.96891023  -8.50616633  -1.40479666   1.78677852
##  [36]  -3.03831064   0.49820551  -3.51849927   2.60075162   7.16007958
##  [41]  -4.28132278  -0.84913458   8.02047925  -1.51097397  -1.02192414
##  [46]  -3.16278335   3.13050860   4.63854987   3.03882096  -6.49240493
##  [51]  -2.54267442   4.80837729  -6.53397278  -1.27966204  -0.50017737
##  [56]   4.37518883   1.40526365   2.41950947  -0.76203112   7.86800633
##  [61]  -2.22969000   2.57050199  -1.38091789  -4.30443762   2.11303876
##  [66]   6.10986641   2.12802633   4.55904760   3.29844180  -4.46351765
##  [71]  -2.24872680   2.71581262  -4.91537967  -5.45372254  -2.59457958
##  [76]  -6.26998979 -11.62183260   2.69952142   2.68584179   6.21192082
##  [81]   3.70936245  -3.93907650   0.06709711  -8.30639946   0.59407051
##  [86]   1.92046926  -4.04635901  -3.20564527  -2.58792010   3.73235774
##  [91]   3.70973099   1.07294758   2.27745545   1.05110334  -5.88034577
##  [96]   6.72669967  -3.41422173   1.23711287  -4.20459905   0.61937218

yce

##   [1] -0.75153742 -2.11455160  0.03175065 -0.27445002  2.92954714 -2.57289431
##   [7]  1.04949184 -2.34028629 -0.35372364 -0.04822717  0.43404258 -2.01974102
##  [13]  1.29632575 -0.11154122 -2.36716231 -0.91949622 -2.25509529 -0.48123437
##  [19] -1.92254234 -2.17402270  1.11159661  2.75480601  2.44832601  5.59005406
##  [25]  0.96725866  1.74804366  0.11129304  0.82530121 -0.22253538  2.00418686
##  [31] -0.47125150 -2.98445511 -4.25308316 -0.70239833  0.89338926 -1.51915532
##  [37]  0.24910275 -1.75924964  1.30037581  3.58003979 -2.14066139 -0.42456729
##  [43]  4.01023962 -0.75548698 -0.51096207 -1.58139167  1.56525430  2.31927493
##  [49]  1.51941048 -3.24620247 -1.27133721  2.40418864 -3.26698639 -0.63983102
##  [55] -0.25008868  2.18759442  0.70263183  1.20975474 -0.38101556  3.93400316
##  [61] -1.11484500  1.28525100 -0.69045894 -2.15221881  1.05651938  3.05493321
##  [67]  1.06401317  2.27952380  1.64922090 -2.23175883 -1.12436340  1.35790631
##  [73] -2.45768984 -2.72686127 -1.29728979 -3.13499490 -5.81091630  1.34976071
##  [79]  1.34292090  3.10596041  1.85468123 -1.96953825  0.03354855 -4.15319973
##  [85]  0.29703525  0.96023463 -2.02317950 -1.60282263 -1.29396005  1.86617887
##  [91]  1.85486550  0.53647379  1.13872772  0.52555167 -2.94017289  3.36334984
##  [97] -1.70711086  0.61855643 -2.10229952  0.30968609

## C) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.


x <- rnorm (100)

y <-  x

xce <- y
yce <- y

xce

##   [1]  1.40258684 -0.09232460 -1.13643722  0.32131762 -0.51444968  0.23753795
##   [7]  2.24061324  0.09905938  1.90533667 -1.26252965 -0.86713627 -1.32077319
##  [13] -0.99143782 -0.43549833  0.45280918 -0.10051682 -0.86594593  1.69724393
##  [19] -2.01516273  0.15700135  0.37972247  0.06229742  1.38766604 -1.24720761
##  [25]  2.19178269 -0.37934698 -1.58664114 -1.23418136 -1.71844607 -0.31692126
##  [31] -2.39415199 -0.13636846 -0.73088830  0.48139233  0.30320475 -0.33252151
##  [37]  0.53626035 -0.13852238 -0.81673783 -0.54330717 -0.76116166 -0.47306687
##  [43]  0.42727005 -0.48477424  0.96924812  1.06695495 -0.83902122  0.00577806
##  [49] -0.33072203 -1.81645637  0.43283737 -0.20746934  0.62648646  0.38930720
##  [55] -1.28285860 -0.69284443  0.97732377  0.12086957  0.63127328 -1.95948097
##  [61] -0.64504028 -0.05091389  1.64951163  0.11958931  0.85366946 -0.54343614
##  [67]  1.57336820  1.04466815  0.84866483 -0.04265028 -0.09827115  0.04776925
##  [73] -0.92800285  1.76399295  0.07133766 -0.99621124  1.53986861  1.31406265
##  [79]  0.70452274  0.72759971  0.24305432 -0.81293438 -0.65339727 -1.90879716
##  [85]  0.15890392 -1.90955204 -0.24802290  1.31146723 -0.02321760 -0.88757668
##  [91]  0.23403576  1.08076592  1.71136399  0.25142325  0.11611089 -0.33394041
##  [97] -0.31354320  0.26563564  0.53627070 -1.64660419

yce

##   [1]  1.40258684 -0.09232460 -1.13643722  0.32131762 -0.51444968  0.23753795
##   [7]  2.24061324  0.09905938  1.90533667 -1.26252965 -0.86713627 -1.32077319
##  [13] -0.99143782 -0.43549833  0.45280918 -0.10051682 -0.86594593  1.69724393
##  [19] -2.01516273  0.15700135  0.37972247  0.06229742  1.38766604 -1.24720761
##  [25]  2.19178269 -0.37934698 -1.58664114 -1.23418136 -1.71844607 -0.31692126
##  [31] -2.39415199 -0.13636846 -0.73088830  0.48139233  0.30320475 -0.33252151
##  [37]  0.53626035 -0.13852238 -0.81673783 -0.54330717 -0.76116166 -0.47306687
##  [43]  0.42727005 -0.48477424  0.96924812  1.06695495 -0.83902122  0.00577806
##  [49] -0.33072203 -1.81645637  0.43283737 -0.20746934  0.62648646  0.38930720
##  [55] -1.28285860 -0.69284443  0.97732377  0.12086957  0.63127328 -1.95948097
##  [61] -0.64504028 -0.05091389  1.64951163  0.11958931  0.85366946 -0.54343614
##  [67]  1.57336820  1.04466815  0.84866483 -0.04265028 -0.09827115  0.04776925
##  [73] -0.92800285  1.76399295  0.07133766 -0.99621124  1.53986861  1.31406265
##  [79]  0.70452274  0.72759971  0.24305432 -0.81293438 -0.65339727 -1.90879716
##  [85]  0.15890392 -1.90955204 -0.24802290  1.31146723 -0.02321760 -0.88757668
##  [91]  0.23403576  1.08076592  1.71136399  0.25142325  0.11611089 -0.33394041
##  [97] -0.31354320  0.26563564  0.53627070 -1.64660419

Linear Regression

Marco Teniente

2023-02-12

Carefully explain the differences between the KNN classifier and KNN regression methods.

This question involves the use of multiple linear regression on the Auto data set.

This question should be answered using the Carseats data set.

This problem involves simple linear regression without an intercept.