To predict Y for a given value of X, the KNN regression considers k closest points to X in training data and takes the average of the responses. The KNN classifier utilizes k closest points to training data to classify qualitative observations.
Auto <- read.csv("C:/Users/dpmar/OneDrive/Documents/R/DATA/Auto.csv", stringsAsFactors = T)
## A): Produce a scatterplot matrix which includes all of the variables in the dataset.
pairs(Auto)
## B): Compute the matrix of correlations between the variables using the function cor().
Auto <- Auto[,-9]
AutoTF <- Auto[,-4]
cor(AutoTF)
## mpg cylinders displacement weight acceleration
## mpg 1.0000000 -0.7762599 -0.8044430 -0.8317389 0.4222974
## cylinders -0.7762599 1.0000000 0.9509199 0.8970169 -0.5040606
## displacement -0.8044430 0.9509199 1.0000000 0.9331044 -0.5441618
## weight -0.8317389 0.8970169 0.9331044 1.0000000 -0.4195023
## acceleration 0.4222974 -0.5040606 -0.5441618 -0.4195023 1.0000000
## year 0.5814695 -0.3467172 -0.3698041 -0.3079004 0.2829009
## origin 0.5636979 -0.5649716 -0.6106643 -0.5812652 0.2100836
## year origin
## mpg 0.5814695 0.5636979
## cylinders -0.3467172 -0.5649716
## displacement -0.3698041 -0.6106643
## weight -0.3079004 -0.5812652
## acceleration 0.2829009 0.2100836
## year 1.0000000 0.1843141
## origin 0.1843141 1.0000000
## C): Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
## Based on the multiple linear regression, there appears to be a relationship between the predictors and response. Weight, acceleration, year, and origin demonstrate the highest influence & significance. The coefficient for the year variable suggests a positive correlation between mpg.
lm.Auto=lm(mpg~cylinders + displacement + weight + acceleration + year + origin, data=Auto)
summary(lm.Auto)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + weight + acceleration +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5573 -2.1745 -0.0456 1.8454 12.9946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.014e+01 4.145e+00 -4.858 1.72e-06 ***
## cylinders -4.198e-01 3.203e-01 -1.311 0.1908
## displacement 1.742e-02 7.189e-03 2.423 0.0158 *
## weight -6.928e-03 5.781e-04 -11.983 < 2e-16 ***
## acceleration 1.591e-01 7.741e-02 2.055 0.0405 *
## year 7.703e-01 4.934e-02 15.613 < 2e-16 ***
## origin 1.356e+00 2.691e-01 5.040 7.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.333 on 390 degrees of freedom
## Multiple R-squared: 0.8214, Adjusted R-squared: 0.8186
## F-statistic: 298.9 on 6 and 390 DF, p-value: < 2.2e-16
## D) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.
## The residual plot shows greater deviations as the fitted values increase. The leverage plot also identifies 3 observations with a higher than average leverage.
plot(lm.Auto)
## E) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
## All of the tested interactions are statistically significant.
lm.Auto.Inter=lm(mpg~cylinders:displacement + displacement:weight + weight:acceleration + acceleration:year + year:origin, data=Auto)
summary(lm.Auto.Inter)
##
## Call:
## lm(formula = mpg ~ cylinders:displacement + displacement:weight +
## weight:acceleration + acceleration:year + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9192 -2.1888 -0.0468 1.7326 12.4489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.256e+01 1.500e+00 8.372 1.02e-15 ***
## cylinders:displacement -3.464e-03 1.033e-03 -3.352 0.000881 ***
## displacement:weight 1.055e-05 2.228e-06 4.738 3.03e-06 ***
## weight:acceleration -5.898e-04 3.409e-05 -17.301 < 2e-16 ***
## acceleration:year 2.817e-02 1.624e-03 17.344 < 2e-16 ***
## year:origin 1.297e-02 3.454e-03 3.754 0.000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.386 on 391 degrees of freedom
## Multiple R-squared: 0.8152, Adjusted R-squared: 0.8128
## F-statistic: 344.9 on 5 and 391 DF, p-value: < 2.2e-16
## F) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
## It appears the log & sqrt transformations impact influence/correlations of origin, acceleration, & displacement on mpg.
lm.Auto.log=lm(mpg~log(cylinders) + log(displacement) + log(weight) + log(acceleration) + log(year) + log(origin), data=Auto)
lm.Auto.sqrt=lm(mpg~sqrt(cylinders) + sqrt(displacement) + sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), data=Auto)
lm.Auto.sqrd=lm(mpg~(cylinders)^2 + (displacement)^2 + (weight)^2 + (acceleration)^2 + (year)^2 + (origin)^2, data=Auto)
summary(lm.Auto.log)
##
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(weight) +
## log(acceleration) + log(year) + log(origin), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7552 -2.0393 -0.0793 1.6484 12.9417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.1712 17.4335 -4.656 4.43e-06 ***
## log(cylinders) 1.7036 1.6906 1.008 0.31422
## log(displacement) -1.0799 1.5707 -0.688 0.49217
## log(weight) -18.8500 1.7867 -10.550 < 2e-16 ***
## log(acceleration) 0.3410 1.1020 0.309 0.75713
## log(year) 59.1308 3.5025 16.883 < 2e-16 ***
## log(origin) 1.3358 0.5123 2.608 0.00947 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.139 on 390 degrees of freedom
## Multiple R-squared: 0.8415, Adjusted R-squared: 0.8391
## F-statistic: 345.2 on 6 and 390 DF, p-value: < 2.2e-16
summary(lm.Auto.sqrt)
##
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(weight) +
## sqrt(acceleration) + sqrt(year) + sqrt(origin), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5371 -2.0696 -0.0884 1.7341 13.0395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.95588 8.01021 -7.735 8.94e-14 ***
## sqrt(cylinders) 0.07329 1.54182 0.048 0.962110
## sqrt(displacement) 0.16085 0.22924 0.702 0.483293
## sqrt(weight) -0.74142 0.06558 -11.306 < 2e-16 ***
## sqrt(acceleration) 0.67239 0.59463 1.131 0.258845
## sqrt(year) 13.41805 0.83071 16.152 < 2e-16 ***
## sqrt(origin) 2.93024 0.75371 3.888 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 390 degrees of freedom
## Multiple R-squared: 0.832, Adjusted R-squared: 0.8294
## F-statistic: 322 on 6 and 390 DF, p-value: < 2.2e-16
summary(lm.Auto.sqrd)
##
## Call:
## lm(formula = mpg ~ (cylinders)^2 + (displacement)^2 + (weight)^2 +
## (acceleration)^2 + (year)^2 + (origin)^2, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5573 -2.1745 -0.0456 1.8454 12.9946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.014e+01 4.145e+00 -4.858 1.72e-06 ***
## cylinders -4.198e-01 3.203e-01 -1.311 0.1908
## displacement 1.742e-02 7.189e-03 2.423 0.0158 *
## weight -6.928e-03 5.781e-04 -11.983 < 2e-16 ***
## acceleration 1.591e-01 7.741e-02 2.055 0.0405 *
## year 7.703e-01 4.934e-02 15.613 < 2e-16 ***
## origin 1.356e+00 2.691e-01 5.040 7.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.333 on 390 degrees of freedom
## Multiple R-squared: 0.8214, Adjusted R-squared: 0.8186
## F-statistic: 298.9 on 6 and 390 DF, p-value: < 2.2e-16
plot(lm.Auto.log)
plot(lm.Auto.sqrt)
plot(lm.Auto.sqrd)
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
## A) Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.Carseats=lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.Carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
## B) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
## "Price" is the company charges for car seats at each site. "Urban" indicates whether the store is in an urban or rural location. "US" indicates whether the store is in the US or not.
## C) Write out the model in equation form, being careful to handle the qualitative variables properly.
## Y = 13.043469 - 0.054459x1 - 0.021916x2 + 1.200573x3 + Error
## D) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
## Predictors Price and US can reject the null hypothesis Bj = 0, given their respective probabilities are less than the designated alpha, 0.05.
## E) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.Carseats.assn=lm(Sales ~ Price + US, data=Carseats)
summary(lm.Carseats.assn)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
## F) How well do the models in (a) and (e) fit the data?
## Both models appear to fit the data extremely well, as the residual plots demonstrate a random spread, and the Q-Q plots present almost identical fits.
plot(lm.Carseats)
plot(lm.Carseats.assn)
## Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(lm.Carseats.assn)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
## A) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
## The coefficeient estimate for regression of X onto Y is the same as Y onto XD when the observation is set to 1.
## B) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
x <- rnorm (100)
y <- 2 * x
xce <- 2 * y
yce <- y
xce
## [1] -1.50307484 -4.22910319 0.06350131 -0.54890003 5.85909429
## [6] -5.14578863 2.09898368 -4.68057257 -0.70744727 -0.09645434
## [11] 0.86808516 -4.03948203 2.59265150 -0.22308245 -4.73432461
## [16] -1.83899245 -4.51019057 -0.96246875 -3.84508468 -4.34804540
## [21] 2.22319322 5.50961202 4.89665203 11.18010811 1.93451732
## [26] 3.49608733 0.22258607 1.65060242 -0.44507075 4.00837371
## [31] -0.94250301 -5.96891023 -8.50616633 -1.40479666 1.78677852
## [36] -3.03831064 0.49820551 -3.51849927 2.60075162 7.16007958
## [41] -4.28132278 -0.84913458 8.02047925 -1.51097397 -1.02192414
## [46] -3.16278335 3.13050860 4.63854987 3.03882096 -6.49240493
## [51] -2.54267442 4.80837729 -6.53397278 -1.27966204 -0.50017737
## [56] 4.37518883 1.40526365 2.41950947 -0.76203112 7.86800633
## [61] -2.22969000 2.57050199 -1.38091789 -4.30443762 2.11303876
## [66] 6.10986641 2.12802633 4.55904760 3.29844180 -4.46351765
## [71] -2.24872680 2.71581262 -4.91537967 -5.45372254 -2.59457958
## [76] -6.26998979 -11.62183260 2.69952142 2.68584179 6.21192082
## [81] 3.70936245 -3.93907650 0.06709711 -8.30639946 0.59407051
## [86] 1.92046926 -4.04635901 -3.20564527 -2.58792010 3.73235774
## [91] 3.70973099 1.07294758 2.27745545 1.05110334 -5.88034577
## [96] 6.72669967 -3.41422173 1.23711287 -4.20459905 0.61937218
yce
## [1] -0.75153742 -2.11455160 0.03175065 -0.27445002 2.92954714 -2.57289431
## [7] 1.04949184 -2.34028629 -0.35372364 -0.04822717 0.43404258 -2.01974102
## [13] 1.29632575 -0.11154122 -2.36716231 -0.91949622 -2.25509529 -0.48123437
## [19] -1.92254234 -2.17402270 1.11159661 2.75480601 2.44832601 5.59005406
## [25] 0.96725866 1.74804366 0.11129304 0.82530121 -0.22253538 2.00418686
## [31] -0.47125150 -2.98445511 -4.25308316 -0.70239833 0.89338926 -1.51915532
## [37] 0.24910275 -1.75924964 1.30037581 3.58003979 -2.14066139 -0.42456729
## [43] 4.01023962 -0.75548698 -0.51096207 -1.58139167 1.56525430 2.31927493
## [49] 1.51941048 -3.24620247 -1.27133721 2.40418864 -3.26698639 -0.63983102
## [55] -0.25008868 2.18759442 0.70263183 1.20975474 -0.38101556 3.93400316
## [61] -1.11484500 1.28525100 -0.69045894 -2.15221881 1.05651938 3.05493321
## [67] 1.06401317 2.27952380 1.64922090 -2.23175883 -1.12436340 1.35790631
## [73] -2.45768984 -2.72686127 -1.29728979 -3.13499490 -5.81091630 1.34976071
## [79] 1.34292090 3.10596041 1.85468123 -1.96953825 0.03354855 -4.15319973
## [85] 0.29703525 0.96023463 -2.02317950 -1.60282263 -1.29396005 1.86617887
## [91] 1.85486550 0.53647379 1.13872772 0.52555167 -2.94017289 3.36334984
## [97] -1.70711086 0.61855643 -2.10229952 0.30968609
## C) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- rnorm (100)
y <- x
xce <- y
yce <- y
xce
## [1] 1.40258684 -0.09232460 -1.13643722 0.32131762 -0.51444968 0.23753795
## [7] 2.24061324 0.09905938 1.90533667 -1.26252965 -0.86713627 -1.32077319
## [13] -0.99143782 -0.43549833 0.45280918 -0.10051682 -0.86594593 1.69724393
## [19] -2.01516273 0.15700135 0.37972247 0.06229742 1.38766604 -1.24720761
## [25] 2.19178269 -0.37934698 -1.58664114 -1.23418136 -1.71844607 -0.31692126
## [31] -2.39415199 -0.13636846 -0.73088830 0.48139233 0.30320475 -0.33252151
## [37] 0.53626035 -0.13852238 -0.81673783 -0.54330717 -0.76116166 -0.47306687
## [43] 0.42727005 -0.48477424 0.96924812 1.06695495 -0.83902122 0.00577806
## [49] -0.33072203 -1.81645637 0.43283737 -0.20746934 0.62648646 0.38930720
## [55] -1.28285860 -0.69284443 0.97732377 0.12086957 0.63127328 -1.95948097
## [61] -0.64504028 -0.05091389 1.64951163 0.11958931 0.85366946 -0.54343614
## [67] 1.57336820 1.04466815 0.84866483 -0.04265028 -0.09827115 0.04776925
## [73] -0.92800285 1.76399295 0.07133766 -0.99621124 1.53986861 1.31406265
## [79] 0.70452274 0.72759971 0.24305432 -0.81293438 -0.65339727 -1.90879716
## [85] 0.15890392 -1.90955204 -0.24802290 1.31146723 -0.02321760 -0.88757668
## [91] 0.23403576 1.08076592 1.71136399 0.25142325 0.11611089 -0.33394041
## [97] -0.31354320 0.26563564 0.53627070 -1.64660419
yce
## [1] 1.40258684 -0.09232460 -1.13643722 0.32131762 -0.51444968 0.23753795
## [7] 2.24061324 0.09905938 1.90533667 -1.26252965 -0.86713627 -1.32077319
## [13] -0.99143782 -0.43549833 0.45280918 -0.10051682 -0.86594593 1.69724393
## [19] -2.01516273 0.15700135 0.37972247 0.06229742 1.38766604 -1.24720761
## [25] 2.19178269 -0.37934698 -1.58664114 -1.23418136 -1.71844607 -0.31692126
## [31] -2.39415199 -0.13636846 -0.73088830 0.48139233 0.30320475 -0.33252151
## [37] 0.53626035 -0.13852238 -0.81673783 -0.54330717 -0.76116166 -0.47306687
## [43] 0.42727005 -0.48477424 0.96924812 1.06695495 -0.83902122 0.00577806
## [49] -0.33072203 -1.81645637 0.43283737 -0.20746934 0.62648646 0.38930720
## [55] -1.28285860 -0.69284443 0.97732377 0.12086957 0.63127328 -1.95948097
## [61] -0.64504028 -0.05091389 1.64951163 0.11958931 0.85366946 -0.54343614
## [67] 1.57336820 1.04466815 0.84866483 -0.04265028 -0.09827115 0.04776925
## [73] -0.92800285 1.76399295 0.07133766 -0.99621124 1.53986861 1.31406265
## [79] 0.70452274 0.72759971 0.24305432 -0.81293438 -0.65339727 -1.90879716
## [85] 0.15890392 -1.90955204 -0.24802290 1.31146723 -0.02321760 -0.88757668
## [91] 0.23403576 1.08076592 1.71136399 0.25142325 0.11611089 -0.33394041
## [97] -0.31354320 0.26563564 0.53627070 -1.64660419