library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
KNN classifier is used when the dependent variable is categorical. KNN regression methods are used when the dependent variable is numeric.
data(Auto)
Part A.
pairs(Auto)
Part B.
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Part C.
Auto$origin[Auto$origin == 1] = "American"
Auto$origin[Auto$origin == 2] = "European"
Auto$origin[Auto$origin == 3] = "Japanese"
Auto$origin = as.factor(Auto$origin)
m1 = lm(mpg ~ .-name, data = Auto)
summary(m1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response. The model is statisticall significant (p < 2.2e-16) which means we can say there is a relationship between mpg with the predictors.
Displacement has a postive significant relationship with mpg (p = 0.002), weight has a negative significant relationship with mpg (p < 2e-16), year has a positive significant relationship mpg (p < 2e-16), Japanese origin has a postive significant relationship with mpg (p = 3.93e-07), and European origin has a positive significant relationship with mpg.
With a 1 unit increase in year, mpg increases by 0.777 with all other predictors held constant.
Part D
par(mfrow = c(2,2))
plot(m1)
The Q-Q Plot shows that the data is fairly normal. There is not a pattern in the residuals vs fitted plot so there is equal variance. In the residuals vs leverage plot, we can see there are a few outliers.
Part E.
m2_interactions = lm(mpg ~ . * ., data=Auto[,-9])
summary(m2_interactions)
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## originEuropean -3.517e+01 1.260e+01 -2.790 0.005547 **
## originJapanese -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:originEuropean -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:originJapanese 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:originEuropean -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:originJapanese 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:originEuropean -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:originJapanese 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:originEuropean 2.277e-03 2.685e-03 0.848 0.397037
## weight:originJapanese -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:originEuropean 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:originJapanese 7.159e-01 3.258e-01 2.198 0.028614 *
## year:originEuropean 2.932e-01 1.444e-01 2.031 0.043005 *
## year:originJapanese 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
Using p < .05, the significant interactions are: cylinder and acceleration, acceleration and year, acceleration and European origin, acceleration and Japanese origin, year and European origin, and year and Japanese origin.
Part F.
I took the log, square root, and raised acceleration to the 2nd power.
In all the models, acceleration is statisitcally significant. Accerlation has a larger coefficient when it is log transformed.
m3log = lm(mpg ~ log(acceleration), data = Auto)
summary(m3log)
##
## Call:
## lm(formula = mpg ~ log(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0234 -5.6231 -0.9787 4.5943 23.0872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.834 5.373 -5.180 3.56e-07 ***
## log(acceleration) 18.801 1.966 9.565 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.033 on 390 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.1879
## F-statistic: 91.49 on 1 and 390 DF, p-value: < 2.2e-16
m4sqt = lm(mpg ~ sqrt(acceleration), data = Auto)
summary(m4sqt)
##
## Call:
## lm(formula = mpg ~ sqrt(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.034 -5.601 -1.027 4.713 23.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.177 4.008 -3.537 0.000453 ***
## sqrt(acceleration) 9.582 1.017 9.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.053 on 390 degrees of freedom
## Multiple R-squared: 0.1855, Adjusted R-squared: 0.1834
## F-statistic: 88.81 on 1 and 390 DF, p-value: < 2.2e-16
m5p = lm(mpg ~ I(acceleration^2), data = Auto)
summary(m5p)
##
## Call:
## lm(formula = mpg ~ I(acceleration^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.753 -5.694 -1.432 4.978 23.238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.597505 1.077541 13.547 <2e-16 ***
## I(acceleration^2) 0.035518 0.004075 8.716 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.15 on 390 degrees of freedom
## Multiple R-squared: 0.163, Adjusted R-squared: 0.1609
## F-statistic: 75.96 on 1 and 390 DF, p-value: < 2.2e-16
data("Carseats")
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Part A.
m6 = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(m6)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Part B.
With a 1 unit increase in Price, for all other variables held constant, there is a 54.46 drop in Sales (p < 2e-16).
If the store is located in an urban area, it has 21.92 less sales compared to areas where the store is located in rural areas with all other variables held constant. However, this finding is not statistically significant (p = 0.94).
Stores in the US have 1200.57 more sales than stores outside of the US with all other predictors held constant (p = 4.86e-06).
Part C.
Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε
Part D.
I can reject the null hypothesis for Price and USYes.
Part E.
m7 = lm(Sales ~ Price + US, data = Carseats)
summary(m7)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Part F.
Model with Price, US, and Urban R-squared: 0.2393 Adjusted R-squared: 0.2335
This model explains 23.35% of the variation in Sales
Model with Price and Urban R-squared: 0.2393 Adjusted R-squared: 0.2354
This model explains 23.54% of the variation in Sales
Part G.
confint(m7)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Part H.
par(mfrow = c(2,2))
plot(m7)
Looking at the residuals vs leverage plot, we can see there are a view outliers.
Part A. When the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.
Part B.
set.seed(1)
x = rnorm(100)
y = 2*x
m8 = lm(y~x+0)
m9 = lm(x~y+0)
summary(m8)
## Warning in summary.lm(m8): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.776e-16 -3.378e-17 2.680e-18 6.113e-17 5.105e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.000e+00 1.296e-17 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
summary(m9)
## Warning in summary.lm(m9): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.24e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
set.seed(1)
x = rnorm(100)
y = -sample(x, 100)
sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 81.05509
m10 = lm(y~x+0)
m11 = lm(x~y+0)
summary(m10)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2833 -0.6945 -0.1140 0.4995 2.1665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.07768 0.10020 0.775 0.44
##
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared: 0.006034, Adjusted R-squared: -0.004006
## F-statistic: 0.601 on 1 and 99 DF, p-value: 0.4401
summary(m11)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2182 -0.4969 0.1595 0.6782 2.4017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.07768 0.10020 0.775 0.44
##
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared: 0.006034, Adjusted R-squared: -0.004006
## F-statistic: 0.601 on 1 and 99 DF, p-value: 0.4401