# Loading in dependencies
pacman::p_load(ISLR2)
The K-Nearest Neighbors (KNN) classifier is used for classification tasks, where the goal is to predict a discrete class/label. Meanwhile, KNN regression method is used for regression tasks, where the goal is to predict a continuous value.
In KNN Classifier if k = 3 and the nearest neighbors, the data points in the training set that are most similar or closest to the point being predicted, are [A, B, A], then the predicted class would be A with A appearing twice. In KNN Regression if k = 3 and the nearest neighbors have values [5, 7, 6], then the predicted value would be (5 + 7 + 6) / 3 = 6.
# Reading in data
auto = read.csv('Auto.csv', na.strings= '?', stringsAsFactors= T)
pairs(auto)
cor(Filter(is.numeric, na.omit(auto)))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Is there a relationship between the predictors and the response? A relationship exists between the predictors and response since some predictors have a p-value less than 0.05, indicating statistical significance. Additionally, the model’s adjusted R^2 is 0.8182 or 81.82% of the variance is explained by the model.
Which predictors appear to have a statistically significant
relationship to the response? There are 4 which are
displacement, weight, year, and
origin with p-values of 0.00844, 2E-16, 2E-16, and
4.67E-7 respectively.
What does the coefficient for the year variable suggest? It suggests for an increase of 1 year, we expect mpg to increase by 0.75.
linear_model = lm(mpg ~ . -name, data=auto)
summary(linear_model)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Based on the Residuals vs Leverage plot, some points are far from the center of the data (red line) which indicates there might be some influential points in the data. The plot does identify 3 observations with high leverage which are observations # 327, 394, and 14.
par (mfrow = c(2 ,2))
plot(linear_model)
Between all the interactions, 14 statistically significant with a
significance level of 0.05. These are weight:acceleration,
cylinders:weight:acceleration,
displacement:weight:acceleration,
horsepower:weight:acceleration,
weight:acceleration:year,
horsepower:acceleration:origin,
horsepower:year:origin,
displacement:horsepower:weight:acceleration,
cylinders:horsepower:acceleration:year,
displacement:weight:acceleration:year,
displacement:weight:acceleration:origin,
horsepower:weight:acceleration:origin,
displacement:acceleration:year:origin, and
displacement:horsepower:weight:acceleration:origin.
stepwise_model = step(lm(mpg ~ cylinders * displacement * horsepower * weight * acceleration * year * origin, data=auto), direction= 'both', trace=0)
summary(stepwise_model)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin + cylinders:displacement + cylinders:horsepower +
## displacement:horsepower + cylinders:weight + displacement:weight +
## horsepower:weight + cylinders:acceleration + displacement:acceleration +
## horsepower:acceleration + weight:acceleration + cylinders:year +
## displacement:year + horsepower:year + weight:year + acceleration:year +
## cylinders:origin + displacement:origin + horsepower:origin +
## weight:origin + acceleration:origin + year:origin + cylinders:displacement:horsepower +
## cylinders:displacement:weight + cylinders:horsepower:weight +
## displacement:horsepower:weight + cylinders:displacement:acceleration +
## cylinders:horsepower:acceleration + displacement:horsepower:acceleration +
## cylinders:weight:acceleration + displacement:weight:acceleration +
## horsepower:weight:acceleration + cylinders:displacement:year +
## cylinders:horsepower:year + displacement:horsepower:year +
## cylinders:weight:year + displacement:weight:year + horsepower:weight:year +
## cylinders:acceleration:year + displacement:acceleration:year +
## horsepower:acceleration:year + weight:acceleration:year +
## cylinders:displacement:origin + cylinders:horsepower:origin +
## displacement:horsepower:origin + cylinders:weight:origin +
## displacement:weight:origin + horsepower:weight:origin + cylinders:acceleration:origin +
## displacement:acceleration:origin + horsepower:acceleration:origin +
## weight:acceleration:origin + cylinders:year:origin + displacement:year:origin +
## horsepower:year:origin + weight:year:origin + acceleration:year:origin +
## cylinders:displacement:horsepower:weight + cylinders:displacement:horsepower:acceleration +
## cylinders:displacement:weight:acceleration + cylinders:horsepower:weight:acceleration +
## displacement:horsepower:weight:acceleration + cylinders:displacement:horsepower:year +
## cylinders:displacement:weight:year + cylinders:horsepower:weight:year +
## displacement:horsepower:weight:year + cylinders:displacement:acceleration:year +
## cylinders:horsepower:acceleration:year + displacement:horsepower:acceleration:year +
## displacement:weight:acceleration:year + horsepower:weight:acceleration:year +
## cylinders:displacement:horsepower:origin + cylinders:displacement:weight:origin +
## cylinders:horsepower:weight:origin + displacement:horsepower:weight:origin +
## cylinders:displacement:acceleration:origin + displacement:horsepower:acceleration:origin +
## cylinders:weight:acceleration:origin + displacement:weight:acceleration:origin +
## horsepower:weight:acceleration:origin + cylinders:displacement:year:origin +
## cylinders:horsepower:year:origin + cylinders:weight:year:origin +
## horsepower:weight:year:origin + cylinders:acceleration:year:origin +
## displacement:acceleration:year:origin + cylinders:displacement:horsepower:weight:acceleration +
## cylinders:displacement:horsepower:weight:year + cylinders:displacement:horsepower:acceleration:year +
## displacement:horsepower:weight:acceleration:year + cylinders:displacement:horsepower:weight:origin +
## displacement:horsepower:weight:acceleration:origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6216 -1.0616 -0.0128 0.9286 9.2300
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 3.398e+04 2.742e+04
## cylinders -1.053e+04 6.739e+03
## displacement -9.496e+02 6.373e+02
## horsepower -8.256e+02 5.865e+02
## weight 3.993e+01 2.470e+01
## acceleration -5.148e+03 3.668e+03
## year 7.021e+02 3.951e+02
## origin -3.909e+04 2.641e+04
## cylinders:displacement 2.479e+02 1.590e+02
## cylinders:horsepower 2.234e+02 1.463e+02
## displacement:horsepower 5.765e+00 3.828e+00
## cylinders:weight -9.089e+00 6.186e+00
## displacement:weight 1.188e-01 9.094e-02
## horsepower:weight 2.286e-01 1.792e-01
## cylinders:acceleration 1.375e+03 9.175e+02
## displacement:acceleration 5.857e+00 4.567e+00
## horsepower:acceleration -2.529e-01 1.718e+00
## weight:acceleration -2.081e-01 6.665e-02
## cylinders:year -1.539e+02 9.982e+01
## displacement:year 4.605e+00 3.350e+00
## horsepower:year -9.157e-01 6.448e-01
## weight:year -7.774e-01 4.972e-01
## acceleration:year 5.603e+01 4.002e+01
## cylinders:origin 1.037e+04 6.633e+03
## displacement:origin 9.782e+02 6.328e+02
## horsepower:origin 8.746e+02 5.799e+02
## weight:origin -3.593e+01 2.491e+01
## acceleration:origin 5.123e+03 3.656e+03
## year:origin -6.564e+02 4.006e+02
## cylinders:displacement:horsepower -1.518e+00 9.565e-01
## cylinders:displacement:weight -3.340e-02 2.261e-02
## cylinders:horsepower:weight -6.587e-02 4.462e-02
## displacement:horsepower:weight -1.598e-03 1.144e-03
## cylinders:displacement:acceleration -2.054e+00 1.125e+00
## cylinders:horsepower:acceleration -6.146e-01 3.540e-01
## displacement:horsepower:acceleration 8.598e-03 1.261e-02
## cylinders:weight:acceleration 1.527e-02 7.111e-03
## displacement:weight:acceleration 8.317e-04 2.847e-04
## horsepower:weight:acceleration 1.657e-03 7.240e-04
## cylinders:displacement:year -1.257e+00 8.279e-01
## cylinders:horsepower:year 7.418e-02 1.002e-01
## displacement:horsepower:year 2.044e-03 3.333e-03
## cylinders:weight:year 1.850e-01 1.249e-01
## displacement:weight:year 1.953e-04 1.012e-04
## horsepower:weight:year 4.711e-04 2.620e-04
## cylinders:acceleration:year -1.486e+01 9.970e+00
## displacement:acceleration:year 8.143e-03 1.521e-02
## horsepower:acceleration:year -2.020e-02 1.858e-02
## weight:acceleration:year 1.354e-03 6.612e-04
## cylinders:displacement:origin -2.481e+02 1.586e+02
## cylinders:horsepower:origin -2.237e+02 1.455e+02
## displacement:horsepower:origin -5.994e+00 3.785e+00
## cylinders:weight:origin 8.770e+00 6.215e+00
## displacement:weight:origin -1.365e-01 8.911e-02
## horsepower:weight:origin -2.684e-01 1.764e-01
## cylinders:acceleration:origin -1.306e+03 9.171e+02
## displacement:acceleration:origin -6.365e+00 4.359e+00
## horsepower:acceleration:origin 7.977e-01 3.847e-01
## weight:acceleration:origin 6.408e-02 3.311e-02
## cylinders:year:origin 1.593e+02 1.004e+02
## displacement:year:origin -4.888e+00 3.284e+00
## horsepower:year:origin 5.032e-01 2.315e-01
## weight:year:origin 7.339e-01 5.007e-01
## acceleration:year:origin -5.441e+01 3.984e+01
## cylinders:displacement:horsepower:weight 4.356e-04 2.864e-04
## cylinders:displacement:horsepower:acceleration 1.973e-03 1.773e-03
## cylinders:displacement:weight:acceleration -2.318e-05 1.303e-05
## cylinders:horsepower:weight:acceleration -5.812e-05 3.006e-05
## displacement:horsepower:weight:acceleration -8.167e-06 2.890e-06
## cylinders:displacement:horsepower:year 6.352e-05 4.238e-04
## cylinders:displacement:weight:year -1.416e-05 1.086e-05
## cylinders:horsepower:weight:year -3.530e-05 2.493e-05
## displacement:horsepower:weight:year -1.685e-06 1.035e-06
## cylinders:displacement:acceleration:year 3.905e-03 2.407e-03
## cylinders:horsepower:acceleration:year 9.913e-03 4.724e-03
## displacement:horsepower:acceleration:year 1.674e-05 1.558e-04
## displacement:weight:acceleration:year -6.470e-06 3.001e-06
## horsepower:weight:acceleration:year -1.109e-05 7.578e-06
## cylinders:displacement:horsepower:origin 1.523e+00 9.518e-01
## cylinders:displacement:weight:origin 3.487e-02 2.244e-02
## cylinders:horsepower:weight:origin 6.957e-02 4.436e-02
## displacement:horsepower:weight:origin 1.758e-03 1.129e-03
## cylinders:displacement:acceleration:origin 1.818e+00 1.112e+00
## displacement:horsepower:acceleration:origin -6.105e-03 3.195e-03
## cylinders:weight:acceleration:origin -9.371e-03 6.239e-03
## displacement:weight:acceleration:origin -1.793e-04 8.734e-05
## horsepower:weight:acceleration:origin -3.753e-04 1.689e-04
## cylinders:displacement:year:origin 1.247e+00 8.219e-01
## cylinders:horsepower:year:origin -1.023e-01 5.713e-02
## cylinders:weight:year:origin -1.821e-01 1.253e-01
## horsepower:weight:year:origin -4.899e-05 2.984e-05
## cylinders:acceleration:year:origin 1.374e+01 9.962e+00
## displacement:acceleration:year:origin -6.591e-03 2.343e-03
## cylinders:displacement:horsepower:weight:acceleration 2.296e-07 1.342e-07
## cylinders:displacement:horsepower:weight:year 1.407e-07 1.137e-07
## cylinders:displacement:horsepower:acceleration:year -3.300e-05 2.414e-05
## displacement:horsepower:weight:acceleration:year 4.934e-08 2.531e-08
## cylinders:displacement:horsepower:weight:origin -4.505e-04 2.849e-04
## displacement:horsepower:weight:acceleration:origin 2.700e-06 1.290e-06
## t value Pr(>|t|)
## (Intercept) 1.239 0.21621
## cylinders -1.562 0.11930
## displacement -1.490 0.13729
## horsepower -1.408 0.16026
## weight 1.617 0.10705
## acceleration -1.404 0.16148
## year 1.777 0.07663 .
## origin -1.480 0.13988
## cylinders:displacement 1.559 0.12017
## cylinders:horsepower 1.527 0.12776
## displacement:horsepower 1.506 0.13309
## cylinders:weight -1.469 0.14285
## displacement:weight 1.307 0.19230
## horsepower:weight 1.276 0.20306
## cylinders:acceleration 1.499 0.13502
## displacement:acceleration 1.282 0.20071
## horsepower:acceleration -0.147 0.88305
## weight:acceleration -3.123 0.00197 **
## cylinders:year -1.542 0.12415
## displacement:year 1.375 0.17030
## horsepower:year -1.420 0.15663
## weight:year -1.564 0.11901
## acceleration:year 1.400 0.16256
## cylinders:origin 1.564 0.11896
## displacement:origin 1.546 0.12326
## horsepower:origin 1.508 0.13257
## weight:origin -1.442 0.15031
## acceleration:origin 1.401 0.16216
## year:origin -1.639 0.10238
## cylinders:displacement:horsepower -1.587 0.11351
## cylinders:displacement:weight -1.477 0.14075
## cylinders:horsepower:weight -1.476 0.14096
## displacement:horsepower:weight -1.397 0.16335
## cylinders:displacement:acceleration -1.826 0.06884 .
## cylinders:horsepower:acceleration -1.736 0.08356 .
## displacement:horsepower:acceleration 0.682 0.49594
## cylinders:weight:acceleration 2.147 0.03263 *
## displacement:weight:acceleration 2.922 0.00375 **
## horsepower:weight:acceleration 2.289 0.02277 *
## cylinders:displacement:year -1.518 0.12998
## cylinders:horsepower:year 0.741 0.45957
## displacement:horsepower:year 0.613 0.54003
## cylinders:weight:year 1.481 0.13956
## displacement:weight:year 1.930 0.05454 .
## horsepower:weight:year 1.798 0.07319 .
## cylinders:acceleration:year -1.490 0.13726
## displacement:acceleration:year 0.535 0.59288
## horsepower:acceleration:year -1.087 0.27779
## weight:acceleration:year 2.048 0.04144 *
## cylinders:displacement:origin -1.565 0.11878
## cylinders:horsepower:origin -1.538 0.12514
## displacement:horsepower:origin -1.583 0.11440
## cylinders:weight:origin 1.411 0.15926
## displacement:weight:origin -1.532 0.12666
## horsepower:weight:origin -1.522 0.12921
## cylinders:acceleration:origin -1.424 0.15561
## displacement:acceleration:origin -1.460 0.14529
## horsepower:acceleration:origin 2.073 0.03901 *
## weight:acceleration:origin 1.935 0.05393 .
## cylinders:year:origin 1.586 0.11372
## displacement:year:origin -1.488 0.13774
## horsepower:year:origin 2.173 0.03057 *
## weight:year:origin 1.466 0.14376
## acceleration:year:origin -1.366 0.17304
## cylinders:displacement:horsepower:weight 1.521 0.12936
## cylinders:displacement:horsepower:acceleration 1.113 0.26677
## cylinders:displacement:weight:acceleration -1.779 0.07634 .
## cylinders:horsepower:weight:acceleration -1.933 0.05418 .
## displacement:horsepower:weight:acceleration -2.826 0.00504 **
## cylinders:displacement:horsepower:year 0.150 0.88096
## cylinders:displacement:weight:year -1.304 0.19329
## cylinders:horsepower:weight:year -1.416 0.15780
## displacement:horsepower:weight:year -1.627 0.10475
## cylinders:displacement:acceleration:year 1.622 0.10581
## cylinders:horsepower:acceleration:year 2.098 0.03672 *
## displacement:horsepower:acceleration:year 0.107 0.91450
## displacement:weight:acceleration:year -2.156 0.03190 *
## horsepower:weight:acceleration:year -1.463 0.14444
## cylinders:displacement:horsepower:origin 1.601 0.11056
## cylinders:displacement:weight:origin 1.554 0.12130
## cylinders:horsepower:weight:origin 1.568 0.11786
## displacement:horsepower:weight:origin 1.557 0.12050
## cylinders:displacement:acceleration:origin 1.635 0.10303
## displacement:horsepower:acceleration:origin -1.911 0.05698 .
## cylinders:weight:acceleration:origin -1.502 0.13420
## displacement:weight:acceleration:origin -2.053 0.04096 *
## horsepower:weight:acceleration:origin -2.223 0.02700 *
## cylinders:displacement:year:origin 1.517 0.13037
## cylinders:horsepower:year:origin -1.790 0.07441 .
## cylinders:weight:year:origin -1.453 0.14716
## horsepower:weight:year:origin -1.642 0.10176
## cylinders:acceleration:year:origin 1.380 0.16879
## displacement:acceleration:year:origin -2.813 0.00524 **
## cylinders:displacement:horsepower:weight:acceleration 1.711 0.08805 .
## cylinders:displacement:horsepower:weight:year 1.237 0.21693
## cylinders:displacement:horsepower:acceleration:year -1.367 0.17271
## displacement:horsepower:weight:acceleration:year 1.950 0.05217 .
## cylinders:displacement:horsepower:weight:origin -1.581 0.11487
## displacement:horsepower:weight:acceleration:origin 2.093 0.03717 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.371 on 293 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9308, Adjusted R-squared: 0.9077
## F-statistic: 40.23 on 98 and 293 DF, p-value: < 2.2e-16
Log Transformation: Reduces the effect of large values and can stabilize variance. I can see in the Residuals vs Fitted plot the points are closer together indicating the adjustment to create constant variance. Square Root Transformation: Can reduce the effect of extreme values which can be useful for data with a skewed distribution. The Q-Q plot looks closer to a straight line which showed the adjustment to be closer to a normal distribution. Square Transformation: If there is a quadratic relationship between the predictor and response then this model can be used. I noticed in the Scale-Location plot the points had a smaller spread of the residuals.
auto$log_mpg = log(auto$mpg)
auto$sqrt_mpg = sqrt(auto$mpg)
auto$sq_mpg = auto$mpg^2
log_model = lm(log_mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data=auto)
summary(log_model)
##
## Call:
## lm(formula = log_mpg ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
sqrt_model = lm(sqrt_mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data=auto)
summary(sqrt_model)
##
## Call:
## lm(formula = sqrt_mpg ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98891 -0.18946 0.00505 0.16947 1.02581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.075e+00 4.290e-01 2.506 0.0126 *
## cylinders -5.942e-02 2.986e-02 -1.990 0.0474 *
## displacement 1.752e-03 6.942e-04 2.524 0.0120 *
## horsepower -2.512e-03 1.274e-03 -1.972 0.0493 *
## weight -6.367e-04 6.024e-05 -10.570 < 2e-16 ***
## acceleration 2.738e-03 9.131e-03 0.300 0.7644
## year 7.381e-02 4.709e-03 15.675 < 2e-16 ***
## origin 1.217e-01 2.569e-02 4.735 3.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3074 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.8535
## F-statistic: 326.3 on 7 and 384 DF, p-value: < 2.2e-16
squared_model = lm(sq_mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data=auto)
summary(squared_model)
##
## Call:
## lm(formula = sq_mpg ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483.45 -141.87 -19.62 103.58 1042.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.878e+03 2.928e+02 -6.412 4.22e-10 ***
## cylinders -1.436e+01 2.038e+01 -0.704 0.48157
## displacement 1.328e+00 4.738e-01 2.802 0.00534 **
## horsepower -3.587e-01 8.693e-01 -0.413 0.68009
## weight -3.522e-01 4.111e-02 -8.567 2.62e-16 ***
## acceleration 9.278e+00 6.232e+00 1.489 0.13740
## year 4.081e+01 3.214e+00 12.698 < 2e-16 ***
## origin 9.509e+01 1.754e+01 5.422 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.8 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.7292, Adjusted R-squared: 0.7243
## F-statistic: 147.8 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(log_model)
plot(sqrt_model)
plot(squared_model)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
carseats_model = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseats_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
coef(carseats_model)[4]
## USYes
## 1.200573
The coefficient for Price is -0.054459 which means for
every dollar increase in the price of my car seat, my stores sales
decrease by about $54. The coefficient for US = Yes is
1.200573 which means, on average, US stores will sell $1,200 more
compared to stores outside the US.
\(Sales = 13.04 - 0.05Price - 0.022Urban + 1.2US\)
See part (b) for interpretation, but Price and
US = Yes are significant thus we can reject the null
hypothesis \(H0 : \beta_j = 0\)
carseats_model_2 = lm(Sales ~ Price + US, data = Carseats)
summary(carseats_model_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The models did not fit the data well at all as the Adjusted R-squared is 0.2335 for part (a) and 0.2354 for part (e). In other words the models only accounted for 23.35% and 23.54% of the variance respectively.
confint(carseats_model_2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Based on the Residuals vs Leverage plot, some points are far from the center of the data (red line) which indicates there might be some influential points in the data. The plot does identify 3 observations with high leverage which are observations # 26, 368, and 50. This is confirmed by influence.measures function showing multiple values are indicators of influential points by the asterisks.
par(mfrow=c(2,2))
plot(carseats_model_2)
summary(influence.measures(carseats_model_2))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
When X and Y are perfectly correlated i.e. when the correlation coefficient is equal to 1 or -1. If X and Y have a perfect linear relationship, the regression coefficients for both directions will be identical because the relationship is one-to-one.
set.seed(42)
# Generating a random set of data for X
n = 100
X = rnorm(n)
# Y as a linear function of X
Y = 7 * X + rnorm(n, mean = 5, sd = 1)
# Fitting regression of X onto Y
model_X_on_Y = lm(X ~ Y)
# Fitting regression of Y onto X
model_Y_on_X = lm(Y ~ X)
# Output coefficients for both regressions
coef(model_Y_on_X)
## (Intercept) X
## 4.911633 7.027159
coef(model_X_on_Y)
## (Intercept) Y
## -0.6879615 0.1401672
# Set seed for reproducibility
set.seed(42)
# Generating a random set of data for X
n = 100
X = rnorm(n)
# Creating Y as a perfectly correlated variable with X
Y = X
# Fitting regression of X onto Y
model_X_on_Y = lm(X ~ Y)
# Fitting regression of Y onto X
model_Y_on_X = lm(Y ~ X)
coef(model_Y_on_X)
## (Intercept) X
## -1.665335e-17 1.000000e+00
coef(model_X_on_Y)
## (Intercept) Y
## -1.665335e-17 1.000000e+00