Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier is used to predict classes, while KNN regression is used to predict numerical values.
Specifically, KNN classification will predict the class of a new instance by finding the K nearest training instances and assigning the most frequent class amongst them as the prediction for the new instance.
In contrast, KNN regression estimates the numerical value of a new instance by finding the K nearest training instances and averaging or using another aggregate function on their values to create an estimation of the new instance.
This question involves the use of multiple linear regression on the Auto data set.
\((a)\) Produce a scatterplot matrix which includes all of the variables in the data set.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## The following objects are masked from 'package:openintro':
##
## housing, mammals
## The following object is masked from 'package:dplyr':
##
## select
\((b)\) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
\((c)\) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
##
## Call:
## lm(formula = mpg ~ ., data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
\(i.\) Is there a relationship between the predictors and the response?
Yes, there is. However, some predictors do not have a statistically significant effect on the response. The R-squared value implies that 81.82% of the changes in the response can be explained by the predictors in this regression model.
\(ii.\) Which predictors appear to have a statistically significant relationship to the response?
displacement, weight, year,
and origin appear to have a statistically significant
relationship to the response.
\(iii.\) What does the coefficient for the year variable suggest?
When every other predictor is held constant, the mpg value increases by 0.75 with each year that passes. This suggests that newer cars will have a higher mpg.
\((d)\) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
The Residuals vs Fitted plot shows that there is a non-linear relationship between the response and the predictors. The QQ plot shows that the residuals are not normally distributed and are right skewed. The Scale Location plot shows that the constant variance of error assumption is not true for this model. The Residuals vs Leverage plot shows that there are no leverage points. However, observation 14 stands out as a potential leverage point.
\((e)\) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
## cylinders displacement horsepower weight acceleration year
## 10.737535 21.836792 9.943693 10.831260 2.625806 1.244952
## origin
## 1.772386
## GVIF Df GVIF^(1/(2*Df))
## cylinders 8.904486 4 1.314320
## horsepower 9.761605 1 3.124357
## weight 9.675322 1 3.110518
## acceleration 2.651730 1 1.628413
## year 1.305357 1 1.142522
## origin 2.023470 2 1.192681
##
## Call:
## lm(formula = mpg ~ . - displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4432 -1.9509 -0.0635 1.5634 12.7861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.420180 4.567763 -4.689 3.82e-06 ***
## cylinders4 7.674517 1.624427 4.724 3.25e-06 ***
## cylinders5 8.387557 2.483501 3.377 0.000807 ***
## cylinders6 5.244637 1.684031 3.114 0.001983 **
## cylinders8 8.031903 1.792288 4.481 9.82e-06 ***
## horsepower -0.025486 0.012811 -1.989 0.047369 *
## weight -0.005097 0.000578 -8.819 < 2e-16 ***
## acceleration -0.000761 0.093156 -0.008 0.993486
## year 0.722135 0.048950 14.753 < 2e-16 ***
## origin2 1.280424 0.522596 2.450 0.014730 *
## origin3 2.213387 0.507419 4.362 1.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.121 on 381 degrees of freedom
## Multiple R-squared: 0.8442, Adjusted R-squared: 0.8401
## F-statistic: 206.5 on 10 and 381 DF, p-value: < 2.2e-16
origin and cylinders were converted into
factors given their categorical nature.
Interactions will be tested amongst the significant predictors in the
baseline model: cylinders, horsepower,
weight, year, and origin.
##
## Call:
## lm(formula = mpg ~ . - displacement + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5305 -1.9615 -0.1253 1.3497 13.6865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.222e+00 5.405e+00 -1.151 0.25040
## cylinders4 7.322e+00 1.586e+00 4.617 5.35e-06 ***
## cylinders5 6.609e+00 2.433e+00 2.716 0.00690 **
## cylinders6 4.461e+00 1.650e+00 2.705 0.00714 **
## cylinders8 7.012e+00 1.759e+00 3.986 8.07e-05 ***
## horsepower -2.693e-02 1.243e-02 -2.166 0.03094 *
## weight -5.099e-03 5.606e-04 -9.095 < 2e-16 ***
## acceleration 1.665e-04 9.072e-02 0.002 0.99854
## year 5.333e-01 6.110e-02 8.729 < 2e-16 ***
## origin2 -4.390e+01 9.637e+00 -4.555 7.06e-06 ***
## origin3 -2.591e+01 8.722e+00 -2.971 0.00316 **
## year:origin2 5.922e-01 1.265e-01 4.683 3.95e-06 ***
## year:origin3 3.618e-01 1.122e-01 3.224 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.026 on 379 degrees of freedom
## Multiple R-squared: 0.8543, Adjusted R-squared: 0.8497
## F-statistic: 185.2 on 12 and 379 DF, p-value: < 2.2e-16
In Automodel1, year:origin (2 = European, 3
= Asian), are statistically significant and indicate that European cars
have a steeper increase in mpg over time than American cars and Asian
cars also improve over time, but at a slightly lower rate than European
cars. The R-Squared value marginally decreases in this model.
cylinders and horsepower are removed from
future models as they are not significant.
Automodel2 <- lm(mpg ~. - displacement - cylinders - horsepower - acceleration
+ year:origin + weight:origin, data=Auto)
summary(Automodel2)##
## Call:
## lm(formula = mpg ~ . - displacement - cylinders - horsepower -
## acceleration + year:origin + weight:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.129 -1.899 -0.049 1.764 12.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.439e+00 4.997e+00 -1.889 0.05964 .
## weight -5.653e-03 2.764e-04 -20.450 < 2e-16 ***
## year 6.421e-01 6.007e-02 10.690 < 2e-16 ***
## origin2 -3.219e+01 9.855e+00 -3.267 0.00119 **
## origin3 -1.209e+01 9.296e+00 -1.301 0.19409
## year:origin2 5.402e-01 1.287e-01 4.197 3.36e-05 ***
## year:origin3 3.513e-01 1.145e-01 3.069 0.00230 **
## weight:origin2 -2.663e-03 8.390e-04 -3.174 0.00162 **
## weight:origin3 -5.579e-03 1.144e-03 -4.878 1.57e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.138 on 383 degrees of freedom
## Multiple R-squared: 0.8417, Adjusted R-squared: 0.8384
## F-statistic: 254.5 on 8 and 383 DF, p-value: < 2.2e-16
In Automodel2, year:origin continues to be
significant. weight:origin has a significant negative
interaction across factors, and suggests that the penalty on fuel
efficiency due to weight is largest for Asian cars, followed by European
cars, and smallest for American cars. The R-Squared value marginally
improves in this model.
Automodel3 <- lm(mpg ~. - displacement -cylinders - horsepower - acceleration
+ year:origin + weight:origin + year:weight,
data=Auto)
summary(Automodel3)##
## Call:
## lm(formula = mpg ~ . - displacement - cylinders - horsepower -
## acceleration + year:origin + weight:origin + year:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0371 -1.8479 -0.0772 1.6285 12.2731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.526e+01 1.722e+01 -5.533 5.86e-08 ***
## weight 2.166e-02 5.266e-03 4.113 4.78e-05 ***
## year 1.786e+00 2.278e-01 7.840 4.53e-14 ***
## origin2 -1.256e+01 1.026e+01 -1.225 0.2214
## origin3 1.164e+01 1.009e+01 1.154 0.2493
## year:origin2 2.589e-01 1.358e-01 1.907 0.0573 .
## year:origin3 7.586e-03 1.290e-01 0.059 0.9531
## weight:origin2 -1.895e-03 8.253e-04 -2.296 0.0222 *
## weight:origin3 -4.523e-03 1.125e-03 -4.019 7.04e-05 ***
## weight:year -3.656e-04 7.040e-05 -5.194 3.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.037 on 382 degrees of freedom
## Multiple R-squared: 0.8521, Adjusted R-squared: 0.8486
## F-statistic: 244.5 on 9 and 382 DF, p-value: < 2.2e-16
In Automodel3, year:origin is no longer
significant. This could possibly occur due to the inclusion of
year:weight, which better explains the effect of
year on mpg. The negative relationship between
year and weight with mpg suggests that in
older cars, weight didn’t reduce mpg as much, but in newer cars, added
weight leads to a larger drop in fuel efficiency. The R-Square value is
now greater than the value found in the baseline model
(Automodel0.1).
\((f)\) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
Auto$log_year <- log(Auto$year)
Auto$sqrt_year <- sqrt(Auto$year)
Auto$sq_year <- Auto$year^2
Auto$log_weight <- log(Auto$weight)
Auto$sqrt_weight <- sqrt(Auto$weight)
Auto$sq_weight <- Auto$weight^2Automodel5<-lm(mpg ~ cylinders + horsepower + log_weight + acceleration
+ log_year + origin, Auto)
summary(Automodel5)##
## Call:
## lm(formula = mpg ~ cylinders + horsepower + log_weight + acceleration +
## log_year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7011 -1.9145 -0.0667 1.4550 12.7600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -89.48120 17.49646 -5.114 5.00e-07 ***
## cylinders4 7.11016 1.56295 4.549 7.25e-06 ***
## cylinders5 8.84993 2.38531 3.710 0.000238 ***
## cylinders6 5.52389 1.61655 3.417 0.000701 ***
## cylinders8 7.52252 1.69958 4.426 1.25e-05 ***
## horsepower -0.01283 0.01220 -1.052 0.293331
## log_weight -17.68816 1.60401 -11.027 < 2e-16 ***
## acceleration 0.04044 0.08792 0.460 0.645780
## log_year 57.07426 3.59855 15.860 < 2e-16 ***
## origin2 1.05522 0.50244 2.100 0.036369 *
## origin3 1.61346 0.49685 3.247 0.001268 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.997 on 381 degrees of freedom
## Multiple R-squared: 0.8563, Adjusted R-squared: 0.8526
## F-statistic: 227.1 on 10 and 381 DF, p-value: < 2.2e-16
Replacing year and weight with log
transformations in a model that included all other significant variables
from the baseline model resulted in log_weight and
log_year being statistically significant. For
log_weight, a 1% increase in weight leads to a
decrease in mpg by approximately 0.1769 mpg. This suggests that heavier
cars have worse fuel efficiency, but the rate of decline slows as weight
increases. For log_year, a 1% increase in year (moving from
1970 to 1971, etc.) results in an increase of ~0.57 mpg, possibly due to
advancements in technology, regulations, and manufacturing techniques.
The R-Squared value is greater than the baseline model in this case.
##
## Call:
## lm(formula = mpg ~ cylinders + sqrt_weight + sqrt_year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2281 -1.8103 -0.0506 1.5655 12.9361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.35442 7.46492 -8.621 < 2e-16 ***
## cylinders4 7.66922 1.57850 4.859 1.73e-06 ***
## cylinders5 9.23969 2.41597 3.824 0.000153 ***
## cylinders6 5.76830 1.64268 3.512 0.000499 ***
## cylinders8 7.58827 1.73561 4.372 1.59e-05 ***
## sqrt_weight -0.67217 0.04752 -14.144 < 2e-16 ***
## sqrt_year 13.35462 0.80174 16.657 < 2e-16 ***
## origin2 1.18526 0.51201 2.315 0.021145 *
## origin3 1.77767 0.49760 3.572 0.000399 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.059 on 383 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8464
## F-statistic: 270.3 on 8 and 383 DF, p-value: < 2.2e-16
When horsepower and acceleration were
included in the model, they were insignificant. Thus, they were removed
and the model was refit. In this updated model, the predictors have a
square root transformation instead. The square root transformation of
weight and year are both statistically
significant. The R-Squared value in this model decreases marginally
compared to the previous.
##
## Call:
## lm(formula = mpg ~ cylinders + sq_weight + sq_year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5047 -2.1180 -0.0608 1.6639 13.1049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.809e+00 2.616e+00 -2.220 0.026976 *
## cylinders4 8.613e+00 1.712e+00 5.032 7.49e-07 ***
## cylinders5 8.251e+00 2.623e+00 3.145 0.001791 **
## cylinders6 4.802e+00 1.782e+00 2.695 0.007344 **
## cylinders8 6.439e+00 1.908e+00 3.375 0.000812 ***
## sq_weight -7.205e-07 7.032e-08 -10.246 < 2e-16 ***
## sq_year 4.876e-03 3.290e-04 14.819 < 2e-16 ***
## origin2 1.496e+00 5.560e-01 2.691 0.007428 **
## origin3 2.692e+00 5.296e-01 5.083 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.324 on 383 degrees of freedom
## Multiple R-squared: 0.8223, Adjusted R-squared: 0.8186
## F-statistic: 221.6 on 8 and 383 DF, p-value: < 2.2e-16
In this model, a squared transformation is applied to
year and weight. Both transformations are once
again statistically significant. However, the R-Squared value on this
model decreases a little to 81.86%. Thus, the square root transformation
is the idea transformation to apply for these variables based on its
lower residual error and higher R-Squared.
This question should be answered using the Carseats data set.
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
\((a)\) Fit a multiple regression model to predict Sales using Price, Urban, and US.
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
\((b)\) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Price: There is likely a correlation between price and sales, with the coefficient showing a negative relationship. This suggests that for each one-unit increase in price, sales are expected to decrease by approximately 0.0544 units.
UrbanYes: There is not enough evidence to suggest a link between the location of the store and the number of sales. With the given information, UrbanYes is not a significant predictor for the model.
USYes: There appears to be a positive relationship between whether a store is located in the US or not and the amount of sales, with an approximate increase of 1.2 sales units if the store is based in the US.
\((c)\) Write out the model in equation form, being careful to handle the qualitative variables properly.
\(\text{Sales}=13.04-0.05\times\text{Price}-0.02\times\text{UrbanYes}+1.20\times\text{USYes}\)
\((d)\) For which of the predictors can you reject the null hypothesis \(H_0\): \(\beta_j\)=0?
The null hypothesis can be rejected for Price and USYes based on the p-values.
\((e)\) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
\((f)\) How well do the models in \((a)\) and \((e)\) fit the data?
Based on the R-squared values, both models fit the data similarly.
\((g)\) Using the model from \((e)\), obtain 95% confidence intervals for the coefficient(s).
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
\((h)\) Is there evidence of outliers or high leverage observations in the model from \((e)\)?
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
The residuals appear to be bounded close to the reference line. Therefore, we can say that there are not many outliers present in the data.
Most of the provided DFB values appear to be relatively small (in the range of 0.0 to 0.3), so there are no strong indications of outliers from the DFB values alone. There are a few DFFIT values that flag observations as influential (obs. #26, #50, and #368). Cook’s D values do not cause any alarm in this case. The Hat values are relatively low but those that are higher might be observations of high leverage.
This problem involves simple linear regression without an intercept.
\((a)\) Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimate for the regression of Y onto X is \(\hat{\beta}\) = \(\frac{\sum_ix_iy_i}{\sum_jx_j^2}\).
The coefficient estimate for the regression of X onto Y is \(\hat{\beta}\) = \(\frac{\sum_ix_iy_i}{\sum_jy_j^2}\).
The coefficients are the same if \(\sum_jx_j^2=\sum_jy_j^2\).
\((b)\) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
n <- 100
x <- rnorm(n)
y <- 2*x+rnorm(100, mean = 0, sd = 1)
fit.Y <- lm(y ~ x)
summary(fit.Y)##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The coefficient for Y is much smaller than the coefficient for X, showing that the regression of X on Y does not have the same slope as the regression of Y on X.
\((c)\) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.318e-16 -6.100e-17 -2.560e-17 -8.000e-19 3.220e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.110e-17 3.379e-17 -3.290e-01 0.743
## x 1.000e+00 3.283e-17 3.046e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.378e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.28e+32 on 1 and 98 DF, p-value: < 2.2e-16
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.318e-16 -6.100e-17 -2.560e-17 -8.000e-19 3.220e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.110e-17 3.379e-17 -3.290e-01 0.743
## y 1.000e+00 3.283e-17 3.046e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.378e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.28e+32 on 1 and 98 DF, p-value: < 2.2e-16
The coefficients are exactly the same.