Conceptual |
Problem 2 - Carefully explain the differences between the KNN classifier and KNN regression methods: The KNN classification method predicts a class for an observation based on the classes of the observations closest to it. The KNN regression method estimates the ƒ(x) from the average of a selected number observations nearest to it. This predicts a value while the classification method predicts a class. |
library("ISLR")
library(ggplot2)
fig.fullwidth = TRUE
data(Auto)
View(Auto)
Problem 9
9.a Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
9.b Compute the matrix of correlations between the variables using the DataFrame.corr() method. .corr()
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
9.c Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
lm.9c <- lm(mpg ~ . - name, data = Auto)
summary(lm.9c)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
9.c.i - Is there a relationship between the predictors and the response?
- Yes, there is a relationship between the predictors and the response, more specifically weight, year, and origin.The F-statistic is far from 1 (with a small p-value), indicating evidence against the null hypothesis.
9.c.ii Which predictors appear to have a statistically significant relationship to the response?
- The p-value associated with each predictor’s t-stat shows that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.
9.c.iii What does the coefficient for the year variable suggest?
- The coefficient for year, 0.750773, suggests that for every one year, mpg increases by the coefficient. In other words, cars become more fuel efficient every year by almost 1 mile per gallon.
9.d Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(lm.9c)
The residuals show a clear non-linear pattern rather than random scatter around zero, indicating that the linear model may not be capturing the true relationship between the predictors and mpg. The Scale-Location plot shows relatively consistent spread of residuals across fitted values, indicating that the assumption of homoscedasticity is reasonably satisfied.
The primary concern here is the non-linear pattern in the residuals plot, suggesting that a linear model may not be the most appropriate fit for this data.
9.e Fit some models with interactions as described in the lab. Do any interactions appear to be statistically significant?
model_wgt_yr <- lm(mpg ~ weight * year + origin, data = Auto)
model_wgt_acc <- lm(mpg~ weight + acceleration + weight:acceleration, data=Auto)
model_cyl_dis <- lm(mpg~cylinders*displacement, data=Auto)
summary(model_wgt_yr)
##
## Call:
## lm(formula = mpg ~ weight * year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9328 -1.8729 -0.0833 1.5977 12.2103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.074e+02 1.276e+01 -8.420 7.44e-16 ***
## weight 2.594e-02 4.363e-03 5.945 6.18e-09 ***
## year 1.961e+00 1.704e-01 11.510 < 2e-16 ***
## origin 9.176e-01 2.452e-01 3.742 0.00021 ***
## weight:year -4.295e-04 5.860e-05 -7.330 1.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.141 on 387 degrees of freedom
## Multiple R-squared: 0.8397, Adjusted R-squared: 0.838
## F-statistic: 506.8 on 4 and 387 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model_wgt_yr)
The relationship between weight and year appears to be statistically significant, indicating that the relationship varies by year. This could suggests that the impact of weight on fuel efficiency is not constant over time, and newer cars may respond differently to changes in weight compared to older models.
summary(model_wgt_acc)
##
## Call:
## lm(formula = mpg ~ weight + acceleration + weight:acceleration,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5823 -2.6411 -0.3517 2.2611 15.6704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.814e+01 4.872e+00 5.776 1.57e-08 ***
## weight -3.168e-03 1.461e-03 -2.168 0.03076 *
## acceleration 1.117e+00 3.097e-01 3.608 0.00035 ***
## weight:acceleration -2.787e-04 9.694e-05 -2.875 0.00426 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.249 on 388 degrees of freedom
## Multiple R-squared: 0.706, Adjusted R-squared: 0.7037
## F-statistic: 310.5 on 3 and 388 DF, p-value: < 2.2e-16
The relationship between weight and acceleration, including the interaction, is statistically significant. With a p-value of 0.0376, weight seems to have a negative effect on mpg. Acceleration a significant positive effect on mpg with a p-value of 0.00035. The significant interaction term suggests that the relationship between weight and mpg changes depending on acceleration values, which makes logical/mechanical sense - heavier cars might be more fuel-efficient at certain acceleration levels than others.
summary(model_cyl_dis)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0432 -2.4308 -0.2263 2.2048 20.9051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.22040 2.34712 20.545 < 2e-16 ***
## cylinders -2.41838 0.53456 -4.524 8.08e-06 ***
## displacement -0.13436 0.01615 -8.321 1.50e-15 ***
## cylinders:displacement 0.01182 0.00207 5.711 2.24e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.454 on 388 degrees of freedom
## Multiple R-squared: 0.6769, Adjusted R-squared: 0.6744
## F-statistic: 271 on 3 and 388 DF, p-value: < 2.2e-16
This interaction between cylinders and displacement is also statistically significant with a p-value of 2.24e-08. The significant interaction between cylinders and displacement suggests that the relationship between number of cylinders and fuel efficiency depends on engine displacement (or vice versa). This makes mechanical sense because engine efficiency varies by configuration of the engine. The positive interaction coefficient (0.01182) suggests that as both cylinders and displacement increase together, the combined negative effect on mpg is somewhat less than you’d expect from just adding their individual effects.
9.f Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
wgt_yr_squared <- lm(mpg ~ weight + I(weight^2) + year + origin, data = Auto)
summary(wgt_yr_squared)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8991 -1.5687 -0.1412 1.6235 12.7733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.422e-01 4.058e+00 -0.109 0.9133
## weight -2.041e-02 1.537e-03 -13.281 <2e-16 ***
## I(weight^2) 2.213e-06 2.332e-07 9.487 <2e-16 ***
## year 8.247e-01 4.416e-02 18.675 <2e-16 ***
## origin 5.026e-01 2.435e-01 2.064 0.0397 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.019 on 387 degrees of freedom
## Multiple R-squared: 0.8519, Adjusted R-squared: 0.8504
## F-statistic: 556.5 on 4 and 387 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(wgt_yr_squared)
wgt_yr_log <- lm(log(mpg) ~ weight + year + origin, data = Auto)
summary(wgt_yr_log)
##
## Call:
## lm(formula = log(mpg) ~ weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42915 -0.06814 0.00772 0.06895 0.37959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.541e+00 1.447e-01 10.647 <2e-16 ***
## weight -2.915e-04 9.190e-06 -31.721 <2e-16 ***
## year 3.129e-02 1.748e-03 17.902 <2e-16 ***
## origin 3.081e-02 9.372e-03 3.288 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1211 on 388 degrees of freedom
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8732
## F-statistic: 898.8 on 3 and 388 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(wgt_yr_log)
The log transformation provides the best statistical fit and more appropriate model assumptions, but requires more careful interpretation of coefficients. For prediction purposes, this model would likely perform best, but for simple communication of results, the original scale might be preferred.
Problem 10
10.a Fit a multiple regression model to predict Sales using Price, Urban, and US
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
attach(Carseats)
fit<-lm(Sales ~ Price + Urban + US, data=Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
10.b Provide an interpretation of each coefficient in the model.
Price and if the location is in the US are significant predictors of sales. With a price coefficient of -0.054459, we can conclude that for every dollar increase in price, sales decrease by 5.4 units. Sales at US stores are 120 units higher compared to sales at non-US stores. Urban has no effect on sales.
10.c Write out the model in equation form, being careful to handle the qualitative variables properly
\[Sales = 13.04 - 0.054 * x_{Price} - o.o22 * x_{Urban} + 1.20 * x_{USYes}\]
10.d For which of the predictors can you reject the null hypothesis H0 : βj = 0?
Price and US
10.e On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
littleFit<-lm(Sales ~ Price + US, data=Carseats)
summary(littleFit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
10.f How well do the models in (a) and (e) fit the data?
The models do not fit the data well. With an R² of 0.24, explaining only 24% of variance is relatively low. However, for business/sales data, this isn’t necessarily terrible - it depends on context. The model is statistically significant (F-test p < 2.2e-16), meaning it’s better than no model at all.
10.g Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(littleFit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
10.h Is there evidence of outliers or high leverage observations in the model from (e)?
#Each obs gets a fit statistic
summary(influence.measures(littleFit))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
par(mfrow = c(2,2))
plot(littleFit)
Residual vs. Fitted = There are several observations with large residuals that stand out from the main cluster, particularly some points with residuals around +5 and -5. These could be considered outliers.
Q-Q = The residuals generally follow the normal line well, but there are some deviations at both tails.
Scale-Location = Shows relatively consistent spread, though there are some points with notably higher standardized residuals.
Residual vs. Leverage = This plot shows several observations with higher leverage toward the right side of the plot.
There is evidence of both outliers and high leverage observations. The points in the upper right of the leverage plot would be of particular concern as they combine high leverage with large residuals, making them potentially influential observations that could be affecting the model fit.
Problem 12.
This problem involves simple linear regression without an intercept.
12.a Recall that the coefficient estimateˆβ for the linear regression of Y onto X without an intercept is given by (3.38). *Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?*
The condition for the coefficient estimates of the regression of ( X ) onto ( Y ) and ( Y ) onto ( X ) to be the same is:
\[ \sum_{i=1}^n X_i^2 = \sum_{i=1}^n Y_i^2 \]
This means that the sum of the squares (which is proportional to the variance) of ( X ) must equal the sum of the squares of ( Y ).
In other words, the variances of ( X ) & ( Y ) must be equal. This is a rare circumstance but can occur if the magnitude of variation of ( X ) & ( Y ) is the same and ( X ) & ( Y ) are perfectly proportional. Thus, for the coefficient estimates to be the same, the two variables must have an equal degree of variation.
12.b Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(42)
n <- 100
X <- rnorm(n)
Y <- 2 * X + rnorm(n)
# Perform the regression of Y onto X
model_Y_on_X <- lm(Y ~ X - 1)
# Perform the regression of X onto Y
model_X_on_Y <- lm(X ~ Y - 1)
cat("Coefficient estimate for Y onto X:\n")
## Coefficient estimate for Y onto X:
print(coef(model_Y_on_X))
## X
## 2.024486
cat("Coefficient estimate for X onto Y:\n")
## Coefficient estimate for X onto Y:
print(coef(model_X_on_Y))
## Y
## 0.4167146
12.c Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(42)
n <- 100
X <- rnorm(n)
# Create Y to have the same variance as X and a perfect linear relationship
Y <- X
model_Y_on_X <- lm(Y ~ X - 1)
model_X_on_Y <- lm(X ~ Y - 1)
cat("Coefficient estimate for Y onto X:\n")
## Coefficient estimate for Y onto X:
print(coef(model_Y_on_X))
## X
## 1
cat("\nCoefficient estimate for X onto Y:\n")
##
## Coefficient estimate for X onto Y:
print(coef(model_X_on_Y))
## Y
## 1