library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
data(Auto)

Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

K-Nearest Neighbors, or KNN, is a simple way for a computer to make predictions. It can be used in two ways: to guess a group (called classification) or to guess a number (called regression). KNN works by looking at the closest points, or “neighbors,” to the one it’s trying to guess. For classification, it picks the most common group among those neighbors. For regression, it averages the numbers from those neighbors. Classification is used when the answer is a label, like “yes” or “no,” while regression is used when the answer is a number, like a house price. They also have different ways of checking how good the guesses are.

Question 9

This question involves the use of multiple linear regression on the Auto data set.

a) Produce a scatterplot matrix which includes all of the variables in the data set.

ggpairs(
  Auto,
  title = "Scatterplot Matrix of Auto Dataset",
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(continuous = wrap("points", alpha = 0.6, size = 0.8)),
  diag = list(continuous = wrap("densityDiag", alpha = 0.5)),
  cardinality_threshold = 500  
) +
  theme_minimal(base_size = 10)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

The correlation matrix was computed to quantify the relationships between variables. The results indicated strong negative correlations between mpg and weight (-0.83), displacement (-0.81), and horsepower (-0.78). Positive correlations were found between mpg and both year (0.58) and origin (0.57). These findings confirm that lighter, newer, and imported cars generally have higher fuel efficiency.

cor_matrix <- cor(Auto[, -which(names(Auto) == "name")])
print(cor_matrix)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

i. Is there a relationship between the predictors and the response?

The F-statistic for the model is 252.4 with a p-value < 2 × 10⁻¹⁶, so the null hypothesis that all slope coefficients are zero is decisively rejected. Combined with an adjusted R² of 0.818, this shows that, taken together, the predictors explain a large share of the variation in miles per gallon (mpg). In short, there is a clear relationship between the set of predictors and the response.

ii. Which predictors appear to have a statistically signifcant relationship to the response?

The variables weight, year, origin, and displacement are statistically significant (p < 0.05), meaning they have a meaningful impact on mpg. In contrast, cylinders, horsepower, and acceleration are not significant when in the model.

iii. What does the coeffcient for the year variable suggest?

The coefficient for year (≈ 0.75) means that, holding all other variables constant, a car that is one model-year newer is expected to achieve about 0.75 additional miles per gallon. This suggests that fuel efficiency tended to improve steadily from year to year over the period covered by the dataset.

model <- lm(mpg ~ . - name, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

d) Use the plot() function to produce diagnostic plots of the linear regression ft. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Diagnostic plots were used to assess the model’s validity. The residuals appeared randomly distributed, the Q-Q plot suggested the residuals were approximately normally distributed, and the scale-location plot indicated fairly constant variance. Although a few high-leverage points were identified, none were extreme or influential. Overall, the regression assumptions were reasonably met.

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) 
plot(model, which = 1:4, pch = 19, col = "#0072B2", cex = 0.7)

e) Use the * and : symbols to ft linear regression models with interaction efects. Do any interactions appear to be statistically significant?

The model shows that both interaction terms are important. The combination of engine size (displacement) and weight has a strong negative effect on miles per gallon (mpg), meaning heavier cars with bigger engines use more fuel. Also, the relationship between engine size and mpg changes depending on the car’s horsepower. These results show that fuel efficiency depends on how these features work together, not just on one factor alone.

interaction_model <- lm(mpg ~ displacement * weight + displacement:horsepower + . - name - displacement - weight - horsepower, data = Auto)
summary(interaction_model)
## 
## Call:
## lm(formula = mpg ~ displacement * weight + displacement:horsepower + 
##     . - name - displacement - weight - horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5314  -2.4221  -0.0745   2.1164  13.6442 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.928e+01  5.021e+00  -5.830 1.17e-08 ***
## cylinders               -9.337e-01  3.178e-01  -2.938 0.003502 ** 
## acceleration             1.218e-01  9.699e-02   1.256 0.209884    
## year                     7.578e-01  5.778e-02  13.116  < 2e-16 ***
## origin                   1.464e+00  2.997e-01   4.886 1.51e-06 ***
## displacement:weight     -1.143e-05  1.950e-06  -5.862 9.83e-09 ***
## displacement:horsepower  1.513e-04  4.053e-05   3.733 0.000218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.824 on 385 degrees of freedom
## Multiple R-squared:  0.7636, Adjusted R-squared:   0.76 
## F-statistic: 207.3 on 6 and 385 DF,  p-value: < 2.2e-16

f) Try a few diferent transformations of the variables, such as log(X), √ X, X2. Comment on your fndings. 11. In this problem we will investigate the t-statistic for the null hypothesis H0 : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows. > set.seed(1) > x <- rnorm(100) > y <- 2 * x + rnorm(100)

Several variable transformations were applied to improve model performance. Transforming horsepower with a logarithm, weight with a square root, and displacement with a square improved the adjusted R-squared values slightly, reaching around 0.833 to 0.834. These transformations helped capture non-linear patterns, but the model with the interaction term remained the best overall in terms of fit.

log_model <- lm(mpg ~ log(horsepower) + weight + year + displacement + acceleration + cylinders + origin, data = Auto)
summary(log_model)
## 
## Call:
## lm(formula = mpg ~ log(horsepower) + weight + year + displacement + 
##     acceleration + cylinders + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3115 -2.0041 -0.1726  1.8393 12.6579 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     27.254005   8.589614   3.173  0.00163 ** 
## log(horsepower) -9.506436   1.539619  -6.175 1.69e-09 ***
## weight          -0.004266   0.000694  -6.148 1.97e-09 ***
## year             0.705329   0.048456  14.556  < 2e-16 ***
## displacement     0.019456   0.006876   2.830  0.00491 ** 
## acceleration    -0.292088   0.103804  -2.814  0.00515 ** 
## cylinders       -0.486206   0.306692  -1.585  0.11372    
## origin           1.482435   0.259347   5.716 2.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.18 on 384 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.834 
## F-statistic: 281.6 on 7 and 384 DF,  p-value: < 2.2e-16
sqrt_model <- lm(mpg ~ sqrt(weight) + horsepower + year + displacement + acceleration + cylinders + origin, data = Auto)
summary(sqrt_model)
## 
## Call:
## lm(formula = mpg ~ sqrt(weight) + horsepower + year + displacement + 
##     acceleration + cylinders + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4018 -2.0112  0.0246  1.7565 12.8943 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.840893   4.486253   0.633  0.52695    
## sqrt(weight) -0.794322   0.066906 -11.872  < 2e-16 ***
## horsepower   -0.010706   0.013111  -0.817  0.41469    
## year          0.773764   0.049030  15.781  < 2e-16 ***
## displacement  0.021846   0.007134   3.062  0.00235 ** 
## acceleration  0.131710   0.094051   1.400  0.16220    
## cylinders    -0.430040   0.310000  -1.387  0.16618    
## origin        1.210091   0.268519   4.507 8.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.191 on 384 degrees of freedom
## Multiple R-squared:  0.8359, Adjusted R-squared:  0.8329 
## F-statistic: 279.4 on 7 and 384 DF,  p-value: < 2.2e-16
sq_model <- lm(mpg ~ I(displacement^2) + horsepower + weight + year + acceleration + cylinders + origin, data = Auto)
summary(sq_model)
## 
## Call:
## lm(formula = mpg ~ I(displacement^2) + horsepower + weight + 
##     year + acceleration + cylinders + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7420 -1.8872 -0.0646  1.6601 12.5615 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.378e+01  4.488e+00  -3.070  0.00229 ** 
## I(displacement^2)  6.951e-05  1.074e-05   6.475 2.91e-10 ***
## horsepower        -4.248e-02  1.381e-02  -3.075  0.00225 ** 
## weight            -6.446e-03  5.848e-04 -11.024  < 2e-16 ***
## year               7.644e-01  4.886e-02  15.646  < 2e-16 ***
## acceleration       7.466e-02  9.427e-02   0.792  0.42884    
## cylinders         -7.083e-01  2.614e-01  -2.710  0.00703 ** 
## origin             1.337e+00  2.537e-01   5.271 2.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.188 on 384 degrees of freedom
## Multiple R-squared:  0.8361, Adjusted R-squared:  0.8331 
## F-statistic: 279.9 on 7 and 384 DF,  p-value: < 2.2e-16

Question 10

This question should be answered using the Carseats data set.

a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

data(Carseats)
model_a <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_a)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The multiple regression model shows how Price, Urban, and US affect Sales therein Carseats dataset. The intercept of 13.04 represents the predicted Sales when Price is 0, and both Urban and US are set to No. The coefficient for Price is -0.0545, indicating that for every $1 increase in price, Sales are expected to decrease by about 0.05 units, which is highly statistically significant (p < 2e-16). The Urban Yes coefficient is -0.0219, suggesting a very small and statistically insignificant difference in Sales between urban and non-urban stores (p = 0.936). In contrast, the US Yes coefficient is 1.20, meaning stores in the US are predicted to sell about 1.2 more units than those outside the US, a difference that is statistically significant (p ≈ 4.86e-06). Overall, Price and US have meaningful relationships with Sales, while Urban does not. The model explains approximately 24% of the variation in Sales (R² = 0.2393), and the overall model is statistically significant (F = 41.52, p < 2.2e-16).

c) Write out the model in equation form, being careful to handle the qualitative variables properly.

The regression model for predicting Sales is:

Sales = 13.0435 − 0.0545 × Price − 0.0219 × UrbanYes + 1.2006 × USYes.

This means that when Price is 0, and the store is not in an urban area and not in the US, the predicted sales are 13.04 units. For every $1 increase in price, sales drop by about 0.05 units. Stores in urban areas sell slightly less (0.02 units), but this difference is not meaningful. Stores in the US sell about 1.2 more units than those outside the US, which is a significant difference.

d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

The diagnostic plots show that the regression model fits the data fairly well. The Residuals vs Fitted plot suggests a mostly linear relationship, with no major patterns. The Q-Q plot shows the residuals are close to normally distributed, with small deviations at the ends. The Scale-Location plot indicates that the spread of residuals is fairly even, though there’s a bit more variation in the middle. The Residuals vs Leverage plot shows a few points with higher leverage (like points 368 and 86), but none are strong utliners. Overall, the model meets the key assumptions and looks reliable.

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))  
plot(model_a, 
     which = 1:4,       
     pch = 19,         
     col = "#0072B2",  
     cex = 0.7)        

e) On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

The smaller model includes only Price and US, since Urban was not a significant predictor. In this model, both Price and US remain statistically significant. Higher prices are linked to lower sales, and stores in the US sell about 1.2 more units than those outside the US. The adjusted R-squared is 0.2354, almost the same as the original model, which means dropping Urban did not hurt the model’s accuracy. This simpler model is just as effective and easier to interpret.

model_e <- lm(Sales ~ Price + US, data = Carseats)
summary(model_e)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f) How well do the models in (a) and (e) ft the data?

Both models fit the data well, but the smaller model (e), which includes only Price and US, performs just as well as the full model (a), which also includes Urban. The adjusted R-squared is slightly higher in the smaller model, and the residual standard error is slightly lower. This means the simpler model is just as accurate and easier to work with.

Model Predictors Adjusted R² Residual Std. Error
Model (a) Price, Urban, US 0.2335 2.472
Model (e) Price, US 0.2354 2.469

g) Using the model from (e), obtain 95 % confdence intervals for the coeffcient(s).

The 95% confidence intervals show that all variables in the smaller model are important. The intercept is between 11.79 and 14.27, which is the expected sales when price is 0 and the store is not in the US. The interval for Price is from -0.0648 to -0.0442, meaning higher prices lead to lower sales. The interval for USYes is from 0.69 to 1.71, showing that US stores sell more than non-US stores. Since none of the intervals include 0, both Price and US have a real effect on sales.

confint(model_e, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h) Is there evidence of outliers or high leverage observations in the model from (e)?

The diagnostic plots show that the model fits the data well. The residuals are spread out evenly with no clear pattern, and the Q-Q plot shows the residuals are mostly normal. The variance looks consistent across the fitted values. A few points, like 368 and 86, have slightly higher leverage, but they are not influential. Overall, there are no major issues with outliers or leverage in this model.

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(model_e,
     which = 1:4,        
     pch = 19,           
     col = "#0072B2",    
     cex = 0.7)          

Question 12

This problem involves simple linear regression without an intercept.

a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

In simple linear regression without an intercept, the slope when predicting Y from X is different from the slope when predicting X from Y. They are only the same when the total squared values of X and Y are equal (that is, when ∑X2=∑Y2\sum X^2 = \sum Y^2). This usually happens when X and Y have the same scale, like when they are equal or very similar in size.

b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)

x <- rnorm(100)
y <- 2 * x + rnorm(100)

model_y_on_x <- lm(y ~ x + 0)
coef(model_y_on_x)
##        x 
## 1.993876
model_x_on_y <- lm(x ~ y + 0)
coef(model_x_on_y)
##         y 
## 0.3911145

c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(2)

x <- rnorm(100)
y <- x  

model_y_on_x_equal <- lm(y ~ x + 0)
coef(model_y_on_x_equal)
## x 
## 1
model_x_on_y_equal <- lm(x ~ y + 0)
coef(model_x_on_y_equal)
## y 
## 1

References:

ISLR Q3.9 - Multiple Linear Regression/Auto. http://www.h4labs.com/ml/islr/chapter03/03_09_melling.html