2. KNN Classification vs. KNN Regression

Both methods use a distance metric (usually Euclidean) to identify neighbors, and the choice of \(K\) affects the bias-variance trade-off in similar ways.

9.

(a) Scatterplot matrix of all predictive variables
library(ISLR2)
Warning: package ‘ISLR2’ was built under R version 4.4.2
pairs(Auto[, -9])  # Exclude 'name' as it's qualitative

(b) Correlation Matrix
cor_matrix <- cor(Auto[, -9])  # Compute correlation matrix without 'name'
print(cor_matrix)
                    mpg  cylinders displacement horsepower     weight acceleration       year     origin
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442    0.4233285  0.5805410  0.5652088
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005 -0.3698552 -0.6145351
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955 -0.4163615 -0.4551715
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392 -0.3091199 -0.5850054
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000  0.2903161  0.2127458
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161  1.0000000  0.1815277
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458  0.1815277  1.0000000
(c) Multiple linear regression
lm_fit <- lm(mpg ~ . -name, data = Auto)  # Exclude 'name' from predictors
summary(lm_fit)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
    • Yes, there is a relationship between the predictors and the response (mpg).
    • The overall F-statistic is 252.4 with a p-value < 2.2e-16, which indicates that at least one of the predictors has a statistically significant relationship with the response variable.
    • The Multiple R-squared value of 0.8215 suggests that about 82.15% of the variability in the mpg is explained by the predictors in the model.
    • This indicates a strong overall relationship between the predictors (cylinders, displacement, horsepower, weight, acceleration, year, origin) and the response (mpg).
  1. The predictors that are statistically significant are displacement (positive), weight(negative), year positive), and origin(positive). These variables are important in explaining variations in mpg.

  2. The coefficient for year suggests that newer cars are generally more fuel-efficient, with an increase in mpg of approximately 0.75 mpg for each year newer the car is.

(d) Diagnostic plots of the linear regression fit
par(mfrow=c(2,2))  # Arrange plots in a 2x2 grid
plot(lm_fit)

Inference:

  • Heteroscedasticity: There’s some evidence of non-constant variance in the residuals vs fitted values and scale-location plots. The spread seems to increase slightly with fitted values.
  • Outliers: No unusually large outliers are apparent.
  • High Leverage: Observation 14 has relatively high leverage, but its Cook’s distance doesn’t indicate it’s overly influential.
(e) Linear regression models with interaction effects
x <- subset(Auto, select = -name)
summary(lm(mpg ~ . + cylinders*displacement, data = x))

Call:
lm(formula = mpg ~ . + cylinders * displacement, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.6081  -1.7833  -0.0465   1.6821  12.2617 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -2.7096590  4.6858582  -0.578 0.563426    
cylinders              -2.6962123  0.4094916  -6.584 1.51e-10 ***
displacement           -0.0774797  0.0141535  -5.474 7.96e-08 ***
horsepower             -0.0476026  0.0133736  -3.559 0.000418 ***
weight                 -0.0052339  0.0006253  -8.370 1.10e-15 ***
acceleration            0.0597997  0.0918038   0.651 0.515188    
year                    0.7594500  0.0473354  16.044  < 2e-16 ***
origin                  0.7087399  0.2736917   2.590 0.009976 ** 
cylinders:displacement  0.0136081  0.0017209   7.907 2.84e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.089 on 383 degrees of freedom
Multiple R-squared:  0.8465,    Adjusted R-squared:  0.8433 
F-statistic: 264.1 on 8 and 383 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ . + displacement:horsepower, data = x))

Call:
lm(formula = mpg ~ . + displacement:horsepower, data = x)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7010 -1.6009 -0.0967  1.4119 12.6734 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -1.894e+00  4.302e+00  -0.440  0.66007    
cylinders                6.466e-01  3.017e-01   2.143  0.03275 *  
displacement            -7.487e-02  1.092e-02  -6.859 2.80e-11 ***
horsepower              -1.975e-01  2.052e-02  -9.624  < 2e-16 ***
weight                  -3.147e-03  6.475e-04  -4.861 1.71e-06 ***
acceleration            -2.131e-01  9.062e-02  -2.351  0.01921 *  
year                     7.379e-01  4.463e-02  16.534  < 2e-16 ***
origin                   6.891e-01  2.527e-01   2.727  0.00668 ** 
displacement:horsepower  5.236e-04  4.813e-05  10.878  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.912 on 383 degrees of freedom
Multiple R-squared:  0.8636,    Adjusted R-squared:  0.8608 
F-statistic: 303.1 on 8 and 383 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ . + cylinders:horsepower*weight, data = x))

Call:
lm(formula = mpg ~ . + cylinders:horsepower * weight, data = x)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2767 -1.5641 -0.0771  1.4277 11.8078 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  5.275e+00  5.340e+00   0.988  0.32384    
cylinders                   -2.350e+00  8.095e-01  -2.903  0.00391 ** 
displacement                 2.101e-03  6.948e-03   0.302  0.76253    
horsepower                  -2.050e-01  4.858e-02  -4.220 3.06e-05 ***
weight                      -6.675e-03  1.135e-03  -5.879 8.99e-09 ***
acceleration                -1.121e-01  9.147e-02  -1.226  0.22092    
year                         7.589e-01  4.494e-02  16.885  < 2e-16 ***
origin                       8.046e-01  2.496e-01   3.224  0.00137 ** 
cylinders:horsepower         1.297e-02  1.006e-02   1.290  0.19794    
cylinders:horsepower:weight  3.019e-06  1.035e-06   2.917  0.00374 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.9 on 382 degrees of freedom
Multiple R-squared:  0.8651,    Adjusted R-squared:  0.8619 
F-statistic: 272.2 on 9 and 382 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ . + weight:cylinders, data = x))

Call:
lm(formula = mpg ~ . + weight:cylinders, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9484  -1.7133  -0.1809   1.4530  12.4137 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.3143478  5.0076737   1.461  0.14494    
cylinders        -5.0347425  0.5795767  -8.687  < 2e-16 ***
displacement      0.0156444  0.0068409   2.287  0.02275 *  
horsepower       -0.0314213  0.0126216  -2.489  0.01322 *  
weight           -0.0150329  0.0011125 -13.513  < 2e-16 ***
acceleration      0.1006438  0.0897944   1.121  0.26306    
year              0.7813453  0.0464139  16.834  < 2e-16 ***
origin            0.8030154  0.2617333   3.068  0.00231 ** 
cylinders:weight  0.0015058  0.0001657   9.088  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.022 on 383 degrees of freedom
Multiple R-squared:  0.8531,    Adjusted R-squared:  0.8501 
F-statistic: 278.1 on 8 and 383 DF,  p-value: < 2.2e-16

Inference :

Several interaction effects appear to be statistically significant:

  • cylinders:displacement (from the first model)
  • displacement:horsepower (from the second model)
  • cylinders:horsepower:weight (the three-way interaction in the third model)
  • cylinders:weight (from the fourth model)

These significant interactions indicate that the effect of one predictor on mpg is indeed influenced by the levels of another predictor. This underscores the complexity of relationships in the data and suggests that models which incorporate these interaction effects may capture the variability in mpg more effectively than models with only main effects.

(f) Try a few different transformations of the variables, such as log(X),sqrt{X},X^2. Comment on your findings.
par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg, cex = 0.2)
plot(log(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(sqrt(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(Auto$horsepower^2, Auto$mpg, cex = 0.2)

x <- subset(Auto, select = -name)
x$weight <- log(x$horsepower)
fit <- lm(mpg ~ ., data = x)
summary(fit)

Call:
lm(formula = mpg ~ ., data = x)

Residuals:
   Min     1Q Median     3Q    Max 
-8.971 -1.674 -0.204  1.526 12.277 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  114.790956   9.930074  11.560  < 2e-16 ***
cylinders     -0.117404   0.299897  -0.391 0.695659    
displacement  -0.022101   0.006438  -3.433 0.000663 ***
horsepower     0.196982   0.023060   8.542 3.13e-16 ***
weight       -33.218155   2.419604 -13.729  < 2e-16 ***
acceleration  -0.616145   0.081222  -7.586 2.52e-13 ***
year           0.705540   0.046227  15.262  < 2e-16 ***
origin         0.879875   0.260964   3.372 0.000823 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.055 on 384 degrees of freedom
Multiple R-squared:  0.8495,    Adjusted R-squared:  0.8468 
F-statistic: 309.7 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))

plot(fit, cex = 0.2)

x1 <- subset(Auto, select = -name)
x1$weight <- sqrt(x1$horsepower)
fit2 <- lm(mpg ~ ., data = x1)
summary(fit2)

Call:
lm(formula = mpg ~ ., data = x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9036 -1.6589 -0.2271  1.5274 12.1298 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   60.455179   6.538054   9.247  < 2e-16 ***
cylinders      0.030652   0.301660   0.102 0.919118    
displacement  -0.023203   0.006441  -3.602 0.000357 ***
horsepower     0.503287   0.043717  11.512  < 2e-16 ***
weight       -12.973681   0.940908 -13.788  < 2e-16 ***
acceleration  -0.616677   0.081104  -7.603 2.24e-13 ***
year           0.703981   0.046154  15.253  < 2e-16 ***
origin         0.899512   0.260234   3.457 0.000608 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.051 on 384 degrees of freedom
Multiple R-squared:  0.8499,    Adjusted R-squared:  0.8472 
F-statistic: 310.7 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))

plot(fit2, cex = 0.2)

x2 <- subset(Auto, select = -name)
x2$weight <- x2$horsepower^2
fit3 <- lm(mpg ~ ., data = x2)
summary(fit3)

Call:
lm(formula = mpg ~ ., data = x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2180 -1.8046 -0.1561  1.5133 11.9203 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.244e+00  4.446e+00   2.079 0.038292 *  
cylinders     4.086e-01  3.133e-01   1.304 0.192950    
displacement -2.548e-02  6.550e-03  -3.889 0.000118 ***
horsepower   -4.201e-01  2.804e-02 -14.982  < 2e-16 ***
weight        1.262e-03  9.487e-05  13.306  < 2e-16 ***
acceleration -6.144e-01  8.209e-02  -7.485 4.95e-13 ***
year          6.992e-01  4.667e-02  14.981  < 2e-16 ***
origin        1.010e+00  2.618e-01   3.857 0.000134 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.086 on 384 degrees of freedom
Multiple R-squared:  0.8464,    Adjusted R-squared:  0.8436 
F-statistic: 302.4 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))

plot(fit3, cex = 0.2)

Summary of Findings on Transformations:

  • Log Transformation (log(horsepower)): -Positive relationship with mpg, but weaker compared to the square root transformation. -R-squared: 0.8495.
  • Square Root Transformation (sqrt(horsepower)): -Strongest positive effect on mpg and best model fit. -R-squared: 0.8499 (highest of all models).
  • Squared Transformation (horsepower)**2: -Positive relationship with mpg, but weaker compared to the square root transformation. -R-squared: 0.8436.

The square root transformation of horsepower provided the best fit, suggesting a stronger but moderated effect on mpg. The log and squared transformation were also effective but less impactful comparated to square root.

10.

(a) Multiple linear regression model to predict sales
library(ISLR2)
Warning: package ‘ISLR2’ was built under R version 4.4.2
data(Carseats)

# Fit the model
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
(b) Interpretation of Coefficients
  • Intercept ( 𝛽0 =13.0435): When Price = 0, the predicted Sales are 13.0435, assuming the store is not in an urban area and is not located in the US.
  • Price ( 𝛽1 =−0.0545): For every one-unit increase in Price, the Sales decrease by 0.0545 units. This is a negative relationship, meaning that higher prices tend to reduce sales.
  • UrbanYes ( 𝛽2 =−0.0219): The difference in sales between urban and rural stores is negligible, as this coefficient is very close to zero. With a p-value of 0.936, we fail to reject the null hypothesis (𝐻0 :𝛽2 =0), meaning there is no evidence to suggest that being in an urban area significantly affects sales.
  • USYes (𝛽3 =1.2006): Stores in the US tend to have 1.2006 more sales units compared to stores not in the US, holding other factors constant. This coefficient is statistically significant with a very low p-value (4.86e-06), so we can reject the null hypothesis and conclude that being in the US is positively associated with sales.
(c) Model Equation

The model can be written as: - Sales=13.0435−0.0545×Price−0.0219×1(Urban=Yes)+1.2006×1(US=Yes)+ϵ

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
  • Price: p<2e−16, very significant, reject 𝐻0
  • USYes: p=4.86×10 −6, significant, reject 𝐻0

However, for UrbanYes, the p-value is 0.936, which is much higher than 0.05, so we cannot reject the null hypothesis for this predictor. Thus, there is no evidence that the location being urban versus rural significantly affects sales.

(e) Reduced model
reduced_model <- lm(Sales ~ Price + US, data = Carseats)
summary(reduced_model)

Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
(f) Model Fit Comparison
  • Adjusted R-squared (Full model): 0.2335
  • Adjusted R-squared (Reduced model): 0.2354

The Adjusted R-squared for the reduced model is slightly higher than that for the full model, indicating that removing the Urban variable didn’t hurt the model’s explanatory power. This suggests that Urban might not be contributing much to the model.

(g) 95% Confidence Intervals for co-efficients of reduced model:
confint(reduced_model)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

Inference: - Price is negatively associated with Sales, and the 95% confidence interval confirms that this relationship is statistically significant. - Being in the US (USYes) is positively associated with Sales, and the confidence interval confirms that this effect is statistically significant.

(h) Evidence of outliers or high leverage observations
par(mfrow = c(2, 2))
plot(reduced_model)

Inference:

  • Linearity: The Residuals vs. Fitted plot does not show any strong curvature, suggesting the linear relationship is reasonable.
  • Normality of Residuals: The Q-Q plot shows that residuals are largely on the diagonal, indicating no major deviation from normality.
  • Constant Variance (Homoscedasticity): The Scale-Location plot looks fairly consistent, so there is no strong evidence of heteroscedasticity.
  • Influential Points: While a couple of observations have a bit larger residuals or leverage, they do not appear to be excessively influential based on Cook’s distance.

12.

(a)
  • The coefficient estimates for the regression of X onto Y and Y onto X are the same if and only if the relationship between X and Y is perfectly proportional (correlation of 1 or -1), and their variances are the same.
(b)
# Set seed for reproducibility
set.seed(42)

# Generate X and Y with different variances
X <- rnorm(100)
Y <- rnorm(100, sd = 5)

# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)

Call:
lm(formula = Y ~ X - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.9075 -2.9733 -0.3705  2.2492 13.8347 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
X   0.1224     0.4380    0.28     0.78

Residual standard error: 4.54 on 99 degrees of freedom
Multiple R-squared:  0.0007886, Adjusted R-squared:  -0.009304 
F-statistic: 0.07813 on 1 and 99 DF,  p-value: 0.7804
# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)

Call:
lm(formula = X ~ Y - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9759 -0.6453  0.1102  0.6822  2.2832 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
Y 0.006441   0.023044    0.28     0.78

Residual standard error: 1.041 on 99 degrees of freedom
Multiple R-squared:  0.0007886, Adjusted R-squared:  -0.009304 
F-statistic: 0.07813 on 1 and 99 DF,  p-value: 0.7804
  • Inference : The coefficient estimates are 0.1224 for Y onto X and 0.006441 for X onto Y, indicating weak relationships between the variables.
(c)

# Set seed for reproducibility
set.seed(42)

# Generate X and Y with the same variance
X <- rnorm(100)
Y <- X * 1.5  # Y is a scaled version of X (same variance as X)

# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)
Warning in summary.lm(reg_Y_on_X) :
  essentially perfect fit: summary may be unreliable

Call:
lm(formula = Y ~ X - 1)

Residuals:
       Min         1Q     Median         3Q        Max 
-4.839e-16 -7.530e-17  2.200e-18  6.570e-17  6.634e-15 

Coefficients:
   Estimate Std. Error   t value Pr(>|t|)    
X 1.500e+00  6.627e-17 2.263e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.87e-16 on 99 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 5.123e+32 on 1 and 99 DF,  p-value: < 2.2e-16
# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)
Warning in summary.lm(reg_X_on_Y) :
  essentially perfect fit: summary may be unreliable

Call:
lm(formula = X ~ Y - 1)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.176e-15 -6.392e-17 -1.502e-17  3.599e-17  2.662e-16 

Coefficients:
   Estimate Std. Error   t value Pr(>|t|)    
Y 6.667e-01  9.342e-18 7.137e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.453e-16 on 99 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 5.093e+33 on 1 and 99 DF,  p-value: < 2.2e-16
  • Inference : The coefficient estimates are 1.5 for Y onto X and 0.667 for X onto Y, which are reciprocals of each other, indicating a perfect linear relationship between X and Y with scaling.
---
title: "Assignment 2"
output: html_notebook
author: "Sreeja Yalamaddi"
---

#### 2. **KNN Classification vs. KNN Regression**

- **KNN Classification:**  
  - **Goal:** Predict a categorical label.  
  - **Method:** Assigns the new observation the most frequent class among its \( K \) nearest neighbors (majority vote).

- **KNN Regression:**  
  - **Goal:** Predict a continuous value.  
  - **Method:** Computes the average (or weighted average) of the responses from the \( K \) nearest neighbors.

Both methods use a distance metric (usually Euclidean) to identify neighbors, and the choice of \( K \) affects the bias-variance trade-off in similar ways.

#### 9.

##### (a) **Scatterplot matrix of all predictive variables**

```{r}
library(ISLR2)
pairs(Auto[, -9])  # Exclude 'name' as it's qualitative
```
##### (b) **Correlation Matrix**

```{r}
cor_matrix <- cor(Auto[, -9])  # Compute correlation matrix without 'name'
print(cor_matrix)

```
##### (c) **Multiple linear regression**

```{r}
lm_fit <- lm(mpg ~ . -name, data = Auto)  # Exclude 'name' from predictors
summary(lm_fit)

```
i)
  -   Yes, there is a relationship between the predictors and the response (mpg).
  -   The overall F-statistic is 252.4 with a p-value < 2.2e-16, which indicates that at least one of the predictors has a statistically significant relationship with the response variable.
  -   The Multiple R-squared value of 0.8215 suggests that about 82.15% of the variability in the mpg is explained by the predictors in the model. 
  -   This indicates a strong overall relationship between the predictors (cylinders, displacement, horsepower, weight, acceleration, year, origin) and the response (mpg).

ii) The predictors that are statistically significant are displacement (positive), weight(negative), year positive), and origin(positive). These variables are important in explaining variations in mpg.

iii)  The coefficient for year suggests that newer cars are generally more fuel-efficient, with an increase in mpg of approximately 0.75 mpg for each year newer the car is.


##### (d) **Diagnostic plots of the linear regression fit**

```{r}
par(mfrow=c(2,2))  # Arrange plots in a 2x2 grid
plot(lm_fit)

```

**Inference**:

-   Heteroscedasticity: There's some evidence of non-constant variance in the residuals vs fitted values and scale-location plots. The spread seems to increase slightly with fitted values.
-   Outliers: No unusually large outliers are apparent.
-   High Leverage: Observation 14 has relatively high leverage, but its Cook's distance doesn't indicate it's overly influential.

##### (e) **Linear regression models with interaction effects**

```{r}
x <- subset(Auto, select = -name)
summary(lm(mpg ~ . + cylinders*displacement, data = x))
summary(lm(mpg ~ . + displacement:horsepower, data = x))
summary(lm(mpg ~ . + cylinders:horsepower*weight, data = x))
summary(lm(mpg ~ . + weight:cylinders, data = x))
```
**Inference** :

Several interaction effects appear to be statistically significant:

-   cylinders:displacement (from the first model)
-   displacement:horsepower (from the second model)
-   cylinders:horsepower:weight (the three-way interaction in the third model)
-   cylinders:weight (from the fourth model)

These significant interactions indicate that the effect of one predictor on mpg is indeed influenced by the levels of another predictor. This underscores the complexity of relationships in the data and suggests that models which incorporate these interaction effects may capture the variability in mpg more effectively than models with only main effects.

##### (f) Try a few different transformations of the variables, such as log(X),sqrt{X},X^2. Comment on your findings.

```{r}
par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg, cex = 0.2)
plot(log(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(sqrt(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(Auto$horsepower^2, Auto$mpg, cex = 0.2)

x <- subset(Auto, select = -name)
x$weight <- log(x$horsepower)
fit <- lm(mpg ~ ., data = x)
summary(fit)
par(mfrow = c(2, 2))
plot(fit, cex = 0.2)

x1 <- subset(Auto, select = -name)
x1$weight <- sqrt(x1$horsepower)
fit2 <- lm(mpg ~ ., data = x1)
summary(fit2)
par(mfrow = c(2, 2))
plot(fit2, cex = 0.2)

x2 <- subset(Auto, select = -name)
x2$weight <- x2$horsepower^2
fit3 <- lm(mpg ~ ., data = x2)
summary(fit3)
par(mfrow = c(2, 2))
plot(fit3, cex = 0.2)
```

**Summary of Findings on Transformations**:

-   Log Transformation (log(horsepower)):
    -Positive relationship with mpg, but weaker compared to the square root transformation.
    -R-squared: 0.8495.
-   Square Root Transformation (sqrt(horsepower)):
    -Strongest positive effect on mpg and best model fit.
    -R-squared: 0.8499 (highest of all models).
-   Squared Transformation (horsepower)**2:
    -Positive relationship with mpg, but weaker compared to the square root transformation.
    -R-squared: 0.8436.

The square root transformation of horsepower provided the best fit, suggesting a stronger but moderated effect on mpg. The log and squared transformation were also effective but less impactful comparated to square root.


#### 10.

##### (a) **Multiple linear regression model to predict sales**

```{r}
library(ISLR2)
data(Carseats)

# Fit the model
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)

```
##### (b) **Interpretation of Coefficients**

-   Intercept ( 𝛽0 =13.0435): When Price = 0, the predicted Sales are 13.0435, assuming the store is not in an urban area and is not located in the US.  
-   Price ( 𝛽1 =−0.0545): For every one-unit increase in Price, the Sales decrease by 0.0545 units. This is a negative relationship, meaning that higher prices tend to reduce sales.  
-   UrbanYes ( 𝛽2 =−0.0219): The difference in sales between urban and rural stores is negligible, as this coefficient is very close to zero. With a p-value of 0.936, we fail to reject the null hypothesis (𝐻0 :𝛽2 =0), meaning there is no evidence to suggest that being in an urban area significantly affects sales.  
-   USYes (𝛽3 =1.2006): Stores in the US tend to have 1.2006 more sales units compared to stores not in the US, holding other factors constant. This coefficient is statistically significant with a very low p-value (4.86e-06), so we can reject the null hypothesis and conclude that being in the US is positively associated with sales. 

##### (c) **Model Equation**

The model can be written as:
-   Sales=13.0435−0.0545×Price−0.0219×1(Urban=Yes)+1.2006×1(US=Yes)+ϵ

##### (d) **For which of the predictors can you reject the null hypothesis H0 : βj = 0?**

-   Price: p<2e−16, very significant, reject 𝐻0
-   USYes: p=4.86×10 −6, significant, reject 𝐻0

However, for UrbanYes, the p-value is 0.936, which is much higher than 0.05, so we cannot reject the null hypothesis for this predictor. Thus, there is no evidence that the location being urban versus rural significantly affects sales.

##### (e) **Reduced model**

```{r}
reduced_model <- lm(Sales ~ Price + US, data = Carseats)
summary(reduced_model)

```
##### (f) **Model Fit Comparison**

-   Adjusted R-squared (Full model): 0.2335
-   Adjusted R-squared (Reduced model): 0.2354

The Adjusted R-squared for the reduced model is slightly higher than that for the full model, indicating that removing the Urban variable didn’t hurt the model’s explanatory power. This suggests that Urban might not be contributing much to the model.

##### (g) **95% Confidence Intervals for co-efficients of reduced model**:

```{r}
confint(reduced_model)

```
**Inference**:
- Price is negatively associated with Sales, and the 95% confidence interval confirms that this relationship is statistically significant.
- Being in the US (USYes) is positively associated with Sales, and the confidence interval confirms that this effect is statistically significant.

##### (h) **Evidence of outliers or high leverage observations**

```{r}
par(mfrow = c(2, 2))
plot(reduced_model)

```
**Inference**:

-   *Linearity*: The Residuals vs. Fitted plot does not show any strong curvature, suggesting the linear relationship is reasonable.
-   *Normality of Residuals*: The Q-Q plot shows that residuals are largely on the diagonal, indicating no major deviation from normality.
-   *Constant Variance (Homoscedasticity)*: The Scale-Location plot looks fairly consistent, so there is no strong evidence of heteroscedasticity.
-   *Influential Points*: While a couple of observations have a bit larger residuals or leverage, they do not appear to be excessively influential based on Cook’s distance.

#### 12.
##### (a) 
- The coefficient estimates for the regression of X onto Y and Y onto X are the same if and only if the relationship between X and Y is perfectly proportional (correlation of 1 or -1), and their variances are the same.

##### (b)
```{r}
set.seed(42)

# Generate X and Y with different variances
X <- rnorm(100)
Y <- rnorm(100, sd = 5)

# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)

# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)
```
-   **Inference** : The coefficient estimates are 0.1224 for Y onto X and 0.006441 for X onto Y, indicating weak relationships between the variables.

##### (c)
```{r}

# Set seed for reproducibility
set.seed(42)

# Generate X and Y with the same variance
X <- rnorm(100)
Y <- X * 1.5  # Y is a scaled version of X (same variance as X)

# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)

# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)

```

-   **Inference** : The coefficient estimates are 1.5 for Y onto X and 0.667 for X onto Y, which are reciprocals of each other, indicating a perfect linear relationship between X and Y with scaling.

