Question 2

invisible("The main difference is that KNN classification predicts a categorical label using a majority vote of the nearest neighbors. While KNN regression predicts a continuous numerical value by averaging those neighbors. Classification is evaluated using error rates or accuracy, where regression is evaluated using Mean Squared Error (MSE). Sorts data into groups while regression estimates a specific number.")

Question 9

# 9(a)
library(ISLR)
pairs(Auto)

# 9(b)
cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

# 9(c)
model <- lm(mpg ~ . - name, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# 9(c-I)
invisible("Relationship between the predictors and mpg. The overall F-statistic p-value, which is close to zero; reject the null hypothesis.")

# 9(c-II)
invisible("p-values below 0.05, the statistically significant predictors are displacement, weight, year, and origin. The other variables (cylinders, horsepower, and acceleration) are not statistically significant.")

# 9(c-III)
invisible("The positive coefficient for year (0.75) means that for every additional year, a car's fuel efficiency increases by roughly 0.75 miles per gallon, assuming all other variables remain constant. Newer cars in the dataset tend to get better gas mileage.")

# 9(d)
plot(model)

invisible("The Residuals vs Fitted plot shows a U-shaped pattern that indicates a non-linear relationship. Points 327 and 394 stand out as outliers. Large positive residuals where actual efficiency is much higher than predicted. Point 14 shows unusually high leverage; disproportionate impact on the model.")

# 9(e)
interaction_model <- lm(mpg ~ cylinders * displacement + displacement * year, data = Auto)
summary(interaction_model)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4366  -1.8976  -0.1448   1.9019  18.2536 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -4.824e+01  8.962e+00  -5.383 1.27e-07 ***
## cylinders              -2.138e+00  4.582e-01  -4.666 4.24e-06 ***
## displacement            7.828e-02  4.564e-02   1.715   0.0871 .  
## year                    1.246e+00  1.085e-01  11.486  < 2e-16 ***
## cylinders:displacement  1.231e-02  1.734e-03   7.100 6.03e-12 ***
## displacement:year      -2.850e-03  5.578e-04  -5.110 5.09e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.516 on 386 degrees of freedom
## Multiple R-squared:  0.7997, Adjusted R-squared:  0.7971 
## F-statistic: 308.2 on 5 and 386 DF,  p-value: < 2.2e-16

invisible("Both interaction effects are statistically significant. The interaction between cylinders and displacement (cylinders:displacement) and the interaction between displacement and year (displacement:year) both have p-values below the 0.05. Showing that the relationships between these variables depend on one another.")

# 9(f)
transformed_model <- lm(mpg ~ log(displacement) + sqrt(weight) + I(horsepower^2) + year + origin, data = Auto)
summary(transformed_model)

## 
## Call:
## lm(formula = mpg ~ log(displacement) + sqrt(weight) + I(horsepower^2) + 
##     year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5959  -1.9058  -0.0392   1.6828  13.1154 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.239e+00  5.148e+00   1.406   0.1605    
## log(displacement) -2.145e+00  1.046e+00  -2.051   0.0409 *  
## sqrt(weight)      -6.328e-01  6.512e-02  -9.717   <2e-16 ***
## I(horsepower^2)    6.665e-05  3.069e-05   2.172   0.0305 *  
## year               7.846e-01  4.867e-02  16.122   <2e-16 ***
## origin             6.059e-01  2.839e-01   2.134   0.0334 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.202 on 386 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8317 
## F-statistic: 387.6 on 5 and 386 DF,  p-value: < 2.2e-16

invisible("The log, square root, and squared transformations are all statistically significant. These transformations improve the model's accuracy; better fit the non-linear patterns in the data.")

Question 10

library(ISLR)

# (10a)
carseats_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseats_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

# (10b)
invisible("The intercept represents a baseline prediction of 13,043 units sold for a rural store outside the US charging a price of $0. For every $1 increase in price, sales decrease by 54.5 units, holding all other variables constant. Stores in urban areas sell 21.9 fewer units than rural stores; while stores within the US sell 1,200.6 more units than international stores.")

# (10c)
# Sales = 13.0435 - 0.0545 * Price - 0.0219 * UrbanYes + 1.2006 * USYes
invisible("UrbanYes = 1 if the store is in an urban location, 0 if rural. USYes = 1 if the store is in the US, 0 if international")

# (10d)
invisible("Reject the null hypothesis for Price and US because their p-values are small and below the 0.05 significance level. Statistically significant impact on sales. Fail to reject the null hypothesis for Urban because its p-value is 0.936, no evidence of a statistically significant relationship between a store's urban location and its sales.")

# (10e)
carseats_smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(carseats_smaller_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

invisible("Fitting a smaller model using only Price and US as predictors; urban did not show evidence of association with Sales.")

# (10f)
invisible("Both models fit the data similarly, but the smaller model (10e) is slightly better because it removes the unnecessary Urban variable. The R-squared value for both models is approximately 0.2393. The smaller model has a slightly higher adjusted R-squared (0.2354 vs 0.2335) and a lower residual standard error (2.469 vs 2.472).")

# (10g)
confint(carseats_smaller_model, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

invisible("The 95% confidence interval for Price ranges from -0.0648 to -0.0442, meaning that for each $1 increase in price, sales decrease by between 44.2 and 64.8 units. The 95% confidence interval for USYes ranges from 0.6915 to 1.7078, indicating that US stores sell between 691.5 and 1,707.8 more units than international stores. Neither interval contains zero, both predictors remain statistically significant.")

# (10h)
plot(carseats_smaller_model, which = 5)

invisible("High leverage observations, as multiple points exceed the average leverage threshold of 0.0075 (3/400), with one point surpassing 0.04.Not a problem because they fall well within Cook's distance boundaries, meaning they do not influence the regression model.")

Question 12

# (12a)
invisible("The coefficient estimates will be identical only if the sum of squares of X equals the sum of squares of Y: sum(X_i^2) = sum(Y_i^2).")

# (12b)
set.seed(123)

X <- rnorm(100, mean = 2, sd = 1)
Y <- 3 * X + rnorm(100, mean = 0, sd = 1)

model_yx <- lm(Y ~ X - 1)
model_xy <- lm(X ~ Y - 1)

coef(model_yx)

##       X 
## 2.94839

coef(model_xy)

##         Y 
## 0.3323714

invisible("The coefficient for Y onto X is 2.9484, while the coefficient for X onto Y is 0.3324. The two estimates are different because the sum of squares of X and Y are not equal.")

# (12c)
set.seed(123)

X <- rnorm(100, mean = 0, sd = 1)
Y <- sample(X)

model_yx <- lm(Y ~ X - 1)
model_xy <- lm(X ~ Y - 1)

coef(model_yx)

##         X 
## 0.1331142

coef(model_xy)

##         Y 
## 0.1331142

invisible("Y as a random permutation of X, the set of values remains identical, which forces the sum of squares of X and Y to be exactly equal. Both regressions yield an identical coefficient estimate of 0.1331.")

Assignment 2

Jairo Cordon

2026-06-18

Question 2

Question 9

Question 10

Question 12