Assignment 2

Author

Jonathan McCanlas

Assingment 2

Problem 2

KNN classifer and KNN regression: K-Nearest Neighbors, or KNN, is a way to make predictions by looking at the closest data points. If you’re trying to predict a category, like whether a college is private or public, KNN classification looks at the nearby examples and picks the most common type. If you’re trying to predict a number, like someone’s salary or a house price, KNN regression takes the average of the nearby numbers. So, the difference is simple: use KNN classification when you’re picking a label, and KNN regression when you’re guessing a number.

Problem 9

data(Auto)

Warning in data(Auto): data set 'Auto' not found

auto <- read.csv("Auto.csv", na.strings = "?")

auto <- na.omit(auto)

Section A

pairs(auto[, -which(names(auto) == "name")],
      main = "Scatterplot Matrix of Auto Data")

Section B

auto_numeric <- auto[, -which(names(auto) == "name")]
cor(auto_numeric[, sapply(auto_numeric, is.numeric)])

                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

Section C

model <- lm(mpg ~ . - name, data = auto)
summary(model)


Call:
lm(formula = mpg ~ . - name, data = auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Part 1 Is there a relationship between the predictors and the response (mpg)?

Yes — definitely. The F-statistic is 224.5 with a very small p-value (< 2.2e-16), which means that at least one of the predictors is significantly related to mpg. The R-squared is 0.8242, meaning that the model explains about 82% of the variance in mpg. That’s a strong fit.

Part 2 Which predictors are statistically significant?

From the P value column, the following predictors are significant at the 0.05 level:

displacement (p = 0.00186)

weight (p < 2e-16)

year (p < 2e-16)

originEurope (p = 4.72e-06)

originJapan (p = 3.93e-07)

These have strong relationships with mpg.

Part 3 What does the coefficient for year suggest?

The year coefficient is 0.777, which means:

Each additional model year increases the expected mpg by about 0.78, holding all other variables constant.

This suggests that newer cars are more fuel-efficient.

Section D

model <- lm(mpg ~ . - name, data = auto)
plot(model)

Residuals vs Fitted Plot shows a curved pattern, suggesting some non-linearity — the relationship between predictors and mpg may not be purely linear.

A few observations (e.g., 323, 327, 326) stand out as potential outliers or high-leverage points.

These points may have a strong influence on the model and should be reviewed.

Overall, the model might benefit from transformations or nonlinear terms to improve fit.

Section E

model_interact <- lm(mpg ~ horsepower * weight, data = auto)
summary(model_interact)


Call:
lm(formula = mpg ~ horsepower * weight, data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7725  -2.2074  -0.2708   1.9973  14.7314 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.93 on 388 degrees of freedom
Multiple R-squared:  0.7484,    Adjusted R-squared:  0.7465 
F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

There is a strong and statistically significant interaction between horsepower and weight. This means the effect of horsepower on fuel efficiency (mpg) changes depending on the weight of the car. Heavier cars with more horsepower show a sharper drop in mpg than lighter ones

model_year_origin <- lm(mpg ~ year * origin, data = auto)
summary(model_year_origin)


Call:
lm(formula = mpg ~ year * origin, data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3141  -3.7120  -0.6513   3.3621  15.5859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -83.3809    12.0000  -6.948 1.57e-11 ***
year          1.3089     0.1576   8.305 1.68e-15 ***
origin       17.3752     6.8325   2.543   0.0114 *  
year:origin  -0.1663     0.0889  -1.871   0.0621 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.199 on 388 degrees of freedom
Multiple R-squared:  0.5596,    Adjusted R-squared:  0.5562 
F-statistic: 164.4 on 3 and 388 DF,  p-value: < 2.2e-16

The year of the car has a strong positive impact on mpg — newer cars tend to get better mileage. Cars from Japan may have slightly higher mpg than U.S. cars, but there’s no strong evidence that the relationship between year and mpg changes based on the car’s origin.

Section F

model_log <- lm(mpg ~ log(horsepower) + log(weight), data = auto)
summary(model_log)


Call:
lm(formula = mpg ~ log(horsepower) + log(weight), data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6665  -2.4028  -0.3842   2.1558  15.3359 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      179.973      7.420   24.25  < 2e-16 ***
log(horsepower)   -7.672      1.210   -6.34 6.36e-10 ***
log(weight)      -15.244      1.478  -10.32  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.993 on 389 degrees of freedom
Multiple R-squared:  0.7396,    Adjusted R-squared:  0.7382 
F-statistic: 552.4 on 2 and 389 DF,  p-value: < 2.2e-16

Both predictors are statistically significant (p-values < 0.001).

Negative coefficients:

As log(horsepower) increases, mpg decreases (−7.672). As log(weight) increases, mpg decreases (−15.244).

The log-transformed model shows that both horsepower and weight are strong, statistically significant predictors of mpg. As either increases, mpg drops, confirming that heavier and more powerful cars are less fuel-efficient. The model explains about 74% of the variation in mpg, with an average prediction error of around ±4 mpg. Using the log transformation improved how well the model captures non-linear relationships.

model_sqrt <- lm(mpg ~ sqrt(horsepower) + sqrt(weight), data = auto)
summary(model_sqrt)


Call:
lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight), data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9211  -2.6240  -0.3587   2.2098  15.7776 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      68.49726    1.49097  45.941  < 2e-16 ***
sqrt(horsepower) -1.27083    0.23678  -5.367 1.38e-07 ***
sqrt(weight)     -0.59713    0.05522 -10.813  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.096 on 389 degrees of freedom
Multiple R-squared:  0.726, Adjusted R-squared:  0.7246 
F-statistic: 515.5 on 2 and 389 DF,  p-value: < 2.2e-16

Both sqrt(horsepower) and sqrt(weight) are statistically significant predictors of mpg (p-values < 0.001)

The negative coefficients mean that as horsepower or weight increases (even slightly), fuel efficiency decreases.

For each unit increase in sqrt(horsepower), mpg drops by about 1.27. For each unit increase in sqrt(weight), mpg drops by about 0.60.

Model Performance: R-squared = 0.726 → the model explains about 72.6% of the variability in mpg.

Residual Standard Error (RSE) = 4.096 → on average, predictions are off by about ±4 mpg.

Using the square root transformation improves the fit over a basic linear model and effectively captures some non-linear effects. The fit is slightly less strong than the log-transformed model, but still solid.

model_quad <- lm(mpg ~ horsepower + I(horsepower^2) + weight + I(weight^2), data = auto)
summary(model_quad)


Call:
lm(formula = mpg ~ horsepower + I(horsepower^2) + weight + I(weight^2), 
    data = auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7988  -2.2736  -0.2347   2.0022  14.8074 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.444e+01  2.850e+00  22.608  < 2e-16 ***
horsepower      -2.163e-01  3.937e-02  -5.495 7.09e-08 ***
I(horsepower^2)  5.727e-04  1.400e-04   4.092 5.21e-05 ***
weight          -1.261e-02  2.306e-03  -5.468 8.17e-08 ***
I(weight^2)      1.258e-06  3.483e-07   3.610 0.000346 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.937 on 387 degrees of freedom
Multiple R-squared:  0.7481,    Adjusted R-squared:  0.7455 
F-statistic: 287.4 on 4 and 387 DF,  p-value: < 2.2e-16

R-squared = 0.7481 → About 74.8% of the variation in mpg is explained — a strong fit.

Residual Standard Error (RSE) = 3.937 → Average error in predictions is about ±4 mpg, slightly better than the square root model.

Including squared terms helps capture nonlinear effects between mpg, horsepower, and weight. The improvement is modest compared to log transformations, but it’s a solid approach for modeling curvature while keeping predictors on their original scale.

Problem 10

Section A

library(ISLR2)
data("Carseats")

model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)


Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Section B

In this regression model, the intercept of 13.04 represents the baseline sales for a store that is not located in an urban area and is not in the US, with a price of zero (though price = 0 isn’t realistic, it anchors the model). The coefficient for price is −0.054, indicating that for each $1 increase in price, sales are expected to decrease by about 0.054 units, holding other factors constant. The coefficient for UrbanYes is −0.022, suggesting urban stores have slightly lower sales than non-urban ones, but this effect is not statistically significant. On the other hand, US stores have significantly higher sales—about 1.2 units more—than non-US stores, as shown by the USYes coefficient of 1.20.

Section C

Sales=13.04−0.054×Price−0.022×UrbanYes+1.20×USYes

Section D

Predictor | Coefficient | p-value | Significance
Price | -0.054 | < 2e-16 | Yes (Reject H_0)
UrbanYes | -0.022 | 0.936 | No (Do not reject H_0)
USYes | +1.201 | 4.86e-06 | Yes (Reject H_0)

We can reject Null Hypothesis for:

Price (because the p-value is extremely small) USYes (also a very small p-value)

You cannot reject Null Hypothesis for:

UrbanYes (p-value is too large — 0.936 — which means it is not statistically significant)

Section E

model_small <- lm(Sales ~ Price + US, data = Carseats)
summary(model_small)


Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

This model drops Urban because it didn’t show a statistically significant relationship with Sales.

Section F

We tried two models to predict Sales. The first model used three things: Price, Urban, and US. The second, simpler model used just Price and US.

The results showed that both models gave similar results, but the second one (with fewer variables) was actually slightly better. This means that Urban didn’t really help explain sales, so we didn’t need it.

Using fewer, more useful predictors worked just as well — and made the model easier to understand.

Section G

confint(model_small)

                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

Section H

plot(model_small)

The diagnostic plots suggest that the model fits the data reasonably well. The residuals appear evenly spread, which supports the assumption of constant variance. The Q-Q plot shows that most residuals follow a normal pattern. The scale-location plot also confirms that error sizes stay fairly consistent. Lastly, the leverage plot shows no major influential observations, so there’s no serious concern about individual points skewing the model.

Problem 12

Section A

The coefficient estimate for the regression of Y onto X (no intercept) is the same as the coefficient for X onto Y (also no intercept) if and only if: The variance of X and Y must be equal.

Section B

# Set seed for reproducibility
set.seed(1)

# Generate Y with larger variance
Y <- rnorm(100, mean = 0, sd = 3)

# Create X as a noisy version of Y (with smaller variance)
X <- Y + rnorm(100, mean = 0, sd = 1)

# Regression: Y onto X (no intercept)
model_Y_on_X <- lm(Y ~ X + 0)
coef(model_Y_on_X)

        X 
0.8905322

# Regression: X onto Y (no intercept)
model_X_on_Y <- lm(X ~ Y + 0)
coef(model_X_on_Y)

        Y 
0.9979587

Section C

set.seed(2)

# Generate a base variable Z
Z <- rnorm(100)

# Let X and Y be the same (or linear transforms with same variance)
X <- Z
Y <- Z

# Regression: Y onto X (no intercept)
model_Y_on_X <- lm(Y ~ X + 0)
coef(model_Y_on_X)

X 
1

# Regression: X onto Y (no intercept)
model_X_on_Y <- lm(X ~ Y + 0)
coef(model_X_on_Y)

Y 
1

R Pubs Assignment 2