Assignment 2

2

The KNN classifier and KNN Regressor methods both predict in different ways. The KNN classifier predicts categorical variables, while the KNN regressor predicts continuos numerical variables.

9

 Auto <-read.table("Auto.data", header = T, na.strings = "?",
 stringsAsFactors = T)

Auto <- na.omit(Auto)

View(Auto)

a.

pairs(Auto[, -9])

b.

cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c.

lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i.

There is a relationship between the predictors and the response. We can see that the r-squared value (0.8182) shows that around 82.15% of the variance in mpg is explained by the predictors in the model. Also, since the p-value is a very small value (2.2e-16), that means that at least one of the predictors has a linear relationship with the mpg variable.

ii.

The variables that have a statistically significant relationship to the response are weight, year, origin, and displacement.

iii

The coefficient for the year variable is 0.750773. This suggests that for each one-year increase, mpg is predicted to increase by about 0.75 miles per gallon.

d.

plot(lm.fit)

We can see that the residuals are pretty close to the normally distributed line, but there are deviations present especially in the tails. The leverage plot does show an observation with unusually high leverage, and that is observation 14.

e.

lm.fit.inter <- lm(mpg ~ . - name + cylinders:displacement + displacement:weight, data = Auto)
summary(lm.fit.inter)

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders:displacement + displacement:weight, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0609  -1.7589  -0.0494   1.5790  12.1496 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -4.795e+00  4.515e+00  -1.062  0.28883    
## cylinders              -1.091e-01  5.965e-01  -0.183  0.85502    
## displacement           -7.186e-02  1.363e-02  -5.273 2.25e-07 ***
## horsepower             -3.457e-02  1.304e-02  -2.651  0.00836 ** 
## weight                 -1.030e-02  1.064e-03  -9.680  < 2e-16 ***
## acceleration            6.618e-02  8.817e-02   0.751  0.45334    
## year                    7.840e-01  4.566e-02  17.171  < 2e-16 ***
## origin                  5.475e-01  2.643e-01   2.071  0.03901 *  
## cylinders:displacement  1.186e-03  2.715e-03   0.437  0.66251    
## displacement:weight     2.141e-05  3.712e-06   5.768 1.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.967 on 382 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8555 
## F-statistic: 258.2 on 9 and 382 DF,  p-value: < 2.2e-16

Yes there are interactions that are statistically significant. The interaction displacement:weight is statistically significant because it has a very small p-value (1.66e-08). This means that the effect of displacement on mpg depends on weight.

f.

lm.fit.log <- lm(mpg ~ log(displacement) + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.log)

## 
## Call:
## lm(formula = mpg ~ log(displacement) + cylinders + horsepower + 
##     weight + acceleration + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6594  -1.8712  -0.0741   1.6427  12.8462 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.3922996  7.1102490   0.336 0.736709    
## log(displacement) -5.2475829  1.3910486  -3.772 0.000187 ***
## cylinders          0.8052759  0.3081112   2.614 0.009312 ** 
## horsepower        -0.0048428  0.0130659  -0.371 0.711106    
## weight            -0.0044886  0.0006912  -6.494 2.58e-10 ***
## acceleration      -0.0047404  0.0986602  -0.048 0.961703    
## year               0.7437614  0.0503990  14.757  < 2e-16 ***
## origin             0.6282457  0.3011778   2.086 0.037642 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.297 on 384 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:  0.8215 
## F-statistic: 258.1 on 7 and 384 DF,  p-value: < 2.2e-16

lm.fit.sqrt <- lm(sqrt(mpg) ~ displacement + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.sqrt)

## 
## Call:
## lm(formula = sqrt(mpg) ~ displacement + cylinders + horsepower + 
##     weight + acceleration + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98891 -0.18946  0.00505  0.16947  1.02581 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.075e+00  4.290e-01   2.506   0.0126 *  
## displacement  1.752e-03  6.942e-04   2.524   0.0120 *  
## cylinders    -5.942e-02  2.986e-02  -1.990   0.0474 *  
## horsepower   -2.512e-03  1.274e-03  -1.972   0.0493 *  
## weight       -6.367e-04  6.024e-05 -10.570  < 2e-16 ***
## acceleration  2.738e-03  9.131e-03   0.300   0.7644    
## year          7.381e-02  4.709e-03  15.675  < 2e-16 ***
## origin        1.217e-01  2.569e-02   4.735 3.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8535 
## F-statistic: 326.3 on 7 and 384 DF,  p-value: < 2.2e-16

lm.fit.sq <- lm(mpg ~ I(horsepower^2) +   displacement + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.sq)

## 
## Call:
## lm(formula = mpg ~ I(horsepower^2) + displacement + cylinders + 
##     horsepower + weight + acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5497 -1.7311 -0.2236  1.5877 11.9955 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.3236564  4.6247696   0.286 0.774872    
## I(horsepower^2)  0.0010060  0.0001065   9.449  < 2e-16 ***
## displacement    -0.0075649  0.0073733  -1.026 0.305550    
## cylinders        0.3489063  0.3048310   1.145 0.253094    
## horsepower      -0.3194633  0.0343447  -9.302  < 2e-16 ***
## weight          -0.0032712  0.0006787  -4.820 2.07e-06 ***
## acceleration    -0.3305981  0.0991849  -3.333 0.000942 ***
## year             0.7353414  0.0459918  15.989  < 2e-16 ***
## origin           1.0144130  0.2545545   3.985 8.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.001 on 383 degrees of freedom
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8522 
## F-statistic: 282.8 on 8 and 383 DF,  p-value: < 2.2e-16

Based on these results, we can see that the square root transformation of mpg and adding squared horsepower resultsed in the highest r-squared values. This means better model fits. We can also note that the log transformation of displacement improved the model.

10

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

data(Carseats)

a.

lm.fit.full <- lm(Sales ~ Price + Urban + US, data   = Carseats)
summary(lm.fit.full)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b.

Intercept is the estimated Sales value when all the predictor variables are zero, and its estimated coefficient value is 13.043469. So, if the Price is zero, and the store is not in an urban area, and the store is not in the US, the predicted Sales would be 13.043469 units.

The estimated coefficient value for the Price variable is -0.054459. This means that every 1 dollar increase in Price, the Sales are predicted to decrease by about 0.054459 units. This shows an inverse relationship between Price and Sales.

Next, is Urban. In this case, Urban is a qualitative variable, and UrbanYes is the dummy variable. The estimated coefficient value for UrbanYes is -0.021916. This value represents the difference in Sales between stores that are in Urban areas and stores that are not. This value can also represent that stores in urban areas have sales that are lower by 0.021916 units. But we shouldn’t consider that value because its not statistically significant.

Lastly, we have US. Similar to before, US is a qualitative variable, and USYes is the dummy variable. The estimated coefficient value for USYes is 1.200573. This value is the difference in Sales between stores that are in the US and stores that are not in the US. That value also means that stores in the US have sales that are higher by 1.200573 units. We should consider this value because it is statistically significant.

c.

Sales = β0 + β1 * Price + β2 * UrbanYes + β3 * USYes

β0 = intercept

β1 = coefficient for Price

β2 = coefficient for UrbanYes

β3 = coefficient for USYes

d.

I can reject the null hypothesis for Price and USYes.

e.

lm.fit.reduced <- lm(Sales ~ Price + US, data = Carseats)
  summary(lm.fit.reduced)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f.

summary(lm.fit.full)$r.squared

## [1] 0.2392754

summary(lm.fit.reduced)$r.squared

## [1] 0.2392629

We can see here from those results that both of these models do not fit the data very well, but the full model does fit the data a little better than the reduced model.

g.

confint(lm.fit.reduced )

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h.

plot(lm.fit.reduced)

As we can see from these plots, there are some outliers present in the model from part (e).

12 a.

Σ(Xi^2) = Σ(Yi^2)

The coefficient estimates for the regression of Y onto X and X onto Y will be the same when the sum of the squared X values is equal to the sum of the squared Y values.

b.

set.seed(123)


n <- 100
 X <- rnorm(n, mean = 5, sd = 2)  
Y <- 2 * X + rnorm(n, mean = 0, sd = 1)  


beta_YX <- sum(X * Y) / sum(X^2)
print(paste("Beta (Y onto X):", beta_YX))

## [1] "Beta (Y onto X): 1.97864171794108"

beta_XY <- sum(X * Y) /  sum(Y^2)
print(paste("Beta (X onto Y):", beta_XY))

## [1] "Beta (X onto Y): 0.501472437357611"

print(paste("Are betas different:", beta_YX != beta_XY))

## [1] "Are betas different: TRUE"

set.seed(456)


n <- 100
Z <- rnorm(n, mean = 0, sd = 1)  
X_same <- Z * sqrt(runif(n, 0.5, 2))  
Y_same <- Z * sqrt(runif(n, 0.5, 2))  


  X_same <- X_same * sqrt(sum(Y_same^2) / sum(X_same^2))


beta_YX_same <- sum(X_same * Y_same) / sum(X_same^2)
print(paste("Beta (Y_same onto X_same):", beta_YX_same))

## [1] "Beta (Y_same onto X_same): 0.963116954969162"

beta_XY_same <- sum(X_same * Y_same) / sum(Y_same^2)
print(paste("Beta (X_same onto Y_same):", beta_XY_same))

## [1] "Beta (X_same onto Y_same): 0.963116954969162"

print(paste("Are betas the same:", abs(beta_YX_same - beta_XY_same) < 1e-10))

## [1] "Are betas the same: TRUE"

Assignment 2

Valeria Villarreal epi354

2025-02-27

2

9

a.

b.

c.

i.

ii.

iii

d.

e.

f.

10

a.

b.

c.

d.

e.

f.

g.

h.

12

a.

b.