Assignmnet2

Irving Cedillo

Assignment 2

Question 2

Both KNN regression and KNN classifier methods are non parametric, the main difference lies in how a new data point is classified. KNN regression chooses to answer a numerical based answer based on an average of K nearest data points. Since the numerical/real number depends on an average it is highly susceptible to extreme outliers given a small number of neighbors. Meanwhile a classifier categorizes a new data point into a particular group based on similar characteristics using the mode instead of the average. Classification also seeks to create a boundary of where future points may lie and give insights into the shape of the given groups.

library(ISLR2)

Warning: package 'ISLR2' was built under R version 4.5.3

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 9

attach(Auto)

The following object is masked from package:lubridate:

    origin

The following object is masked from package:ggplot2:

    mpg

#a
pairs(Auto)

subset_Auto <- select(Auto,-name)
#b : remove the name variable or last entry
round(cor(subset_Auto) , 2)

               mpg cylinders displacement horsepower weight acceleration  year
mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
             origin
mpg            0.57
cylinders     -0.57
displacement  -0.61
horsepower    -0.46
weight        -0.59
acceleration   0.21
year           0.18
origin         1.00

#c
auto_lm <- lm(mpg~.,data=subset_Auto)
summary(auto_lm)


Call:
lm(formula = mpg ~ ., data = subset_Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

When fitting the full model and regressing mpg against all predictors, except name, the resulting model can explain \(81.8\%\) of variance from the R sqrd adj. From the large F 252.4 statistic we have selected at least one significant predictor whose coefficient is not 0. The following are the significant predictors:

displacement: While holding all other variabels constant one unit increase in displacement results in .019 increase in mpg
weight: While holding all other variables constant one unit increase in weight results in .006 decrease in mpg
year: While holding all other variables constant one unit increase in year results in .75 increase in mpg
origin: While holding all other variables constant one unit increase in origin results in 1.42 increase in mpg

#d
par(mfrow=c(2,2))
plot(auto_lm)

From the residual plots we see that there were many violations of the linearity assumptions. The slight curvature in the line of the residuals vs fitted imply that there exists a nonlinear relationship using the full model. The Q-Q residuals plot suggest that the residuals do not follow a normal distribution at the positive extremes. The sqrt of the standardized residuals also point to non constant variance and heteroscedasticity. Finally the residuals vs leverage plot point to observation 14 being an influential point as having high leverage and on the boundary of cook’s distance.

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

#e
inter<-lm(mpg~horsepower*weight+weight+horsepower+year,data=subset_Auto)
summary(inter)


Call:
lm(formula = mpg ~ horsepower * weight + weight + horsepower + 
    year, data = subset_Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9146 -1.8987 -0.0386  1.5536 12.6333 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.577e+00  3.911e+00   0.915    0.361    
horsepower        -2.236e-01  2.063e-02 -10.837   <2e-16 ***
weight            -1.185e-02  5.868e-04 -20.198   <2e-16 ***
year               7.749e-01  4.508e-02  17.190   <2e-16 ***
horsepower:weight  5.790e-05  5.020e-06  11.534   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.963 on 387 degrees of freedom
Multiple R-squared:  0.8574,    Adjusted R-squared:  0.8559 
F-statistic: 581.5 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(inter)

From the smaller model mpg~weight*horsepower+year we are able to achieve a high F statistics and Adjusted R sqrd \(.85\) while addressing the issue of multicolinearity. The interaction term is significant with a p. value of \(2^{-26}<<.05\). However, the non-linear pattern still remains with the residuals still not being fully normalized on the positive extremes.

#f
log_trans<-lm(log(mpg)~horsepower*weight+weight+horsepower+year,data=subset_Auto)
summary(log_trans)


Call:
lm(formula = log(mpg) ~ horsepower * weight + weight + horsepower + 
    year, data = subset_Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.37317 -0.07025  0.00474  0.06668  0.36732 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.159e+00  1.514e-01  14.259  < 2e-16 ***
horsepower        -6.044e-03  7.985e-04  -7.569 2.78e-13 ***
weight            -4.042e-04  2.271e-05 -17.795  < 2e-16 ***
year               3.040e-02  1.745e-03  17.424  < 2e-16 ***
horsepower:weight  1.369e-06  1.943e-07   7.048 8.38e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1147 on 387 degrees of freedom
Multiple R-squared:  0.8874,    Adjusted R-squared:  0.8862 
F-statistic: 762.4 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(log_trans)

sqrt_trans<-lm(sqrt(mpg)~horsepower*weight+weight+horsepower+year,data=subset_Auto)
summary(sqrt_trans)


Call:
lm(formula = sqrt(mpg) ~ horsepower * weight + weight + horsepower + 
    year, data = subset_Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82552 -0.18225 -0.00418  0.16285  0.99239 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.652e+00  3.739e-01   7.093 6.27e-12 ***
horsepower        -1.886e-02  1.972e-03  -9.561  < 2e-16 ***
weight            -1.099e-03  5.610e-05 -19.590  < 2e-16 ***
year               7.610e-02  4.309e-03  17.658  < 2e-16 ***
horsepower:weight  4.667e-06  4.799e-07   9.726  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2833 on 387 degrees of freedom
Multiple R-squared:  0.8768,    Adjusted R-squared:  0.8756 
F-statistic: 688.9 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(sqrt_trans)

sqr_trans<-lm(mpg~horsepower*I(weight^2)+I(weight^2)+horsepower+year,data=subset_Auto)
summary(sqr_trans)


Call:
lm(formula = mpg ~ horsepower * I(weight^2) + I(weight^2) + horsepower + 
    year, data = subset_Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6169 -1.9979 -0.0211  1.7409 12.7008 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -1.002e+01  3.779e+00   -2.65  0.00837 ** 
horsepower             -1.650e-01  1.259e-02  -13.10  < 2e-16 ***
I(weight^2)            -2.092e-06  1.026e-07  -20.39  < 2e-16 ***
year                    7.596e-01  4.593e-02   16.54  < 2e-16 ***
horsepower:I(weight^2)  1.101e-08  7.462e-10   14.75  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.024 on 387 degrees of freedom
Multiple R-squared:  0.8514,    Adjusted R-squared:  0.8499 
F-statistic: 554.3 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(sqr_trans)

The final logarithmic transformation of the model \(log(X)\) resulted in the best model fit: log(mpg)~horsepower*weight+year, which slightly outperformed the \(\sqrt X\) and \(X^2\) models. This final model addresses the nonlinear pattern as well as normalizing the residuals. Homoscedasticity was achieved with constant variance and the influential point 14 was addressed. The model did increase in complexity but has the highest Adjusted R sqrd value thus far at \(88.74\%\).

Question 10

attach(Carseats)
#a, b
fit<- lm(Sales~Price+Urban+US, data=Carseats)
summary(fit)


Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The following provides interpretation of the coefficients of the model above:

price is statistically significant: While holding all other variables constant one unit increase in price results in a decrease of sales of 54.4 units.
- With a pvalue of \(2^{-16}<<.05\) reject the null hypothesis so, \(H_0:\beta_{price}\neq0\)
urbanYes is not statistically significant: While holding all other variables constant there is no statistical evidence to imply a difference in carseat sales between locations urban or rural.
- With a pvalue of \(.936>.05\) we fail to reject the null hypothesis so, \(H_0:\beta_{Urban}= 0\)
USYes is statistically significant: While holding all other variables constant, when compared to the base USNo, a store located in the united states is associated with an increase of 1200 units compared to a store outside of the united states.
- With a pvalue of \(4.86^{-6}<.05\) reject the null hypothesis so, \(H_0:\beta_{US}\neq 0\)

#e, f
smaller_fit<- lm(Sales~Price+US, data=Carseats)
summary(smaller_fit)


Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

anova(fit, smaller_fit)

Analysis of Variance Table

Model 1: Sales ~ Price + Urban + US
Model 2: Sales ~ Price + US
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    396 2420.8                           
2    397 2420.9 -1  -0.03979 0.0065 0.9357

From the partial f test of the two models, we see that the F statistics yielded a p value of \(.9357>.05\) meaning that we fail to reject the null hypothesis that the coefficient of Urban is zero. Therefore dropping Urban does not significantly affect the RSS. Following the law of parsimony we select the simpler reduced model.

confint(smaller_fit)

                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

summary(influence.measures(smaller_fit))

Potentially influential observations of
     lm(formula = Sales ~ Price + US, data = Carseats) :

    dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

The table above has identified 5 points in the hat column that are classified as high leverage, meaning that the feature values for those observation are significantly different from the mean of the whole dataset. Meanwhile the dependent value portion of the observations are well behaved as non cross Cook’s distance.

Question 12

1. The regressor coefficient of X onto Y are the same as the regressor coefficients of Y onto X when the sets X and Y are the same OR hold the same sum of squares.

Define the regressors of \(Y\) onto \(X\) as: \[\hat{\beta} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}\]

Next define the regressors of \(X\) onto \(Y\) as: \[\hat{\alpha} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n y_i^2}\]

For the cases where \(\hat{\beta} = \hat{\alpha}\): \[\frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n y_i^2} \implies \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 \quad \text{Q.E.D.}\]

#b
set.seed(42)
x<-rnorm(100,mean=0, sd=1)
y<-rnorm(100,mean=0, sd=13)

fit_Y_on_X<-lm(y~x)
fit_X_on_Y<-lm(x~y)

coef_Y_on_X<-coef(fit_Y_on_X)
coef_X_on_Y<-coef(fit_X_on_Y)

cat("Coeff of Y onto X:", round(coef_Y_on_X,2))

Coeff of Y onto X: -1.15 0.35

cat("\n")

cat("Coeffof X onto Y:", round(coef_X_on_Y,2))

Coeffof X onto Y: 0.04 0

#c
y=x
fit_Y_on_X<-lm(y~x)
fit_X_on_Y<-lm(x~y)

coef_Y_on_X<-coef(fit_Y_on_X)
coef_X_on_Y<-coef(fit_X_on_Y)

cat("Coeff of Y onto X:", round(coef_Y_on_X,2))

Coeff of Y onto X: 0 1

cat("\n")

cat("Coeffof X onto Y:", round(coef_X_on_Y,2))

Coeffof X onto Y: 0 1