STA 6543 HW Assignment #2

Question 2

Explain the difference between KNN Classifier & KNN Regression methods.

The KNN Classifier is for solving classification problems (problems with qualitative outputs) while KNN Regression is used for regression problems (problems with quantitative outputs). Additionally, KNN Classifier estimates the class of the neighbors to X while KNN Regression estimates f(x) with the average of the neighbor values.

Question 9

Use the Auto data set for multiple linear regression.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.2

data(Auto)

Part A) Scatterplot with all variables.

plot(Auto)

Part B) Matrix of Correlations (excluding ‘name’ variable).

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Part C) Multiple Linear Regression (result = ‘mpg’ & predictors = all other but ‘name’)

Auto_lm <- lm(mpg~., data = Auto[, 1:8])
summary(Auto_lm)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Relationship between the predictors and response?

At least some of the variables have a relationship with ‘mpg’ (the response). This is indicated by the low F-Statistic which is 2.2e-16. #### Predictors that appear to have a statistically significant relationship to response? ‘Displacement’ - with a 1 unit increase, ‘mpg’ increases 0.019896 units (all others constant) ‘Weight’ - with a 1 unit increase, ‘mpg’ decreases 0.006474 units (all others constant) ‘Year’ - with a 1 unit increase, ‘mpg’ increases 0.750773 units (all others constant) ‘Origin’ - with a 1 unit increase, ‘mpg’ increases 1.426141 units (all others constant) ‘Cylinders’, ‘horsepower’, and ‘acceleration’ show on relationship to ‘mpg’ as indicated by the higher p-value. #### Coefficient for ‘year’ suggests? The ‘Year’ coefficient of 0.750773 indicates that as the car is one year newer, it’s average miles per gallon (efficiency) increases by about 0.75.

Part D) Diagnostic Plots of Linear Regression Fit

par(mfrow = c(2,2))
plot(Auto_lm)

#### Comments on Problems The plot of Residuals vs. Fitted values shows a non-linearity in the data. Also, the Residuals vs Leverage plot shows there are some outliers in the data.

Part E) Linear Regression Models with Interaction Effects

Auto_lm_2 <- lm(mpg ~ cylinders * displacement + horsepower * weight + acceleration * year, data = Auto[, 1:8])
summary(Auto_lm_2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight + 
##     acceleration * year, data = Auto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3265 -1.5779  0.0389  1.3483 11.6961 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.162e+02  1.853e+01   6.274 9.53e-10 ***
## cylinders              -1.803e-01  4.776e-01  -0.377   0.7061    
## displacement           -2.867e-02  1.425e-02  -2.013   0.0449 *  
## horsepower             -2.261e-01  2.609e-02  -8.664  < 2e-16 ***
## weight                 -1.019e-02  9.020e-04 -11.296  < 2e-16 ***
## acceleration           -7.081e+00  1.158e+00  -6.113 2.41e-09 ***
## year                   -6.719e-01  2.417e-01  -2.780   0.0057 ** 
## cylinders:displacement  2.790e-03  2.067e-03   1.350   0.1779    
## horsepower:weight       5.154e-05  6.727e-06   7.661 1.53e-13 ***
## acceleration:year       9.113e-02  1.502e-02   6.069 3.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.819 on 382 degrees of freedom
## Multiple R-squared:  0.8726, Adjusted R-squared:  0.8696 
## F-statistic: 290.6 on 9 and 382 DF,  p-value: < 2.2e-16

Statistically Significant Interactions

The interactions between ‘horsepower’ and ‘weight’ and between ‘acceleration’ and ‘cylinders’ are statistically significant. However, the interaction between ‘cylinders’ and ‘displacement’ is not statistically significant.

Part F) Additional Findings

Graph 1: Horsepower

par(mfrow = c(2,2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

#### Graph 2: Acceleration

par(mfrow = c(2,2))
plot(log(Auto$acceleration), Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)

#### Graph 3: Cylinders

par(mfrow = c(2,2))
plot(log(Auto$cylinders), Auto$mpg)
plot(sqrt(Auto$cylinders), Auto$mpg)
plot((Auto$cylinders)^2, Auto$mpg)

#### Comments on Findings Using the variables that did not have a relationship with ‘mpg’ originally (‘horsepower’, ‘acceleration’, and ‘cylinders’), the log(X) of ‘horsepower’ appears to be the closest to a linear model than the other variations.

Question 10

Use the Carseats data set

data(Carseats)

Part A) Multiple Linear Regression Model to predict ‘Sales’ using ’ Price’, ‘Urban’, & ‘US’

Carseats_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseats_lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Part B) Interpretation of Variables

‘Price’ - with a $1 increase, ‘Sales’ decreases $54.46 (all others constant) ‘US’ - Sales in a US store are $1,200.57 higher than in a non-US store (all others constant) The ‘Urban’ variable shows no relationship to ‘Sales’ due to the larger p-value.

Part C) Model in Equation Form

Sales = 13.043469 - (0.054459 * Price) - (0.021916 * (Urban = yes)) + (1.200573 * (US = yes))

Where (Urban = yes) = 1 for Urban and 0 for Not Urban & (US = yes) = 1 for a US Store and 0 for a Non-US Store

Part D) For which predictors can we reject the null?

The null hypothesis can be rejected for ‘Price’ and ‘US’ since both have an effect on ‘Sales’

Part E) Smaller model with just ‘Price’ & ‘US’

Carseats_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(Carseats_lm_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Part F) How well do the Models (Carseats_lm & Carseats_lm_2) fit the data?

Both models show that ~23% of the variability can be explained by the model. This indicates that neither of the models fit the data well.

Part G) Use Carseats_lm_2 to obtain 95% confidence intervals for the coefficients

confint(Carseats_lm_2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Part H) Evidence of Outliers or High Leverage Observations?

Diagnostic Plots

par(mfrow = c(2,2))
plot(Carseats_lm_2)

#### Observations The Residuals vs. Fitted plot shows there appears to be linearity in the data. However, the Residuals vs. Leverage plot shows there are outliers and there appears to be high leverage observations greater than 0.01.

Question 12

Part A) When is the coefficient estimate for regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The two coefficient estimates are the same when the sum of X^2 is equal to the sum of Y^2.

Part B) The two coefficients estimates are DIFFERENT with n = 100

set.seed(1)
X <- 1:100
sum(X^2)

## [1] 338350

Y <- X * -153
sum(Y^2)

## [1] 7920435150

fit.X <- lm(Y ~ X + 0)
fit.Y <- lm(X ~ Y + 0)
summary(fit.Y)

## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = X ~ Y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.493e-13 -1.826e-15  4.100e-17  1.549e-15  1.140e-14 
## 
## Coefficients:
##     Estimate Std. Error    t value Pr(>|t|)    
## Y -6.536e-03  2.846e-19 -2.297e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.276e+32 on 1 and 99 DF,  p-value: < 2.2e-16

summary(fit.X)

## Warning in summary.lm(fit.X): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = Y ~ X + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.926e-12 -2.460e-13 -1.000e-15  2.110e-13  3.282e-11 
## 
## Coefficients:
##     Estimate Std. Error    t value Pr(>|t|)    
## X -1.530e+02  5.745e-15 -2.663e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.342e-12 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.093e+32 on 1 and 99 DF,  p-value: < 2.2e-16

Part C) The two coefficients estimates are SAME with n = 100 (X = A & Y = B – for ease of variable use)

A <- 1:100
sum(A^2)

## [1] 338350

B <- 100:1
sum(B^2)

## [1] 338350

fit.A <- lm(A ~ B + 0)
fit.B <- lm(B ~ A + 0)
summary(fit.B)

## 
## Call:
## lm(formula = B ~ A + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## A   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(fit.A)

## 
## Call:
## lm(formula = A ~ B + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## B   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08