3.2

Carefully explain the differences between the KNN classifier and KNN regression methods.

The KNN classifier chooses a categorical variable from the K nearest neighbors which class has the most frequent instances, while the regression methods chooses a continuous variable that is the average of the K nearest neighbors.

3.9

This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.

First we will import the Auto dataset and repeat preprocessing done in assignment 1:

auto <- read.table('auto.data', header = T)

# head(auto)

# sum(is.na(auto)) # no outright missing values

# unique(auto$horsepower) # there is a "?" which is a missing value. will remove these columns

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

auto <- auto %>% 
  filter(!grepl("\\?", horsepower)) %>% 
  select(-grep("\\?", names(auto)))
# unique(auto$horsepower)

Now we can plot a scatterplot matrix of the variables excluding name:

auto$horsepower <- as.numeric(auto$horsepower)

pairs(auto[,-9])

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

auto_cor <- cor(auto[,-9])
corrplot(auto_cor)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

auto_noname <- auto[,-9] # for convenience
full_model <- lm(mpg~., data=auto_noname)
summary(full_model)

## 
## Call:
## lm(formula = mpg ~ ., data = auto_noname)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

Yes, there appears to be a relationship between the predictors displacement, weight, year, and origin, and the response variable mpg.

Which predictors appear to have a statistically significant relationship to the response?

The statistically significant predictors using a significance of 0.05 and using every predictor is displacement, weight, year, and origin.

What does the coefficient for the year variable suggest?

The coefficient for year is 0.75, meaning that for every additional year there is an increase in mpg by 0.75 if all other variables are held constant.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(full_model)

From looking at the residuals v fitted plot, we see some evidence of a pattern in the data. This would suggest a non linear pattern may be present. We can also see in the leverage plot that point 14 appears to have high leverage, which could influence the model fit.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

interactions_model <- lm(mpg~.*., data=auto_noname)
summary(interactions_model)

## 
## Call:
## lm(formula = mpg ~ . * ., data = auto_noname)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

If we look at a model containing all terms and their interactions, we see a few that are statistically significant using a significance of 0.05: displacement, acceleration, origin, displacement:year, acceleration:year, and acceleration:origin.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

par(mfrow=c(2,2))
auto_log <- log(auto_noname)
log_model <- lm(mpg~., data=auto_log)
summary(log_model)

## 
## Call:
## lm(formula = mpg ~ ., data = auto_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41298 -0.07098  0.00055  0.06150  0.39532 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.155391   0.648230  -0.240  0.81068    
## cylinders    -0.082815   0.061429  -1.348  0.17841    
## displacement  0.006625   0.056970   0.116  0.90748    
## horsepower   -0.294389   0.057652  -5.106 5.18e-07 ***
## weight       -0.569666   0.082397  -6.914 1.98e-11 ***
## acceleration -0.179239   0.059536  -3.011  0.00278 ** 
## year          2.243989   0.131661  17.044  < 2e-16 ***
## origin        0.044848   0.018821   2.383  0.01767 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1136 on 384 degrees of freedom
## Multiple R-squared:  0.8903, Adjusted R-squared:  0.8883 
## F-statistic: 445.3 on 7 and 384 DF,  p-value: < 2.2e-16

plot(log_model, main = "Log Model")

auto_sqrt <- sqrt(auto_noname)
sqrt_model <- lm(mpg~., data=auto_sqrt)
summary(sqrt_model)

## 
## Call:
## lm(formula = mpg ~ ., data = auto_sqrt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98667 -0.17280 -0.00315  0.16145  1.02245 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.949286   0.847481  -2.300 0.021979 *  
## cylinders    -0.108552   0.141968  -0.765 0.444964    
## displacement  0.019707   0.021182   0.930 0.352752    
## horsepower   -0.090896   0.028428  -3.197 0.001502 ** 
## weight       -0.061414   0.007292  -8.422 7.48e-16 ***
## acceleration -0.107258   0.077048  -1.392 0.164699    
## year          1.266015   0.079308  15.963  < 2e-16 ***
## origin        0.272324   0.070883   3.842 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2964 on 384 degrees of freedom
## Multiple R-squared:  0.8662, Adjusted R-squared:  0.8638 
## F-statistic: 355.1 on 7 and 384 DF,  p-value: < 2.2e-16

plot(sqrt_model, main ='Square Root Model')

auto_sq <- auto_noname^2
sq_model <- lm(mpg~., data=auto_sq)
summary(sq_model)

## 
## Call:
## lm(formula = mpg ~ ., data = auto_sq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -501.89 -145.36  -18.91  111.41 1034.08 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.523e+02  1.456e+02  -5.165 3.87e-07 ***
## cylinders    -3.746e+00  1.559e+00  -2.403 0.016713 *  
## displacement  3.356e-03  8.547e-04   3.926 0.000102 ***
## horsepower    1.279e-04  3.076e-03   0.042 0.966851    
## weight       -4.833e-05  5.551e-06  -8.707  < 2e-16 ***
## acceleration  4.892e-01  1.663e-01   2.941 0.003474 ** 
## year          2.731e-01  2.183e-02  12.513  < 2e-16 ***
## origin        2.608e+01  4.275e+00   6.101 2.57e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.8 on 384 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.7001 
## F-statistic: 131.4 on 7 and 384 DF,  p-value: < 2.2e-16

plot(sq_model, main = 'Square Model')

If we apply each of these transformations to the dataset we can see the effects. If we recall the original adjR2 is around 0.81, we can see some improvement with some of these transformations; in particular we see log(x) improving the adjR2 to 0.88 and sqrt(x) improving it to 0.86. The x^2 in turn seems to decrease the performance, causing the adjR2 to decrease to 0.70.

Reviewing the diagnostic plots of each of these transformations we see that log(x) seems to have removed the patterns found in the base model, and sqrt(x) seems to have also done so but to a lesser degree. The transformation x^2 in turn seems to have worsened these trends. None of these transformations seem to have affected the high leverage point, however.

3.10

This question should be answered using the Carseats data set.

data("Carseats")
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

Fit a multiple regression model to predict Sales using Price, Urban, and US.

model <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

For every unit increase in price, there is a decrease of 54 sales of the carseat with every other variable held constant. If the store is in an urban location, there is a decrease of 21 sales in comparison to being in a rural location with every other variable held constant. If the store is in the US, there is an increase of 1200 sales in comparison to not being in the US with all other variables held constant.

Write out the model in equation form, being careful to handle the qualitative variables properly.

If we treat ‘Yes’ as 1 and ‘No’ as 0 for the categorical variables Urban and US, we can use the following equation:

\[ \text{Sales} = -0.054 \cdot \text{Price} - 0.021 \cdot \text{Urban} + 1.200 \cdot \text{US} + 13.043\]

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Using a significance of 0.05, we can reject the null hypothesis that there is no relatoinship between the predictor and response variable for Price and US.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

smaller_model <- lm(Sales ~ Price + US, data=Carseats)
summary(smaller_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

We can compare the models using R2 and adjR2. They both have an R2 of 0.2393 meaning that they account for 23.93% of the variance of the data, but the smaller model in part (e) has a slightly better adjR2 at 0.2354. This means that they essentially fit the data the same, but the smaller model is better due to having fewer variables and getting the same results.

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(smaller_model, level=0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

We can determine outliers using Cook’s distance and high leverage points using the residual vs leverage plot, both of which can be found in the diagnostic plots.

# Cook's distance to identify outliers
n <- nrow(Carseats)
plot(smaller_model, which = 4)
abline(h = 4/n, lty = 2, col = "red") # threshold for outliers

# get residual v outliers:
plot(smaller_model, which = 5)

From here we can see several outliers using Cook’s distance, above the threshold of 4/n. There also appears to be a few high leverage points as can be seen in the second plot. One particular point of concern is observation 368, which has a Cook’s distance above the threshold and has a relatively high leverage.

3.12

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate of x onto y and y onto x will be the same if the relationship (without an intercept) is y = x, meaning that this will only occur when there is a perfect linear relationship with a coefficient of 1.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(123)
x <- rnorm(100)
y <- 5 * x + rnorm(100)

diff_coef_model <- lm(y~x)
summary(diff_coef_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9073 -0.6835 -0.0875  0.5806  3.2904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.10280    0.09755  -1.054    0.295    
## x            4.94753    0.10688  46.291   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9707 on 98 degrees of freedom
## Multiple R-squared:  0.9563, Adjusted R-squared:  0.9558 
## F-statistic:  2143 on 1 and 98 DF,  p-value: < 2.2e-16

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- rnorm(100)
y <- 1 * x + rnorm(100)

same_coef_model <- lm(y~x)
summary(same_coef_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4412 -0.6537  0.0080  0.7412  2.5266 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03031    0.10515  -0.288    0.774    
## x            0.95094    0.11036   8.617  1.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.043 on 98 degrees of freedom
## Multiple R-squared:  0.4311, Adjusted R-squared:  0.4253 
## F-statistic: 74.25 on 1 and 98 DF,  p-value: 1.204e-13

STA6543 Predictive Modeling: Linear Regression

Michael De La Rosa

2025-01-27

3.2

3.9

3.10

3.12