Assignment 2

Question 2: Carefully explain the differences between KNN classifier and KNN regression methods.

The goal for KNN regression is to predict continuous variables and KNN classifiers predict categorical variables. KNN regression looks for the mean of the neighbors while KNN classifiers look for majority vote among the neighbors as predictions.

Question 9a.Produce a scatterplot matrix which includes all of the variables in the data set.

#load data and libraries
library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)
library(tidyr)

## Warning: package 'tidyr' was built under R version 4.3.3

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(broom)
library(GGally)

## Warning: package 'GGally' was built under R version 4.3.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Question 10
auto <- read.csv("C:/Users/lclha/Documents/MSDA Program 2024-2026/Summer 2025/Predictive Modeling/Datasets/Auto.csv", stringsAsFactors = TRUE)
auto$origin <- as.numeric(auto$origin)
auto$horsepower <- as.numeric(auto$horsepower)
auto <- auto %>% tidyr::drop_na() %>% dplyr::select(-name)

pairs(auto)

Question 9b.Compute the matrix of correlations between the variables using the DataFrame.corr() method.

corr_matrix <- cor(auto)
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black")

Question 9c.Use the sm.OLS() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summarize() function to print the results. Comment on the output. For instance:

model <- lm(mpg ~ ., data = auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ ., data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.629 -2.034 -0.046  1.801 13.010 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.128e+01  4.259e+00  -4.998 8.78e-07 ***
## cylinders    -2.927e-01  3.382e-01  -0.865   0.3874    
## displacement  1.603e-02  7.284e-03   2.201   0.0283 *  
## horsepower    7.942e-03  6.809e-03   1.166   0.2442    
## weight       -6.870e-03  5.799e-04 -11.846  < 2e-16 ***
## acceleration  1.539e-01  7.750e-02   1.986   0.0477 *  
## year          7.734e-01  4.939e-02  15.661  < 2e-16 ***
## origin        1.346e+00  2.691e-01   5.004 8.52e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.8188 
## F-statistic: 256.7 on 7 and 389 DF,  p-value: < 2.2e-16

anova(model)

## Analysis of Variance Table
## 
## Response: mpg
##               Df  Sum Sq Mean Sq   F value    Pr(>F)    
## cylinders      1 14613.9 14613.9 1317.0129 < 2.2e-16 ***
## displacement   1  1112.8  1112.8  100.2818 < 2.2e-16 ***
## horsepower     1    21.4    21.4    1.9290   0.16566    
## weight         1  1190.6  1190.6  107.2945 < 2.2e-16 ***
## acceleration   1    59.9    59.9    5.3969   0.02069 *  
## year           1  2659.4  2659.4  239.6702 < 2.2e-16 ***
## origin         1   277.8   277.8   25.0398 8.523e-07 ***
## Residuals    389  4316.5    11.1                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coef(model)["year"]

##      year 
## 0.7734417

Question 9c i. Is there a relationship between the predictors and the response? Use the anova_lm() function from stats models to answer this question.

Yes, there is a relationship between the predictors and the mpg response variable except for horsepower.

Question 9c ii. Which predictors appear to have a statistically significant relationship to the response?

The following predictors have a statistically significant relationship to the response which are displacement, weight, acceleration, year, and origin variables.

Question 9c iii. What does the coefficient for the year variable suggest?

The coefficient for the year variable is 0.77 which suggests for every additional model year, mpg increases by 0.77 given all other variables stay constant.

Question 9d. Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the f it. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

#Diagnostic plots
par(mfrow = c(2, 2))
plot(model)

In the Residuals vs Fitted plot, it does not seem to have randomness and seems to follow a U curve, Q-Q Residuals plot is mostly linear except towards the tail end upwards it falls off the line but doesn’t suggest unusual outliers, looks to have a good spread in the scale-location plot, and there are a few high leverage points in the residuals vs leverage plot.

Question 9e. Fit some models with interactions as described in the lab. Do any interactions appear to be statistically significant?

interaction_model <- lm(mpg ~ weight* acceleration + year + displacement +horsepower + origin + cylinders, data = auto)
summary(interaction_model)

## 
## Call:
## lm(formula = mpg ~ weight * acceleration + year + displacement + 
##     horsepower + origin + cylinders, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3855 -2.0117  0.1639  1.7455 12.2559 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.800e+01  5.849e+00  -8.207 3.39e-15 ***
## weight               2.007e-03  1.505e-03   1.334 0.182926    
## acceleration         1.627e+00  2.437e-01   6.677 8.44e-11 ***
## year                 8.196e-01  4.763e-02  17.208  < 2e-16 ***
## displacement        -2.156e-03  7.511e-03  -0.287 0.774247    
## horsepower           7.009e-03  6.492e-03   1.080 0.280948    
## origin               9.063e-01  2.657e-01   3.412 0.000714 ***
## cylinders            3.655e-03  3.257e-01   0.011 0.991052    
## weight:acceleration -5.181e-04  8.167e-05  -6.344 6.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.175 on 388 degrees of freedom
## Multiple R-squared:  0.8387, Adjusted R-squared:  0.8354 
## F-statistic: 252.3 on 8 and 388 DF,  p-value: < 2.2e-16

interaction_model <- lm(mpg ~ weight * year + displacement + horsepower + acceleration + origin + cylinders, data = auto)
summary(interaction_model)

## 
## Call:
## lm(formula = mpg ~ weight * year + displacement + horsepower + 
##     acceleration + origin + cylinders, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0323 -1.8948  0.0079  1.5384 11.7087 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.220e+02  1.331e+01  -9.170  < 2e-16 ***
## weight        2.916e-02  4.576e-03   6.372 5.29e-10 ***
## year          2.089e+00  1.722e-01  12.134  < 2e-16 ***
## displacement  7.674e-03  6.848e-03   1.121 0.263115    
## horsepower    5.503e-03  6.332e-03   0.869 0.385391    
## acceleration  2.694e-01  7.345e-02   3.668 0.000278 ***
## origin        1.038e+00  2.529e-01   4.103 4.98e-05 ***
## cylinders     5.689e-02  3.172e-01   0.179 0.857764    
## weight:year  -4.794e-04  6.046e-05  -7.929 2.38e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.094 on 388 degrees of freedom
## Multiple R-squared:  0.8468, Adjusted R-squared:  0.8437 
## F-statistic: 268.2 on 8 and 388 DF,  p-value: < 2.2e-16

interaction_model <- lm(mpg ~ weight + acceleration *cylinders + year + displacement +horsepower + origin, data = auto)
summary(interaction_model)

## 
## Call:
## lm(formula = mpg ~ weight + acceleration * cylinders + year + 
##     displacement + horsepower + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6385 -2.0148  0.0181  1.8525 12.5798 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -37.997857   5.777865  -6.576 1.56e-10 ***
## weight                  -0.006220   0.000589 -10.560  < 2e-16 ***
## acceleration             1.077460   0.233595   4.613 5.41e-06 ***
## cylinders                2.603860   0.767972   3.391 0.000769 ***
## year                     0.800046   0.048790  16.398  < 2e-16 ***
## displacement             0.003954   0.007698   0.514 0.607805    
## horsepower               0.007197   0.006672   1.079 0.281377    
## origin                   1.135397   0.268341   4.231 2.90e-05 ***
## acceleration:cylinders  -0.178164   0.042618  -4.181 3.60e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.263 on 388 degrees of freedom
## Multiple R-squared:  0.8297, Adjusted R-squared:  0.8262 
## F-statistic: 236.3 on 8 and 388 DF,  p-value: < 2.2e-16

Yes, some interactions appear to be statistically significant such as between the following pairs weight and year, weight and acceleration, acceleration and cylinders.

Question 9f.Try a few different transformations of the variables, such as log(X), X, X2. Comment on your findings.

# Add log(weight) and horsepower²
auto <- auto %>%
  mutate(log_weight = log(weight),
         horsepower2 = horsepower^2)

# Fit transformed model
trans_model <- lm(mpg ~ log_weight + horsepower2 + displacement + year + acceleration + origin + cylinders, data = auto)
summary(trans_model)

## 
## Call:
## lm(formula = mpg ~ log_weight + horsepower2 + displacement + 
##     year + acceleration + origin + cylinders, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3429 -1.9097  0.0201  1.6754 12.7258 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.368e+02  1.107e+01  12.365  < 2e-16 ***
## log_weight   -2.263e+01  1.488e+00 -15.213  < 2e-16 ***
## horsepower2  -2.072e-05  6.718e-05  -0.308  0.75796    
## displacement  1.951e-02  6.405e-03   3.046  0.00248 ** 
## year          8.030e-01  4.603e-02  17.445  < 2e-16 ***
## acceleration  1.807e-01  7.134e-02   2.532  0.01172 *  
## origin        9.595e-01  2.506e-01   3.829  0.00015 ***
## cylinders    -4.202e-01  3.170e-01  -1.325  0.18579    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.089 on 389 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8442 
## F-statistic: 307.5 on 7 and 389 DF,  p-value: < 2.2e-16

I did a logarithmic transformation on weight and squared horsepower and kept the other variables the same to see if there are any statistically significant variables that will affect mpg. While doing this there was statistically significant variables such for the logarithmic weight, displacement, year, acceleration, and origin variables.

Question 10a. Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.3

data(Carseats)
str(Carseats)  # View structure of dataset

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

# Fit the full model with all three predictors
model2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model2)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Question 10b. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The coefficients for this model are as follows: Intercept: 13.04, Price: -0.054, UrbanYes : -0.021, and USYes: 1.20.

Intercept of 13.04 means that when Price = 0, and both Urban = No, and US = No, 13.04 is the predicted sales.

Price of -0.054 indicates that for each $1 increase in price, Sales decreases by 0.054 units given all other variables stay constant.

UrbanYes of -0.021 indicates that being in an urban area, Sales decreases by 0.021 units compared to not urban area given all other variables stay constant.

USYes of 1.20 indicates that stores in the US sell 1.20 more units than stores outside the US given all other variables stay constant.

Question 10c. Write out the model in equation form, being careful to handle the qualitative variables properly.

Predicted Sales = 13.04 - 0.54 (Price) – 0.021(UrbanYes) + 1.20 (USYes)

Qualitative variables where 1 means Yes and 0 mean No.

Question 10d. For which of the predictors can you reject the null hypothesis H0 : j =0?

For the Price and USYes variables we can reject the null hypothesis since it is less than 0.05 alpha level and deemed significant predictors.

Question 10e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

model2.1 <- lm(Sales ~ Price + US, data = Carseats)
summary(model2.1)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Question 10f.How well do the models in (a) and (e) fit the data?

summary(model2)$adj.r.squared

## [1] 0.2335123

summary(model2.1)$adj.r.squared

## [1] 0.2354305

The model from question A has an adjusted R squared of 0.2335 or 23.35% and model from question E has an adjusted R squared of 0.2354 or 23.54%. This is relatively similar performance with Model E performing slightly better.

Question 10g.Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(model2.1, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Price confidence intervals are between -0.06476 to -0.04412

USYes confidence intervals are between 0.6915 to 1.7078

Question 10h. Is there evidence of outliers or high leverage observations in the model from (e)?

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model2.1)

# High leverage points
influencePlot(model2.1)

##        StudRes         Hat       CookD
## 26   2.5996518 0.011621599 0.026109457
## 43  -0.5349931 0.043337657 0.004329756
## 51  -2.8358431 0.004224147 0.011173381
## 175 -1.2144859 0.029686718 0.015024314
## 368  1.7366086 0.023707048 0.024287363
## 377  2.8915213 0.006637175 0.018282191

Yes, there is evidence of a few high leverage observations in the model on the far right indicating influential points.

Question 12a. Recall that the coefficient estimate B for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When their sums of square are equal is when this condition has the same coefficient estimate.

Question 12b.Generate an example in Python with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(42)

n <- 100
x <- rnorm(n)
y <- 2 * x

# Regression of Y onto X (no intercept)
model_y_on_x <- lm(y ~ x + 0)
coef_y_on_x <- coef(model_y_on_x)

# Regression of X onto Y (no intercept)
model_x_on_y <- lm(x ~ y + 0)
coef_x_on_y <- coef(model_x_on_y)

# Show coefficients
coef_y_on_x

## x 
## 2

coef_x_on_y

##   y 
## 0.5

Question 12c. Generate an example in Python with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(42)

n <- 100
x <- rnorm(n)
y <- x * 3

# Regression of Y onto X (no intercept)
model_y_on_x <- lm(y ~ x + 0)
coef_y_on_x <- coef(model_y_on_x)

# Regression of X onto Y (no intercept)
model_x_on_y <- lm(x ~ y + 0)
coef_x_on_y <- coef(model_x_on_y)

# Now re scale y
y_scaled <- y / sqrt(9)  # divide by sqrt(9) = 3

# Fit models
coef(lm(y_scaled ~ x + 0))  # Should match

## x 
## 1

coef(lm(x ~ y_scaled + 0))

## y_scaled 
##        1

Assignment 2

Lauren Hall

2025-06-16

Assignment 2

Question 2: Carefully explain the differences between KNN classifier and KNN regression methods.

Question 9a.Produce a scatterplot matrix which includes all of the variables in the data set.

Question 9b.Compute the matrix of correlations between the variables using the DataFrame.corr() method.

Question 9c.Use the sm.OLS() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summarize() function to print the results. Comment on the output. For instance:

Question 9c i. Is there a relationship between the predictors and the response? Use the anova_lm() function from stats models to answer this question.

Question 9c ii. Which predictors appear to have a statistically significant relationship to the response?

Question 9c iii. What does the coefficient for the year variable suggest?

Question 9d. Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the f it. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Question 9e. Fit some models with interactions as described in the lab. Do any interactions appear to be statistically significant?

Question 9f.Try a few different transformations of the variables, such as log(X), X, X2. Comment on your findings.

Question 10a. Fit a multiple regression model to predict Sales using Price, Urban, and US.

Question 10b. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Question 10c. Write out the model in equation form, being careful to handle the qualitative variables properly.

Question 10d. For which of the predictors can you reject the null hypothesis H0 : j =0?

Question 10e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Question 10f.How well do the models in (a) and (e) fit the data?

Question 10g.Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

Question 10h. Is there evidence of outliers or high leverage observations in the model from (e)?

Question 12a. Recall that the coefficient estimate B for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Question 12b.Generate an example in Python with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

Question 12c. Generate an example in Python with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.