The goal for KNN regression is to predict continuous variables and KNN classifiers predict categorical variables. KNN regression looks for the mean of the neighbors while KNN classifiers look for majority vote among the neighbors as predictions.
#load data and libraries
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(broom)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Question 10
auto <- read.csv("C:/Users/lclha/Documents/MSDA Program 2024-2026/Summer 2025/Predictive Modeling/Datasets/Auto.csv", stringsAsFactors = TRUE)
auto$origin <- as.numeric(auto$origin)
auto$horsepower <- as.numeric(auto$horsepower)
auto <- auto %>% tidyr::drop_na() %>% dplyr::select(-name)
pairs(auto)
corr_matrix <- cor(auto)
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black")
model <- lm(mpg ~ ., data = auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ ., data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.629 -2.034 -0.046 1.801 13.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.128e+01 4.259e+00 -4.998 8.78e-07 ***
## cylinders -2.927e-01 3.382e-01 -0.865 0.3874
## displacement 1.603e-02 7.284e-03 2.201 0.0283 *
## horsepower 7.942e-03 6.809e-03 1.166 0.2442
## weight -6.870e-03 5.799e-04 -11.846 < 2e-16 ***
## acceleration 1.539e-01 7.750e-02 1.986 0.0477 *
## year 7.734e-01 4.939e-02 15.661 < 2e-16 ***
## origin 1.346e+00 2.691e-01 5.004 8.52e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.8188
## F-statistic: 256.7 on 7 and 389 DF, p-value: < 2.2e-16
anova(model)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 1 14613.9 14613.9 1317.0129 < 2.2e-16 ***
## displacement 1 1112.8 1112.8 100.2818 < 2.2e-16 ***
## horsepower 1 21.4 21.4 1.9290 0.16566
## weight 1 1190.6 1190.6 107.2945 < 2.2e-16 ***
## acceleration 1 59.9 59.9 5.3969 0.02069 *
## year 1 2659.4 2659.4 239.6702 < 2.2e-16 ***
## origin 1 277.8 277.8 25.0398 8.523e-07 ***
## Residuals 389 4316.5 11.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(model)["year"]
## year
## 0.7734417
Yes, there is a relationship between the predictors and the mpg response variable except for horsepower.
The following predictors have a statistically significant relationship to the response which are displacement, weight, acceleration, year, and origin variables.
The coefficient for the year variable is 0.77 which suggests for every additional model year, mpg increases by 0.77 given all other variables stay constant.
#Diagnostic plots
par(mfrow = c(2, 2))
plot(model)
In the Residuals vs Fitted plot, it does not seem to have randomness and
seems to follow a U curve, Q-Q Residuals plot is mostly linear except
towards the tail end upwards it falls off the line but doesn’t suggest
unusual outliers, looks to have a good spread in the scale-location
plot, and there are a few high leverage points in the residuals vs
leverage plot.
interaction_model <- lm(mpg ~ weight* acceleration + year + displacement +horsepower + origin + cylinders, data = auto)
summary(interaction_model)
##
## Call:
## lm(formula = mpg ~ weight * acceleration + year + displacement +
## horsepower + origin + cylinders, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3855 -2.0117 0.1639 1.7455 12.2559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.800e+01 5.849e+00 -8.207 3.39e-15 ***
## weight 2.007e-03 1.505e-03 1.334 0.182926
## acceleration 1.627e+00 2.437e-01 6.677 8.44e-11 ***
## year 8.196e-01 4.763e-02 17.208 < 2e-16 ***
## displacement -2.156e-03 7.511e-03 -0.287 0.774247
## horsepower 7.009e-03 6.492e-03 1.080 0.280948
## origin 9.063e-01 2.657e-01 3.412 0.000714 ***
## cylinders 3.655e-03 3.257e-01 0.011 0.991052
## weight:acceleration -5.181e-04 8.167e-05 -6.344 6.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.175 on 388 degrees of freedom
## Multiple R-squared: 0.8387, Adjusted R-squared: 0.8354
## F-statistic: 252.3 on 8 and 388 DF, p-value: < 2.2e-16
interaction_model <- lm(mpg ~ weight * year + displacement + horsepower + acceleration + origin + cylinders, data = auto)
summary(interaction_model)
##
## Call:
## lm(formula = mpg ~ weight * year + displacement + horsepower +
## acceleration + origin + cylinders, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0323 -1.8948 0.0079 1.5384 11.7087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.220e+02 1.331e+01 -9.170 < 2e-16 ***
## weight 2.916e-02 4.576e-03 6.372 5.29e-10 ***
## year 2.089e+00 1.722e-01 12.134 < 2e-16 ***
## displacement 7.674e-03 6.848e-03 1.121 0.263115
## horsepower 5.503e-03 6.332e-03 0.869 0.385391
## acceleration 2.694e-01 7.345e-02 3.668 0.000278 ***
## origin 1.038e+00 2.529e-01 4.103 4.98e-05 ***
## cylinders 5.689e-02 3.172e-01 0.179 0.857764
## weight:year -4.794e-04 6.046e-05 -7.929 2.38e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.094 on 388 degrees of freedom
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8437
## F-statistic: 268.2 on 8 and 388 DF, p-value: < 2.2e-16
interaction_model <- lm(mpg ~ weight + acceleration *cylinders + year + displacement +horsepower + origin, data = auto)
summary(interaction_model)
##
## Call:
## lm(formula = mpg ~ weight + acceleration * cylinders + year +
## displacement + horsepower + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6385 -2.0148 0.0181 1.8525 12.5798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.997857 5.777865 -6.576 1.56e-10 ***
## weight -0.006220 0.000589 -10.560 < 2e-16 ***
## acceleration 1.077460 0.233595 4.613 5.41e-06 ***
## cylinders 2.603860 0.767972 3.391 0.000769 ***
## year 0.800046 0.048790 16.398 < 2e-16 ***
## displacement 0.003954 0.007698 0.514 0.607805
## horsepower 0.007197 0.006672 1.079 0.281377
## origin 1.135397 0.268341 4.231 2.90e-05 ***
## acceleration:cylinders -0.178164 0.042618 -4.181 3.60e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.263 on 388 degrees of freedom
## Multiple R-squared: 0.8297, Adjusted R-squared: 0.8262
## F-statistic: 236.3 on 8 and 388 DF, p-value: < 2.2e-16
Yes, some interactions appear to be statistically significant such as between the following pairs weight and year, weight and acceleration, acceleration and cylinders.
# Add log(weight) and horsepower²
auto <- auto %>%
mutate(log_weight = log(weight),
horsepower2 = horsepower^2)
# Fit transformed model
trans_model <- lm(mpg ~ log_weight + horsepower2 + displacement + year + acceleration + origin + cylinders, data = auto)
summary(trans_model)
##
## Call:
## lm(formula = mpg ~ log_weight + horsepower2 + displacement +
## year + acceleration + origin + cylinders, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3429 -1.9097 0.0201 1.6754 12.7258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.368e+02 1.107e+01 12.365 < 2e-16 ***
## log_weight -2.263e+01 1.488e+00 -15.213 < 2e-16 ***
## horsepower2 -2.072e-05 6.718e-05 -0.308 0.75796
## displacement 1.951e-02 6.405e-03 3.046 0.00248 **
## year 8.030e-01 4.603e-02 17.445 < 2e-16 ***
## acceleration 1.807e-01 7.134e-02 2.532 0.01172 *
## origin 9.595e-01 2.506e-01 3.829 0.00015 ***
## cylinders -4.202e-01 3.170e-01 -1.325 0.18579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.089 on 389 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8442
## F-statistic: 307.5 on 7 and 389 DF, p-value: < 2.2e-16
I did a logarithmic transformation on weight and squared horsepower and kept the other variables the same to see if there are any statistically significant variables that will affect mpg. While doing this there was statistically significant variables such for the logarithmic weight, displacement, year, acceleration, and origin variables.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.3
data(Carseats)
str(Carseats) # View structure of dataset
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
# Fit the full model with all three predictors
model2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model2)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The coefficients for this model are as follows: Intercept: 13.04, Price: -0.054, UrbanYes : -0.021, and USYes: 1.20.
Intercept of 13.04 means that when Price = 0, and both Urban = No, and US = No, 13.04 is the predicted sales.
Price of -0.054 indicates that for each $1 increase in price, Sales decreases by 0.054 units given all other variables stay constant.
UrbanYes of -0.021 indicates that being in an urban area, Sales decreases by 0.021 units compared to not urban area given all other variables stay constant.
USYes of 1.20 indicates that stores in the US sell 1.20 more units than stores outside the US given all other variables stay constant.
Predicted Sales = 13.04 - 0.54 (Price) – 0.021(UrbanYes) + 1.20 (USYes)
Qualitative variables where 1 means Yes and 0 mean No.
For the Price and USYes variables we can reject the null hypothesis since it is less than 0.05 alpha level and deemed significant predictors.
model2.1 <- lm(Sales ~ Price + US, data = Carseats)
summary(model2.1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
summary(model2)$adj.r.squared
## [1] 0.2335123
summary(model2.1)$adj.r.squared
## [1] 0.2354305
The model from question A has an adjusted R squared of 0.2335 or 23.35% and model from question E has an adjusted R squared of 0.2354 or 23.54%. This is relatively similar performance with Model E performing slightly better.
confint(model2.1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Price confidence intervals are between -0.06476 to -0.04412
USYes confidence intervals are between 0.6915 to 1.7078
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model2.1)
# High leverage points
influencePlot(model2.1)
## StudRes Hat CookD
## 26 2.5996518 0.011621599 0.026109457
## 43 -0.5349931 0.043337657 0.004329756
## 51 -2.8358431 0.004224147 0.011173381
## 175 -1.2144859 0.029686718 0.015024314
## 368 1.7366086 0.023707048 0.024287363
## 377 2.8915213 0.006637175 0.018282191
Yes, there is evidence of a few high leverage observations in the model
on the far right indicating influential points.
When their sums of square are equal is when this condition has the same coefficient estimate.
set.seed(42)
n <- 100
x <- rnorm(n)
y <- 2 * x
# Regression of Y onto X (no intercept)
model_y_on_x <- lm(y ~ x + 0)
coef_y_on_x <- coef(model_y_on_x)
# Regression of X onto Y (no intercept)
model_x_on_y <- lm(x ~ y + 0)
coef_x_on_y <- coef(model_x_on_y)
# Show coefficients
coef_y_on_x
## x
## 2
coef_x_on_y
## y
## 0.5
set.seed(42)
n <- 100
x <- rnorm(n)
y <- x * 3
# Regression of Y onto X (no intercept)
model_y_on_x <- lm(y ~ x + 0)
coef_y_on_x <- coef(model_y_on_x)
# Regression of X onto Y (no intercept)
model_x_on_y <- lm(x ~ y + 0)
coef_x_on_y <- coef(model_x_on_y)
# Now re scale y
y_scaled <- y / sqrt(9) # divide by sqrt(9) = 3
# Fit models
coef(lm(y_scaled ~ x + 0)) # Should match
## x
## 1
coef(lm(x ~ y_scaled + 0))
## y_scaled
## 1