Exploring Linear Regression and KNN in R
The purpose of this assignment is to help you gain practical experience with two different machine learning algorithms: linear regression and k-Nearest Neighbors (KNN). You will work with a sample dataset to explore, analyze, and make predictions.
You can use the mtcars dataset, which is available in R by default. This dataset comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.
• Use summary statistics and visualizations to understand the dataset. • Identify any trends, correlations, or patterns.
### Here, I am checking for the summary stats of mtcars
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## printed first 5 rows of mtcars; given 11 colummns
head(mtcars, 5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## general information giving description of where the data came from the format
#?mtcars
## gen information about the column names and the the row names
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##Im loading in the package for data visualization:
library(ggplot2)
## code to view all boxplots at the same time hp , wt, qsec, and carb all have outliers
#hist(mtcars$disp)
#hist(mtcars$mpg)
corr_matrix <- cor(mtcars) #print(corr_matrix)
##Boxplot is visualized below and histograms
par(mfrow = c(3, 4))
for (i in 1:11) {
boxplot(mtcars[, i], main = colnames(mtcars)[i], ylab = "")}
##vs and am both have split data between o and 1. Disp, hp, carb and gear are skewed to the right. drat and qsec are close to normally fitted
par(mfrow=c(3,4))
for (col in 2:ncol(mtcars)) {
hist(mtcars[,col],main = colnames(mtcars)[col], xlab = colnames(mtcars)[col])
}
## no missing values
sum(is.na(mtcars))
## [1] 0
• Create a linear regression model to predict mpg (miles per gallon) based on other variables in the dataset. • Interpret the coefficients. • Evaluate the model using MSE (Mean Square Error)
## mpg and weight has negative relationship between weight and mpg. We can inference that heavy cars tend to have lower mpg. The intercept of mpg has a 37.28 relationship and the coefficient for weight is -5.3445.
## The MSE
mtcars_lm_simple <- lm(mpg ~ wt, data = mtcars)
summary(mtcars_lm_simple)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
mtcars_lm_simple <- lm(mpg ~ carb, data = mtcars)
summary(mtcars_lm_simple)
##
## Call:
## lm(formula = mpg ~ carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.250 -3.316 -1.433 3.384 10.083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.8723 1.8368 14.085 9.22e-15 ***
## carb -2.0557 0.5685 -3.616 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.113 on 30 degrees of freedom
## Multiple R-squared: 0.3035, Adjusted R-squared: 0.2803
## F-statistic: 13.07 on 1 and 30 DF, p-value: 0.001084
##The MSE data, or the residual standard error is 2.65. A smaller value would be better for a good fit. The model has about 0.869 of the variance in mpg which suggests a reasonable good fir.
mtcars_lm <- lm(mpg ~ ., data = mtcars)
summary(mtcars_lm)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
• Use k-Nearest Neighbors to also predict mpg. • Try different values of k. • Evaluate the model using MSE.
# Diagnostic Plots
par(mfrow =c(2, 2))
plot(mtcars_lm)
# Perform KNN
library(kknn)
## Warning: package 'kknn' was built under R version 4.3.1
k <- 5 # Number of neighbors
knn_fit_mpg <- kknn(mpg ~ ., mtcars, mtcars, k=k)
knn_fit <- kknn(mpg ~ ., mtcars, mtcars, k=k, scale = FALSE)
mse_df <- data.frame(k = integer(), MSE = numeric())
for (k in c(1, 3, 5, 7, 9, 11)) {
# Fit the k-NN model using kknn function
knn_model <- kknn(mpg ~ ., train = mtcars, test = mtcars, k = k)
# Calculate the MSE
mse <- mean((knn_model$fitted.values - mtcars$mpg)^2)
mse_df <- rbind(mse, data.frame(k = k, MSE = mse))
}
# Show the MSE data frame
print(mse_df)
## k MSE
## 1 4.977965 4.977965
## 2 11.000000 4.977965
knn_std_mse <- mean((knn_fit_mpg$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for stdKNN:", round(knn_std_mse, 2)))
## [1] "Mean Squared Error for stdKNN: 2.16"
knn_mse <- mean((knn_fit_mpg$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for KNN:", round(knn_mse, 2)))
## [1] "Mean Squared Error for KNN: 2.16"
• Compare the performance of the linear regression model and the KNN model. The linear regression model has a lower residual standard error, MSE, 2.65 compared to the KNN model’s MSE, 4.98. The LR model has a better predictive performance suggesting a smaller average different between the predicted and actual values.
• Discuss the advantages and disadvantages of both methods.
Pros of LR: helps understand relationship between independent and dependent variables
Cons of LR: assumes a linear relationship, robust to outliers
Pros of KNN: no assumptions about underlying data distributions
Cons of KNN: the choice of KNN can impact the model performance
• Explain which model you would recommend and why.
Since the model is more relatively linear, has a lower residual standard error and high R-squared value this suggests that mtcars provides a good fit to the data.
Make sure you answer these questions clearly with figure/table as evidence to support your arguments:
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.1
## corrplot 0.92 loaded
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
tl.col="black", tl.srt=45)
MPG is highly correlated with VS [ Engine (0 = V-shaped, 1 = straight)] and QSEC[1/4 mile time].
2. How does the value of k in KNN affect the model’s performance?
When K=1, the model is highly influenced by noise or outliers. The larger values lead to overly smoothed which do not necessarily demonstrate the patterns and ove rcomplicates the training data.
3. What assumptions are being made when we use linear regression? Are they met in this dataset? Just describe what you observe from the diagnostic plots.
The Q-Q Residual is very linear with most points following the guided line. The scale-location is slightly increasing with the sqrt(standardized residuals) and the fitted values of lm(mpg) increases. The residual vs fitted is not near the fitted line with a drop in residuals and the increase as the fitted value increases after the 20th value.
4. Try adding interaction terms to your linear regression model. At least try to find out one interaction term that has a statistically significant coefficient. Report the interaction term and check how do these interaction terms influence the model’s performance in terms of R^2 and how do you interpret your new model?
What if we believe there is an interaction effect on mpg
and vs?
The R^2 performance is 0.775 which tells us that the proportion of the variance in the mpg (dependent variable) is explained by the vs (independent variable).
mtcars_lm_it1 <- lm(mpg ~ vs + vs + mpg*vs, data = mtcars)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
## the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned
summary(mtcars_lm_it1)
##
## Call:
## lm(formula = mpg ~ vs + vs + mpg * vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.217 -1.192 0.000 0.000 9.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6167 0.6967 23.850 < 2e-16 ***
## vs -16.6167 3.8882 -4.274 0.000189 ***
## mpg:vs 1.0000 0.1524 6.561 3.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.956 on 29 degrees of freedom
## Multiple R-squared: 0.775, Adjusted R-squared: 0.7595
## F-statistic: 49.94 on 2 and 29 DF, p-value: 4.048e-10
5. Is there any outliers in the dataset? If yes, apply truncation or winsorization techniques to handle outliers. Compare the performance of the models before and after applying these techniques. What differences do you observe?
##Given the outliers in hp, qsec, and carb. we will apply winsorization to each. After completing each of the outliers, each boxplot no longer has outliers which wll affect the knn results and lr model.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
bench_hp <- 146.7 + 1.5*IQR(mtcars$hp)
bench_hp
## [1] 271.95
mtcars[mtcars > bench_hp] <- bench_hp
boxplot(mtcars$hp)
bench_qsec <- 17.85 + 1.5*IQR(mtcars$qsec)
bench_qsec
## [1] 20.86125
mtcars[mtcars > bench_qsec] <- bench_qsec
boxplot(mtcars$qsec)
bench_carb <- 2.812 + 1.5*IQR(mtcars$carb)
bench_carb
## [1] 5.812
mtcars[mtcars > bench_carb] <- bench_carb
boxplot(mtcars$carb)
6. How could feature scaling affect the KNN model?
It has an impact on the KNN by relying on the distance metrics to determine the closest neighbors to make prediction, hence name! In addition the normalization or robust scaling are meathods that can reduce the rnage and scale to outliers within the dataset such as truncation.
7. What insights can you derive from comparing the linear regression and KNN models?
After comparing linear regression and KNN models, I see how general the KNN can be with small k values, or how the lr model is sensitive to outliers.