homework3_MS5333

Homework Assignment:

Exploring Linear Regression and KNN in R

Objective:

The purpose of this assignment is to help you gain practical experience with two different machine learning algorithms: linear regression and k-Nearest Neighbors (KNN). You will work with a sample dataset to explore, analyze, and make predictions.

Dataset:

You can use the mtcars dataset, which is available in R by default. This dataset comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

Instructions:

Data Exploration

• Use summary statistics and visualizations to understand the dataset. • Identify any trends, correlations, or patterns.

### Here, I am checking for the summary stats of mtcars
data(mtcars)
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

## printed first 5 rows of mtcars; given 11 colummns
head(mtcars, 5)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

## general information giving description of where the data came from the format 
#?mtcars
## gen information about the column names and the the row names
colnames(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

rownames(mtcars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

##Im loading in the package for data visualization:
library(ggplot2)
## code to view all boxplots at the same time hp , wt, qsec, and carb all have outliers
#hist(mtcars$disp)
#hist(mtcars$mpg)
corr_matrix <- cor(mtcars) #print(corr_matrix)
##Boxplot is visualized below and histograms
par(mfrow = c(3, 4))
for (i in 1:11) {
  boxplot(mtcars[, i], main = colnames(mtcars)[i], ylab = "")}
##vs and am both have split data between o and 1. Disp, hp, carb and gear are skewed to the right. drat and qsec are close to normally fitted
par(mfrow=c(3,4))

for (col in 2:ncol(mtcars)) {
  hist(mtcars[,col],main = colnames(mtcars)[col], xlab = colnames(mtcars)[col])
}

Data Pre-processing • If necessary, handle missing or inconsistent data.

## no missing values
sum(is.na(mtcars))

## [1] 0

Linear Regression using lm

• Create a linear regression model to predict mpg (miles per gallon) based on other variables in the dataset. • Interpret the coefficients. • Evaluate the model using MSE (Mean Square Error)

## mpg and weight has negative relationship between weight and mpg. We can inference that heavy cars tend to have lower mpg. The intercept of mpg has a 37.28 relationship and the coefficient for weight is -5.3445.
## The MSE 
mtcars_lm_simple <- lm(mpg ~ wt, data = mtcars)
summary(mtcars_lm_simple)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

mtcars_lm_simple <- lm(mpg ~ carb, data = mtcars)
summary(mtcars_lm_simple)

## 
## Call:
## lm(formula = mpg ~ carb, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.250 -3.316 -1.433  3.384 10.083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.8723     1.8368  14.085 9.22e-15 ***
## carb         -2.0557     0.5685  -3.616  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.113 on 30 degrees of freedom
## Multiple R-squared:  0.3035, Adjusted R-squared:  0.2803 
## F-statistic: 13.07 on 1 and 30 DF,  p-value: 0.001084

##The MSE data, or the residual standard error is 2.65. A smaller value would be better for a good fit. The model has about 0.869 of the variance in mpg which suggests a reasonable good fir.  
mtcars_lm <- lm(mpg ~ ., data = mtcars)
summary(mtcars_lm)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

KNN using kknn

• Use k-Nearest Neighbors to also predict mpg. • Try different values of k. • Evaluate the model using MSE.

# Diagnostic Plots
par(mfrow =c(2, 2))
plot(mtcars_lm)

# Perform KNN
library(kknn)

## Warning: package 'kknn' was built under R version 4.3.1

k <- 5 # Number of neighbors
knn_fit_mpg <- kknn(mpg ~ ., mtcars, mtcars, k=k)
knn_fit <- kknn(mpg ~ ., mtcars, mtcars, k=k, scale = FALSE)
mse_df <- data.frame(k = integer(), MSE = numeric())

for (k in c(1, 3, 5, 7, 9, 11)) {
  
# Fit the k-NN model using kknn function
knn_model <- kknn(mpg ~ ., train = mtcars, test = mtcars, k = k)
  
# Calculate the MSE
mse <-  mean((knn_model$fitted.values - mtcars$mpg)^2)
mse_df <- rbind(mse, data.frame(k = k, MSE = mse))
}
# Show the MSE data frame
print(mse_df)

##           k      MSE
## 1  4.977965 4.977965
## 2 11.000000 4.977965

knn_std_mse <- mean((knn_fit_mpg$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for stdKNN:", round(knn_std_mse, 2)))

## [1] "Mean Squared Error for stdKNN: 2.16"

knn_mse <- mean((knn_fit_mpg$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for KNN:", round(knn_mse, 2)))

## [1] "Mean Squared Error for KNN: 2.16"

Comparison and Insights

• Compare the performance of the linear regression model and the KNN model. The linear regression model has a lower residual standard error, MSE, 2.65 compared to the KNN model’s MSE, 4.98. The LR model has a better predictive performance suggesting a smaller average different between the predicted and actual values.

• Discuss the advantages and disadvantages of both methods.

Pros of LR: helps understand relationship between independent and dependent variables

Cons of LR: assumes a linear relationship, robust to outliers

Pros of KNN: no assumptions about underlying data distributions

Cons of KNN: the choice of KNN can impact the model performance

• Explain which model you would recommend and why.

Since the model is more relatively linear, has a lower residual standard error and high R-squared value this suggests that mtcars provides a good fit to the data.

Additional Question/Answers:

Make sure you answer these questions clearly with figure/table as evidence to support your arguments:

What variables are most strongly correlated with mpg?

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.1

## corrplot 0.92 loaded

corrplot(corr_matrix, method="circle", type="upper", order="hclust",
         tl.col="black", tl.srt=45)

MPG is highly correlated with VS [ Engine (0 = V-shaped, 1 = straight)] and QSEC[1/4 mile time].

2. How does the value of k in KNN affect the model’s performance?

When K=1, the model is highly influenced by noise or outliers. The larger values lead to overly smoothed which do not necessarily demonstrate the patterns and ove rcomplicates the training data.

3. What assumptions are being made when we use linear regression? Are they met in this dataset? Just describe what you observe from the diagnostic plots.

The Q-Q Residual is very linear with most points following the guided line. The scale-location is slightly increasing with the sqrt(standardized residuals) and the fitted values of lm(mpg) increases. The residual vs fitted is not near the fitted line with a drop in residuals and the increase as the fitted value increases after the 20th value.

4. Try adding interaction terms to your linear regression model. At least try to find out one interaction term that has a statistically significant coefficient. Report the interaction term and check how do these interaction terms influence the model’s performance in terms of R^2 and how do you interpret your new model?

Interaction between one continous variable and one categorical variable

What if we believe there is an interaction effect on mpg and vs?

The R^2 performance is 0.775 which tells us that the proportion of the variance in the mpg (dependent variable) is explained by the vs (independent variable).

mtcars_lm_it1 <- lm(mpg ~ vs + vs + mpg*vs, data = mtcars)

## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
## the right-hand side and was dropped

## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned

summary(mtcars_lm_it1)

## 
## Call:
## lm(formula = mpg ~ vs + vs + mpg * vs, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.217 -1.192  0.000  0.000  9.383 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.6167     0.6967  23.850  < 2e-16 ***
## vs          -16.6167     3.8882  -4.274 0.000189 ***
## mpg:vs        1.0000     0.1524   6.561 3.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.956 on 29 degrees of freedom
## Multiple R-squared:  0.775,  Adjusted R-squared:  0.7595 
## F-statistic: 49.94 on 2 and 29 DF,  p-value: 4.048e-10

5. Is there any outliers in the dataset? If yes, apply truncation or winsorization techniques to handle outliers. Compare the performance of the models before and after applying these techniques. What differences do you observe?

##Given the outliers in hp, qsec, and carb. we will apply winsorization to each. After completing each of the outliers, each boxplot no longer has outliers which wll affect the knn results and lr model.
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

bench_hp <- 146.7 + 1.5*IQR(mtcars$hp)  
bench_hp

## [1] 271.95

mtcars[mtcars > bench_hp] <- bench_hp
boxplot(mtcars$hp)

bench_qsec <- 17.85 + 1.5*IQR(mtcars$qsec)  
bench_qsec

## [1] 20.86125

mtcars[mtcars > bench_qsec] <- bench_qsec
boxplot(mtcars$qsec)

bench_carb <- 2.812 + 1.5*IQR(mtcars$carb)  
bench_carb

## [1] 5.812

mtcars[mtcars > bench_carb] <- bench_carb
boxplot(mtcars$carb)

6. How could feature scaling affect the KNN model?

It has an impact on the KNN by relying on the distance metrics to determine the closest neighbors to make prediction, hence name! In addition the normalization or robust scaling are meathods that can reduce the rnage and scale to outliers within the dataset such as truncation.

7. What insights can you derive from comparing the linear regression and KNN models?

After comparing linear regression and KNN models, I see how general the KNN can be with small k values, or how the lr model is sensitive to outliers.