HOMEWORK 3

Madisson Beckworth

#1. Data Exploration

The mtcars dataset is already included in R. It includes data on 32 cars with miles per gallon, weight, horsepower and a few other variables. In this section we will load the data set, review the summary and get some visualization to understand the relationship between everything.

# Load packages and dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(performance)
## Warning: package 'performance' was built under R version 4.5.1
data(mtcars)

# View first few rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Summary statistics
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Structure of dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

#1.2 Trends? Correlations? Patterns?

We will look for patterns and relationships between MPG and the other variables.

# Scatterplot: MPG vs Weight
plot(mtcars$wt, mtcars$mpg,
     main = "MPG vs Car Weight",
     xlab = "Weight (1000 lbs)",
     ylab = "Miles Per Gallon",
     pch = 19, col = "blue")

# Scatterplot: MPG vs Horsepower
plot(mtcars$hp, mtcars$mpg,
     main = "MPG vs Horsepower",
     xlab = "Horsepower",
     ylab = "Miles Per Gallon",
     pch = 19, col = "red")

From these plots, I observed that heavier cars with higher horespower tended to have lower MPG, which shows a potentially negative relationship between these variables.

#1.3 Correlation with MPG

We will be checking for variables that are the strongest in correlation with mpg.

# Correlation matrix
cor_matrix <- cor(mtcars)
cor_matrix["mpg", ]
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594  0.4186840 
##         vs         am       gear       carb 
##  0.6640389  0.5998324  0.4802848 -0.5509251

It appears that weight and horsepower are the most negativley correlated with MPG. The cars that are heavier and have more horsepower tend to have low fuel efficiency.

#2 Data Processing

Before running the regression model, I want to check for any missing or invalid data that can skew our results.

# Check for missing values
colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

It appears that there are no missing values or impossible values in the mtcars dataset. The data appears clean!

#3 Linear Regression using lm

We will create a linear regression model to predict mpg using weight and horsepower.

# Linear regression model
model1 <- lm(mpg ~ wt + hp, data = mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

-Intercept( 37.32): Even though unrealistic, this would be predicted mpg when weight and horsepower are both zero.

  • Weight: for each 1,000lb increase in weight, the mpg decreases by about 3.9mpg.

  • Horsepower: For each additional horsepower, mpg decreases by about 0.03mpg.

  • Both variables are statistically significant.

  • 82.6% of the variation in mpg is explained by weight and horsepower.

#3.2 Regression Assumptions and Diagnostic Plots

We need to check for linearity, normality of residuals and constant variance within our data.

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(1, 1))

Interpretation: - Linearity is satisfied. - residuals are approx. normal. - Homoscedasticity: even spread.

All major assumptions of linear regression are met.

#3.3 Mean Squared Error

# Compute Mean Squared Error
mse <- mean(model1$residuals^2)
mse
## [1] 6.095242

The MSE is 6.10 meaning the models predictions do slighly differ from actual mpg values by about 2.47 mpg.

The MSE is relatively low, so the model is good.

#3.4 Interaction Term

We are now going to include an interaction between weight and horsepower to see if their combined effect is significant.

# Model with interaction term
model2 <- lm(mpg ~ wt * hp, data = mtcars)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt:hp        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

Since the p-value is less than 0.05, the interaction is significant. The R² is at .88 which means 88% of variance can be explained by the model. This means that horsepower and weight do improve the model fit that helps explain how these two effect mpg.

#3.5 Outlier Direction

Check for outliers using a boxplot.

boxplot(mtcars$hp, main = "Boxplot of Horsepower", col = "lightblue")

Based on the box plot, it does appear that there is at least one or more outliers, since there is a point outside of the whisker.

#3.7 Winsorization

Applying a 5% and 95% winsorization to the hp variable to reduce the influence of outliers:

# Winsorization
hp_winsor <- mtcars$hp
lower <- quantile(hp_winsor, 0.05)
upper <- quantile(hp_winsor, 0.95)

hp_winsor[hp_winsor < lower] <- lower
hp_winsor[hp_winsor > upper] <- upper

# New dataset with winsorized hp
mtcars$hp_winsor <- hp_winsor

# Fit new model
model3 <- lm(mpg ~ wt + hp_winsor, data = mtcars)
summary(model3)
## 
## Call:
## lm(formula = mpg ~ wt + hp_winsor, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -1.6545 -0.0968  0.8367  5.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.31722    1.56964  23.774  < 2e-16 ***
## wt          -3.58279    0.66427  -5.394  8.5e-06 ***
## hp_winsor   -0.03952    0.01059  -3.732 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8215 
## F-statistic: 72.34 on 2 and 29 DF,  p-value: 5.348e-12

It appears that our original model had a R² of .88 whereas after the winsorization, it is now .833 suggesting the outliers were potentially impacting the model, but to be honest it doesn’t seem like a huge shift.

#3.8 Multicollinearity

Variance inflation factor checks for multicollinearity (whether or not predictors are highly correlated).

library(performance)
check_collinearity(model1)
## # Check for Multicollinearity
## 
## Low Correlation
## 
##  Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
##    wt 1.77 [1.29, 3.00]     1.33      0.57     [0.33, 0.77]
##    hp 1.77 [1.29, 3.00]     1.33      0.57     [0.33, 0.77]

There are no multicollinearity concerns. Weight (r=-0.87) and horsepower (r=-0.78) are strongly but negatively correlated with mpg. This returns back to our initial observation that high horespower and heavier cars tend to have lower fuel efficiency.

#Conclusion

The regression analysis shows that both car weight and horsepower significantly reduce miles per gallon. Adding the interaction term between the two improved the model fit, indicating that the effect of horsepower on fuel efficiency depends on vehicle weight. Diagnostic checks confirmed that the model assumptions were met and outlier adjustment had minimal impact. Multicollinearity was low and the model explained about 88% of variance in mpg. Overall, the regression model is statistically significant and consistent with expectations.