R Markdown

# Load required libraries
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Load the mtcars dataset
data(mtcars)

# Review the codebook by ?
?mtcars
## starting httpd help server ...
##  done
# Display the first few rows and summary statistics
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

2.1 Create a linear regression model to predict mpg (miles per gallon) with just wt and hp variables in the dataset. 10pts

# Create linear regression model
model <- lm(mpg ~ wt + hp, data = mtcars)

# View model summary
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

2.2 Interpret both coefficients, like what we did in class. 10pts

Answer: Here is the linear regression model to predict mpg with just the variables wt and hp. The results indicate that both weight and horsepower have statistically significant negative relationships with mpg. The coefficient for wt is −3.88. This means that if weight increases by 1000 pounds, the car’s mpg decreases by about 3.88 mpg, assuming horsepower stays the same. The coefficient for hp is −0.032. This means that if horsepower increases by 1 unit, the car’s mpg decreases by about 0.03 mpg, assuming weight stays the same.

2.3 What assumptions are being made when we use linear regression? Are they met in this dataset? Use diagnostic plots and describe what you observe from the plots. 10pts

# Fit model
model <- lm(mpg ~ wt + hp, data = mtcars)

# Diagnostic plots
par(mfrow = c(2,2))
plot(model)

Answer: I used diagnostic plots by running plot(model). From the Residuals vs Fitted plot, the points look fairly randomly scattered, so the linear relationship seems reasonable. The Normal Q-Q plot mostly follows a straight line, so the residuals appear approximately normal. The Scale-Location plot does not show a strong funnel shape, suggesting the variance is fairly constant. The Residuals vs Leverage plot does not show any extremely influential points. Overall, the assumptions appear to be reasonably met for this dataset.

2.4 Evaluate the model by reporting the MSE (Mean Square Error) 10pts

# Get predicted values
predicted <- predict(model)

# Calculate MSE
mse <- mean((mtcars$mpg - predicted)^2)

mse
## [1] 6.095242

Answer: The Mean Squared Error (MSE) of the regression model is 6.095. This means that, on average, the squared difference between the actual mpg values and the predicted mpg values is about 6.10.

2.5 Evaluate the model by reporting the R^2 10pts

summary(model)$r.squared
## [1] 0.8267855

Answer: The R squared value of the regression model is 0.8268. This means that approximately 82.7% of the variability in miles per gallon (mpg) is explained by weight and horsepower. This indicates that the model has a strong fit.

2.6 Try adding interaction term between wt and hp to your linear regression model fitted in question 1.

• Report the significance of the interaction term at 0.05 level. 10pts

• Check how do this interaction terms influence the model’s performance in terms of R^2? 10pts

• How do you interpret your new model? 10pts

# Model with interaction term
model_int <- lm(mpg ~ wt * hp, data = mtcars)

summary(model_int)
## 
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt:hp        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13
summary(model_int)$r.squared
## [1] 0.8847637

Answer: I added an interaction term between wt and hp using wt * hp. The interaction term is statistically significant (p = 0.000811 < 0.05), meaning the effect of weight on mpg depends on horsepower. The R² increased from 0.8268 to 0.8848, indicating that the interaction improves the model’s ability to explain variation in mpg.

2.7 And let’s assume hp has outliers, apply 5%/95% winsorization technique to fix it. Then fit a new linear regression model on the winsorized hp and original wt. No interaction term is needed. Compare the performance of the newly fitted model model in question 1. What differences do you observe in R^2 and the coefficients? 10pts

# Find 5th and 95th percentiles
lower_hp <- quantile(mtcars$hp, 0.05)
upper_hp <- quantile(mtcars$hp, 0.95)

# Winsorize hp
mtcars$hp_win <- ifelse(mtcars$hp < lower_hp, lower_hp,
                        ifelse(mtcars$hp > upper_hp, upper_hp,
                               mtcars$hp))
model_win <- lm(mpg ~ wt + hp_win, data = mtcars)
summary(model_win)
## 
## Call:
## lm(formula = mpg ~ wt + hp_win, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -1.6545 -0.0968  0.8367  5.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.31722    1.56964  23.774  < 2e-16 ***
## wt          -3.58279    0.66427  -5.394  8.5e-06 ***
## hp_win      -0.03952    0.01059  -3.732 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8215 
## F-statistic: 72.34 on 2 and 29 DF,  p-value: 5.348e-12
summary(model_win)$r.squared
## [1] 0.8330309

Answer: After winsorizing horsepower at the 5% and 95% levels, I refitted the regression model using wt and hp_win. The R squared increased slightly from 0.8268 to 0.8330, showing a small improvement in model fit. The coefficients changed a little, but weight is still the strongest predictor of mpg. Overall, winsorizing hp only made a minor difference in the model.

2.8 Check multicollinearity using vif() on the model fitted in task 3.1. What do you find? 10pts

library(car)
## Loading required package: carData
vif(model)
##       wt       hp 
## 1.766625 1.766625

Answer: The VIF values for wt and hp are both about 1.77, which is well below 5. This indicates that there is no serious multicollinearity problem in the model.

2.9 (no credit, just think) Does an improved R^2 really improve the model predictability?

Answer: An improved R squared does not necessarily mean the model has better predictive performance. R squared only measures how well the model fits the existing data. Adding more variables can increase R squared even if they do not improve predictions on new data. Therefore, prediction accuracy should be evaluated using measures like MSE or cross-validation.