library(MASS)
data(mtcars)
#1. Create a linear regression model to predict mpg (miles per gallon) with just wt and hp variables in the dataset. 10pts
mtcars_lm_simple <- lm(mpg ~ wt + hp, data = mtcars)
summary(mtcars_lm_simple)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
#2.Interpret both coefficients, like what we did in class. 10pts
Both wt and hp p-values are <.01, indicating that they are significant predictors of mpg.
#3. What assumptions are being made when we use linear regression? Are they met in this dataset? Use diagnostic plots and describe what you observe from the plots. 10pts
Linear regression assumes the relationship between the independent and dependent variables are linear and each observation is independent from one another.
plot(mtcars_lm_simple)
The assumptions are met in this dataset judging by the residuals v fitted plot showing randomly dispersed points around 0, and the Q-Q plot showing residuals are normally distributed.
#4. Evaluate the model by reporting the MSE (Mean Square Error) 10pts
mse <- mean(residuals(mtcars_lm_simple)^2)
mse
## [1] 6.095242
#5. Evaluate the model by reporting the R^2 10pts
summary(mtcars_lm_simple)$r.squared
## [1] 0.8267855
#6. Try adding interaction term between wt and hp to your linear regression model fitted in question 1.
mtcars_lm_interaction <- lm(mpg ~ wt * hp, data = mtcars)
summary(mtcars_lm_interaction)
##
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt:hp 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
#Report the significance of the interaction term at 0.05 level. 10pts P value of wt:hp is <.05 (.000811), therefore it is significant.
#Check how do this interaction terms influence the model’s performance in terms of R^2? 10pts The r square increased so adding this interaction improved the model’s performance.
#How do you interpret your new model? 10pts The interaction term is significant and shows that relationship between wt and mpg varies depending on hp. This model explains more variance in mpg than the simpler model.
#7. And let’s assume hp has outliers, apply 5%/95% winsorization technique to fix it. Then fit a new linear regression model on the winsorized hp and original wt. No interaction term is needed. Compare the performance of the newly fitted model model in question 1. What differences do you observe in R^2 and the coefficients? 10pts
#Calculate the 1st and 99th percentiles for hp
lower_hp <- quantile(mtcars$hp, 0.05)
upper_hp <- quantile(mtcars$hp, 0.95)
#winsorize the data
mtcars$hp[mtcars$hp < lower_hp] <- lower_hp
mtcars$hp[mtcars$hp > upper_hp] <- upper_hp
summary(mtcars[,c("mpg","hp")])
## mpg hp
## Min. :10.40 Min. : 63.65
## 1st Qu.:15.43 1st Qu.: 96.50
## Median :19.20 Median :123.00
## Mean :20.09 Mean :144.23
## 3rd Qu.:22.80 3rd Qu.:180.00
## Max. :33.90 Max. :253.55
# Create new winsorized variable (do NOT overwrite hp)
mtcars$hp_win <- pmin(pmax(mtcars$hp, lower_hp), upper_hp)
#fit new regression model
mtcars_lm_win <- lm(mpg ~ wt + hp_win, data = mtcars)
summary(mtcars_lm_win)
##
## Call:
## lm(formula = mpg ~ wt + hp_win, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8825 -1.6545 -0.0968 0.8367 5.7259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.31722 1.56964 23.774 < 2e-16 ***
## wt -3.58279 0.66427 -5.394 8.5e-06 ***
## hp_win -0.03952 0.01059 -3.732 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8215
## F-statistic: 72.34 on 2 and 29 DF, p-value: 5.348e-12
The r square slightly increased from the original model. Hp became more negative and wt became less negative, showing Hp impact became stronger. This model explains more variance than the original.
#8. Check multicollinearity using vif() on the model fitted in task 3.1. What do you find? 10pts
library(car)
## Loading required package: carData
car::vif(mtcars_lm_simple)
## wt hp
## 1.766625 1.766625
The vif values are all over 10 which indicate strong multicollinearity in the model and predictors are highly correlated.