Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The dataset I decided to build a multple regression model for is the mtcars dataset.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
The variables are:
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
Here I look at the summary of the dataset and try to identify possible variables I should include in my model. Since the task is to define include one quadratic, one dichotomous term and a dichotemos vs a quantitative interaction term I chose:
plot(mtcars$mpg)
mtcars.lm.full <- lm(mpg ~ am + cyl*hp + disp^2, data= mtcars)
summary(mtcars.lm.full)
##
## Call:
## lm(formula = mpg ~ am + cyl * hp + disp^2, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3473 -1.4555 -0.5026 0.7588 6.2112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.314518 5.996433 7.557 5.06e-08 ***
## am 2.886649 1.320967 2.185 0.03807 *
## cyl -2.632419 0.944821 -2.786 0.00983 **
## hp -0.192397 0.059866 -3.214 0.00348 **
## disp -0.014390 0.009921 -1.450 0.15889
## cyl:hp 0.021298 0.007775 2.739 0.01097 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.541 on 26 degrees of freedom
## Multiple R-squared: 0.8509, Adjusted R-squared: 0.8222
## F-statistic: 29.67 on 5 and 26 DF, p-value: 5.828e-10
The resulting formula is
\[ mpg = 45.314518 + 2.886649 * cyl -0.192397 * hp -0.014390 * disp + cyl*hyp 0.021298 \] There is a negative correlation between cyl, hp and disp. For every increase in mpg, there is a decrease in cyl by 0.19, hp by 0.19, disp by 0.014
The p value for disp shows that it is not statistically significant and should be removed for better performance.
par(mfrow=c(2,2))
plot(mtcars.lm.full)
plot(fitted(mtcars.lm.full), resid(mtcars.lm.full))
Residuals have no true pattern which indicate it may be a reasonable to use the linear model.
qqnorm(resid(mtcars.lm.full))
qqline(resid(mtcars.lm.full))
There is a heavy tail at the end of the QQ plot indicating the residuals are not nearly normal therefore this is not a good fit for this data set.
The fit of a linear model is determined by the residuals. Given the
residual plots above we can see where there is no clear pattern, the QQ
plot has a heavy tail. I would not say this model is a good fit. To
improve this model, I would remove the disp variable, since
it showed through the summary that is it not statistically
significant.