Multiple Regression

Discussion Board Post 12

Task:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The dataset I decided to build a multple regression model for is the mtcars dataset.

Load Data

Summarize the Data

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The variables are:

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Here I look at the summary of the dataset and try to identify possible variables I should include in my model. Since the task is to define include one quadratic, one dichotomous term and a dichotemos vs a quantitative interaction term I chose:

  1. am as my dichotomous term
  2. cyl * hp
  3. disp to be a quadratic,
plot(mtcars$mpg)

mtcars.lm.full <- lm(mpg ~ am + cyl*hp + disp^2, data= mtcars)
summary(mtcars.lm.full)
## 
## Call:
## lm(formula = mpg ~ am + cyl * hp + disp^2, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3473 -1.4555 -0.5026  0.7588  6.2112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.314518   5.996433   7.557 5.06e-08 ***
## am           2.886649   1.320967   2.185  0.03807 *  
## cyl         -2.632419   0.944821  -2.786  0.00983 ** 
## hp          -0.192397   0.059866  -3.214  0.00348 ** 
## disp        -0.014390   0.009921  -1.450  0.15889    
## cyl:hp       0.021298   0.007775   2.739  0.01097 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 26 degrees of freedom
## Multiple R-squared:  0.8509, Adjusted R-squared:  0.8222 
## F-statistic: 29.67 on 5 and 26 DF,  p-value: 5.828e-10

The resulting formula is

\[ mpg = 45.314518 + 2.886649 * cyl -0.192397 * hp -0.014390 * disp + cyl*hyp 0.021298 \] There is a negative correlation between cyl, hp and disp. For every increase in mpg, there is a decrease in cyl by 0.19, hp by 0.19, disp by 0.014

The p value for disp shows that it is not statistically significant and should be removed for better performance.

par(mfrow=c(2,2))
plot(mtcars.lm.full)

plot(fitted(mtcars.lm.full), resid(mtcars.lm.full))

Residuals have no true pattern which indicate it may be a reasonable to use the linear model.

qqnorm(resid(mtcars.lm.full))
qqline(resid(mtcars.lm.full))

There is a heavy tail at the end of the QQ plot indicating the residuals are not nearly normal therefore this is not a good fit for this data set.

Conclusion:

The fit of a linear model is determined by the residuals. Given the residual plots above we can see where there is no clear pattern, the QQ plot has a heavy tail. I would not say this model is a good fit. To improve this model, I would remove the disp variable, since it showed through the summary that is it not statistically significant.