RegressionModels

Exec Summary

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”

“Quantify the MPG difference between automatic and manual transmissions”

Preparation

We take the mtcars data set and write up an analysis to answer the question using regression models and exploratory data analyses.

data(mtcars)
head(mtcars,10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

we will analyse the following variables how they impact mpg (Miles/(US) gallon). cyl: Number of cylinders hp: horsepower drat: Rear axle ratio wt: Weight (lb/1000) qsec: 1/4 mile time vs: V/S am: Transmission (0 = automatic, 1 = manual)

Quantitave Analysis

The difference between manual and automatic transmission using the mean shows that in average the fuel consumption (mpg) is higher for manual compared to automatic transmission. Visualized i Appendix plot

aggregate(mpg~am, data = mtcars, mean)

##   am      mpg
## 1  0 17.14737
## 2  1 24.39231

Simple Regression

To quantify this result we will do basic fit on a linear regression model. We will use mpg as the dependent variable and am as the independent variable.

single <- lm(mpg ~ am, data = mtcars)
summary(single)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Since the p-value = 0.000285 is less than 0.05 so we rejected null hypothesis. The adjusted R squared value is 0.3385 which means our model only explains 35.85% of the variance. We need to include other predictor variables.

Multi variable regression

From the results above we can see that the simple linear regression model is not sufficient to explain the data. There are several other predictor variables that we must take into account. To determine the most significant variables we use the analysis via cor function.

cor(mtcars)[1,]

##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

With the variables wt (weight), hp (horse power), disp, cyl (cylinder) and am we determine the multi-variable regression model.

multi <- lm(mpg ~ wt+hp+disp+cyl+am, data = mtcars)
summary(multi)

## 
## Call:
## lm(formula = mpg ~ wt + hp + disp + cyl + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5952 -1.5864 -0.7157  1.2821  5.5725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.20280    3.66910  10.412 9.08e-11 ***
## wt          -3.30262    1.13364  -2.913  0.00726 ** 
## hp          -0.02796    0.01392  -2.008  0.05510 .  
## disp         0.01226    0.01171   1.047  0.30472    
## cyl         -1.10638    0.67636  -1.636  0.11393    
## am           1.55649    1.44054   1.080  0.28984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared:  0.8551, Adjusted R-squared:  0.8273 
## F-statistic:  30.7 on 5 and 26 DF,  p-value: 4.029e-10

Residual Analysis and Diagnostics

I plot the residuals to examine any heteroskedacity between the fitted and residual values; as well as to check for any non-normality.

plot(multi)

Appendix

library(ggplot2)
g = ggplot(mtcars, aes(factor(am), mpg, fill=factor(am)))
g = g + geom_boxplot()
g = g + geom_jitter(position=position_jitter(width=.1, height=0))
g = g + scale_colour_discrete(name = "Type")
g = g + scale_fill_discrete(name="Type", breaks=c("0", "1"),
            labels=c("Automatic", "Manual"))
g = g + scale_x_discrete(breaks=c("0", "1"), labels=c("Automatic", "Manual"))
g = g + xlab("")
g