Regression Model Project

Executive summary

The aim of this project is to analyze data about a collection of cars and explore the relationship between a set of variables and the miles per gallon (outcome) of the cars. Specifically the two questions that we are trying to answer are:

Question1: Is an automatic or manual transmission better for MPG? Question2: Quantify the MPG difference between automatic and manual transmissions.“

Loading the data

For the purpose of this analysis, we will be using the mtcars dataset provided in R.

library(datasets)
data(mtcars)

Exploratory data analysis

Let's take a quick look at the mtcars data.

head(mtcars,3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

We can make a plot to see how miles per gallon varies with the kind of transmission. We also make a matrix of scatter plots to understand the relation between all the variables in the mtcars dataset. The plots are shown in the Appendix below.

From the boxplot we can see that manual transmission cars seem to have a higher mileage than automatic transmissions. To understand this further let's perform hypothesis testing.

Hypothesis Testing

We can do a hypothesis test to confirm whether there is a statistically significant impact of manual/automatic transmission on the mileage of a car. The null hypothesis is that there is no significant difference between manual and automatic transmission w.r.t. mpg

t.test(mtcars$mpg ~ mtcars$am)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

The 95% confidence interval (-11.280194,-3.209684) is entirely below 0 and p-value is <0.05 which says that we reject the null hypothesis that the means are equal. Manual transmission seems to clearly have better mileage than Automatic transmission. But how much more? To quantify this, we build our regression model.

Regression Models

Next we need to look into building our model. Let's examine the fit of the base model - where mpg is the dependent variable, and am is the predictor.

fit <- lm(mpg~factor(am),data=mtcars)
summary(fit)$r.squared
## [1] 0.3597989

We can see that only 36% of the variability in mpg is explained by our base model. We need to look at building a better model to explain the variability in mpg. Putting together the pairs plot we plotted earlier and a correlation matrix of all the variables in the mtcars dataset, we should be able to extract variables that might be helpful in defining our regression model.

head(cor(mtcars),4)
##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
##            qsec         vs         am       gear       carb
## mpg   0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251
## cyl  -0.5912421 -0.8108118 -0.5226070 -0.4926866  0.5269883
## disp -0.4336979 -0.7104159 -0.5912270 -0.5555692  0.3949769
## hp   -0.7082234 -0.7230967 -0.2432043 -0.1257043  0.7498125

Let's consider the variables cyl, disp, hp, wt apart from am to determine our best regression model. To test which regression model fits best, we try various models. Starting with the basic model comparing mpg with am, then adding on variables that seem to have most impact on mpg.The variables cyl and disp seem to show collinearity. So we build models that include them, exclude them and uses only one of them.

fit <- lm(mpg~factor(am),data=mtcars)
fit1 <- lm(mpg~factor(am)+wt+hp,data=mtcars)
fit2 <- lm(mpg~factor(am)+cyl+wt+hp,data=mtcars)
fit3 <- lm(mpg~factor(am)+cyl+disp+wt+hp,data=mtcars)
anova(fit,fit1,fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt + hp
## Model 3: mpg ~ factor(am) + cyl + wt + hp
## Model 4: mpg ~ factor(am) + cyl + disp + wt + hp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 180.29  2    540.61 43.0841 5.576e-09 ***
## 3     27 170.00  1     10.29  1.6407    0.2115    
## 4     26 163.12  1      6.88  1.0963    0.3047    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of 5.6*10-9 we can conclude that the model fit1 is significantly better than the base model that uses only 'am' as the predictor. The other two models don't significantly outperform fit1.

summary(fit1)
## 
## Call:
## lm(formula = mpg ~ factor(am) + wt + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## factor(am)1  2.083710   1.376420   1.514 0.141268    
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

The residuals are approximately normally distributed as seen from the residual plots (Shown in Appendix). This model explains approximately 84% of the variability in 'mpg' and is significantly better than the base model at predicting mileage of a car. The coefficient of 'am' suggests that manual transmission gives 2.084 miles per gallon more than automatic transmission taking into account 'wt' and 'hp' and keeping all other variables constant. Similarly, holding all other variables constant, every 1000lbs increase in weight reduces mpg by 2.87 and 100 units increase in horsepower reduces mpg by 3.74. The p-value of coefficient 'am' suggests a high uncertainty of 14.13% in the effect of transmission type on mpg in our model.

Appendix

library(ggplot2)
p1 <- ggplot(mtcars,aes(factor(am),mpg))+geom_boxplot(aes(fill=factor(am)))+ggtitle("MPG vs Transmission type")+xlab("Transmission type (Automatic=0, Manual=1)")+ylab("Miles per Gallon (MPG)")
plot(p1)

plot of chunk unnamed-chunk-8

pairs(mtcars)

plot of chunk unnamed-chunk-8

par(mfrow=c(2,2))
plot(fit1)

plot of chunk unnamed-chunk-8