Johns Hopkins| Coursera - Regression Modelling Project

Saurabh Sindwani

05/05/2017

Executive Summary

From our analysis of the mtcars dataset, we have determined that in general manual transmissions are better in terms of miles per gallon than automatic transmissions. In a linear regression model with only transmission type as an explanatory variable, a change from automatic to manual transmission increased the mpg by 7.245 however, transmission type only explained 36% of the variability in mpg.

A linear regression model of all significant variables explained 86% of the variabiity in mpg. It included only the variables weight, horsepower and the number of cylinders.

Report

The Data

We are to investigate the relationship between miles per gallon (numerical class variable, mpg) and a set of independent variables. The independent variables and their classes are:

cyl: number of cylinders (factor, 4,6,8)
disp: displacement (cu.in.) (numerical)
hp: gross horsepower (numerical)
drat: rear axle ratio (numerical)
wt: weight (1000 pounds) (numerical)
qsec: 1/4 mile time (numerical)
vs: V/S, V-engine or Straight engine (factor -> V,S)
am: transmission type (factor -> automatic, manual)
gear: number of forwards gears (factor -> 3,4,5)
carb: number of carburetors (factor -> 1,2,3,4,5,6,7,8)

Data Processing

Let’s convert the factor variables to their proper class.

# Converting the 5 numeric variables to the factor variables
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, levels = c(0,1), labels= c("automatic", "manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Exploratory Data Analysis

Are there differences in MPG because of the transmission type?

library(ggplot2)
ggplot(mtcars, aes(x = am, y = mpg, color=am)) + geom_boxplot() + geom_jitter() +                    ggtitle("Miles Per Gallon by Transmission Type") + xlab("Transmission") + ylab("Miles Per Gallon")

The boxplot above suggests that there are differences.

Statistical Confirmation

Let’s statistically confirm that there are indeed differnces by doing a two sample T-Test.

test=t.test(mpg ~am, data = mtcars)
test

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

We can now statistically confirm that there are differences in MPG due to transmission type. (The p value is less than 0.05.)

Model Building

Let’s build a model that predicts mpg only using am(transmission type)

model1 <- lm(mpg ~ am, data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The model quanitifies the MPG difference between the automatic and manual transmission which is equal to 7.245 miles per gallon. As you notice, the R squared value is 0.36 approximately i.e. the model explains only 36% of variability in the mpg, thus we should try other models.

Let’s look at the pair plot to find relationships between the various variables.

library(GGally)

## Warning: package 'GGally' was built under R version 3.3.3

# ggpairs allows categorical variables, so we will use
# all the varaibles for simplicity
ggpairs(mtcars, aes(color=am))

We notice that disp, wt, hp, drat are highly corelated with mpg. Also, looking at the graph of mpg and cyl, cyl seems to have a high negative corerelation with mpg. Also, there could be a problem of multicollinearity between variables. We should keep that in mind and use PCA if need be.

Let’s now build a model by using variables disp, wt, hp, drat, cyl.

model2 <- lm(mpg ~ disp+wt+hp+drat+cyl, data = mtcars)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ disp + wt + hp + drat + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9344 -1.2739 -0.3381  1.1240  5.3297 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.132223   6.279735   5.117 2.76e-05 ***
## disp         0.005026   0.013122   0.383  0.70495    
## wt          -3.263607   1.096444  -2.977  0.00639 ** 
## hp          -0.026771   0.013311  -2.011  0.05521 .  
## drat         0.902752   1.375897   0.656  0.51774    
## cyl6        -3.100793   1.580532  -1.962  0.06100 .  
## cyl8        -3.069259   3.030280  -1.013  0.32083    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 25 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.8267 
## F-statistic: 25.64 on 6 and 25 DF,  p-value: 1.545e-09

This model explains 86% of variability of data and also the adjusted R squared is 83%. This is a much beter model than the first one which only used am as predictor. Lets create a final model(#3) using only statistically significant variables as given in this model(#2) i.e wt, hp and cyl. We Will then compare and choose between model 2 and 3.

model3 <- lm(mpg ~ wt+hp+cyl, data = mtcars)
summary(model3)

## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2612 -1.0320 -0.3210  0.9281  5.3947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.84600    2.04102  17.563 2.67e-16 ***
## wt          -3.18140    0.71960  -4.421 0.000144 ***
## hp          -0.02312    0.01195  -1.934 0.063613 .  
## cyl6        -3.35902    1.40167  -2.396 0.023747 *  
## cyl8        -3.18588    2.17048  -1.468 0.153705    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.44 on 27 degrees of freedom
## Multiple R-squared:  0.8572, Adjusted R-squared:  0.8361 
## F-statistic: 40.53 on 4 and 27 DF,  p-value: 4.869e-11

The R square is 86% and Adjusted R square is 83%, very similar to model 2. Lets compare the models 2 and 3 next.

anova(model2, model3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ disp + wt + hp + drat + cyl
## Model 2: mpg ~ wt + hp + cyl
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     25 157.42                           
## 2     27 160.78 -2   -3.3614 0.2669 0.7679

We notice that the model 2 and 3 are very similar with p value being 0.77(much greater than 0.05). So, we will go with the simpler model which is model # 3.

Finally, lets plot the diagnostic plots for the model.

par(mfrow = c(2,2))
plot(model3)

There do not appear to be any problems with these plots; the residuals appear randomly, the standardized residuals appear normally distributed, and there are not any highly influential outliers. There are 3 outliers which can be evaluated further.

Conclusion

We have determined that there is a difference in mpg in relation to transmission type and have quantified that difference. However, transmission type does not appear to be a very good explanatory variable for mpg; weight, horsepower, and number of cylinders are all more significant variables.