Regression Models Course Project

We were looking at a data set of a collection of cars presented in mtcars dataset, we were interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We were particularly interested in the following two questions:

We discovered that though cars with manual transmission have better mpg it’s uncertain if it’s affect of the transmission or cars just tend to be lighter and have less cylinders.

Overview

Variable Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors
dim(mtcars)
## [1] 32 11
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Exploratory analysis

From Fig1 (see Apendix) it looks like mpg is higher for manual transmission. Let’s check it with t.test

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

As we can see, t.test confirms the significans of difference between mpg for different types of transmission, showing possobility of 0.001374 for null hypotesis(that there is no differentce betwwen different types).

But is it only am which affects mpg?

From correlation plot (Fig2) we can see strong negative corelations between mpg and cyl, disp, hp, and wt and positive correlations between mpg and drat anb am. We also can see strong negative corelations between am and wt, disp and cyl and positive correlation between am and drat.

Let’s check influence of the columns with linear regression.

For training model we will convert cyl, vs, am, gear and carb to factors.

Single variable model

single <- lm(mpg ~ am, transform(mtcars, 
                                 cyl  = as.factor(cyl),
                                 vs   = as.factor(vs),
                                 am   = as.factor(am),
                                 gear = as.factor(gear),
                                 carb = as.factor(carb)))
summary(single)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am1          7.244939   1.764422  4.106127 2.850207e-04
summary(single)$r.squared
## [1] 0.3597989

With single variable model we can see the strong enfluence of type of transmission on mpg(+7mpg for manual transmission). We also can see that this model explains just 36% of the variance.

Selecting the best model

Adding more columns we will cover more variance, but we will also have a risk of overfitting. Let’s use step function to find the best combination of columns to describe our model.

best_model <- step(lm(mpg ~ ., transform(mtcars, 
                                 cyl  = as.factor(cyl),
                                 vs   = as.factor(vs),
                                 am   = as.factor(am),
                                 gear = as.factor(gear),
                                 carb = as.factor(carb))))
summary(best_model)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## am1          1.80921138 1.39630450  1.295714 2.064597e-01
summary(best_model)$r.squared
## [1] 0.8658799

As we can see, the best model contains am column and cyl, wt and hp columns as well and explains 87% of the varience.

Models comparison

Let’s compare our models to confirm that our best model is in fact better that one, related on transmisison type only.

anova(single, best_model)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we expected there is a significant difference between these two models and our best model is better then single variable model.

Outliers and Leverages

From the plots of residuals (Fig3), we can see that there is no pattern in residuals and they are homoscedastic.

head(sort(dfbetas(best_model)[,'am1'], decreasing = TRUE), n = 3)
##     Toyota Corona          Fiat 128 Chrysler Imperial 
##         0.7305402         0.4292043         0.3507458
head(sort(hatvalues(best_model), decreasing = TRUE), n = 3)
##       Maserati Bora Lincoln Continental       Toyota Corona 
##           0.4713671           0.2936819           0.2777872

We also can see that all outliers are shown on our plots and don’t have significant influance, so we can conclude that our analysis is accurate.

Quantifying the difference

Let’s look at our best model again to understand how different factors affect mpg

summary(best_model)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## am1          1.80921138 1.39630450  1.295714 2.064597e-01

It looks like manual transmission is better(+1.8 mpg) than auto transmission for mpg. We also can see that every 1000lbs decrease mpg by 2.5 and bigger amount of cylinders decrese mpg.

Comparing weights and acceleration for transmission types

From plots Fig4 we can assume that cars with auto transmission much heavier and have more cylinders.

t.test(wt ~ am, data = mtcars)$p.value
## [1] 6.27202e-06
t.test(cyl ~ am, data = mtcars)$p.value
## [1] 0.002464713

t.test confirms both assumptions.

Conclusion

From our analysis of mtcars dataset it looks like cars with manual transmission have better mpg indeed, however, it’s not certain if the affect is related to transmission type itself, it looks like cars with manual transmission, presented in the dataset, tend to be lighter and have less cylinders and those two parameters have significant affect on mpg.

Appendix

Fig1

Fig2

Fig3

Fig4