We were looking at a data set of a collection of cars presented in mtcars dataset, we were interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We were particularly interested in the following two questions:
We discovered that though cars with manual transmission have better mpg it’s uncertain if it’s affect of the transmission or cars just tend to be lighter and have less cylinders.
| Variable | Description |
|---|---|
mpg |
Miles/(US) gallon |
cyl |
Number of cylinders |
disp |
Displacement (cu.in.) |
hp |
Gross horsepower |
drat |
Rear axle ratio |
wt |
Weight (1000 lbs) |
qsec |
1/4 mile time |
vs |
V/S |
am |
Transmission (0 = automatic, 1 = manual) |
gear |
Number of forward gears |
carb |
Number of carburetors |
dim(mtcars)
## [1] 32 11
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
From Fig1 (see Apendix) it looks like mpg is higher for manual transmission. Let’s check it with t.test
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
As we can see, t.test confirms the significans of difference between mpg for different types of transmission, showing possobility of 0.001374 for null hypotesis(that there is no differentce betwwen different types).
But is it only am which affects mpg?
From correlation plot (Fig2) we can see strong negative corelations between mpg and cyl, disp, hp, and wt and positive correlations between mpg and drat anb am. We also can see strong negative corelations between am and wt, disp and cyl and positive correlation between am and drat.
Let’s check influence of the columns with linear regression.
For training model we will convert cyl, vs, am, gear and carb to factors.
single <- lm(mpg ~ am, transform(mtcars,
cyl = as.factor(cyl),
vs = as.factor(vs),
am = as.factor(am),
gear = as.factor(gear),
carb = as.factor(carb)))
summary(single)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am1 7.244939 1.764422 4.106127 2.850207e-04
summary(single)$r.squared
## [1] 0.3597989
With single variable model we can see the strong enfluence of type of transmission on mpg(+7mpg for manual transmission). We also can see that this model explains just 36% of the variance.
Adding more columns we will cover more variance, but we will also have a risk of overfitting. Let’s use step function to find the best combination of columns to describe our model.
best_model <- step(lm(mpg ~ ., transform(mtcars,
cyl = as.factor(cyl),
vs = as.factor(vs),
am = as.factor(am),
gear = as.factor(gear),
carb = as.factor(carb))))
summary(best_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## am1 1.80921138 1.39630450 1.295714 2.064597e-01
summary(best_model)$r.squared
## [1] 0.8658799
As we can see, the best model contains am column and cyl, wt and hp columns as well and explains 87% of the varience.
Let’s compare our models to confirm that our best model is in fact better that one, related on transmisison type only.
anova(single, best_model)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we expected there is a significant difference between these two models and our best model is better then single variable model.
From the plots of residuals (Fig3), we can see that there is no pattern in residuals and they are homoscedastic.
head(sort(dfbetas(best_model)[,'am1'], decreasing = TRUE), n = 3)
## Toyota Corona Fiat 128 Chrysler Imperial
## 0.7305402 0.4292043 0.3507458
head(sort(hatvalues(best_model), decreasing = TRUE), n = 3)
## Maserati Bora Lincoln Continental Toyota Corona
## 0.4713671 0.2936819 0.2777872
We also can see that all outliers are shown on our plots and don’t have significant influance, so we can conclude that our analysis is accurate.
Let’s look at our best model again to understand how different factors affect mpg
summary(best_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## am1 1.80921138 1.39630450 1.295714 2.064597e-01
It looks like manual transmission is better(+1.8 mpg) than auto transmission for mpg. We also can see that every 1000lbs decrease mpg by 2.5 and bigger amount of cylinders decrese mpg.
From plots Fig4 we can assume that cars with auto transmission much heavier and have more cylinders.
t.test(wt ~ am, data = mtcars)$p.value
## [1] 6.27202e-06
t.test(cyl ~ am, data = mtcars)$p.value
## [1] 0.002464713
t.test confirms both assumptions.
From our analysis of mtcars dataset it looks like cars with manual transmission have better mpg indeed, however, it’s not certain if the affect is related to transmission type itself, it looks like cars with manual transmission, presented in the dataset, tend to be lighter and have less cylinders and those two parameters have significant affect on mpg.