In this assignment, we will analyse the given mtcars dataset and try to find out the relationship between a set of variables and miles per gallon (MPG) (outcome).
This assignment will also answer the two following questions (the purpose of this assignment):
The key finding is that manual transmissions on average do give 2.084 miles per gallon more than automatic transmission. However, this is taking into account the confounding variables of weight and cylinders.
require(datasets); require(ggplot2)
data(mtcars); head(mtcars) # where as "am" is the transmission (0 = automatic, 1 = manual)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars$mpg[mtcars$am==0]) # taking a summary of MPG from automatic cars
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.40 14.95 17.30 17.15 19.20 24.40
summary(mtcars$mpg[mtcars$am==1]) # taking a summary of MPG from manual cars
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 15.00 21.00 22.80 24.39 30.40 33.90
# putting together with a nicer code snipped:
aggregate(mtcars["mpg"], by = list(am = mtcars$am), mean)
#> am mpg
#> 1 0 17.14737
#> 2 1 24.39231
# Explorative plot via dotchart
# cars are grouped by automatic/manual
dat <- mtcars[order(mtcars$am, mtcars$mpg), ] # ordering the data
dat$am <- factor(dat$am) # from numberic to factor
# introduce a new column with color
dat$color[dat$am==0] <- "red" # am==0: automatic
dat$color[dat$am==1] <- "blue" # am==1: manual
# dotchart
dotchart(dat$mpg,
labels = row.names(dat),
cex = 0.7,
groups = dat$am,
gcolor = "black",
color = dat$color,
pch = 19,
main = "MPG for Car Models\ngrouped by Transmission\n0:automatic 1:manual",
xlab = "Miles per Gallon")
# Explorative plot via ggplot2/box plot
g <- ggplot(dat, aes(x = am, y = mpg)) +
geom_violin(fill = "lightblue") +
geom_boxplot(fill = "cornflowerblue", color = "black", width = 0.2) +
geom_point(position = "jitter", color = "blue", alpha = 0.5) +
geom_rug(side = "l", color = "black")
g Synopsis: Looking at the plots (see appendix) and refering to the difference in the means comparing transmissions (automatic/manual), we can say that the manual transmission seems to have a better MPG compared to the automatic. To quantify this claim, we try to find a suitable regression model for it.
# let's start simple first
fit1 <- lm(mpg ~ am, data = dat)
summary(fit1)
#>
#> Call:
#> lm(formula = mpg ~ am, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.3923 -3.0923 -0.2974 3.2439 9.5077
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 17.147 1.125 15.247 1.13e-15 ***
#> am1 7.245 1.764 4.106 0.000285 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.902 on 30 degrees of freedom
#> Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
#> F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285Synopsis: We see that the manual transmission provides 7.245 MPG (the am1 coefficient) better performance than automatic ones.
Furthermore, we can see that the p-value of 0.000285 is very small and the confidence interval does not include zero. So we can conclude that there is a significant difference in MPG between the two groups at a 95% confidence interval. But the R squared value is only 0.3385 which means that only 33.85% of the regression variance can be explained by the linear model. We need to consider more predictor variables to try to catch the data better.
fit2 <- lm(mpg ~ am + hp, data = dat)
fit3 <- lm(mpg ~ am + hp + wt, data = dat)
fit4 <- lm(mpg ~ am + hp + wt + cyl, data = dat)
anova(fit1, fit2, fit3, fit4)
#> Analysis of Variance Table
#>
#> Model 1: mpg ~ am
#> Model 2: mpg ~ am + hp
#> Model 3: mpg ~ am + hp + wt
#> Model 4: mpg ~ am + hp + wt + cyl
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 30 720.90
#> 2 29 245.44 1 475.46 75.5148 2.638e-09 ***
#> 3 28 180.29 1 65.15 10.3472 0.003356 **
#> 4 27 170.00 1 10.29 1.6348 0.211917
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Synopsis: Here, fit1, fit2 and fit3 are nested within model fit4. The anova() function provides a simultaneous test that hp, wt and cyl add to linear prediction above and beyond mpg. Because the test is nonsignificant (p-value = 0.2119) for fit4, we conclude that cyl doesn’t add to the linear prediction; so we can drop fit4. The variable wt - being tested in fit3 - is still very significant (p-value = 0.00336) . And hp - being tested in fit2 - is extremely significant. At this point, we select fit3 to be the model to go with.
summary(fit3)
#>
#> Call:
#> lm(formula = mpg ~ am + hp + wt, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4221 -1.7924 -0.3788 1.2249 5.5317
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
#> am1 2.083710 1.376420 1.514 0.141268
#> hp -0.037479 0.009605 -3.902 0.000546 ***
#> wt -2.878575 0.904971 -3.181 0.003574 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.538 on 28 degrees of freedom
#> Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
#> F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11Compared to the simple linear model (fit1), the adjusted R-squared for this model (fit3) is 0.823 which means it explains approximately 82.3% of the regression variance.
We can conclude that wt and cyl are confounding variables in the relationship between am and mpg and that manual transmission cars on average have 2.084 miles per gallon more than automatic cars.
# Let's have a closer look at fit3: residual plot
par(mfrow=c(2,2))
plot(fit3)# Let's look at the corrections - the whole mtcars data
require(GGally)
ggcorr(mtcars, palette = "RdBu", label = TRUE)