You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We see that the variable that we are interested, am, is numeric. For our statistical analysis we need to convert it to a categorical variable. R calls categorical variables as “factors”.
library(plyr)
## Warning: package 'plyr' was built under R version 4.0.4
mtcars$am <- factor(mtcars$am)
mtcars$am <- mapvalues(mtcars$am, from = c("0", "1"), to = c("automatic", "manual"))
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
levels(mtcars$am)
## [1] "automatic" "manual"
Exploratory Data Analysis. We use the Tukey plots created by Dr. John Tukey in 1977. They are also called box plots. With that we will quickly compare graphically automatic and manual transmission. We will use the ggplot2 library.
# Basic box plot
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=am)) +
geom_boxplot(fill="gray")+
labs(title="Miles Per Gallon Versus Transmission",x="Miles Per Gallon", y = "Transmission")+
theme_classic()
From the Box plot we can see that the variance of automatic is different than the variance of manual because the data is more spread in manual. For the Ttest we need to assume that the distributions of automatic and manual are normal. So, we will create a density plot to check this assumption because otherwise the results of the Ttest would not be valid.
library(ggplot2)
ggplot(mtcars, aes(mpg, fill = am, color = am)) + geom_density(alpha = 0.2) + facet_wrap(.~am, ncol=1) + xlim(c(0,60))
From the plots we can say that is reasonable assum that the distribution of mpg for automatic and manual are both normal. We will use that and the assumption that variance are different to test using the Welch t-test method if the means of automatic and manual are equal.
automatic <- mtcars[ which(mtcars$am== "automatic"), ]
manual <- mtcars[ which(mtcars$am== "manual"), ]
#OPTION 1
t.test(automatic$mpg, manual$mpg, var.equal=FALSE, paired=FALSE)
##
## Welch Two Sample t-test
##
## data: automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
#OPTION 2
t.test(automatic$mpg, manual$mpg, alternative="two.sided", var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
If the p-value is less than 0.05, we reject the null hypothesis that there’s no difference between the means and conclude that a significant difference does exist. As the p-value is 0.001, we reject the null hypothesis.
#"Quantify the MPG difference between automatic and manual transmissions"
mean(automatic$mpg) - mean(manual$mpg)
## [1] -7.244939
The best predictor of the differnce between automatic and manual, is the mean difference. The same value can be obtained using the lm function. For example:
#"Quantify the MPG difference between automatic and manual transmissions"
lm(mpg ~ am, mtcars)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Coefficients:
## (Intercept) ammanual
## 17.147 7.245
Executive Summary:
We can conclude that vehicles with manual transmission have better mpg than automatic.