You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
In this analysis, we will use R build-in datasets mtcars to answer these questions. The detailed description can be found here
The data has already been loaded, and let’s see the basic information.
dim(mtcars)
## [1] 32 11
head(mtcars)
let’s see the variables of the data.
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
In order to analyze the mpg, from the description, we can see the “am” is transmission data that 0 for automatic and 1 for manual.
mpg_data <- mtcars[,c("mpg","am")]
summary(mpg_data)
## mpg am
## Min. :10.40 Min. :0.0000
## 1st Qu.:15.43 1st Qu.:0.0000
## Median :19.20 Median :0.0000
## Mean :20.09 Mean :0.4062
## 3rd Qu.:22.80 3rd Qu.:1.0000
## Max. :33.90 Max. :1.0000
we can see the “am” variable as numeric. Let’s set it to factor. Then split the auto and manual data into seperate vector auto and manual.
mpg_data$am <- as.character(mpg_data$am)
mpg_data[which(mpg_data$am == "0"),2] <- "auto"
mpg_data[which(mpg_data$am == "1"),2] <- "manual"
Let’s do t.test for mpg differene between auto and manualtransmission.
t.test(mpg_data$mpg ~ mpg_data$am, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mpg_data$mpg by mpg_data$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group auto mean in group manual
## 17.14737 24.39231
Since p-value is smaller than 0, we can reject null hypothesis. Let’s visualize the differnt in mpg through boxplot on Apendix.
We can see the clear differece of mpg bettwen two diffrent transmission groups. Now, let’s quantify the differece, and see how mpg differ between two groups.
fit <- lm(mpg ~ am, data = mpg_data)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
From the summary, we can see the coefficient for manual is 7.245 with p-value 0.0002 < 0.05.
From the analysis of mpg difference between different transmission, we can have following conclusion.
From the linear regression fitting result, the adjusted R-squred: 0.3385, which measn only 33.85% of the variation is explained through the model. The reason is that we have only include “am” as single predictor to predict mpg. If the model includes other features, the Adjusted R-squred expected be higher.
boxplot(mpg ~ am, data = mpg_data, main = "MPG vs Transmission",
xlab = "Transmission",ylab = "MPG")