Zhe Jiang

Nov 2, 2018

Introduction

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Data

In this analysis, we will use R build-in datasets mtcars to answer these questions. The detailed description can be found here

Data Processing

The data has already been loaded, and let’s see the basic information.

dim(mtcars)
## [1] 32 11
head(mtcars)

let’s see the variables of the data.

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

In order to analyze the mpg, from the description, we can see the “am” is transmission data that 0 for automatic and 1 for manual.

mpg_data <- mtcars[,c("mpg","am")]
summary(mpg_data)
##       mpg              am        
##  Min.   :10.40   Min.   :0.0000  
##  1st Qu.:15.43   1st Qu.:0.0000  
##  Median :19.20   Median :0.0000  
##  Mean   :20.09   Mean   :0.4062  
##  3rd Qu.:22.80   3rd Qu.:1.0000  
##  Max.   :33.90   Max.   :1.0000

we can see the “am” variable as numeric. Let’s set it to factor. Then split the auto and manual data into seperate vector auto and manual.

mpg_data$am <- as.character(mpg_data$am)
mpg_data[which(mpg_data$am == "0"),2] <- "auto"
mpg_data[which(mpg_data$am == "1"),2] <- "manual"

Let’s do t.test for mpg differene between auto and manualtransmission.

t.test(mpg_data$mpg ~ mpg_data$am, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  mpg_data$mpg by mpg_data$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
##   mean in group auto mean in group manual 
##             17.14737             24.39231

Since p-value is smaller than 0, we can reject null hypothesis. Let’s visualize the differnt in mpg through boxplot on Apendix.

We can see the clear differece of mpg bettwen two diffrent transmission groups. Now, let’s quantify the differece, and see how mpg differ between two groups.

fit <- lm(mpg ~ am, data = mpg_data)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mpg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From the summary, we can see the coefficient for manual is 7.245 with p-value 0.0002 < 0.05.

Result

From the analysis of mpg difference between different transmission, we can have following conclusion.

Note

From the linear regression fitting result, the adjusted R-squred: 0.3385, which measn only 33.85% of the variation is explained through the model. The reason is that we have only include “am” as single predictor to predict mpg. If the model includes other features, the Adjusted R-squred expected be higher.

Apendix

boxplot(mpg ~ am, data = mpg_data, main = "MPG vs Transmission",
        xlab = "Transmission",ylab = "MPG")