This project will work for Motor Trend, a magazine about the automobile industry. We will do analysis for “mtcars” dataset to explore the relationship between a set of variables and miles per gallon(MPG) (outcome). This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). We will focus on answering these two questions:
In order to answer those two questions, we will perform exploratory data analysis (EDA) and use hypothesis test and linear regression to make statistical inference
library(ggplot2)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
library(GGally)
library(knitr)
library(car)
## Loading required package: carData
library(MASS)
data(mtcars)
dim(mtcars)
## [1] 32 11
The data has 32 observations and 11 variables. We will look at the first 6 rows to have more details about this dataset
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We will visualize the relationship between am and mpg via a boxplot
ggplot(mtcars, aes(y = mpg, x = factor(am, labels = c("automatic", "manual")), fill = factor(am))) +
geom_boxplot(colour = "black", size = 1) +
xlab("Transmission") +ylab("MPG")
According to the plot, automatic cars have lower MPG than manual cars so automatic cars have lower efficiency than manual cars. However, it is possible that this happened due to the random chance so we will do hypothesis test later to check about this. But first, we will pairwise to see which variables correlate to each other
ggpairs(mtcars,
lower = list(continuous = "smooth", colour = "blue"),
upper = list(corSize = 15),
diag = list(continuous = "bar", colour = "blue"),
axisLabels = "show")
## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'bar' to 'barDiag'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We assume that the variables are independent from each other, and mpg follows the normal distribution and alpha is set to be 5%. We will perform hypothesis test with confidence interval and use t.test function to find 95% confidence interval
State the hypothesis
Prepare the dataset
test <- t.test(mpg ~ am, data = mtcars, var.equal = FALSE, paired = FALSE, conf.level = 0.95)
test
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
p-value is very small –> fail to reject null hypothesis. We have a strong evidence to say that automatic cars and manual cars have difference efficiency for mpg.
mtcars$amfactor <- factor(mtcars$am, labels = c("automatic", "manual"))
summary(lm(mpg ~ factor(amfactor), data = mtcars))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## factor(amfactor)manual 7.244939 1.764422 4.106127 2.850207e-04
The intercept of 17.147368 is simply the mean MPG of automatic transmission. The slope of 7.244939 is the change in the mean between manual transmission and automatic transmission. The p-value of 2.850207e-04 for the mean MPG difference between manual and automatic transmission is significant. Therefore we conclude that according to this model manual transmission is more fuel efficient.
If we just take into consideration one predictor, the model is not sufficient enough because we don’t know if other variables have any relationship with the outcome or not. Thus, should develop a model that includes the effect of other variables
summary(lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## factor(vs)1 0.31776281 2.10450861 0.1509915 0.88142347
## factor(am)1 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
One problem with MLR model is collinearity. If two or more predictors are highly correlated and they are both included into the model, it will increase the true standard error and we will ger a very unstable estimates of the slope. We can asses the collinearity by variance inflation factor (VIF)
fitvif <- lm(mpg~cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
vif(fitvif)
## cyl disp hp drat wt qsec
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958
## factor(vs) factor(am) gear carb
## 4.965873 4.648487 5.357452 7.908747
To have a better view of the result, we will use “kable” function
kable(vif(fitvif), align = "c")
x | |
---|---|
cyl | 15.373833 |
disp | 21.620241 |
hp | 9.832037 |
drat | 3.374620 |
wt | 15.164887 |
qsec | 7.527958 |
factor(vs) | 4.965873 |
factor(am) | 4.648487 |
gear | 5.357452 |
carb | 7.908747 |
VIF with values more than 10 is considered large so we should leave only one of these variables in the model. To know which variables to be kept in the model, we will perform stepwise selection method
fit <- lm(mpg~cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
step <- stepAIC(fit, direction = "both", trace = FALSE)
summary(step)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## factor(am)1 2.935837 1.4109045 2.080819 4.671551e-02
summary(step)$r.squared
## [1] 0.8496636
So, besides “am” variables, “wt” and “qsec” have high relation to mpg. R squared is 84.97% means that the model is 84.97% highly predicted
fit1 <- lm(mpg ~ factor(am), data = mtcars)
fit2 <- lm(mpg ~ factor(am) + wt, data = mtcars)
fit3 <- lm(mpg ~ factor(am) + wt + qsec, data = mtcars)
anova(fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 73.203 2.673e-09 ***
## 3 28 169.29 1 109.03 18.034 0.0002162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will add other variables to see if the model is changed or not
fit4 <- lm(mpg ~ factor(am) + wt + qsec + hp, data = mtcars)
fit5 <- lm(mpg ~ factor(am) + wt + qsec + hp + drat, data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + qsec
## Model 4: mpg ~ factor(am) + wt + qsec + hp
## Model 5: mpg ~ factor(am) + wt + qsec + hp + drat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 72.536 5.362e-09 ***
## 3 28 169.29 1 109.03 17.870 0.0002579 ***
## 4 27 160.07 1 9.22 1.511 0.2299925
## 5 26 158.64 1 1.43 0.234 0.6326111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that p-value for fit4 and fit5 is not significant. It means that if we add other variables into the model, the variation will be increased. So only wt, am and qsec are significant predictors for the model.
fit_final <- lm(mpg ~ wt + factor(am) + qsec, data = mtcars)
summary(fit_final)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## factor(am)1 2.935837 1.4109045 2.080819 4.671551e-02
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
We can answer the second research question via this table
Question: Quantify the MPG difference between automatic and manual transmissions? Answer: On average, the manual transmission cars have 2.94 MPGs more than automatic transmission cars
fitvif <- lm(mpg~wt+factor(am)+qsec,data=mtcars)
kable(vif(fitvif), align = "c")
x | |
---|---|
wt | 2.482951 |
factor(am) | 2.541437 |
qsec | 1.364339 |
cog_final = lm(mpg~wt+factor(am)+qsec, data=mtcars)
plot(cog_final$residuals ~ mtcars$wt + mtcars$qsec, main = "Linearity Condition")
plot(cog_final$residuals ~ cog_final$fitted, main = "Residuals vs Fitted")
plot(abs(cog_final$residuals) ~ cog_final$fitted, main = "Absolute value of residuals vs. fitted")
qqnorm(cog_final$residuals)
qqline(cog_final$residuals)
plot(cog_final$residuals, main = "Independent residuals")