The purpose of the analysis is to explore the relationship between a set of variables in a cars dataset and miles per gallon (MPG) (outcome). There are 2 questions to be answered:
1- Is an automatic or manual transmission better for MPG?
2- Quantify the MPG difference between automatic and manual transmissions.
In this analysis,simple linear model and multivariate model are explored respectively, the R squared of the model is improved greatly by AIC stepwise search algorithm.
Data Source:
The data was extracted from the 1974 Motor Trend US magzine, and comprises fuel consumption and 10 aspects of automobile design and performance for | 32 automobiles (1973-74 models).
matcars is a data frame containing 32 objects and 11 variables, include mpg,cyl,disp,hp ,drat,wt,qsec,vs,am,gear,carb.
1- mpg: Miles/(US) gallon
2- cyl: Number of cylinders
3- disp: Displacement
4- hp: Gross horsepower
5- drat: Rear axle ratio
6- wt: Weight (1000 lbs)
7- qsec: 1/4 mile time
8- vs: V/S
9- am: Transmision (0=automatic, 1= manual)
10-gear: Number of forward gears
11-carb: Number of carburetors
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Test the Null Hypothesis: The mean of MPG of Automatic Transmission and Manual Transmission cars are the same with t test.
fit<-lm(mpg~am,mtcars)
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
t.test(mpg~am,data=mtcars)$conf.int
## [1] -11.280194 -3.209684
## attr(,"conf.level")
## [1] 0.95
t.test(mpg~am,data=mtcars)$p.value
## [1] 0.001373638
1- when am=0 (Automatic Transmission), mpg=beta0_hat (intercept), the average mpg (miles/galon) is 17.147 (miles/galon)
2- when am=1 (Manual Transmission), mpg=beta0_hat + beta1_hat, the average mpg (miles/galon) is 24.39231 (miles/galon)
3- The p-value is 0.00137, which is less than the significant level 0.05. We reject the Null hypothesis and accept the alternative hypothesis that there is difference between the Automatic Transmission and Manual Transmission type.
4- The 95% Confidence Interval of mpg difference is [-11.280194,-3.209684]. There is 95% possibility that the difference of MPG between Automatic Transmission and Manual Transmission cars falls in [-11.280194,-3.209684].
summary(fit)$r.squared
## [1] 0.3597989
However, the R squared value in this model is only 0.3597989, which is very low. The most of the variability of the response cannot be explained by the simple linear model. Multivariable Model need to be considered.
In order to select the best model with significant predictors, we perform the stepwise search based on AIC.
lm1<-lm(mpg~.,data=mtcars)
fit_final<-step(lm1)
The AIC value after each search step and the summary of final model is shown below
fit_final$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 21 147.4944 70.89774
## 2 - cyl 1 0.07987121 22 147.5743 68.91507
## 3 - vs 1 0.26852280 23 147.8428 66.97324
## 4 - carb 1 0.68546077 24 148.5283 65.12126
## 5 - gear 1 1.56497053 25 150.0933 63.45667
## 6 - drat 1 3.34455117 26 153.4378 62.16190
## 7 - disp 1 6.62865369 27 160.0665 61.51530
## 8 - hp 1 9.21946935 28 169.2859 61.30730
summary(fit_final)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
summary(fit_final)$adj.r.squared
## [1] 0.8335561
Compare the base model with only am as the predictor variable and the best model obtained above containing confounder variables.
anova(fit,fit_final)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Final Model: mpg~ wt+qsec+am
1- The adjusted R square is 0.8335561, which is the max value compared to other combination of variables of other models.
2- The Intercept coefficient is mean MPG for cars with Automatic Transmission (am=0) when holding other variables constant, that is 9.617781.
3- The coefficent of am is the mean increase in MPG Manual Transmissions type (am=1) when holding other variables constant, that is 2.935837.
The mean MPG for cars with manual transmission is 9.617781+2.935837=12.55362
4- The P-value for the varaible ‘am’ reaches 0.0467, which is less than the significance level alpha=0.05.
The Alternative Hypothesis that there is difference of mpg between two ‘am’ group, that is the Automatic Tranmission and Manual Transmission, is accepted.
Residuals should be uncorrelated with the fit, independent and identically distributed with mean zero,
par(mfrow=c(1,2))
plot(fit_final,which=1,lwd=3,col="red")
plot(fit_final,which=2,lwd=3,col="red")
influence<-dfbetas(fit_final)
head(sort(influence[,4],decreasing=TRUE),3)
## Chrysler Imperial Fiat 128 Toyota Corona
## 0.5626418 0.4765680 0.4050410
leverage<-hatvalues(fit_final)
head(sort(leverage,decreasing=TRUE),3)
## Merc 230 Lincoln Continental Chrysler Imperial
## 0.2970422 0.2642151 0.2296338
Result Analysis:
1- The points in the Residuals-Fitted plot are randomly scattered on the plot that verifies the independence condition.
2- The Normal Q-Q consists of the points which mostly fall on the line indicating that the residuals are normally distributed.
3- Toyota Corolla, Fiat 128 and Chrysler Imperial are the influence points which affect the shape of Residuals-Fitted plot greatly.
4- The influential points founded by dfbetas() are corresponding to that by residual-fitted plot.
5- Merc 230,Lincoln Continental and Chrysler Imperial are the leverage points found by hatvalues() .
The boxplot of mpg~am and The pairsplot between the variable ‘mpg’, ‘cyl’,‘hp’,‘wt’,‘am’, which shows that the mean and median of MPG with Manual Transmission is higher than Automobile Transmission.
boxplot(mpg~am,mtcars,col="lightblue",names=c("Automatic","Manual"))
pairs(mtcars[,c(1,2,4,6,9)],panel=panel.smooth)