Summary

The purpose of the analysis is to explore the relationship between a set of variables in a cars dataset and miles per gallon (MPG) (outcome). There are 2 questions to be answered:

1- Is an automatic or manual transmission better for MPG?
2- Quantify the MPG difference between automatic and manual transmissions.
In this analysis,simple linear model and multivariate model are explored respectively, the R squared of the model is improved greatly by AIC stepwise search algorithm.

Exploratory Data Analysis

Data Source:
The data was extracted from the 1974 Motor Trend US magzine, and comprises fuel consumption and 10 aspects of automobile design and performance for | 32 automobiles (1973-74 models).
matcars is a data frame containing 32 objects and 11 variables, include mpg,cyl,disp,hp ,drat,wt,qsec,vs,am,gear,carb.
1- mpg: Miles/(US) gallon
2- cyl: Number of cylinders
3- disp: Displacement
4- hp: Gross horsepower
5- drat: Rear axle ratio
6- wt: Weight (1000 lbs)
7- qsec: 1/4 mile time
8- vs: V/S
9- am: Transmision (0=automatic, 1= manual)
10-gear: Number of forward gears
11-carb: Number of carburetors

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Obtain the regression line as mpg is the outcome and am is predictor

Test the Null Hypothesis: The mean of MPG of Automatic Transmission and Manual Transmission cars are the same with t test.

fit<-lm(mpg~am,mtcars)
summary(fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04
t.test(mpg~am,data=mtcars)$conf.int
## [1] -11.280194  -3.209684
## attr(,"conf.level")
## [1] 0.95
t.test(mpg~am,data=mtcars)$p.value
## [1] 0.001373638

1- when am=0 (Automatic Transmission), mpg=beta0_hat (intercept), the average mpg (miles/galon) is 17.147 (miles/galon)
2- when am=1 (Manual Transmission), mpg=beta0_hat + beta1_hat, the average mpg (miles/galon) is 24.39231 (miles/galon)
3- The p-value is 0.00137, which is less than the significant level 0.05. We reject the Null hypothesis and accept the alternative hypothesis that there is difference between the Automatic Transmission and Manual Transmission type.
4- The 95% Confidence Interval of mpg difference is [-11.280194,-3.209684]. There is 95% possibility that the difference of MPG between Automatic Transmission and Manual Transmission cars falls in [-11.280194,-3.209684].

summary(fit)$r.squared 
## [1] 0.3597989

However, the R squared value in this model is only 0.3597989, which is very low. The most of the variability of the response cannot be explained by the simple linear model. Multivariable Model need to be considered.

MultiVariable Model Exploration

In order to select the best model with significant predictors, we perform the stepwise search based on AIC.

lm1<-lm(mpg~.,data=mtcars)
fit_final<-step(lm1)

The AIC value after each search step and the summary of final model is shown below

fit_final$anova
##     Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1        NA         NA        21   147.4944 70.89774
## 2  - cyl  1 0.07987121        22   147.5743 68.91507
## 3   - vs  1 0.26852280        23   147.8428 66.97324
## 4 - carb  1 0.68546077        24   148.5283 65.12126
## 5 - gear  1 1.56497053        25   150.0933 63.45667
## 6 - drat  1 3.34455117        26   153.4378 62.16190
## 7 - disp  1 6.62865369        27   160.0665 61.51530
## 8   - hp  1 9.21946935        28   169.2859 61.30730
summary(fit_final)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## am           2.935837  1.4109045  2.080819 4.671551e-02
summary(fit_final)$adj.r.squared
## [1] 0.8335561

Compare the base model with only am as the predictor variable and the best model obtained above containing confounder variables.

anova(fit,fit_final)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result Analysis:

The Final Model: mpg~ wt+qsec+am
1- The adjusted R square is 0.8335561, which is the max value compared to other combination of variables of other models.
2- The Intercept coefficient is mean MPG for cars with Automatic Transmission (am=0) when holding other variables constant, that is 9.617781.
3- The coefficent of am is the mean increase in MPG Manual Transmissions type (am=1) when holding other variables constant, that is 2.935837.
   The mean MPG for cars with manual transmission is 9.617781+2.935837=12.55362
4- The P-value for the varaible ‘am’ reaches 0.0467, which is less than the significance level alpha=0.05.
   The Alternative Hypothesis that there is difference of mpg between two ‘am’ group, that is the Automatic Tranmission and Manual Transmission, is accepted.

Residuals vs Fitted plot.

Residuals should be uncorrelated with the fit, independent and identically distributed with mean zero,

par(mfrow=c(1,2))
plot(fit_final,which=1,lwd=3,col="red")
plot(fit_final,which=2,lwd=3,col="red")

influence<-dfbetas(fit_final)
head(sort(influence[,4],decreasing=TRUE),3)
## Chrysler Imperial          Fiat 128     Toyota Corona 
##         0.5626418         0.4765680         0.4050410
leverage<-hatvalues(fit_final)
head(sort(leverage,decreasing=TRUE),3)
##            Merc 230 Lincoln Continental   Chrysler Imperial 
##           0.2970422           0.2642151           0.2296338

Result Analysis:
1- The points in the Residuals-Fitted plot are randomly scattered on the plot that verifies the independence condition.
2- The Normal Q-Q consists of the points which mostly fall on the line indicating that the residuals are normally distributed.
3- Toyota Corolla, Fiat 128 and Chrysler Imperial are the influence points which affect the shape of Residuals-Fitted plot greatly.
4- The influential points founded by dfbetas() are corresponding to that by residual-fitted plot.
5- Merc 230,Lincoln Continental and Chrysler Imperial are the leverage points found by hatvalues() .

Apendix

The boxplot of mpg~am and The pairsplot between the variable ‘mpg’, ‘cyl’,‘hp’,‘wt’,‘am’, which shows that the mean and median of MPG with Manual Transmission is higher than Automobile Transmission.

boxplot(mpg~am,mtcars,col="lightblue",names=c("Automatic","Manual"))

pairs(mtcars[,c(1,2,4,6,9)],panel=panel.smooth)