This report aims to exploring the relationship between a set of variables and miles per gallon(MPG) based on the following two questions:

Exploring the data mtcars

data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
?mtcars
mtcars$tm<-factor(mtcars$am,levels = c(0,1),labels = c("automatic","manual"))
qplot(mpg,facets =tm~., data = mtcars,binwidth=1,col=I("black"),fill=I("grey"),main= "Exploring whether the Manual or the Automatic better for mpg")+theme(plot.title = element_text(hjust = 0.5))

am_mpg<-mean(mtcars[mtcars$am==0,]$mpg)
m_mpg<-mean(mtcars[mtcars$am==1,]$mpg)

From the the histogram and the avrage mpg for automatic (\(17.1473684\) miles per gallon) and manual(\(24.3923077\) miles per galon) transmission, we can conlude that the manual maybe better for mpg, but maybe the autmcatic are high effiency cars due to other factors like weight , number of cylinders, gross horsepower and so on. ## Simple linear regression model Make the am as the predictor and the mpg as the outcome.

fit1<-lm(mpg~factor(tm),data = mtcars )
summary(fit1)$coef
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      17.147368   1.124603 15.247492 1.133983e-15
## factor(tm)manual  7.244939   1.764422  4.106127 2.850207e-04
fit1pvalue<-summary(fit1)$coef[2,4]
slopfit1<-summary(fit1)$coef[2,1]
interceptfit1<-summary(fit1)$coef[1,1]

The intercept \(17.1473684\) is just the mean mpg for the automatic trasmission, and the slope \(7.2449393\) is change in the mean between the manual and automatic, since the preditor am only have two values 0 and 1,treated the am as a dummy variable and the p-value \(2.8502074\times 10^{-4}\) is much less than the 0.05, so the changes in the predictor are related to changes in the outcome, so the maual trasmission better for mpg is significant.

Multivariable regression models

The modeling based on one variable doesn’t seem to be suffient predict and explain the mpg, so following we will try to use more variables in the explain the mpg.

Fitting all the variable

#delete previous added factor for automatic transmission.
fit2<-lm(mpg~cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb,data = mtcars)
summary(fit2)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## factor(vs)1  0.31776281  2.10450861  0.1509915 0.88142347
## factor(am)1  2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Since all the variable’p value are larger than 0.05, none of the variable is a significant predictor for the mpg, following we will use the step wise variable selection to choose the varibale we need.

library(MASS)
stepfinal<-step(fit2,direction = "backward",trace = FALSE)
summary(stepfinal)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02

The coeffients’ p value are all less than 0.05, so , this can significantly explain the outcome mpg.Besides we can also analysis the nested model and the vif of the final model.

fit01<-lm(mpg~factor(am),data = mtcars)
fit02<-lm(mpg~factor(am)+qsec, data = mtcars)
fit03<-lm(mpg~factor(am)+qsec+wt, data = mtcars)
fit04<-lm(mpg~factor(am)+qsec+wt+hp, data = mtcars)
fit05<-lm(mpg~factor(am)+qsec+wt+hp+drat, data = mtcars)
fit06<-lm(mpg~factor(am)+qsec+wt+hp+drat+disp,data = mtcars)
anova(fit01, fit02, fit03, fit04,fit05,fit06)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + qsec
## Model 3: mpg ~ factor(am) + qsec + wt
## Model 4: mpg ~ factor(am) + qsec + wt + hp
## Model 5: mpg ~ factor(am) + qsec + wt + hp + drat
## Model 6: mpg ~ factor(am) + qsec + wt + hp + drat + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 352.63  1    368.26 61.3391 3.453e-08 ***
## 3     28 169.29  1    183.35 30.5389 9.615e-06 ***
## 4     27 160.07  1      9.22  1.5356    0.2268    
## 5     26 158.64  1      1.43  0.2378    0.6300    
## 6     25 150.09  1      8.55  1.4233    0.2441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results of the nested model, adding any more variable in additidion to am, qsec, wt, the p-value will incresed significantly. So the final model would be following.

Final model

finalfit <- lm(mpg ~ wt+qsec+factor(am), data = mtcars)
summary(finalfit)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02
finalr<-summary(finalfit)$r.squared

We can read from the coefficient for am, on average, manual transmission cars have 2.94 MPGs more than automatic transmission cars. However this effect was much higher than when we did not adjust for weight and qsec. ## Regression disgnostic In this section, the vif and residual plot are given.

library(car)
vif(finalfit)
##         wt       qsec factor(am) 
##   2.482952   1.364339   2.541437
plot(finalfit)

The vif of all the variable are less than 5, the normal Q-Q plot also shows that errors are in accordance with the normal distribution.