Introduction

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

1.- Is an automatic or manual transmission better for MPG" 2.- “Quantify the MPG difference between automatic and manual transmissions”

data(mtcars)
mtcars2 <- mtcars #We will use this later

The following are the variables from the mtcars database:

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Automatic vs Manual transmission

As we can see, the transmission types are stored in the variable am. 0 means that the car has an automatic transmission, whereas 1 means that the car has a manual transmission. We can put it as a factor:

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic transmission", "Manual transmission")

We will make a boxplot to quantify the main difference between automatic and manual transmission. Looking to the boxplot, we can affirm that the cars with automatic transmission have higher mpg than manual transmission cars.

boxplot(mpg ~ am, data = mtcars, col = "red", ylab = "Miles per gallon (mpg)")

This can be set as the null hypothesis. To verify this, we can perform a t-test. It can be seen that the obtained p-value is 0.001374. As this value is smaller than our error tolerance level (0.05), we can reject this null hypothesis: that is, manual cars have better transmission than automatic cars. This conclusion is only valid in case of the equality of other variables: for eample, two cars having the same horse power, the manual one will have better mpg.

t.test(mtcars$mpg ~ mtcars$am, 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic transmission    mean in group Manual transmission 
##                             17.14737                             24.39231

Creating a linear model

In order to fit the model, we will fit it with the most correlated variables.

cor(mtcars2$mpg,mtcars2[,-1])
##            cyl       disp         hp      drat         wt     qsec
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251
mod <- lm(mpg~ am + vs + drat + wt + hp + cyl, data = mtcars)

m1<-lm(mpg~am,data=mtcars)


anova(m1,mod)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + vs + drat + wt + hp + cyl
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     25 166.84  5    554.06 16.605 3.033e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod)

We can also create the model using all the variables:

mod2<- lm(mpg~ ., data = mtcars)
anova(m1,mod2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     21 147.49  9     573.4 9.0711 1.779e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod2)

summary(mod)
## 
## Call:
## lm(formula = mpg ~ am + vs + drat + wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5421 -1.5787 -0.4003  1.3326  5.4488 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           31.34852    9.01121   3.479  0.00186 **
## amManual transmission  1.83252    1.76168   1.040  0.30820   
## vs                     1.19317    1.84800   0.646  0.52438   
## drat                   0.40474    1.51180   0.268  0.79111   
## wt                    -2.50419    0.96337  -2.599  0.01545 * 
## hp                    -0.02660    0.01437  -1.850  0.07611 . 
## cyl                   -0.32673    0.85544  -0.382  0.70573   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.583 on 25 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8163 
## F-statistic: 23.96 on 6 and 25 DF,  p-value: 3.139e-09
summary(mod2)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           12.30337   18.71788   0.657   0.5181  
## cyl                   -0.11144    1.04502  -0.107   0.9161  
## disp                   0.01334    0.01786   0.747   0.4635  
## hp                    -0.02148    0.02177  -0.987   0.3350  
## drat                   0.78711    1.63537   0.481   0.6353  
## wt                    -3.71530    1.89441  -1.961   0.0633 .
## qsec                   0.82104    0.73084   1.123   0.2739  
## vs                     0.31776    2.10451   0.151   0.8814  
## amManual transmission  2.52023    2.05665   1.225   0.2340  
## gear                   0.65541    1.49326   0.439   0.6652  
## carb                  -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

It can be seen that, with the first model we cover 85.18% of the variance, and with the second one 86.9%.