library(ggplot2)
library(ggthemes)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Executive summary

The data from the dataset “mtcars” was extracted from the 1974 Motor Trend US magazine. It is a report that analyzed the relationship between transmission type and miles traveled per gallon.

For the next analyzed it will be covered the next subjects: Is an automatic or manual transmission better for MPG? Quantify the MPG difference between automatic and manual transmissions

The dataset contains 11 columns, what are these: * mpg: Miles/(US) gallon. Millas recorridas con un galón de combustible * cyl: Number of cylinders. Número de cilindros * disp: Displacement (cu.in.). Cilindrada del motor en cm3 * hp: Gross horsepower. Potencia del motor en caballos * drat: Rear axle ratio * wt: Weight (1000 lbs). Peso del automovil * qsec: 1/4 mile time. * vs: V/S. Disposición de los cilindros * am: Transmission (0 = automatic, 1 = manual) * gear: Number of forward gears. Número de marchas hacia adelante * carb: Number of carburetors. Número de carburadores.

Load Data

We need to transform some columns from numeric to factor type.

data("mtcars")
as.factor(mtcars$cyl)->mtcars$cyl
as.factor(mtcars$vs)->mtcars$vs
as.factor(mtcars$am)->mtcars$am
as.factor(mtcars$gear)->mtcars$gear
as.factor(mtcars$carb)->mtcars$carb

Exploratory Analysis

It can be see the structure of the dataset, as well as the first 6 rows and the last 6 rows from the dataset.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars);tail(mtcars)

With the student t test below, the P.value is 0.001374, less than a significance level of 0.05, and with this we reject the null hypothesis and we stay with the alternative one: true difference in means is not equal to 0.

This means that one group type of the transmissions has a gratet mean value in MPG, in this case, group 1 (manual transmission).

t.test(mpg~am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Regression Analysis

We can consider that the varaibles mpg and am are important. Doing a simple regression it can be seen that the variables are significant, but the R-squared says that the model only explains approximately 36%, so, we need a multivariable regression.

lm(mpg ~ am, data = mtcars)->lm1; summary(lm1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

It is not taking off the intercept from the regression, because it is needed to make a comparison in the mpg performance.

In the multivariate regression, it is explain approximately the 89% of the data, so it is a good indicator.

None of the variables has a P.value less than 0.05, thus it can not be conclude which variables are more significant.

lm(mpg ~ ., data = mtcars )->lm2; summary(lm2)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## am1          1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

We will use the step function to find the best model, and the most significant variables for the model withn the lowest AIC. The best model, and the best variable combination are below.

step(lm2, trace = 0)->sp1

The new model has 4 variables: Number of cylinders, Gross horsepower, Weight (1000 lbs) and Transmission (0 = automatic, 1 = manual). The R-squared is approximately 87%, so is a very good sign. Also, it can be seen more easily the most significant variables.

summary(sp1)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Residuals and Diagnostics.

The plots show that:

Conclusion

Appendix

ggplot(data = mtcars, aes(x=am, y=mpg, color=mpg)) +
     theme_tufte()+
     geom_boxplot(fill="gray")+
     geom_jitter()+
     ylab("Miles Per Gallon")+ xlab("Transmission Type (0 = automatic, 1 =      manual)") + ggtitle("MPG vs Transmission type")

ggpairs(mtcars[c(1,2,4,6,9)], aes(colour = cyl, alpha=0.4))

par(mfrow = c(2,2))
plot(sp1 )