Executive Summary

In this report we will use a data set of a collection of cars in order to try to answer the following two questions:

This report include five sections:

  1. Exploratory Analysis

  2. Model Selection Analysis

  3. Residual Plot and Diagnostic

  4. Conclusions

  5. Annexes

1. Exploratory Analisys

We will use the mtcars data set. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

A data frame with 32 observations on 11 variables:

Let’s see some of the values included in the data set:

data(mtcars)
print.data.frame(mtcars[1:3,])
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

We will mainly focus in the relationship between mpg and am. Let’s change some variables from numeric class to factor class and make a plot in order to validate if exists some relationship:

library(ggplot2)

mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)

# Box plot
p <- ggplot(mtcars, aes(x=am, y=mpg,fill=am,group=am)) + 
  labs(title="Plot of MPG per Transmision Type",
       x="Transmission Type (0=automatic, 1=manual)", 
       y = "Miles/gallon") + 
  geom_boxplot() + theme(legend.position="none")
## Output in Annexes Section

According with the graphs looks like exists a difference between the MPG that the automatic and manual cars have, let’s make a inference test to validate that.

1.1 Inference Statistics Test

t_test <- t.test(mpg ~ am, data=mtcars)
## Output in Annexes Section

Because the p value is < 0.05 we can reject the null hipothesys, so the difference between the transmission car type have a statistics influence in the amount of MGP. The MGP mean for manual transmission cars (am=1) is 7.2 more better (24.39) than automatic cars (17.15).

2. Model Selection Analysis

In order to choose the best model, we will consider to use a Simple Linear Regression Model and a Multiple Linear Regression Model (Total and Stepwise Backward).

2.1. Simple Linear Regression Model

Let’s validate the validaty of a simple lineal reggression model using mpg ~ am:

s_model <- lm(mpg ~ am, data=mtcars)
## Output in Annexes Section

This model has the Residual Standard Error of 4.902 with 30 degrees of freedom, and the Adjusted R-Square values is 0.3585, which means that the model can only explain about 36% of the regression variance of the MPG variable. So let’s consider to include other variables in order to built a better model.

2.2. Multiple Linear Regression Model

Let’s consider the full model and use the stepwise regression model in order to find the best one:

2.2.a Full Model

The full model including all the variables has the following results:

f_model <- lm(mpg ~ . , data=mtcars)
## Output in Annexes Section

This model has the Residual Standard Error of 2.833 with 15 degrees of freedom, and the Adjusted R-Square values is 0.779, which means that the model can explain about 78% of the regression variance of the MPG variable, that is much better that the previous one.

2.2.b Stepwise Backward Model

Finally, let’s use the stepwise barckward reggression method:

b_model <- step(f_model, direction="backward", trace=0)
## Output in Annexes Section

According with this, the best model should be mpg ~ cyl + hp + wt + am, that have a Residual Standard Error of 2.41 with 26 degrees of freedom, and the Adjusted R-Square values is 0.8401, which means that the model can explain about 84% of the regression variance of the MPG variable, that is the best one. We will choose this model because is the best one.

3. Residual Plot and Diagnostic

par(mfrow = c(2, 2))
##plot(b_model)
## Output in Annexes Section

Based on the residual plots, we can verify the following assumptions: * The Residual vs Fitted Plots don’t have a consisten pattern, supporting the indepence assumptions

The above analysis meet all the basic assumptions of linear regression and answer the questions.

4. Conclusions

Looking at the selected model (please check 5.2.2.b Stepwise Backward Model), we can see how mpg is affected by changes in cyl, hp and wt:

5. Annexes

Here you can find the output of the differents functions/plots

5.1. Plot of MPG per Transmision Type

p

5.1.1 Inference Statistics Test

t_test
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

5.2.1 Simple Linear Regression Model

summary(s_model)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

5.2.2.a Full Model

summary(f_model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## am1          1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

5.2.2.b Stepwise Backward Model

summary(b_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

5.3 Residual Plot and Diagnostic

par(mfrow = c(2, 2))
plot(b_model)