library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.1.3

The objective of this work is to analyze a data set of a collection of cars, (mtcars) to explore the relationship between a set of variables and miles per gallon (MPG) (outcome) and analyze which type of transmission automatic or manual transmission better for MPG. Also we are interested in Quantifying the MPG difference between automatic and manual transmissions.

Plan of analysis. 1. Read in data and search meaning of variables 2. Exploratory analysis 3. Select the best regression model 4. Do diagnostic of selected model 5. Conclusions.

Step 1. Read in data and search meaning of variables

library(MASS)
library(ISLR)
data(mtcars)

We see from appendix 1 that the data set that contains 32 observaciones and 11 variables from which cyl, am, vs, gear and carb are factor variables while mpg, disp, hp, drat, wt and qsec are continuous.

step 2 Exploratory analysis

Appendix 2 gives us a the exploratory analysis of each individual variable in relation with mpg. We present the scatter plots, simple linear regression between each variables with mpg. The continuous variables with significative regression coefficients with mpg are dspl, wt and horsepowers. The Manual transmission seems to have higher MPG than automatic transmission.

Step 3 Select the best regression model

To find the best fitting model a glm linear model is fit against the most important variables obtained in step2. Discrete variables are fed into the model as a factor to generate dummy variables, and we use a stepwise method to select the best model

stepfit <- glm(mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp + hp + drat + wt, data = mtcars)
modelfit <- stepAIC(stepfit, direction = 'both'); modelfit$anova
## Start:  AIC=161.91
## mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp + 
##     hp + drat + wt
## 
##                   Df Deviance    AIC
## - as.factor(gear)  2   150.10 158.27
## - drat             1   148.41 159.91
## - disp             1   149.66 160.18
## - as.factor(am)    1   151.25 160.51
## <none>                 148.40 161.91
## - as.factor(cyl)   2   174.77 163.14
## - hp               1   173.03 164.82
## - wt               1   174.42 165.07
## 
## Step:  AIC=158.27
## mpg ~ as.factor(am) + as.factor(cyl) + disp + hp + drat + wt
## 
##                   Df Deviance    AIC
## - drat             1   150.41 156.34
## - disp             1   150.81 156.42
## - as.factor(am)    1   157.42 157.79
## <none>                 150.10 158.27
## - as.factor(cyl)   2   175.67 159.30
## + as.factor(gear)  2   148.40 161.91
## - wt               1   182.38 162.50
## - hp               1   182.68 162.56
## 
## Step:  AIC=156.34
## mpg ~ as.factor(am) + as.factor(cyl) + disp + hp + wt
## 
##                   Df Deviance    AIC
## - disp             1   151.03 154.47
## <none>                 150.41 156.34
## - as.factor(am)    1   160.13 156.34
## - as.factor(cyl)   2   179.91 158.07
## + drat             1   150.10 158.27
## + as.factor(gear)  2   148.41 159.91
## - hp               1   182.87 160.59
## - wt               1   183.04 160.62
## 
## Step:  AIC=154.47
## mpg ~ as.factor(am) + as.factor(cyl) + hp + wt
## 
##                   Df Deviance    AIC
## <none>                 151.03 154.47
## - as.factor(am)    1   160.78 154.47
## - as.factor(cyl)   2   180.29 156.13
## + disp             1   150.41 156.34
## + drat             1   150.81 156.42
## + as.factor(gear)  2   149.66 158.18
## - hp               1   182.97 158.61
## - wt               1   197.20 161.00
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp + 
##     hp + drat + wt
## 
## Final Model:
## mpg ~ as.factor(am) + as.factor(cyl) + hp + wt
## 
## 
##                Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                       22   148.4018 161.9061
## 2 - as.factor(gear)  2 1.6986725        24   150.1005 158.2703
## 3            - drat  1 0.3082973        25   150.4088 156.3359
## 4            - disp  1 0.6167861        26   151.0256 154.4669

The results of an ANOVA test from the stepwise fit algorithm suggest a final model based on the transmission type compunded with the number of cylinders in the car’s engine (cyl), the horsepower of the car (hp) and the weight of the car (wt).

modelo_final <- glm(mpg ~ as.factor(am) + as.factor(cyl) + hp + wt, data = mtcars)
summary(modelo_final)$coefficients; summary(modelo_final$residuals)
##                    Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     33.70832390 2.60488618 12.940421 7.733392e-13
## as.factor(am)1   1.80921138 1.39630450  1.295714 2.064597e-01
## as.factor(cyl)6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## as.factor(cyl)8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp              -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt              -2.49682942 0.88558779 -2.819404 9.081408e-03
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.9390 -1.2560 -0.4013  0.0000  1.1250  5.0510

The results of the final model suggest a much lower relationship, with a manual transmission attributable to a 1.8 mile per gallon lift when controlled for the other factors. However, the high p-value of 0.21 for the transmission factor means that the relationship as observed in this data is not significant.

More significant factors are found among the engine size, horsepower, and the weight of the car. Behind this factor, is a likely significant factor that the presence of a 6 cylindar engine as compared to a 4 cylindar baseline attributes a reduction of mpg by 3.0 miles per gallon.

Examination of the model fit and residual fit (Appendix 3 show that given the data and variables for analysis, incorporating the additional variable results in a significantly better explantory and predictive model.

step 4 REGRESSION DIAGNOSIS

In the appendix 3 we show The residual plots for diagnostics. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the normal asumption is fullfilled

step 5 Conclusions

The final model for the outcome variable obtained with stepwise regression is mpg ~ 33/70+1.8am1 -3.03cyl6 -2.17cyl8 - 0.03hp -2.50 wt this model is high significant with R-squared:0.8659. The significant variables for this model are: hp, wt, and cyl6, they have negative sign. The am is positive but insignificant. The manual transmission gets higer relationship than automatic one attributable to a 1.8 mile per gallon lift when controlled for the other factors. However, the high p-value of 0.21 for the transmission factor means that the relationship as observed in this data is not significant.

Appendix 1

names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Appendix 2: Single Variable Regression

attach(mtcars)
regplot=function(x,y,...){fit=lm(y~x,);plot(x,y,...);
abline(fit,col="green"); summary(fit)$coefficients: cor(x,y) }
regplot(disp,mpg, xlab="Displacement", ylab="mpg", title(sub="Rsquare= .7183433"), col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado

##  [1] 29.5998548 28.5998548 27.5998548 26.5998548 25.5998548 24.5998548
##  [7] 23.5998548 22.5998548 21.5998548 20.5998548 19.5998548 18.5998548
## [13] 17.5998548 16.5998548 15.5998548 14.5998548 13.5998548 12.5998548
## [19] 11.5998548 10.5998548  9.5998548  8.5998548  7.5998548  6.5998548
## [25]  5.5998548  4.5998548  3.5998548  2.5998548  1.5998548  0.5998548
## [31] -0.4001452
regplot(hp,mpg, xlab="Horsepower", ylab="mpg", title(sub="Rsquare= .6024373"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado

##  [1] 30.09886054 29.09886054 28.09886054 27.09886054 26.09886054
##  [6] 25.09886054 24.09886054 23.09886054 22.09886054 21.09886054
## [11] 20.09886054 19.09886054 18.09886054 17.09886054 16.09886054
## [16] 15.09886054 14.09886054 13.09886054 12.09886054 11.09886054
## [21] 10.09886054  9.09886054  8.09886054  7.09886054  6.09886054
## [26]  5.09886054  4.09886054  3.09886054  2.09886054  1.09886054
## [31]  0.09886054
regplot(drat,mpg, xlab="Rear axle ratio", ylab="mpg", title(sub="Rsquare= .4639952"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado

## [1] -7.5246184 -6.5246184 -5.5246184 -4.5246184 -3.5246184 -2.5246184
## [7] -1.5246184 -0.5246184  0.4753816
regplot(wt,mpg, xlab="Weight(lb/1000)", ylab="mpg", title(sub="Rsquare= .7528328"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado

##  [1] 37.2851262 36.2851262 35.2851262 34.2851262 33.2851262 32.2851262
##  [7] 31.2851262 30.2851262 29.2851262 28.2851262 27.2851262 26.2851262
## [13] 25.2851262 24.2851262 23.2851262 22.2851262 21.2851262 20.2851262
## [19] 19.2851262 18.2851262 17.2851262 16.2851262 15.2851262 14.2851262
## [25] 13.2851262 12.2851262 11.2851262 10.2851262  9.2851262  8.2851262
## [31]  7.2851262  6.2851262  5.2851262  4.2851262  3.2851262  2.2851262
## [37]  1.2851262  0.2851262 -0.7148738
regplot(qsec,mpg, xlab="1/4 mile time", ylab="mpg", title(sub="Rsquare= .1752963"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado

## [1] -5.1140382 -4.1140382 -3.1140382 -2.1140382 -1.1140382 -0.1140382
regplot2=function(x,y,...){x=factor(x);plot(x,y,...);fit=lm(y~x);summary(fit)}

regplot2(cyl,mpg, xlab="cyl", ylab="mpg", col="red", pch=20)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
## x6           -6.9208     1.5583  -4.441 0.000119 ***
## x8          -11.5636     1.2986  -8.905 8.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09
regplot2(vs,mpg, xlab="V/S", ylab="mpg", col="red", pch=20)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.757 -3.082 -1.267  2.828  9.383 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.617      1.080  15.390 8.85e-16 ***
## x1             7.940      1.632   4.864 3.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.581 on 30 degrees of freedom
## Multiple R-squared:  0.4409, Adjusted R-squared:  0.4223 
## F-statistic: 23.66 on 1 and 30 DF,  p-value: 3.416e-05
regplot2(am,mpg, xlab="Transmission (0 = automatic, 1 = manual)", ylab="mpg", col="red", pch=20)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## x1             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
regplot2(gear,mpg, xlab="Number of forward gears", ylab="mpg", col="red", pch=20)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7333 -3.2333 -0.9067  2.8483  9.3667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.107      1.216  13.250 7.87e-14 ***
## x4             8.427      1.823   4.621 7.26e-05 ***
## x5             5.273      2.431   2.169   0.0384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared:  0.4292, Adjusted R-squared:  0.3898 
## F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948
regplot2(carb,mpg, xlab="Number of carburetors", ylab="mpg", col="red", pch=20)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.243 -3.325  0.000  2.360  8.557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.343      1.854  13.670 2.21e-13 ***
## x2            -2.943      2.417  -1.218  0.23435    
## x3            -9.043      3.385  -2.672  0.01285 *  
## x4            -9.553      2.417  -3.952  0.00053 ***
## x6            -5.643      5.243  -1.076  0.29174    
## x8           -10.343      5.243  -1.973  0.05927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.905 on 26 degrees of freedom
## Multiple R-squared:  0.4445, Adjusted R-squared:  0.3377 
## F-statistic: 4.161 on 5 and 26 DF,  p-value: 0.006546

Appendix 3

layout(matrix(c(1,2,3,4),2,2))
plot(modelo_final)