library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.1.3
The objective of this work is to analyze a data set of a collection of cars, (mtcars) to explore the relationship between a set of variables and miles per gallon (MPG) (outcome) and analyze which type of transmission automatic or manual transmission better for MPG. Also we are interested in Quantifying the MPG difference between automatic and manual transmissions.
Plan of analysis. 1. Read in data and search meaning of variables 2. Exploratory analysis 3. Select the best regression model 4. Do diagnostic of selected model 5. Conclusions.
Step 1. Read in data and search meaning of variables
library(MASS)
library(ISLR)
data(mtcars)
We see from appendix 1 that the data set that contains 32 observaciones and 11 variables from which cyl, am, vs, gear and carb are factor variables while mpg, disp, hp, drat, wt and qsec are continuous.
step 2 Exploratory analysis
Appendix 2 gives us a the exploratory analysis of each individual variable in relation with mpg. We present the scatter plots, simple linear regression between each variables with mpg. The continuous variables with significative regression coefficients with mpg are dspl, wt and horsepowers. The Manual transmission seems to have higher MPG than automatic transmission.
Step 3 Select the best regression model
To find the best fitting model a glm linear model is fit against the most important variables obtained in step2. Discrete variables are fed into the model as a factor to generate dummy variables, and we use a stepwise method to select the best model
stepfit <- glm(mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp + hp + drat + wt, data = mtcars)
modelfit <- stepAIC(stepfit, direction = 'both'); modelfit$anova
## Start: AIC=161.91
## mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp +
## hp + drat + wt
##
## Df Deviance AIC
## - as.factor(gear) 2 150.10 158.27
## - drat 1 148.41 159.91
## - disp 1 149.66 160.18
## - as.factor(am) 1 151.25 160.51
## <none> 148.40 161.91
## - as.factor(cyl) 2 174.77 163.14
## - hp 1 173.03 164.82
## - wt 1 174.42 165.07
##
## Step: AIC=158.27
## mpg ~ as.factor(am) + as.factor(cyl) + disp + hp + drat + wt
##
## Df Deviance AIC
## - drat 1 150.41 156.34
## - disp 1 150.81 156.42
## - as.factor(am) 1 157.42 157.79
## <none> 150.10 158.27
## - as.factor(cyl) 2 175.67 159.30
## + as.factor(gear) 2 148.40 161.91
## - wt 1 182.38 162.50
## - hp 1 182.68 162.56
##
## Step: AIC=156.34
## mpg ~ as.factor(am) + as.factor(cyl) + disp + hp + wt
##
## Df Deviance AIC
## - disp 1 151.03 154.47
## <none> 150.41 156.34
## - as.factor(am) 1 160.13 156.34
## - as.factor(cyl) 2 179.91 158.07
## + drat 1 150.10 158.27
## + as.factor(gear) 2 148.41 159.91
## - hp 1 182.87 160.59
## - wt 1 183.04 160.62
##
## Step: AIC=154.47
## mpg ~ as.factor(am) + as.factor(cyl) + hp + wt
##
## Df Deviance AIC
## <none> 151.03 154.47
## - as.factor(am) 1 160.78 154.47
## - as.factor(cyl) 2 180.29 156.13
## + disp 1 150.41 156.34
## + drat 1 150.81 156.42
## + as.factor(gear) 2 149.66 158.18
## - hp 1 182.97 158.61
## - wt 1 197.20 161.00
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## mpg ~ as.factor(am) + as.factor(cyl) + as.factor(gear) + disp +
## hp + drat + wt
##
## Final Model:
## mpg ~ as.factor(am) + as.factor(cyl) + hp + wt
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 22 148.4018 161.9061
## 2 - as.factor(gear) 2 1.6986725 24 150.1005 158.2703
## 3 - drat 1 0.3082973 25 150.4088 156.3359
## 4 - disp 1 0.6167861 26 151.0256 154.4669
The results of an ANOVA test from the stepwise fit algorithm suggest a final model based on the transmission type compunded with the number of cylinders in the car’s engine (cyl), the horsepower of the car (hp) and the weight of the car (wt).
modelo_final <- glm(mpg ~ as.factor(am) + as.factor(cyl) + hp + wt, data = mtcars)
summary(modelo_final)$coefficients; summary(modelo_final$residuals)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## as.factor(am)1 1.80921138 1.39630450 1.295714 2.064597e-01
## as.factor(cyl)6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## as.factor(cyl)8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.9390 -1.2560 -0.4013 0.0000 1.1250 5.0510
The results of the final model suggest a much lower relationship, with a manual transmission attributable to a 1.8 mile per gallon lift when controlled for the other factors. However, the high p-value of 0.21 for the transmission factor means that the relationship as observed in this data is not significant.
More significant factors are found among the engine size, horsepower, and the weight of the car. Behind this factor, is a likely significant factor that the presence of a 6 cylindar engine as compared to a 4 cylindar baseline attributes a reduction of mpg by 3.0 miles per gallon.
Examination of the model fit and residual fit (Appendix 3 show that given the data and variables for analysis, incorporating the additional variable results in a significantly better explantory and predictive model.
step 4 REGRESSION DIAGNOSIS
In the appendix 3 we show The residual plots for diagnostics. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the normal asumption is fullfilled
step 5 Conclusions
The final model for the outcome variable obtained with stepwise regression is mpg ~ 33/70+1.8am1 -3.03cyl6 -2.17cyl8 - 0.03hp -2.50 wt this model is high significant with R-squared:0.8659. The significant variables for this model are: hp, wt, and cyl6, they have negative sign. The am is positive but insignificant. The manual transmission gets higer relationship than automatic one attributable to a 1.8 mile per gallon lift when controlled for the other factors. However, the high p-value of 0.21 for the transmission factor means that the relationship as observed in this data is not significant.
Appendix 1
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Appendix 2: Single Variable Regression
attach(mtcars)
regplot=function(x,y,...){fit=lm(y~x,);plot(x,y,...);
abline(fit,col="green"); summary(fit)$coefficients: cor(x,y) }
regplot(disp,mpg, xlab="Displacement", ylab="mpg", title(sub="Rsquare= .7183433"), col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado
## [1] 29.5998548 28.5998548 27.5998548 26.5998548 25.5998548 24.5998548
## [7] 23.5998548 22.5998548 21.5998548 20.5998548 19.5998548 18.5998548
## [13] 17.5998548 16.5998548 15.5998548 14.5998548 13.5998548 12.5998548
## [19] 11.5998548 10.5998548 9.5998548 8.5998548 7.5998548 6.5998548
## [25] 5.5998548 4.5998548 3.5998548 2.5998548 1.5998548 0.5998548
## [31] -0.4001452
regplot(hp,mpg, xlab="Horsepower", ylab="mpg", title(sub="Rsquare= .6024373"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado
## [1] 30.09886054 29.09886054 28.09886054 27.09886054 26.09886054
## [6] 25.09886054 24.09886054 23.09886054 22.09886054 21.09886054
## [11] 20.09886054 19.09886054 18.09886054 17.09886054 16.09886054
## [16] 15.09886054 14.09886054 13.09886054 12.09886054 11.09886054
## [21] 10.09886054 9.09886054 8.09886054 7.09886054 6.09886054
## [26] 5.09886054 4.09886054 3.09886054 2.09886054 1.09886054
## [31] 0.09886054
regplot(drat,mpg, xlab="Rear axle ratio", ylab="mpg", title(sub="Rsquare= .4639952"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado
## [1] -7.5246184 -6.5246184 -5.5246184 -4.5246184 -3.5246184 -2.5246184
## [7] -1.5246184 -0.5246184 0.4753816
regplot(wt,mpg, xlab="Weight(lb/1000)", ylab="mpg", title(sub="Rsquare= .7528328"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado
## [1] 37.2851262 36.2851262 35.2851262 34.2851262 33.2851262 32.2851262
## [7] 31.2851262 30.2851262 29.2851262 28.2851262 27.2851262 26.2851262
## [13] 25.2851262 24.2851262 23.2851262 22.2851262 21.2851262 20.2851262
## [19] 19.2851262 18.2851262 17.2851262 16.2851262 15.2851262 14.2851262
## [25] 13.2851262 12.2851262 11.2851262 10.2851262 9.2851262 8.2851262
## [31] 7.2851262 6.2851262 5.2851262 4.2851262 3.2851262 2.2851262
## [37] 1.2851262 0.2851262 -0.7148738
regplot(qsec,mpg, xlab="1/4 mile time", ylab="mpg", title(sub="Rsquare= .1752963"),col="red", pch=20)
## Warning in summary(fit)$coefficients:cor(x, y): expresión numérica tiene 8
## elementos: solo el primero es utilizado
## [1] -5.1140382 -4.1140382 -3.1140382 -2.1140382 -1.1140382 -0.1140382
regplot2=function(x,y,...){x=factor(x);plot(x,y,...);fit=lm(y~x);summary(fit)}
regplot2(cyl,mpg, xlab="cyl", ylab="mpg", col="red", pch=20)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
## x6 -6.9208 1.5583 -4.441 0.000119 ***
## x8 -11.5636 1.2986 -8.905 8.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
## F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
regplot2(vs,mpg, xlab="V/S", ylab="mpg", col="red", pch=20)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.757 -3.082 -1.267 2.828 9.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.617 1.080 15.390 8.85e-16 ***
## x1 7.940 1.632 4.864 3.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.581 on 30 degrees of freedom
## Multiple R-squared: 0.4409, Adjusted R-squared: 0.4223
## F-statistic: 23.66 on 1 and 30 DF, p-value: 3.416e-05
regplot2(am,mpg, xlab="Transmission (0 = automatic, 1 = manual)", ylab="mpg", col="red", pch=20)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## x1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
regplot2(gear,mpg, xlab="Number of forward gears", ylab="mpg", col="red", pch=20)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7333 -3.2333 -0.9067 2.8483 9.3667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.107 1.216 13.250 7.87e-14 ***
## x4 8.427 1.823 4.621 7.26e-05 ***
## x5 5.273 2.431 2.169 0.0384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
## F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
regplot2(carb,mpg, xlab="Number of carburetors", ylab="mpg", col="red", pch=20)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.243 -3.325 0.000 2.360 8.557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.343 1.854 13.670 2.21e-13 ***
## x2 -2.943 2.417 -1.218 0.23435
## x3 -9.043 3.385 -2.672 0.01285 *
## x4 -9.553 2.417 -3.952 0.00053 ***
## x6 -5.643 5.243 -1.076 0.29174
## x8 -10.343 5.243 -1.973 0.05927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.905 on 26 degrees of freedom
## Multiple R-squared: 0.4445, Adjusted R-squared: 0.3377
## F-statistic: 4.161 on 5 and 26 DF, p-value: 0.006546
Appendix 3
layout(matrix(c(1,2,3,4),2,2))
plot(modelo_final)