You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
# loading the data
data(mtcars)
# Renaming am variable to Transmission (0 = automatic, 1 = manual
names(mtcars)[names(mtcars) == 'am'] <- 'Transmission'
mtcars$Transmission <- ifelse(mtcars$Transmission == 0, 'Automatic', 'Manual')
mtcars$Transmission <- as.factor(mtcars$Transmission)
knitr::kable(mtcars,align='c')
| mpg | cyl | disp | hp | drat | wt | qsec | vs | Transmission | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | Manual | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | Manual | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | Manual | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | Automatic | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | Automatic | 3 | 2 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | Automatic | 3 | 1 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | Automatic | 3 | 4 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | Automatic | 4 | 2 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | Automatic | 4 | 2 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | Automatic | 4 | 4 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | Automatic | 4 | 4 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | Automatic | 3 | 3 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | Automatic | 3 | 3 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | Automatic | 3 | 3 |
| Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | Automatic | 3 | 4 |
| Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | Automatic | 3 | 4 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | Automatic | 3 | 4 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | Manual | 4 | 1 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | Manual | 4 | 2 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | Manual | 4 | 1 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | Automatic | 3 | 1 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | Automatic | 3 | 2 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | Automatic | 3 | 2 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | Automatic | 3 | 4 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | Automatic | 3 | 2 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | Manual | 4 | 1 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | Manual | 5 | 2 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | Manual | 5 | 2 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | Manual | 5 | 4 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | Manual | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | Manual | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | Manual | 4 | 2 |
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## Transmission gear carb
## Automatic:19 Min. :3.000 Min. :1.000
## Manual :13 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
Running a two sample t-test to determine if two population means are equal.
t.test( mpg ~ Transmission, paired=F, var.equal=F, data=mtcars, alternative='less')
##
## Welch Two Sample t-test
##
## data: mpg by Transmission
## t = -3.7671, df = 18.332, p-value = 0.0006868
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -3.913256
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
The small p-value indicates that the test is significant and we can reject the null hypothesis stating that the two means are the same and accept the alternative hypothesis stating that the mean value of mpg of group 0 (Transmission type automatic) is less than the mpg of group 1 (Transmission type manual).
Since our outcome variable is continuous and not binary we are going to use linear regression models.
fit <- lm(mpg~Transmission,data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ Transmission, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## TransmissionManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Our initial model that uses just one predictor variable explains only 36% of the data variability which is pretty low as seen by the R square value. This indicates that we should include more predictors in our models since they may be related with the variability observed in mpg.
Since one variable is not sufficient we are going to examine all the available variables offered with the dataset to try to find correlation between them and the outcome variable, mpg.
require(GGally);
#g <- ggpairs(mtcars, lower= list(continuous=wrap('smooth')))
ggpairs(mtcars,
upper = list(continuous=wrap('smooth')),
lower = list(continuous = "cor"),
diag = list(continuous = "density"),
axisLabels = 'none',corSize=10)
## alternative code similar results (much faster)
# pairs(~., data = mtcars)
By looking at the our pairs plot we can visually observed a lot of connections and correlations between our variable. Unfortunately we can see high correlation values between the predictors which may cause bias to our analysis if we ommit them. The high correlation values also suggest that if we include other regressors, we would inflate the variance inflation factor. Despite this consequence we might want to include certain variables, even if they dramatically inflate our variance.
# fitting a model with all the predictors
all <- lm(mpg~., data=mtcars)
#summary(all)
require(car)
## Loading required package: car
## Warning: package 'car' was built under R version 3.2.4
vif(all)
## cyl disp hp drat wt
## 15.373833 21.620241 9.832037 3.374620 15.164887
## qsec vs Transmission gear carb
## 7.527958 4.965873 4.648487 5.357452 7.908747
The results above verify our previous pair plot which indicate high correlation value between the variables (also known as multicollinearity) which causes variance inflation as seen by the high VIF values (values greater than 10). Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Note: VIF is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared. Consider using stepwise regression, which is the method we used in order to create the optimal model.
Moving on to linear models with more regressors. First i will define a model with all the regressors and then find the most influential ones (predictors that have a significant contribution to the variability explained). We are going to use the step function with direction backwards to do this. For more help please search ?step on r.
all <- lm(mpg~., data=mtcars)
#stepping through the model backwards removing regressors who don't contribute to the variability explained in the model
step.model <- step(all, direction="backward")
#summary(step.model)
For our analysis we are using the adjusted R-squared which is a modified version of R-squared that has been adjusted for the number of predictors in the model. This is mainly because, every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.
The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.
data.frame(all.model= summary(all)$adj.r.squared, step.model= summary(step.model)$adj.r.square, row.names = 'R squared')
## all.model step.model
## R squared 0.8066423 0.8335561
The all.models uses all available regressors (10 predictors used) but still has worse performance than our stepwise model which uses just 3 predictors. This can be easily observed by the adjusted R-squared values. The stepwise model explains 83% of the variability observed in the miles per gallon variable compared to the 80% explained by the model that uses all the predictors. Since less predictors is always desirable in linear models we are going to choose the stepwise model in order to proceed with our analysis.
# for plotting all diagnostic plots
# for (i in 1:4){
# plot(step.model, which=i)
# }
plot(step.model,which = 1)
This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship.
In our model the residuals are equally spread around a horizontal line without distinct patterns, that is a good indication that we don’t have non-linear relationships.
# Q-Q plot
plot(step.model, which=2)
This plot shows if residuals are normally distributed. The residuals approximately follow a straight line well and do not deviate severely.
plot(step.model, which=3)
It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). In our model we see a horizontal line with equally (randomly) spread points with the exception of a few outliers.
plot(step.model, which = 4)
This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis. Even though data can have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases.
In our case we can see thar Chrysler Imperial, Merc 230 and Fiat 128 are influential with Chrystler Imperial being the most extreme. Same results were observed in the Scale location plot and the Residual vs Leverage plot drawn below.
plot(step.model, which = 5)
Using our stepwise model we reach the following conclusions. For our analysis we used a 95% confidence intervals.
## [1] "We are 95% confident that a 1000lb increase in weight will result in a decrease of the miles achieved with a gallon of gas somewhere in the range of -5.37 -2.46 and on average we will observed a decrease of 3.92 miles per gallon."
## [1] "We are 95% confident that for each aditional second increase, required to run a .25 mile (approximately 400 meters) will result in a increase of the miles achieved with a gallon of gas somewhere in the range of 0.63 1.82 and on average we will observed an increase of 1.23 miles per gallon."
## [1] "We are 95% confident that using a manual transmission type vehicle will increase the miles achieved with a gallon of gas somewhere in the range of 0.05 5.83 and on average we will observed an increase of 1.23 miles per gallon."