This analysis explores the relationship between a set of variables and miles per gallon (MPG) (outcome) of motor cars. We are particularly interested in the following two questions:
cars<-mtcars
library(ggplot2)
ggplot(cars, aes(as.factor(am), mpg)) + geom_boxplot(aes(fill = as.factor(am)))
From this plot we can get a rough idea the Transmission mode has correlation with MPG. But this is not enough to quantify or conclude this is the only correlation.
In order to quantify how correlated Transmission mode is, we need to first find what are the other variants that are correlated. To find that we can use correlation function.
correlation <-cor(cars$mpg, cars)
order <- correlation[,order(abs(correlation), decreasing = T)]
order
## mpg wt cyl disp hp drat
## 1.0000000 -0.8676594 -0.8521620 -0.8475514 -0.7761684 0.6811719
## vs am carb gear qsec
## 0.6640389 0.5998324 -0.5509251 0.4802848 0.4186840
Now, lets select only the variables that are as or more correlated than Transmission mode.
variates <- names(order)[1:8]
relavantData<-cars[,names(cars) %in% variates]
head(relavantData)
## mpg cyl disp hp drat wt vs am
## Mazda RX4 21.0 6 160 110 3.90 2.620 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 1 0
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 0 0
## Valiant 18.1 6 225 105 2.76 3.460 1 0
Now we have subsetted the data, lets fit a linear regression model.
basicFit <- lm(mpg ~ am, relavantData)
summary(basicFit)
##
## Call:
## lm(formula = mpg ~ am, data = relavantData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Here the R-squared value is 35. So our besic fit only explains 35% of the varience. Lets model a multi variate regression.
multiFit <- lm(mpg ~ ., relavantData)
summary(multiFit)
##
## Call:
## lm(formula = mpg ~ ., data = relavantData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.660 -1.678 -0.417 1.371 5.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.45671 9.02383 3.597 0.00145 **
## cyl -0.63992 0.89674 -0.714 0.48235
## disp 0.01348 0.01212 1.112 0.27695
## hp -0.03032 0.01469 -2.063 0.05005 .
## drat 0.54696 1.51009 0.362 0.72037
## wt -3.24531 1.16754 -2.780 0.01041 *
## vs 1.39761 1.84843 0.756 0.45694
## am 1.95201 1.75665 1.111 0.27749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.571 on 24 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.818
## F-statistic: 20.91 on 7 and 24 DF, p-value: 9.089e-09
Now, the R-squared value has increased significantly. 86% of the varience is explained in this model which is considerably good. This model shows the Transmission mode from automatic to manual has increased 1.96 MPG.
Lets also check the Residuals plots.
par(mfrow = c(2,2))
plot(multiFit)
The residuals show mostly homoskedastic behaviour, thus can conclude its a fairly good model.
The analysis is built one by one from basic exploratory analysis to get a rough idea to a fairly complex multivariate model with 86% variance explained by the choosen variates. The analysis answers the questions we had quantitatively.