The purpose of this article is to demonstrate the regression analysis methods learned during the Regression Models course created by John Hopkins School of Public Health and hosted by Coursera. I will use the mtcars toy dataset to answer the following questions:
Based on the analysis performed, I determined that there is a correlation between the transmission type and the miles per gallon. When ignoring all other variables, a manual transmission will increase mpg by a little over 7 mpg.
I used a density histogram to explore the dataset in the figure below. In the left panel, I have shown the mpg density of manual transmission cars and in the right I have shown the density of the automatic transmission cars. Simply by looking at the mode values for the panels we can see that the transmission likely has an impact on the mpg. The mode of the left panel is 15 mpg while the mode for the right panel looks to be about 22. We can also make the rough assumptions that the data has a gaussian distribution. Both vehicle classes seems to be centered and evenly distributed around the mean.
library(knitr)
library(dplyr)
library(ggplot2)
data(mtcars)
mtcars$am<- factor(mtcars$am, labels=c("Automatic","Manual"))
g<- ggplot(mtcars)
g<- g + aes(mpg)
g<- g + geom_histogram(aes(y=..density..),fill = "blue",binwidth=2)
g<- g + geom_density()
g<- g + facet_grid(.~am)
g<- g + theme(plot.background= element_rect(fill="white"),
panel.background = element_rect(fill="white"))
g<- g + labs(title="Density Plot of Miles Per Gallon by Transmisison Type")
g
Our hypothesis is further strengthened by calculating the mean mpg by transmission type. In table 1 we see that the average mpg for Automatic transmission vehicles is less than the mean MPG value for Manual transmission vehicles.
mtcars %>%
group_by(am) %>%
summarise(Mean= mean(mpg)) %>%
kable()
| am | Mean |
|---|---|
| Automatic | 17.14737 |
| Manual | 24.39231 |
Ok, we’ve anecdotaly established that manual transmission cars have higher mpg than automatic cars. Lets look at the other relationships to see if there are any potential confounding variables. In the pairs plot below we can see the MPG is in the first row and each column is a different variable. Looking for patterns we see that there is possibly a relationship with displacement and horsepower.
pairs(mtcars,
pch=19)
fit1<- lm(mpg~am, data=mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
df<- data.frame(Item= 1:nrow(mtcars), Residual = fit1$residuals)
g<- ggplot(df)
g<- g + aes(x=Item,y=Residual)
g<- g + geom_point()
g<- g + labs(x= "Item", y= "Residual", title= "Residuals from fit1")
g<- g + theme(plot.background= element_rect(fill="white"),
panel.background = element_rect(fill="white"))
g
fit2<- lm(mpg~am*disp, data=mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am * disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6056 -2.1022 -0.8681 2.2894 5.2315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.157064 1.925053 13.068 1.94e-13 ***
## amManual 7.709073 2.502677 3.080 0.00460 **
## disp -0.027584 0.006219 -4.435 0.00013 ***
## amManual:disp -0.031455 0.011457 -2.745 0.01044 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.907 on 28 degrees of freedom
## Multiple R-squared: 0.7899, Adjusted R-squared: 0.7674
## F-statistic: 35.09 on 3 and 28 DF, p-value: 1.27e-09
df2<- data.frame(Item= 1:nrow(mtcars), Residual = fit2$residuals)
g<- ggplot(df2)
g<- g + aes(x=Item,y=Residual)
g<- g + geom_point()
g<- g + labs(x= "Item", y= "Residual", title= "Residuals from fit2")
g<- g + theme(plot.background= element_rect(fill="white"),
panel.background = element_rect(fill="white"))
g