This project aims to build a Regression Model for “mtcars” dataset. We are particularly interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). Our aim is to find answers for the following 2 questions:
Before we start our analysis we need to load the dataset. The following R code does that.
data("mtcars")
We will start our analysis by doing some basic exploratory analysis.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
head(mtcars, 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Let’s firs make a simple boxplot to see if there are overall differences between the two type of transmission. We will use “ggplot2” package to create the boxplot but one could produce a similar boxplot using a different package.
require(ggplot2)
## Loading required package: ggplot2
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars <- mtcars %>%
mutate(am = ifelse(test = am==1, yes = "Manual", no = "Automatic")) %>%
mutate(am = as.factor(am))
g <- ggplot(data = mtcars, aes(x = am, y = mpg, fill = am))
g + geom_boxplot() +
labs(fill="Transmission Type") +
theme(legend.position = c(.2, .8)) +
ggtitle(label = "Miles Per Gallon relationship with Transmission types") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Transmission Type") + ylab("Miles Per Gallon")
We can see from the boxplot that the amount of MPG for Manual Transmission is higher in general. However, we need to do statistical analysis to claim if the difference between the transmission types are statistically significant. Apart from that, we need to quantify the difference between the two transmission types.
The following R codes quantify the mean difference between the two transmission types.
comparison <- mtcars %>%
group_by(am) %>%
summarise(meanMPG = mean(mpg)) %>%
add_row(am = "Difference (Manual - Automatic)", meanMPG = 7.24494)
## `summarise()` ungrouping output (override with `.groups` argument)
data.frame(comparison)
## am meanMPG
## 1 Automatic 17.14737
## 2 Manual 24.39231
## 3 Difference (Manual - Automatic) 7.24494
The table shows that there is about 7 mpg difference between the transmission types (in favor of the type Manual)
Let’s first create a simple linear regression model miles per gallon (mpg) as the outcome and transmission (am) as a predictor variable.
fit0 <- lm(formula = mpg ~ am, data = mtcars)
summary(fit0)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Our first linear model with only one predictor indicates that our model is statistically significant meaning the variable “am” can be used as a good predictor for miles per gallon. Also, note that our coefficient (slope) for Manual transmission is 7.245 indicating that the manual transmission is about 7 mpg higher compared to automatic transmission. Coefficients have very high t values and very small p values both showing that the result is significant. However, R-squared of the model is relatively low, suggesting only about 35% of the variation can be explained by the model. Residual standard error is 4.9. Next we will add different variables to see if we can improve our model.
Now let’s use other variables as well in order to improve our model. Adding more variables to the model can also help us identify multicollinearity.
fit1 <- lm(formula = mpg ~ am + wt, data = mtcars)
fit2 <- lm(formula = mpg ~ am + wt + qsec, data = mtcars)
anova(fit0, fit1, fit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 73.203 2.673e-09 ***
## 3 28 169.29 1 109.03 18.034 0.0002162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see from the anova table that adding other variables can improve the model performance. The ANOVA table above shows clear decrease of RSS and SSQ as well as very small p values. Therefore, we the last model (fit2) can be chosen as a final model.
Let’s see how is our final model is performing.
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## amManual 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Compared to our first model where there was only one predictor this time R squared has improved drastically (from about 36% of explained variation to 85%).
Before claiming that our model is working well we need to check the assumptions of Linear Regression. The following R codes will produce multiple residual plots in order to detect if there is any assumptions violated.
require(ggfortify)
## Loading required package: ggfortify
autoplot(fit2)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The plots above show that relatively speaking the assumptions of Linear Regression are met. Residuals seem to be normally distributed and heteroscedasticity is seen in the first plot.
It is also very important to mention that there may be some error and uncertainty with the chosen model. Therefore, with the following R codes we will demonstrate how much uncertainty we may expect with our model using 95% confidence interval.
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) -4.63829946 23.873860
## amManual 0.04573031 5.825944
## wt -5.37333423 -2.459673
## qsec 0.63457320 1.817199