In this report, we examine mtcars dataset and explore how miles per galon (mpg) is affected by various other variables. Mostly, we aim to answer following two questions-
Our analysis shows that:
# loading required libraries
library(ggplot2)
# loading dataset
data(mtcars)
# copying it for later operations
mt <- mtcars
# summarazing the variables
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The classes are numeric for all the variables so we change that and transform them into factors.
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c("Automatic","Manual"))
Exploring mpg vs am (Transmission type).
boxplot(mpg ~ am, data = mtcars, col = (c("red","blue")), ylab = "Miles Per Gallon", xlab = "Type of Transmission", main="Miles per Gallon by Transmission type")
The plot clearly shows that there is a relation between the miles covered per gallon by different transmission types.
tapply(mtcars$mpg, mtcars$am, mean)
## Automatic Manual
## 17.14737 24.39231
In the initial analysis, we can see that Manual type gives around 7 mpg more than the Automatic type. We further explore this by regression modelling.
fit <- lm(mpg~am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
A p-value of 0.000285 suggests that we will not reject this hypothesis. But at the same time, R-squared value is around 0.35, which means that our variable only contributes 36% to the variance and there might be other variables affecting our model.
cor(mt)[1,]
## mpg cyl disp hp drat wt qsec
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840
## vs am gear carb
## 0.6640389 0.5998324 0.4802848 -0.5509251
We see that variables cyl, disp, hp, wt are strongly correlated with mpg so we need to include some of them when fitting into a model.
fit2 <- lm(mpg~am + cyl + disp + hp + wt, mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9374 -1.3347 -0.3903 1.1910 5.0757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.864276 2.695416 12.564 2.67e-12 ***
## amManual 1.806099 1.421079 1.271 0.2155
## cyl6 -3.136067 1.469090 -2.135 0.0428 *
## cyl8 -2.717781 2.898149 -0.938 0.3573
## disp 0.004088 0.012767 0.320 0.7515
## hp -0.032480 0.013983 -2.323 0.0286 *
## wt -2.738695 1.175978 -2.329 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8344
## F-statistic: 27.03 on 6 and 25 DF, p-value: 8.861e-10
This model gives a p-value of less than 0.05 for cy16, hp, and wt. The R-squared value is also around 87% which is quite good.
To check for residuals:
par(mfrow = c(2, 2))
plot(fit2)
As seen, the residuals are normally distributed except for some of the outliers.
We can conclude that there is a definite relation between mpg and am. Apart from that, there are some confounding variables like wt, hp and cyl which affect the relation between mpg and am.