For this report, we will look at the r dataset “mtcars” to explore the correlation between miles per gallon (mpg) and other variables collected. We want to investigate, how does the transmission type affect the mpg value. We will also try to find a model that can find the best correlation with the highest amound of predictors appropriate.
We load the dataset “mtcars” and set the categorical variables to a factor type of data.
data(mtcars)
head(mtcars) #Sample of data
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
attach(mtcars) #Saves time by avoiding to write "mtcars$" everytime
We do a boxplot to investigate the MPG vs Transmission relationship. (See point 1 of Appendix)
From the boxplot, we can see that mpg values are generally higher when we have a manual transmission compared to when we have an automatic transmission.
We quantify the difference by comparing the mean of mpg of cars with automatic trasmission with the mean of mpg of cars with manual transmission.
automatic<-subset(mtcars, am==0)
manual<-subset(mtcars, am==1)
mean(automatic$mpg)-mean(manual$mpg)
## [1] -7.244939
We can see by the difference in means that automatic cars will do 7.25 mpg less than manual cars.
We do a pair graph to investigate the relationship between different pairs of variables. (See point 2 of Appendix)
From the pair graph, we see a correlation between “mpg” and “cyl”, “disp”, “hp”, “wt” and “vs”. We can also confirm the relationship previously observed in the boxplot.
We formulate the null hypothesis that: the MPG of the automatic and manual transmissions are from the same underlying normal distribution. We test this by using a two sample T-test.
ttest<-t.test(mpg ~ am)
ttest$p.value
## [1] 0.001373638
Since the p-value is smaller than 0.05, we can reject the null hypothesis at a 95% significance level. This means that the MPG for cars with manual and automatic transmission come from two different distributions.
We start off by fitting the full model to a linear regression, i.e. mpg vs all of the variables that we found to have a correlation with mpg in the pair graph.
fullModel<-lm(mpg ~ cyl + disp + hp + wt + vs + am, data=mtcars)
summary(fullModel) #Results not shown to respect page length limit
We can see from the summary that the adjusted R-squared value is 0.836, meaning that the model can fit 83.6% of data points, though only a couple of the coefficients are significant at a 95% confidence interval, namely, “cyl4”, “hp” and “wt”.
We follow a stepwise selection of the full model, to find a better model. We set the BIC as the k=log(number of samples in mtcars).
stepModel<-step(fullModel, k=log(nrow(mtcars)))
summary(stepModel) #Results not shown to respect page length limit
Here we get an adjusted R-squared value of 0.8148, meaning that the model can fit 81.5% of data points and all coeffients are significant at a 95% confidence level.
Finally, we try to fit a model that only takes into consideration the mpg and transmittion type.
amModel<-lm(mpg ~ am, data=mtcars)
summary(amModel) #Results not shown to respect page length limit
We get a model that can only fit 33.9% of data points, though the coeffients are all significant at a 95% confidence level. This is not a very good model.
We decide the best model to be the model resulting from the stepwise selection. It has a valid adjusted R-squared value and the coeffients are all significant.
We plot the diagnostics tests to show the residual analysis. (See point 3 of Appendix)
From the plots we can see that:
- In the Residual vs Fitted plot, there is no discernable pattern, validating the independence assumption.
- In the Normal Q-Q plot, the residuals are normally distributed, as shown to their closeness to the line.
- In the Scale - Location plot, there is no discernable pattern, validating the constant variance assumption.
- Residual vs Leverage plot, we see that the data points are all located in the region within the 0.5 bands, meaning no outliers are prensent.
We can conclude that:
1. Cars with manual transmission can do, on average, 7 miles per gallon more than cars with automatic transmission.
2. There is a relationship between miles per gallon (mpg), gross hoursepower (hp) and weight (wt), as investigated by the best fitting model we found for mt cars, i.e. formula = mpg ~ hp + wt.
boxplot(mpg ~ am, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="MPG",
main="MPG vs Transmission type")
pairs(mtcars, panel=panel.smooth, main="Pair Graph")
par(mfrow = c(2, 2))
plot(stepModel)