Calling data from mtcars package

library(ggplot2) #for plots
## Warning: package 'ggplot2' was built under R version 3.4.3
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))

To help us understand the data, exploratory plots are built. Plot 1, shows there is a definite impact on MPG by transmission with Automatic transmissions having a lower MPG.

Regression analysis

boxplot(mpg ~ am, data = mtcars, col = (c("red","blue")), ylab = "Miles Per Gallon", xlab = "Transmission Type")
aggregate(mpg~am, data = mtcars, mean)

Thus we hypothesize that automatic cars have an MPG 7.25 lower than manual cars. To determine if this is a significant difference, a t-test is used.

D_automatic <- mtcars[mtcars$am == "Automatic",]
D_manual <- mtcars[mtcars$am == "Manual",]
t.test(D_automatic$mpg, D_manual$mpg)

The p-value is 0.001374, thus O can state this is a significant difference. Now to quantify this.

init <- lm(mpg ~ am, data = mtcars)
summary(init)
pairs(mpg ~ ., data = mtcars)

This shows us that the average MPG for automatic is 17.1 MPG, while manual is 7.2 MPG higher. The R2 value is 0.36 thus telling us this model only explains us 36% of the variance. As a result, we need to build a multivariate linear regression.

The new model will use the other variables to make it more accurate. We explore the other variable via a pairs plot 2 to see how all the variables correlate with mpg. From this we see that cyl, disp, hp, wt have the strongest correlation with mpg. We build a new model using these variables and compare them to the initial model with the anova function.

betterFit <- lm(mpg~am + cyl + disp + hp + wt, data = mtcars)
anova(init, betterFit)
par(mfrow = c(2,2))
plot(betterFit)

This results in a p-value of 8.637e-08, and we can claim the betterFit model is significantly better than our init simple model. We double-check the residuals for non-normality (Plot 3) and can see they are all normally distributed and homoskedastic.

summary(betterFit)

The model explains 86.64% of the variance and as a result, cyl, disp, hp, wt did affect the correlation between mpg and am. Thus, It can say the difference between automatic and manual transmissions is 1.81 MPG.