In this project we are going to analyze the relationship between the MPG i.e., miles per gallon with other factors of a car, for which we are going to use the mtcars dataset which consists of many characteristics of a car in columns for different cars in rows. Analysis is focussed on two questions:
Here is the glimpse of the dataset.
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$vs <- factor(mtcars$vs)
mtcars$am.label <-factor(mtcars$am,
labels=c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Here we have converted all the desired feilds as factors.
library(ggplot2)
ggplot(data=mtcars,aes(x=am.label,y=mpg,color=am.label))+geom_point()
From the above plot it’s clear that the cars with manual transmission type obtains a better MPG than the cars with automatic transmission type
test1<-lm(mpg~am.label,data=mtcars)
summary(test1)
##
## Call:
## lm(formula = mpg ~ am.label, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am.labelManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The above model cleared that manual transmission provides more than 7 MPG on an average than automatic transmission and this hypothesis is significant as p-value is calculated as less than 0.0003 which is significant but the R-squared value for this model suggests that only about 36% variance in MPG is explained by the transmission alone.
So we find significant explanation of variance in MPG given by other variables.
anova(lm(mpg~.,data=mtcars))
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.71 817.71 102.5913 2.298e-08 ***
## disp 1 37.59 37.59 4.7166 0.045252 *
## hp 1 9.37 9.37 1.1757 0.294304
## drat 1 16.47 16.47 2.0660 0.169883
## wt 1 77.48 77.48 9.7202 0.006629 **
## qsec 1 3.95 3.95 0.4955 0.491609
## vs 1 0.13 0.13 0.0163 0.900058
## am 1 14.47 14.47 1.8160 0.196569
## gear 2 2.32 1.16 0.1454 0.865782
## carb 5 19.03 3.81 0.4774 0.787894
## Residuals 16 127.53 7.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above model’s analysis of variation suggests that cyl, disp and wt very significantly explain the variance in the MPG as p-value≤0.05 that’s why these three should also be included in the final model along with the transmission variable.
mdl<-lm(mpg~am.label+cyl+disp+wt,data=mtcars)
summary(mdl)
##
## Call:
## lm(formula = mpg ~ am.label + cyl + disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.318 -1.362 -0.479 1.354 6.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.898313 3.601540 11.356 8.68e-12 ***
## am.labelManual 0.129066 1.321512 0.098 0.92292
## cyl -1.784173 0.618192 -2.886 0.00758 **
## disp 0.007404 0.012081 0.613 0.54509
## wt -3.583425 1.186504 -3.020 0.00547 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared: 0.8327, Adjusted R-squared: 0.8079
## F-statistic: 33.59 on 4 and 27 DF, p-value: 4.038e-10
Summary of the above multivariate model clearly shows that the value of R-squared is over 0.83 which suggests that the included variables explain over 83% of variance in MPG. Another information about the model is that the variable cyl (number of engine cylinder) and wt (weight of the car) have p-value less than 0.05 which act as the confounding variables in the relationship of transmission type and MPG of the car.
par(mfrow = c(2, 2)) #accommodate all plots
plot(mdl)
The Residuals vs Fitted plot above clearly shows there are few outliers but the residuals are not heteroscedastic but homoscedastic that is the variance of residuals have same scatter over the plot that is constant and normally distributed.
From this whole analysis we found that: