To understand the relation and devise a model for miles per gallon (MPG) from the dataset of cars. The key objective of this analytic report is to answer the following questions: 1. Is an automatic or manual transmission better for MPG. 2. Quantify MPG difference between the automatic and manual transmissions.
library(ggplot2)
library(datasets)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We would put an exploratory analysis of mpg v/s automatic, manual transmission using boxplot as shown in box-plot - Appendix A
d_manual <- mtcars$mpg[mtcars$am == 1]
d_automatic <- mtcars$mpg[mtcars$am == 0]
t_anyz <- t.test(d_manual, d_automatic)
Thus with the above information we can render a fit viz. mpg vs transmission (am).
fit1 <- lm(mpg ~ am, data = mtcars)
s_fit <- summary(fit1)
s_fit
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
From the above regression fit, we can make the following interpretation. - The model has Adjusted R-squared value 0.3384589 viz. only fitting the exactly with our actual data. We need to work on the multivariate regression to improve the model. - The model was formed keeping the automatic mode of transmission as fixed. This would indicate that manual - automatic mpg difference is 7.2449393 MPG.
Since the R-Squared is less we need to work on a better model. The predictors that need to be included is shown in Appendix B.
From Appendix B we came across what all new variables are to be added that mpg is dependent on.
fit2 <- lm(mpg ~ cyl + disp + hp + wt + am, data = mtcars)
To check the improvement on the two models we conduct an Analysis of Variance (ANOVA) test
anova(fit2, fit1)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + wt + am
## Model 2: mpg ~ am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 163.12
## 2 30 720.90 -4 -557.78 22.226 4.507e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value calculated is 4.507481210^{-8} which indicate the given fit is better fit compared to initial.
n_fit <- summary(fit2)
n_fit
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5952 -1.5864 -0.7157 1.2821 5.5725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
## cyl -1.10638 0.67636 -1.636 0.11393
## disp 0.01226 0.01171 1.047 0.30472
## hp -0.02796 0.01392 -2.008 0.05510 .
## wt -3.30262 1.13364 -2.913 0.00726 **
## am 1.55649 1.44054 1.080 0.28984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
## F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
mtcars$am_factor <- factor(mtcars$am, labels = c("automatic","manual"))
g = ggplot(data = mtcars, aes(x = am, y = mpg, group = am, fill = am))
g = g + geom_boxplot()
g
The plot indicates that mpg is dependent on the transmission whether it is manual or automatic. This can be further be clarified by implementing t-test.
The relation of mpg v/s other variables and its dependency is found using ggcorr function
ggcorr(mtcars)
## Warning in ggcorr(mtcars): data in column(s) 'am_factor' are not numeric and
## were ignored
From the heat map and correlation table we can predict that mpg is dependent on the following variable which is having corr > 0.75.
Plotting the residuals data. The following is the residual plot for linear regression model.
par(mfrow = c(2,2))
plot(fit1)
These are the plots for linear regression between mpg and am.
The following would be residual plots for multivariate regression.
par(mfrow = c(2,2))
plot(fit2)