Motor Cars Data analysis

Johns Hopkins Data Science Project: Regression Model

Suhas. P. K

2023-07-17

Introduction

Motor Trend is a magazine about the automobile industry. In this analysis, a data set having information of car collection is explored to understand the relationship between miles per gallon and transmission type.
The data set is available from the CRAN repository.

Libraries used.

library(ggplot2)
library(ggdark)

Reading the data set.

data(mtcars)

Basics checking of the data set.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Check the structure of data set.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Check for the missing data.

colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

There are no missing value.

Exploratory Data Analysis

Preliminary analysis

Understand the distribution of miles per gallon variable.

histplot <- ggplot(data = mtcars,
                   aes(x = mtcars$mpg)) + geom_histogram(color = "black",fill = "lightgreen") +
    xlab("miles per gallon")+
    ggtitle("Histogram of Miles per gallon")+
    dark_theme_light()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
histplot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plotting a scatter plot based on the transmission type and mpg.

scttrplot <- ggplot(data = mtcars,
                    aes(x = am, y = mpg, color = factor(am)))+ 
    geom_point(size = 2)+geom_smooth(method=lm, color = "yellow")

scttrplot +
    xlab("Transmission")+
    ylab("miles per gallon")+
    scale_colour_discrete(
        name = "Transmission",
        limits = c("0","1"),
        labels = c("Automatic",
                   "Manual")
    ) + dark_theme_linedraw()
## `geom_smooth()` using formula = 'y ~ x'

Visualizing the ‘mpg’ vs ‘transmission’ using boxplot.

    bxplot <- ggplot(mtcars, aes(x=factor(am),y = mpg, color = factor(am)))+
    geom_boxplot() +
    geom_point(stat = "summary",
              fun = "mean",
              color = "white", label = "mean")+
    xlab("Transmission")+
    ylab("Miles per gallon")
## Warning in geom_point(stat = "summary", fun = "mean", color = "white", label =
## "mean"): Ignoring unknown parameters: `label`
bxplot +
    scale_colour_discrete(
    name = "Tranmission",
    limits = c("0","1"),
    labels = c("Automatic","Manual")
) + dark_theme_light()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

To understand better from the scatter plot, the mean value of mpg based on the transmission type must be calculated. By far for this calculation I will use this method.

mean_am <- with(mtcars,
               tapply(mpg, am, mean))
mean_am 
##        0        1 
## 17.14737 24.39231

Now take the difference between the median based on the transmission type. With this we will get to know which transmission type has better mpg.

mean_am[2]-mean_am[1]
##        1 
## 7.244939

In this case, the mean shows that, cars recorded with manual transmission can travel 7.24 more miles per gallon on average than the cars with automatic transmission.
Thus, manual transmission is better than the automatic.

A bit advance analysis

Performing t-test comparing the mean between the two transmission groups.

am_auto <- mtcars$mpg[mtcars$am == 0]
am_man <- mtcars$mpg[mtcars$am == 1]
t.test(
    am_auto, am_man,
    paired = FALSE,
    alternative = "two.sided",
    var.equal = FALSE
)
## 
##  Welch Two Sample t-test
## 
## data:  am_auto and am_man
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The confidence interval (95%) does not contain zero (-11.28,-3.21) and p-value is greater then 0.005. Then, it can conclude that the average consumption, in miles per gallon, with automatic transmission is higher than the manual transmission. In this case, the mean analysis, it is possible to quantify the MPG difference between automatic and manual transmissions: 7.24 mpg greater, subtracting means.

Regression analysis

Single Model linear model The analysis is made to compare results from the mean analysis. The null hypothesis is that the difference between mean of mpg and am is zero.

single_model <- 
    lm(mtcars$mpg ~ mtcars$am)
summary(single_model)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## mtcars$am    7.244939   1.764422  4.106127 2.850207e-04

The results show us that the p-value of the slope is less than 0.005. Then, it can reject the null hypothesis, and the results of the exploratory analysis were confirmed: automatic transmission results are 7.245 miles per gallon greater. If the slope is greater than zero, manual transmission is better than the automatic one.

Multivariable analysis.

require(MASS)
## Loading required package: MASS
multi_model <- stepAIC(
    lm(mpg~. , data = mtcars),
    direction = "both",
    trace = FALSE
)
multi_model$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
## Final Model:
## mpg ~ wt + qsec + am
## 
## 
##     Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                             21   147.4944 70.89774
## 2  - cyl  1 0.07987121        22   147.5743 68.91507
## 3   - vs  1 0.26852280        23   147.8428 66.97324
## 4 - carb  1 0.68546077        24   148.5283 65.12126
## 5 - gear  1 1.56497053        25   150.0933 63.45667
## 6 - drat  1 3.34455117        26   153.4378 62.16190
## 7 - disp  1 6.62865369        27   160.0665 61.51530
## 8   - hp  1 9.21946935        28   169.2859 61.30730

The best model indicated by the automated analysis consists of the variables wt, qsec, am and mpg as the outcome.

final_model <- lm(mtcars$mpg ~
                      mtcars$wt + mtcars$qsec + mtcars$am)
summary(final_model)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## mtcars$wt   -3.916504  0.7112016 -5.506882 6.952711e-06
## mtcars$qsec  1.225886  0.2886696  4.246676 2.161737e-04
## mtcars$am    2.935837  1.4109045  2.080819 4.671551e-02

Then, the regression equation is \(mpg = 9.618 -3.917 wt + 1.226 qsec + 1.4109 am\) . It is assumed that \(Errors = 0\). As the two-sided p-value for the am coefficient is 0.04672, smaller than 0.05, it can we reject the null hypothesis.Looking at the plots,

par(mfrow = c(2,2))
plot(final_model)

Final Model Residuals , the visual analysis show us that the behavior of the best model is adequate considering normal residuals and constant variability. The leverage is within reasonable upper limit.

Conclusion

  • Manual transmission is better than the automatic.
  • Cars analyzed with manual transmission can travel 7.24 more miles per gallon on average than the cars with automatic transmission.
  • There is a correlation between mpg and transmission, but other variables should also be considered, as qsec and wt, beyond the type of transmission.
  • The obtained regression equation is mpg = 9.618 -3.917 wt + 1.226 qsec + 1.4109 am . Then, for the same weight (wt) and quarter mile time (qsec),manual transmission cars get 1.4109 miles per gallon more than automatic transmission cars.