Overview

This is a project for Regression Models course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The report aims to explore the relationship between a set of variables and miles per gallon (\(MPG\)) in a data set mtcars, and answers the following questions:

  • Whether an automatic or manual transmission better for MPG?
  • What is the quantitative difference in MPG between automatic and manual transmissions?

  • Code chunks can be displayed by clicking Code button

Executive Summary

The mtcars is extracted from the 1974 Motor Trend US magazine: fuel consumption on 10 aspects for 32 automobiles.

Quantitative difference in MPG between automatic and manual transmissions:

  • initially, t.test found manual gearbox MPG is significantly more than automatic one:
    • 7.2449 MPG in mean difference between automatic and manual gearbox, p-value of 0.0014;
  • then, linear regression of MPG on gearbox performed pretty weak Adj.R-squared of 0.3385,
  • finally, a step wise regression lead to significant bestfit.model conclusions:
    • cylinders, weight, horsepower and gearbox are able to explain 84.01\(\%\) of variance,
    • the least influence of gearbox on MPG among the model variables,
    • manual gearbox MPG is only 1.8092 miles over automatic one.

Which transmission better for MPG: it seems impossible to give a correct answer while not specified who is meant “better” for:

  • for millionaires, there is no difference in their car gas mileage,
  • for gasoline manufactures, high fuel consumption is “better” ,
  • while for our planet, no one drives cars is “better” (and uses fuel at all).

Exploratory Data Analyses (EDA)

From mtcars R help page: there is a data frame with 32 observations on 11 (numeric) variables, particularly:

mpg cyl hp wt am
miles/(US) gallon nr. of cylinders gross horsepower weight (1000 lbs) transmission

Some variables are given clearer names, levels vars (incl. am: 0 = automatic, 1 = manual) are changed to factor type. Then, structure and summary of the data are drawn, and correlation matrix constructed

Structure/ Summary of the data

?mtcars
data("mtcars")
mycars <- data.table(mtcars, model = rownames(mtcars))
mycars <- mycars %>%
  transmute(model, mpg, ncyl = factor(cyl), disp, horsepower = hp, drat,
         weight=wt, qsec, engine = factor(vs, labels = c("Vshaped","straight")),
         gearbox = factor(am, labels=c("automatic","manual")),
         gear = factor(gear), ncarb = factor(carb))
str(mycars)
Classes 'data.table' and 'data.frame':  32 obs. of  12 variables:
 $ model     : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ ncyl      : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp      : num  160 160 108 258 360 ...
 $ horsepower: num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ weight    : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec      : num  16.5 17 18.6 19.4 17 ...
 $ engine    : Factor w/ 2 levels "Vshaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
 $ gearbox   : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear      : Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ ncarb     : Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
summary(mycars)
    model                mpg        ncyl        disp         horsepower   
 Length:32          Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0  
 Class :character   1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5  
 Mode  :character   Median :19.20   8:14   Median :196.3   Median :123.0  
                    Mean   :20.09          Mean   :230.7   Mean   :146.7  
                    3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0  
                    Max.   :33.90          Max.   :472.0   Max.   :335.0  
      drat           weight           qsec            engine        gearbox  
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Vshaped :18   automatic:19  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   straight:14   manual   :13  
 Median :3.695   Median :3.325   Median :17.71                               
 Mean   :3.597   Mean   :3.217   Mean   :17.85                               
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                               
 Max.   :4.930   Max.   :5.424   Max.   :22.90                               
 gear   ncarb 
 3:15   1: 7  
 4:12   2:10  
 5: 5   3: 3  
        4:10  
        6: 1  
        8: 1  

Correlogram

res <- cor.mtest(mtcars, conf.level = .95)
corrmat<- corrplot.mixed(cor(mtcars), diag = NULL, order="FPC", p.mat = res$p,
                         number.font = 4, number.cex=.8,
                         tl.col = "purple3", mar = c(0, 0, 1, 0))

Interpretation:

  • strong negative correlations between mpg and ncyl/disp/weight/horsepower (alarming!)
  • high positive relation between ncyl and disp (could be an issue if both appear in a model).

Boxplots

There is no way to answer, which transmission is “better” for MPG unless it’s said who that should be better for.

Though, a close question can be answered: boxplots comparing MPG by gearbox show manual gearbox MPG is notably more than automatic one:

boxplot <- ggplot(mycars, aes(x = gearbox, y = mpg)) + 
  geom_boxplot(aes(fill=gearbox), color="darkgrey") +
  geom_jitter(width = 0.2, colour="cornflowerblue") +
  scale_fill_manual(name="gearbox", values = c("violet","purple"))+
  theme(axis.ticks = element_blank(), axis.text.x = element_blank(),
        axis.title.x = element_blank()) +
  scale_x_discrete(name="")+
ggtitle("MPG by gearbox type (interactive)")
ggplotly(boxplot)

Hypotheses testing

Statistics evidence: t-test for difference in gearbox means, \(H_0:\mu_{auto}=\mu_{manual}\):

ttest<-t.test(mpg ~ gearbox, data = mycars)
tval<- round(ttest$p.value, 5)
tmeans <- ttest$estimate
tdiff<- round(tmeans[[2]] - tmeans[[1]],5)
tint<- round(ttest$conf.int[1:2],5)
ttest

    Welch Two Sample t-test

data:  mpg by gearbox
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group automatic    mean in group manual 
               17.14737                24.39231 

Interpretation:

  • \(95\%\) conf. interval (-11.2802, -3.2097): the prob. of observing mean differences in it is \(~95\%\)
  • p-value=0.0014 is less than \(\alpha=0.05\): the probability of rejection true \(H_0\) less than \(5\%\)

So, the t-test confirms rejecting \(H_0\) in favour of \(H_a:\mu_{auto}\neq\mu_{manual}\), i.e. significant difference \(\mu_{manual}-\mu_{auto}=\) 7.2449, with confidence \(1-\)p-value=99.86\(\%\).

Regression Analysis

Regression models

Although there is a statistically significant difference in MPG between automatic and manual gearbox, the correlation matrix looks alarming. It seems, the large difference in mpg could be also explained by other predictors rather than only gearbox. So it’s time to compare gearbox.model (linear model of MPG on gearbox) to some multiple models: all.model (initial regression of MPG on all other variables) with further selection of the best.model by step-wise backward method. Comparison tools:

  • AIC (Akaike information criterion, both overfitting/ underfitting risk),
  • Adjusted R-squared (variance in MPG around its mean accounted by model),
  • ANOVA F-statistic p-value (significance level of the changes in model)
gearbox.model <- lm(mpg ~ gearbox, data = mycars)
all.model <- lm(mpg ~ gearbox + ncyl + weight + horsepower + ncarb + gear +
                   drat + disp + qsec + engine, data = mycars)
best.model <- step(all.model, direction = "backward" , data = mycars, trace = 0)
bdiff <- round(best.model$coefficients[[2]],5)

gearrsq<- round(summary(gearbox.model)$adj.r.squared,5)
bestrsq<- round(summary(best.model)$adj.r.squared,5)
allrsq<- round(summary(all.model)$adj.r.squared,5)

aic<- round(c(AIC(gearbox.model), AIC(best.model), AIC(all.model)),5)
fpvalue <- round(anova(gearbox.model, best.model, all.model)$`Pr(>F)`,5)
compar.mdls<- cbind(AIC=aic, `Adj.R-sq`=c(gearrsq, bestrsq, allrsq),`Pr(>F)`= fpvalue)
row.names(compar.mdls)<-c("gearbox.model", "best.model","all.model")

best.model

Call:
lm(formula = mpg ~ gearbox + ncyl + weight + horsepower, data = mycars)

Coefficients:
  (Intercept)  gearboxmanual          ncyl6          ncyl8         weight  
     33.70832        1.80921       -3.03134       -2.16368       -2.49683  
   horsepower  
     -0.03211  
anova(gearbox.model, best.model, all.model)
Analysis of Variance Table

Model 1: mpg ~ gearbox
Model 2: mpg ~ gearbox + ncyl + weight + horsepower
Model 3: mpg ~ gearbox + ncyl + weight + horsepower + ncarb + gear + drat + 
    disp + qsec + engine
  Res.Df    RSS Df Sum of Sq       F     Pr(>F)    
1     30 720.90                                    
2     26 151.03  4    569.87 17.7489 0.00001476 ***
3     15 120.40 11     30.62  0.3468     0.9588    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
compar.mdls
                   AIC Adj.R-sq  Pr(>F)
gearbox.model 196.4844  0.33846      NA
best.model    154.4669  0.84009 0.00001
all.model     169.2155  0.77902 0.95882

Interpretation:

  • step() function finds that best.model includes ncyl, weight, horsepower (the leaders of correlation matrix excepting disp, highly correlating with ncyl), and gearbox - the weakest regressor in this model
  • best.model demonstrates best results in all 3 nominations
    • AIC: improves a goodness of gearbox.model by \(AIC_{gearbox}-AIC_{best}=\) 42.0175, and of all.model by \(AIC_{all}-AIC_{best}=\) 14.7485
    • Adj.R-squared: improves explained variance to 84.01\(\%\) comparing to 33.85\(\%\) of gearbox.model, and of 77.9\(\%\) all.model
    • F-statistic: significantly, with p-value=0, improves gearbox.model, and doesn’t give all.model with its p-value=0.9588, a single chance to improve the result
  • all the best.model predictors except gearbox show negative relationship with mpg, i.e. gas mileage decreases with increasing number of cylinders, weight, horsepower
    • and increases with changing gearbox from automatic to manual
  • changing number of cylinders from 4 to 6 decreases average mpg by 3.0313, and to 8 - by 2.1637
    • adding a lb/1000 (weight) decreases average mpg by 2.4968
    • adding a horsepower decreases average mpg by 0.0321
  • changing gearbox from automatic to manual increases average mpg by 1.8092

Residual Analysis and Diagnostics

Residual plots demonstrate outliers with high leverage/ influence:

par(mfrow=c(2,2), oma=c(0,0,2,0))
bestres<-plot(best.model, col="cornflowerblue", sub.caption = "Residual plots for 'best.model': mpg ~ gearbox + ncyl + weight + horsepower")

Diagnostic tests identify potential influencers (high leverage/ outlier) with Deletion diagnostics and Bonferroni test:

infmes<-names(which(influence.measures(best.model)$is.inf,arr.ind = TRUE)[,1])
out <- names(outlierTest(best.model)$bonf.p)
influence.points <-as.integer(c(infmes,out))
influence.cars<-mycars$model[influence.points]
influence.cars <- cbind(influence.points,influence.cars)
influence.cars
     influence.points influence.cars       
[1,] "16"             "Lincoln Continental"
[2,] "31"             "Maserati Bora"      
[3,] "20"             "Toyota Corolla"     

Interpretation

  • Residuals vs. Fitted: residuals appear random, no consistent pattern \(=>\) independent
  • Normal Q-Q: points mostly fall on the line \(=>\) normally distributed
  • Scale-Location: points indicate some level of homoscedasticity (constant variance)
  • Residuals vs. Leverage: the model appears to have some outliers/ leverage points
  • Potential Influencers: observations №16 - Lincoln Continental, №31 - Maserati Bora, and №20 - Toyota Corolla,
    • additional analysis is required to decide whether these points should be deleted.

SUMMARISING

  • number of cylinders, weight, horsepower have higher influence on MPG than \(gearbox\) type
  • manual gearbox over automatic enhances mileage by only \(1.8092\) miles per gallon