This is a project for Regression Models course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.
The report aims to explore the relationship between a set of variables and miles per gallon (\(MPG\)) in a data set mtcars, and answers the following questions:
Code buttonThe mtcars is extracted from the 1974 Motor Trend US magazine: fuel consumption on 10 aspects for 32 automobiles.
Quantitative difference in MPG between automatic and manual transmissions:
t.test found manual gearbox MPG is significantly more than automatic one:
p-value of 0.0014;bestfit.model conclusions:
Which transmission better for MPG: it seems impossible to give a correct answer while not specified who is meant “better” for:
From mtcars R help page: there is a data frame with 32 observations on 11 (numeric) variables, particularly:
mpg |
cyl |
hp |
wt |
am |
|---|---|---|---|---|
| miles/(US) gallon | nr. of cylinders | gross horsepower | weight (1000 lbs) | transmission |
Some variables are given clearer names, levels vars (incl. am: 0 = automatic, 1 = manual) are changed to factor type. Then, structure and summary of the data are drawn, and correlation matrix constructed
?mtcars
data("mtcars")
mycars <- data.table(mtcars, model = rownames(mtcars))
mycars <- mycars %>%
transmute(model, mpg, ncyl = factor(cyl), disp, horsepower = hp, drat,
weight=wt, qsec, engine = factor(vs, labels = c("Vshaped","straight")),
gearbox = factor(am, labels=c("automatic","manual")),
gear = factor(gear), ncarb = factor(carb))
str(mycars)Classes 'data.table' and 'data.frame': 32 obs. of 12 variables:
$ model : chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ ncyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp : num 160 160 108 258 360 ...
$ horsepower: num 110 110 93 110 175 105 245 62 95 123 ...
$ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ weight : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec : num 16.5 17 18.6 19.4 17 ...
$ engine : Factor w/ 2 levels "Vshaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
$ gearbox : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
$ gear : Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
$ ncarb : Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
summary(mycars) model mpg ncyl disp horsepower
Length:32 Min. :10.40 4:11 Min. : 71.1 Min. : 52.0
Class :character 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5
Mode :character Median :19.20 8:14 Median :196.3 Median :123.0
Mean :20.09 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :472.0 Max. :335.0
drat weight qsec engine gearbox
Min. :2.760 Min. :1.513 Min. :14.50 Vshaped :18 automatic:19
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 straight:14 manual :13
Median :3.695 Median :3.325 Median :17.71
Mean :3.597 Mean :3.217 Mean :17.85
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
Max. :4.930 Max. :5.424 Max. :22.90
gear ncarb
3:15 1: 7
4:12 2:10
5: 5 3: 3
4:10
6: 1
8: 1
res <- cor.mtest(mtcars, conf.level = .95)
corrmat<- corrplot.mixed(cor(mtcars), diag = NULL, order="FPC", p.mat = res$p,
number.font = 4, number.cex=.8,
tl.col = "purple3", mar = c(0, 0, 1, 0))Interpretation:
mpg and ncyl/disp/weight/horsepower (alarming!)ncyl and disp (could be an issue if both appear in a model).There is no way to answer, which transmission is “better” for MPG unless it’s said who that should be better for.
Though, a close question can be answered: boxplots comparing MPG by gearbox show manual gearbox MPG is notably more than automatic one:
boxplot <- ggplot(mycars, aes(x = gearbox, y = mpg)) +
geom_boxplot(aes(fill=gearbox), color="darkgrey") +
geom_jitter(width = 0.2, colour="cornflowerblue") +
scale_fill_manual(name="gearbox", values = c("violet","purple"))+
theme(axis.ticks = element_blank(), axis.text.x = element_blank(),
axis.title.x = element_blank()) +
scale_x_discrete(name="")+
ggtitle("MPG by gearbox type (interactive)")
ggplotly(boxplot)Statistics evidence: t-test for difference in gearbox means, \(H_0:\mu_{auto}=\mu_{manual}\):
ttest<-t.test(mpg ~ gearbox, data = mycars)
tval<- round(ttest$p.value, 5)
tmeans <- ttest$estimate
tdiff<- round(tmeans[[2]] - tmeans[[1]],5)
tint<- round(ttest$conf.int[1:2],5)
ttest
Welch Two Sample t-test
data: mpg by gearbox
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group automatic mean in group manual
17.14737 24.39231
Interpretation:
p-value=0.0014 is less than \(\alpha=0.05\): the probability of rejection true \(H_0\) less than \(5\%\)So, the t-test confirms rejecting \(H_0\) in favour of \(H_a:\mu_{auto}\neq\mu_{manual}\), i.e. significant difference \(\mu_{manual}-\mu_{auto}=\) 7.2449, with confidence \(1-\)p-value=99.86\(\%\).
Regression models
Although there is a statistically significant difference in MPG between automatic and manual gearbox, the correlation matrix looks alarming. It seems, the large difference in mpg could be also explained by other predictors rather than only gearbox. So it’s time to compare gearbox.model (linear model of MPG on gearbox) to some multiple models: all.model (initial regression of MPG on all other variables) with further selection of the best.model by step-wise backward method. Comparison tools:
gearbox.model <- lm(mpg ~ gearbox, data = mycars)
all.model <- lm(mpg ~ gearbox + ncyl + weight + horsepower + ncarb + gear +
drat + disp + qsec + engine, data = mycars)
best.model <- step(all.model, direction = "backward" , data = mycars, trace = 0)
bdiff <- round(best.model$coefficients[[2]],5)
gearrsq<- round(summary(gearbox.model)$adj.r.squared,5)
bestrsq<- round(summary(best.model)$adj.r.squared,5)
allrsq<- round(summary(all.model)$adj.r.squared,5)
aic<- round(c(AIC(gearbox.model), AIC(best.model), AIC(all.model)),5)
fpvalue <- round(anova(gearbox.model, best.model, all.model)$`Pr(>F)`,5)
compar.mdls<- cbind(AIC=aic, `Adj.R-sq`=c(gearrsq, bestrsq, allrsq),`Pr(>F)`= fpvalue)
row.names(compar.mdls)<-c("gearbox.model", "best.model","all.model")
best.model
Call:
lm(formula = mpg ~ gearbox + ncyl + weight + horsepower, data = mycars)
Coefficients:
(Intercept) gearboxmanual ncyl6 ncyl8 weight
33.70832 1.80921 -3.03134 -2.16368 -2.49683
horsepower
-0.03211
anova(gearbox.model, best.model, all.model)Analysis of Variance Table
Model 1: mpg ~ gearbox
Model 2: mpg ~ gearbox + ncyl + weight + horsepower
Model 3: mpg ~ gearbox + ncyl + weight + horsepower + ncarb + gear + drat +
disp + qsec + engine
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 720.90
2 26 151.03 4 569.87 17.7489 0.00001476 ***
3 15 120.40 11 30.62 0.3468 0.9588
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
compar.mdls AIC Adj.R-sq Pr(>F)
gearbox.model 196.4844 0.33846 NA
best.model 154.4669 0.84009 0.00001
all.model 169.2155 0.77902 0.95882
Interpretation:
step() function finds that best.model includes ncyl, weight, horsepower (the leaders of correlation matrix excepting disp, highly correlating with ncyl), and gearbox - the weakest regressor in this modelbest.model demonstrates best results in all 3 nominations
gearbox.model by \(AIC_{gearbox}-AIC_{best}=\) 42.0175, and of all.model by \(AIC_{all}-AIC_{best}=\) 14.7485gearbox.model, and of 77.9\(\%\) all.modelp-value=0, improves gearbox.model, and doesn’t give all.model with its p-value=0.9588, a single chance to improve the resultbest.model predictors except gearbox show negative relationship with mpg, i.e. gas mileage decreases with increasing number of cylinders, weight, horsepower
Residual plots demonstrate outliers with high leverage/ influence:
par(mfrow=c(2,2), oma=c(0,0,2,0))
bestres<-plot(best.model, col="cornflowerblue", sub.caption = "Residual plots for 'best.model': mpg ~ gearbox + ncyl + weight + horsepower")Diagnostic tests identify potential influencers (high leverage/ outlier) with Deletion diagnostics and Bonferroni test:
infmes<-names(which(influence.measures(best.model)$is.inf,arr.ind = TRUE)[,1])
out <- names(outlierTest(best.model)$bonf.p)
influence.points <-as.integer(c(infmes,out))
influence.cars<-mycars$model[influence.points]
influence.cars <- cbind(influence.points,influence.cars)
influence.cars influence.points influence.cars
[1,] "16" "Lincoln Continental"
[2,] "31" "Maserati Bora"
[3,] "20" "Toyota Corolla"
Interpretation