Load data mtcars and some exploratory data analysis

# Load mtcars data
#library(UsingR)
data(mtcars)
head(mtcars)
mtcars <- transform(mtcars, vs=factor(vs), am=factor(am))
levels(mtcars$am) = c("Automa", "Manual")
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automa","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

first comparing between automatic and manuel transmission using PCA

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
X <- mtcars[,-c(8)]
res <- PCA(X, scale.unit = TRUE, quali.sup = 8, graph = FALSE)
plot.PCA(res, axes=c(1, 2), choix="var")

plot.PCA(res, axes=c(1, 2), choix="ind", habillage=8)

Plot PCA show us that cars with manuel transmission are approximatly strong positively corrolated with mpg So Automatic transmission seem better for mpg than manuel transmission. However it appears also that the car with automatic tramsmission compared to others with manuel transmission, have good performance measure, primarily of acceleration. Fastest time to travel 1/4 mile from standstill (in seconds).

Mean of MPG per type of transmission

Boxplot can tell more

g <- ggplot(data = mtcars, aes(x=am, y=mpg, fill = am))
g <- g+geom_boxplot()
g <- g+ggtitle("Boxplot of MPG by type of transmission")
g

We can now use the t.test to see if difference mean is significant

T_autom <- mtcars[mtcars$am == "Automa",]
T_manual <- mtcars[mtcars$am == "Manual",]
t.test(T_autom$mpg, T_manual$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  T_autom$mpg and T_manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

P-value = 0.001374<0.05, so the means of the two group are significantly different.

Exploring the relationship between a set of variables and miles per gallon

(MPG)

ggpairs of the data

require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 3.4.4
g = ggpairs(mtcars[,1:7])
g

We see here like all numerics variables are strong corrolated with mpg except qsec

relationship between mpg and others variables

Relationship between mpg and am

To see the relationship between mpg and others variables let’s see first if only tramission can fit mpg

fit <- lm(mpg~am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The R-squared is small (0.3385). am do not explain mpg.

Let’s find for the best linear model to fit mpg: Using of anova to access the significances of the added regressors

model <- lm(mpg~., data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs1          0.31776    2.10451   0.151   0.8814  
## amManual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

From the coefficients, it looks like wt is the only factor that significantly changes with mpg. However, including all variables will possibly result overfitting, and we need to test different models with different exploratory variables.

And we will use anova to access the significances of the added regressors. The null hypothesis is that the added regressors are not significant.

fit1 <- lm(mpg~wt, data = mtcars)
fit2 <- lm(mpg~wt+am, data = mtcars)
fit3 <- lm(mpg~wt+am+qsec, data = mtcars)
fit4 <- lm(mpg~wt+am+qsec+hp, data = mtcars)
fit5 <- lm(mpg~wt+am+qsec+hp, data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5)

The third model fit3 seems to be the best model to predict mpg. p-value with added regressosrs show us 0.0002055

fit3 <- lm(mpg~wt+am+qsec, data = mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ wt + am + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The variance explained by fit3 is 0.8336.

fit4 explain more variance than fit3, but we choose fit3 because the variance inflation of fit4 is more than vif(fit3)

library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
vif(fit3)
##       wt       am     qsec 
## 2.482952 2.541437 1.364339
vif(fit4)
##       wt       am     qsec       hp 
## 3.964515 2.541527 3.216021 4.922129

To finish our work, let’s verify if the conditions of regressions are verified

par(mfrow=c(2,2))
plot(fit3)

Conclusion and interpretation of coefficients of our best linear model

It looks like the best model is the one that includes wt, qsec and am, which means besides transmission types, weight and accelearation also needs to be considered. Weight negatively changes with mpg, and qsec and am positively changes. Every lb/1000 weight increase will cause a decrease of roughly 4 mpg, every increase of 1/4 mile time will cause an increase of 1.2 mpg, and on average, manual transmission is 2.9 mpg better than automatic transmission. The model is able to explain 83% of variance. The residual plots also seems to be randomly scattered.

However there are a some lacks or influentials points (128, and 230), talk about the leverage of these must be more important to perform our analysis

Thanks