# Load mtcars data
#library(UsingR)
data(mtcars)
head(mtcars)
mtcars <- transform(mtcars, vs=factor(vs), am=factor(am))
levels(mtcars$am) = c("Automa", "Manual")
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automa","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
X <- mtcars[,-c(8)]
res <- PCA(X, scale.unit = TRUE, quali.sup = 8, graph = FALSE)
plot.PCA(res, axes=c(1, 2), choix="var")
plot.PCA(res, axes=c(1, 2), choix="ind", habillage=8)
Plot PCA show us that cars with manuel transmission are approximatly strong positively corrolated with mpg So Automatic transmission seem better for mpg than manuel transmission. However it appears also that the car with automatic tramsmission compared to others with manuel transmission, have good performance measure, primarily of acceleration. Fastest time to travel 1/4 mile from standstill (in seconds).
Boxplot can tell more
g <- ggplot(data = mtcars, aes(x=am, y=mpg, fill = am))
g <- g+geom_boxplot()
g <- g+ggtitle("Boxplot of MPG by type of transmission")
g
We can now use the t.test to see if difference mean is significant
T_autom <- mtcars[mtcars$am == "Automa",]
T_manual <- mtcars[mtcars$am == "Manual",]
t.test(T_autom$mpg, T_manual$mpg)
##
## Welch Two Sample t-test
##
## data: T_autom$mpg and T_manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
P-value = 0.001374<0.05, so the means of the two group are significantly different.
(MPG)
require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 3.4.4
g = ggpairs(mtcars[,1:7])
g
We see here like all numerics variables are strong corrolated with mpg except qsec
To see the relationship between mpg and others variables let’s see first if only tramission can fit mpg
fit <- lm(mpg~am, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The R-squared is small (0.3385). am do not explain mpg.
model <- lm(mpg~., data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs1 0.31776 2.10451 0.151 0.8814
## amManual 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
From the coefficients, it looks like wt is the only factor that significantly changes with mpg. However, including all variables will possibly result overfitting, and we need to test different models with different exploratory variables.
And we will use anova to access the significances of the added regressors. The null hypothesis is that the added regressors are not significant.
fit1 <- lm(mpg~wt, data = mtcars)
fit2 <- lm(mpg~wt+am, data = mtcars)
fit3 <- lm(mpg~wt+am+qsec, data = mtcars)
fit4 <- lm(mpg~wt+am+qsec+hp, data = mtcars)
fit5 <- lm(mpg~wt+am+qsec+hp, data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5)
The third model fit3 seems to be the best model to predict mpg. p-value with added regressosrs show us 0.0002055
fit3 <- lm(mpg~wt+am+qsec, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ wt + am + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The variance explained by fit3 is 0.8336.
fit4 explain more variance than fit3, but we choose fit3 because the variance inflation of fit4 is more than vif(fit3)
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
vif(fit3)
## wt am qsec
## 2.482952 2.541437 1.364339
vif(fit4)
## wt am qsec hp
## 3.964515 2.541527 3.216021 4.922129
To finish our work, let’s verify if the conditions of regressions are verified
par(mfrow=c(2,2))
plot(fit3)
It looks like the best model is the one that includes wt, qsec and am, which means besides transmission types, weight and accelearation also needs to be considered. Weight negatively changes with mpg, and qsec and am positively changes. Every lb/1000 weight increase will cause a decrease of roughly 4 mpg, every increase of 1/4 mile time will cause an increase of 1.2 mpg, and on average, manual transmission is 2.9 mpg better than automatic transmission. The model is able to explain 83% of variance. The residual plots also seems to be randomly scattered.
However there are a some lacks or influentials points (128, and 230), talk about the leverage of these must be more important to perform our analysis
Thanks