library(ggplot2)
library(ggthemes)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
The data from the dataset “mtcars” was extracted from the 1974 Motor Trend US magazine. It is a report that analyzed the relationship between transmission type and miles traveled per gallon.
For the next analyzed it will be covered the next subjects: Is an automatic or manual transmission better for MPG? Quantify the MPG difference between automatic and manual transmissions
The dataset contains 11 columns, what are these: * mpg: Miles/(US) gallon. Millas recorridas con un galón de combustible * cyl: Number of cylinders. Número de cilindros * disp: Displacement (cu.in.). Cilindrada del motor en cm3 * hp: Gross horsepower. Potencia del motor en caballos * drat: Rear axle ratio * wt: Weight (1000 lbs). Peso del automovil * qsec: 1/4 mile time. * vs: V/S. Disposición de los cilindros * am: Transmission (0 = automatic, 1 = manual) * gear: Number of forward gears. Número de marchas hacia adelante * carb: Number of carburetors. Número de carburadores.
We need to transform some columns from numeric to factor type.
data("mtcars")
as.factor(mtcars$cyl)->mtcars$cyl
as.factor(mtcars$vs)->mtcars$vs
as.factor(mtcars$am)->mtcars$am
as.factor(mtcars$gear)->mtcars$gear
as.factor(mtcars$carb)->mtcars$carb
It can be see the structure of the dataset, as well as the first 6 rows and the last 6 rows from the dataset.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars);tail(mtcars)
With the student t test below, the P.value is 0.001374, less than a significance level of 0.05, and with this we reject the null hypothesis and we stay with the alternative one: true difference in means is not equal to 0.
This means that one group type of the transmissions has a gratet mean value in MPG, in this case, group 1 (manual transmission).
t.test(mpg~am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
We can consider that the varaibles mpg and am are important. Doing a simple regression it can be seen that the variables are significant, but the R-squared says that the model only explains approximately 36%, so, we need a multivariable regression.
lm(mpg ~ am, data = mtcars)->lm1; summary(lm1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
It is not taking off the intercept from the regression, because it is needed to make a comparison in the mpg performance.
In the multivariate regression, it is explain approximately the 89% of the data, so it is a good indicator.
None of the variables has a P.value less than 0.05, thus it can not be conclude which variables are more significant.
lm(mpg ~ ., data = mtcars )->lm2; summary(lm2)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am1 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
We will use the step function to find the best model, and the most significant variables for the model withn the lowest AIC. The best model, and the best variable combination are below.
step(lm2, trace = 0)->sp1
The new model has 4 variables: Number of cylinders, Gross horsepower, Weight (1000 lbs) and Transmission (0 = automatic, 1 = manual). The R-squared is approximately 87%, so is a very good sign. Also, it can be seen more easily the most significant variables.
summary(sp1)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am1 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The plots show that:
there is not trend in the residuals vs fitted plot.
the points, in the QQ-plot, refers to a normal distribution in the residuals.
Scale-Location plot show a constant variance behavior.
In the Residuals vs Leverage points, there are not points out of the confidence lines, so, there is not presence of outliers.
ggplot(data = mtcars, aes(x=am, y=mpg, color=mpg)) +
theme_tufte()+
geom_boxplot(fill="gray")+
geom_jitter()+
ylab("Miles Per Gallon")+ xlab("Transmission Type (0 = automatic, 1 = manual)") + ggtitle("MPG vs Transmission type")
ggpairs(mtcars[c(1,2,4,6,9)], aes(colour = cyl, alpha=0.4))
par(mfrow = c(2,2))
plot(sp1 )