In this report we will use a data set of a collection of cars in order to try to answer the following two questions:
This report include five sections:
Exploratory Analysis
Model Selection Analysis
Residual Plot and Diagnostic
Conclusions
Annexes
We will use the mtcars data set. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
A data frame with 32 observations on 11 variables:
Let’s see some of the values included in the data set:
data(mtcars)
print.data.frame(mtcars[1:3,])
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
We will mainly focus in the relationship between mpg and am. Let’s change some variables from numeric class to factor class and make a plot in order to validate if exists some relationship:
library(ggplot2)
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
# Box plot
p <- ggplot(mtcars, aes(x=am, y=mpg,fill=am,group=am)) +
labs(title="Plot of MPG per Transmision Type",
x="Transmission Type (0=automatic, 1=manual)",
y = "Miles/gallon") +
geom_boxplot() + theme(legend.position="none")
## Output in Annexes Section
According with the graphs looks like exists a difference between the MPG that the automatic and manual cars have, let’s make a inference test to validate that.
t_test <- t.test(mpg ~ am, data=mtcars)
## Output in Annexes Section
Because the p value is < 0.05 we can reject the null hipothesys, so the difference between the transmission car type have a statistics influence in the amount of MGP. The MGP mean for manual transmission cars (am=1) is 7.2 more better (24.39) than automatic cars (17.15).
In order to choose the best model, we will consider to use a Simple Linear Regression Model and a Multiple Linear Regression Model (Total and Stepwise Backward).
Let’s validate the validaty of a simple lineal reggression model using mpg ~ am:
s_model <- lm(mpg ~ am, data=mtcars)
## Output in Annexes Section
This model has the Residual Standard Error of 4.902 with 30 degrees of freedom, and the Adjusted R-Square values is 0.3585, which means that the model can only explain about 36% of the regression variance of the MPG variable. So let’s consider to include other variables in order to built a better model.
Let’s consider the full model and use the stepwise regression model in order to find the best one:
The full model including all the variables has the following results:
f_model <- lm(mpg ~ . , data=mtcars)
## Output in Annexes Section
This model has the Residual Standard Error of 2.833 with 15 degrees of freedom, and the Adjusted R-Square values is 0.779, which means that the model can explain about 78% of the regression variance of the MPG variable, that is much better that the previous one.
Finally, let’s use the stepwise barckward reggression method:
b_model <- step(f_model, direction="backward", trace=0)
## Output in Annexes Section
According with this, the best model should be mpg ~ cyl + hp + wt + am, that have a Residual Standard Error of 2.41 with 26 degrees of freedom, and the Adjusted R-Square values is 0.8401, which means that the model can explain about 84% of the regression variance of the MPG variable, that is the best one. We will choose this model because is the best one.
par(mfrow = c(2, 2))
##plot(b_model)
## Output in Annexes Section
Based on the residual plots, we can verify the following assumptions: * The Residual vs Fitted Plots don’t have a consisten pattern, supporting the indepence assumptions
The Normal Q-Q Plot indicates that the residual are normally distributed because the points looks like a line
The Scale-Location Plot confirm the constant variance assumptions, as the point are randomly distributed
The Residual vs Leverage argues that no outliers are present, as all fall well whitin the o.5 bands.
The above analysis meet all the basic assumptions of linear regression and answer the questions.
Looking at the selected model (please check 5.2.2.b Stepwise Backward Model), we can see how mpg is affected by changes in cyl, hp and wt:
Cars with manual transmission get about 1.8 MPG more than automatic transmission.
MPG decreases by about 2.5 for every 1000 pound increase in weight.
MPG decreases very marginally with horsepower, about 3 MPG for every 100 horsepower.
MPG decreases by about 2 for 6 cylinder engines and by 3 for 4 cylinder engines.
Here you can find the output of the differents functions/plots
p
t_test
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
summary(s_model)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
summary(f_model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am1 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
summary(b_model)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am1 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
par(mfrow = c(2, 2))
plot(b_model)