Course Project - Regression Models

Execute summary

The present analysis have been done in order to know the relationship between a set of variables and miles per gallon. The main questions are:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

It has been proved that it exist a relationship between the type of transmission and the consumption. A simple regression model comparing the mpg (miles/gallon) with the am (transmission) detect an increase of 7.245 miles per gallon when using manual instead of automatic transmission. However, the inclusion of other variables available in the data, can detect a significant reduction in this difference. For instance, automatic cars (in this case) are heavier than manual what could probably make them lees eficient. At the same time, this automatic cars have also less qsec values, what implies that they have more horsepower. The model with just this three variables (am, qsec and wt) has a good performance and detects an increase of 2.936.

Project

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG).

Criteria: - Did the student interpret the coefficients correctly? - Did the student do some exploratory data analyses? - Did the student fit multiple models and detail their strategy for model selection? - Did the student answer the questions of interest or detail why the question(s) is (are) not answerable? - Did the student do a residual plot and some diagnostics? - Did the student quantify the uncertainty in their conclusions and/or perform an inference correctly? - Was the report brief (about 2 pages long) for the main body of the report and no longer than 5 with supporting appendix of figures? - Did the report include an executive summary? - Was the report done in Rmd (knitr)?

Load libraries:

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
data(mtcars)

Now, lest see the data:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Main variables of interest are:

“mpg” <- miles/gallon
“am” <- transmission (0=auto, 1=manual)

Boxplot to compare mpg between auto and manual:

data <- select(mtcars, "mpg", "am")
bp <- ggplot(data, aes(factor(am), mpg)) +
    geom_boxplot(fill = "blue", colour="black", alpha = 0.5,
                     outlier.colour = "#1F3552", outlier.shape = 20) +
    ggtitle("Boxplot of mpg by type of transmission")+
    scale_x_discrete(name = "Transmission")+
    scale_y_continuous(name= "Miles per gallon")+
    theme_bw()
bp

The boxplot reflects that “auto”" cars go less mpg that manual cars. Lets see the means and sd of each group:

group_by(data, am) %>%
  summarise(
    count = n(),
    mean = mean(mpg, na.rm = TRUE),
    sd = sd(mpg, na.rm = TRUE)
  )

## # A tibble: 2 x 4
##      am count  mean    sd
##   <dbl> <int> <dbl> <dbl>
## 1  0       19  17.1  3.83
## 2  1.00    13  24.4  6.17

¿There are significative differences between variances?

data.vtest <- var.test(mpg ~ am, data = data)
data.vtest

## 
##  F test to compare two variances
## 
## data:  mpg by am
## F = 0.38656, num df = 18, denom df = 12, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1243721 1.0703429
## sample estimates:
## ratio of variances 
##          0.3865615

data.vtest.p.value <- round(as.numeric(data.vtest["p.value"]),4)

In this case, the p.value (0.0669) is greater than the significance level alpha = 0.05, so there is not significant difference between the variances of the groups.

¿There are significative differences between means? (use t.test)

# Compute t-test
data.ttest <- t.test(mpg ~ am, data = data, var.equal = TRUE)
data.ttest

## 
##  Two Sample t-test
## 
## data:  mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

data.ttest.p.value <- round(as.numeric(data.ttest["p.value"]),4)

In this case, the p.value (310^{-4}) is less than level alpha = 0.05, so there is significant difference between the means of the groups.

Regression Models:

Simple linear model

First, lest try how accurate a simple linear regression between groups would work:

fit1 = lm(mpg~am, data=data)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Here we could see (as we found in the inference analysis) that change from auto to manual implies an increase of 7.245 mpg. However, this model is not quite accurate –> r.square = 0.36.

Plot the linear regression (with 2 variables):

lmplot <- ggplot(data, aes(mpg, am)) + 
      geom_point() +
      geom_smooth(method="lm", method.args=list(family="binomial"), 
                  fullrange=TRUE, se=FALSE)+
    theme_bw()
lmplot

## Warning: In lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##     ...) :
##  extra argument 'family' will be disregarded

GLM

Lest see now what happens if we try to model this 2 variables within a logistic regression:

fit2 = glm(am ~  mpg, family=binomial, data=mtcars, trace=0)
summary(fit2)

## 
## Call:
## glm(formula = am ~ mpg, family = binomial, data = mtcars, trace = 0)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5701  -0.7531  -0.4245   0.5866   2.0617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -6.6035     2.3514  -2.808  0.00498 **
## mpg           0.3070     0.1148   2.673  0.00751 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 29.675  on 30  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 5

Plot of the logistic model (with 2 variables):

glmplot <- ggplot(data, aes(mpg, am)) + 
      geom_point() +
      geom_smooth(method="glm", method.args=list(family="binomial"), 
                  fullrange=TRUE, se=FALSE)+
    theme_bw()
glmplot

Note: be careful with the interpretation of am –> it is a factor, not a continuous variable.

Multivariable linear model

It should be interesting to study how the addition of other variables affects the effect between auto and manual transmission.

Prepare the data (change classes to factor):

mtcars2 <- mtcars
mtcars2$cyl <- factor(mtcars2$cyl)
mtcars2$vs <- factor(mtcars2$vs)
mtcars2$gear <- factor(mtcars2$gear)
mtcars2$carb <- factor(mtcars2$carb)
mtcars2$am <- factor(mtcars2$am,labels=c('Automatic','Manual'))
str(mtcars2)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Lets see the possible correlation between variables with a plot:

mlmplot <- ggpairs(mtcars,lower=list(continuous="smooth"))
mlmplot

As we could have find out looking the the plot above, there are some variables that are highly correlated. Using all this variables increases the true standard error and creates an non-ideal model.

Including variables that we should’t have increases actuall standard errors of the regression variables. Lets see the coefficients for all the variables incorporated at the same time:

fit3 = lm(mpg~., data=mtcars2)
summary(fit3)$coef

##                Estimate  Std. Error     t value   Pr(>|t|)
## (Intercept) 23.87913244 20.06582026  1.19004018 0.25252548
## cyl6        -2.64869528  3.04089041 -0.87102622 0.39746642
## cyl8        -0.33616298  7.15953951 -0.04695316 0.96317000
## disp         0.03554632  0.03189920  1.11433290 0.28267339
## hp          -0.07050683  0.03942556 -1.78835344 0.09393155
## drat         1.18283018  2.48348458  0.47627845 0.64073922
## wt          -4.52977584  2.53874584 -1.78425732 0.09461859
## qsec         0.36784482  0.93539569  0.39325050 0.69966720
## vs1          1.93085054  2.87125777  0.67247551 0.51150791
## amManual     1.21211570  3.21354514  0.37718957 0.71131573
## gear4        1.11435494  3.79951726  0.29328856 0.77332027
## gear5        2.52839599  3.73635801  0.67670068 0.50889747
## carb2       -0.97935432  2.31797446 -0.42250436 0.67865093
## carb3        2.99963875  4.29354611  0.69863900 0.49546781
## carb4        1.09142288  4.44961992  0.24528452 0.80956031
## carb6        4.47756921  6.38406242  0.70136677 0.49381268
## carb8        7.25041126  8.36056638  0.86721532 0.39948495

The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors. The square root of the VIF is the increase in the sd instead of variance.¹

There is a way to detect this correlations (using vif function):

sqrt(vif(fit3))

##           GVIF       Df GVIF^(1/(2*Df))
## cyl  11.319053 1.414214        1.834225
## disp  7.769536 1.000000        2.787389
## hp    5.312210 1.000000        2.304823
## drat  2.609533 1.000000        1.615405
## wt    4.881683 1.000000        2.209453
## qsec  3.284842 1.000000        1.812413
## vs    2.843970 1.000000        1.686407
## am    3.151269 1.000000        1.775181
## gear  7.131081 1.414214        1.634138
## carb 22.432384 2.236068        1.364858

Now we will pick the variables in order to perform a nested comparison. The procedure is: pick variables from low to high values picking just one when the value of sqrt(vif) are very close. So, for instance, as we have to include “am” into the model, we will not consider all the closest values (drat, vs, gear). Nest variable will be “qsec”, then wt and finaly disp.

Model selection (nested likelyhood ratio test)

lm1 <- lm(mpg ~ am, data = mtcars2)
lm2 <- update(lm1, mpg ~ am + qsec)
lm3 <- update(lm1, mpg ~ am + qsec + wt)
lm4 <- update(lm1, mpg ~ am + qsec + wt + hp)
lm5 <- update(lm1, mpg ~ am + qsec + wt + hp + disp)
anova(lm1, lm2, lm3, lm4, lm5)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + qsec
## Model 3: mpg ~ am + qsec + wt
## Model 4: mpg ~ am + qsec + wt + hp
## Model 5: mpg ~ am + qsec + wt + hp + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 352.63  1    368.26 62.4021 2.240e-08 ***
## 3     28 169.29  1    183.35 31.0682 7.442e-06 ***
## 4     27 160.07  1      9.22  1.5622    0.2225    
## 5     26 153.44  1      6.63  1.1232    0.2990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you look at the p-value’s you will find out that adding any or variable in addition to “am”, “qsec” and “wt” will increase the variation in the model to make p-value insignificant.

summary(lm3)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## amManual     2.935837  1.4109045  2.080819 4.671551e-02
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06

We can confirm here that all the variables are statistically significant and the r.square is 0.85. This model is better than the simple linear model made in Simple linear model. With this model we found that cars with manual transmission have 2.936 more mpg than automatic ones.

Let’s see how this 4 variables are related:

lm3data <- select(mtcars2, "mpg", "am", "qsec", "wt")
lm3.1 <- ggplot(lm3data, aes(am, mpg))+
    geom_boxplot() + 
    ylab ("mpg") +
    ggtitle("mpg - Manueal vs Auto") +
    theme_bw()

lm3.2 <- ggplot(lm3data, aes(am, qsec))+
    geom_boxplot() + 
    ylab ("Time(s) to do 1/4 mile") +
    ggtitle("qsec - Manual vs Auto") +
    theme_bw()

lm3.3 <- ggplot(lm3data, aes(am, wt))+
    geom_boxplot() + 
    ylab ("Weight (1000 lbs)") +
    ggtitle("wt - Manual vs Auto") +
    theme_bw()

plot_grid(lm3.1, lm3.2, lm3.3,
          ncol = 3, nrow = 1)

Here we can confirm what the coefficients told us: Manual cars consume less energy (higher mpg) but this happens mainly because they are lighter (among other factors). We can translate the “time to do 1/4 mile” as horsepower. Manual transmission shows an slight decrease in this time (what means more horsepower), however in our model has a possitive influence (possitive value).

lmqsec = lm(mpg~qsec, data=mtcars)
summary(lmqsec)$coef

##              Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) -5.114038 10.0295433 -0.5098974 0.61385436
## qsec         1.412125  0.5592101  2.5252133 0.01708199

Diagnostic plots –> Analysis of Residuals

plot1 <- ggplot(lm3, aes(.fitted, .resid))+
    geom_point() + 
    stat_smooth(method="loess") +
    geom_hline(yintercept=0, col="red", linetype="dashed") +
    xlab("Fitted values") + ylab ("Residuals") +
    ggtitle("Residual vs Fitted Plot") +
    theme_bw()

plot2 <- ggplot(lm3, aes(.fitted, sqrt(abs(.stdresid)))) +
    geom_point(na.rm=T) + 
    stat_smooth(method="loess", na.rm=T) +
    xlab("Fitted values") + ylab(expression(sqrt("|Standarized residuals"))) + 
    ggtitle("Scale-Location") + 
    theme_bw()

plot3 <- ggplot(lm3, aes(qqnorm(.stdresid)[[1]], .stdresid)) +
    geom_point(na.rm=T) + 
    stat_smooth(method="loess", na.rm=T) +
    xlab("Theoretical Quantiles") + ylab("Standardized Residuals") + 
    ggtitle("Normal Q-Q") + 
    theme_bw()

plot_grid(plot1, plot2, plot3,
          ncol = 3, nrow = 1)

By looking at residuals (plots 1 and 2) we found out that there is not any sort of pattern or clue to reject the validity of the model. It is remarcable thant looking at Normal Q-Q plot, the residuals are distributed evenly along the curve but it seems that they follow an exponential curve.

(quote from “Regression Models for Data Science in R - Brian Caffo”)↩