Regression Models

Synopsis

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Goodness of fit Test for mpg

There are, 25 values in mpg. * H0 = The mpg is consistent with a specified reference distribution. * H1 = The mpg is NOT consistent with a specified reference distribution

library('zoo')
p1 <-hist(mtcars$mpg, include.lowest=FALSE, right=FALSE)

x=seq(min(mtcars$mpg), max(mtcars$mpg),length = length(p1$counts))
breaks_cdf <- dnorm(x, mean(mtcars$mpg), sd(mtcars$mpg))
a <- chisq.test(x = p1$counts, p=breaks_cdf, rescale.p=TRUE, simulate.p.value=TRUE)
a$p.value

## [1] 0.005497251374

Data Explorations

Correlations with Significance

library("Hmisc")
library("dplyr")

library(kableExtra)
library(formattable)
flat_cor_mat <- function(cor_r, cor_p){
  #This function provides a simple formatting of a correlation matrix
  #into a table with 4 columns containing :
    # Column 1 : row names (variable 1 for the correlation test)
    # Column 2 : column names (variable 2 for the correlation test)
    # Column 3 : the correlation coefficients
    # Column 4 : the p-values of the correlations
  library(tidyr)
  library(tibble)
  cor_r <- rownames_to_column(as.data.frame(round(cor_r,2)), var = "row")
  cor_r <- gather(cor_r, column, cor, -1)
  cor_p <- rownames_to_column(as.data.frame( round(cor_p,2)), var = "row")
  cor_p <- gather(cor_p, column, p, -1)
  cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
  cor_p_matrix
}

cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))

my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
kable(my_cor_matrix,"latex",longtable = T, caption = "Correlation Table") %>%
   kable_styling(fixed_thead = T, latex_options = c("striped", "repeat_header"), full_width = F, position = "center")

Opacity of each label is dependent on the absolute value of the coefficient of correlation.

library(GGally)
ggcorr(mtcars, label = TRUE, label_alpha = TRUE)

Figure 1. Correlations

Variance Inflation Factors

Variances of the estimated coefficients increase as more regressors with higher correlation are included. To get a more accurate result, it is important to leave these regressors out

lm_all <- lm(mpg ~ ., data = mtcars)
summary(lm_all)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.4506441 -1.6044018 -0.1196051  1.2192678  4.6270942 
## 
## Coefficients:
##                Estimate  Std. Error  t value Pr(>|t|)  
## (Intercept) 12.30337416 18.71788443  0.65731 0.518124  
## cyl         -0.11144048  1.04502336 -0.10664 0.916087  
## disp         0.01333524  0.01785750  0.74676 0.463489  
## hp          -0.02148212  0.02176858 -0.98684 0.334955  
## drat         0.78711097  1.63537307  0.48130 0.635278  
## wt          -3.71530393  1.89441430 -1.96119 0.063252 .
## qsec         0.82104075  0.73084480  1.12341 0.273941  
## vs           0.31776281  2.10450861  0.15099 0.881423  
## am           2.52022689  2.05665055  1.22540 0.233990  
## gear         0.65541302  1.49325996  0.43891 0.665206  
## carb        -0.19941925  0.82875250 -0.24063 0.812179  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.650197 on 21 degrees of freedom
## Multiple R-squared:  0.8690158,  Adjusted R-squared:  0.8066423 
## F-statistic: 13.93246 on 10 and 21 DF,  p-value: 0.0000003793152

library(car)
makelms <- function(){
  # Simulate a dependent variable, y, as x1
  # plus a normally distributed error of mean 0 and 
  # standard deviation .3.
  col_len <- dim(mtcars)[2]
  viff <- c()
  for(i in 2: col_len){
    temp <- capture.output(cat(na.omit(colnames(mtcars)[-which(colnames(mtcars)=="am")][i:col_len]),sep="+"))
    if(length(temp)){
      formula =  paste0("mpg ~ factor(am)+",temp)
    mdl <- lm(formula = formula, data = mtcars)
   print(vif(mdl))
   viff = c(viff, vif(mdl))
    }
  }
  viff
}
#compute bias - overfitting and underfitting

vifs <- makelms()

##   factor(am)          cyl         disp           hp         drat 
##  4.648487456 15.373833403 21.620241029  9.832036844  3.374620008 
##           wt         qsec           vs         gear         carb 
## 15.164886964  7.527958225  4.965873466  5.357452106  7.908746751 
##   factor(am)         disp           hp         drat           wt 
##  4.332285930 20.088643356  9.499794893  3.118061510 14.971794853 
##         qsec           vs         gear         carb 
##  6.960353472  4.454934513  4.691535866  7.497053712 
##  factor(am)          hp        drat          wt        qsec          vs 
## 4.285814561 6.015787688 3.111500773 6.051126642 5.918682154 4.270956073 
##        gear        carb 
## 4.690187139 4.290467707 
##  factor(am)        drat          wt        qsec          vs        gear 
## 4.258479132 3.043073328 5.104823316 4.139107281 4.191817892 4.688164464 
##        carb 
## 3.826242795 
##  factor(am)          wt        qsec          vs        gear        carb 
## 3.974216136 4.779644615 4.068140457 4.145482549 4.519156734 3.768859449 
##  factor(am)        qsec          vs        gear        carb 
## 3.402052744 3.740871649 3.641623856 3.999256760 2.453053992 
##  factor(am)          vs        gear        carb 
## 3.109049828 1.990702430 3.985747305 2.228001604 
##  factor(am)        gear        carb 
## 2.926043153 3.153213924 1.168889369 
##  factor(am)        carb 
## 1.003321195 1.003321195

Correlation Between Regressors

The gear and drat should be left out.

#find correlation between variables
cor.test(mtcars$am, mtcars$gear, method=c("pearson", "kendall", "spearman"))

Pearson's product-moment correlation

data: mtcars\(am and mtcars\)gear t = 7.1552247, df = 30, p-value = 0.00000005834043 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6158963198 0.8949545741 sample estimates: cor 0.7940587603

cor.test(mtcars$am, mtcars$drat, method=c("pearson", "kendall", "spearman"))

Pearson's product-moment correlation

data: mtcars\(am and mtcars\)drat t = 5.5650966, df = 30, p-value = 0.00000472679 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4843990837 0.8501319450 sample estimates: cor 0.7127111272

cor.test(mtcars$drat, mtcars$carb, method=c("pearson", "kendall", "spearman"))

Pearson's product-moment correlation

data: mtcars\(drat and mtcars\)carb t = -0.49933844, df = 30, p-value = 0.6211834 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.4259975986 0.2663357912 sample estimates: cor -0.09078979887 ### Shapiro-Wilk test can be performed as follow: Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed

#plot(mtcars$mpg,mtcars$am,pch=19,col="darkgrey",xlab="MPG",ylab="Automatic Transmission")
shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756473, p-value = 0.1228814

shapiro.test(mtcars$am)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$am
## W = 0.62507437, p-value = 0.00000007836354

Conclusion: Normality cannot be assumed for am column

Model Comparisons for Linear Regression

anova(fit_no_wt, fit_simple, fit_interact)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ wt + factor(am)
## Model 3: mpg ~ wt * factor(am)
##   Res.Df       RSS Df Sum of Sq        F          Pr(>F)    
## 1     30 720.89660                                          
## 2     29 278.31970  1 442.57690 65.91302 0.0000000077171 ***
## 3     28 188.00767  1  90.31203 13.45018       0.0010171 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear vs. logistic regression

Linear

\[ RW_i = b_0 + b_1 RS_i + e_i \]

\[ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i\]

Logistic

\[ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}\]

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]

Applying Logistic Regression

##            Test_MPG pred_test
## Mazda RX4      21.0    23.075
## Valiant        18.1    17.300
## Duster 360     14.3    17.300
## Merc 240D      24.4    17.300
## Merc 230       22.8    17.300
## Merc 280C      17.8    17.300

Plots

### Odds ratios and confidence intervals

exp(mod_glm$coeff)

##  (Intercept)           am 
## 17.300000000  1.333815029

exp(confint(mod_glm))

##                    2.5 %       97.5 %
## (Intercept) 13.830637458 20.769182899
## am           1.043921992  1.735534932

mod_glm2 <-  glm(mpg ~ factor(am)*wt, family=gaussian(link="log"), data= train_cars) 
summary(mod_glm2)

## 
## Call:
## glm(formula = mpg ~ factor(am) * wt, family = gaussian(link = "log"), 
##     data = train_cars)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.5332767  -1.4022925  -0.1536827   0.9869617   4.0139082  
## 
## Coefficients:
##                   Estimate  Std. Error  t value       Pr(>|t|)    
## (Intercept)     3.44003032  0.29068633 11.83417 0.000000056461 ***
## factor(am)1     0.72227921  0.31492523  2.29349       0.040674 *  
## wt             -0.15562829  0.07749517 -2.00823       0.067670 .  
## factor(am)1:wt -0.26121984  0.09368849 -2.78817       0.016400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.069524889)
## 
##     Null deviance: 484.29750  on 15  degrees of freedom
## Residual deviance:  48.83429  on 12  degrees of freedom
## AIC: 73.259537
## 
## Number of Fisher Scoring iterations: 5

pred_test2 <- predict(mod_glm,newdata=test_cars, type = "response")

Final Model Selection

ANOVA for logistic regression

par(mar=c(10,5,2,1), mfcol=c(1,2), cex=1, ps=12)
plot(mod_glm2, which = 1)
plot(mod_glm2, which = 2)

Figure 3. Residual Plots with Interactions

anova(mod_glm,mod_glm2,test="LRT")

## Analysis of Deviance Table
## 
## Model 1: mpg ~ (am)
## Model 2: mpg ~ factor(am) * wt
##   Resid. Df Resid. Dev Df  Deviance               Pr(>Chi)    
## 1        14  350.89500                                        
## 2        12   48.83429  2 302.06071 < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance and AIC of, mpg ~ factor(am) * wt are, 48.83 and 73.26 which are much lower than those of mpg ~ (am) which are, 350.89 and 100.81

Model Selection on AIC

## Start:  AIC=163.71
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df  Deviance       AIC
## - cyl   1 147.57430 161.72713
## - vs    1 147.65456 161.74453
## - carb  1 147.90110 161.79792
## - gear  1 148.84749 162.00203
## - drat  1 149.12146 162.06087
## - disp  1 151.41110 162.54847
## - hp    1 154.33434 163.16040
## - qsec  1 156.35855 163.57737
## <none>    147.49443 163.70981
## - am    1 158.04108 163.91988
## - wt    1 174.50881 167.09173
## 
## Step:  AIC=161.73
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df  Deviance       AIC
## - vs    1 147.84282 159.78531
## - carb  1 148.09444 159.83972
## - gear  1 149.39545 160.11962
## - drat  1 149.55692 160.15418
## - disp  1 151.47524 160.56203
## - hp    1 154.93747 161.28521
## <none>    147.57430 161.72713
## - qsec  1 157.66756 161.84416
## - am    1 159.41025 162.19591
## + cyl   1 147.49443 163.70981
## - wt    1 174.60235 165.10887
## 
## Step:  AIC=159.79
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df  Deviance       AIC
## - carb  1 148.52829 157.93333
## - gear  1 149.98651 158.24597
## - drat  1 150.05672 158.26094
## - disp  1 151.48947 158.56503
## - hp    1 154.94878 159.28755
## <none>    147.84282 159.78531
## - am    1 159.41221 160.19630
## - qsec  1 163.52587 161.01160
## + vs    1 147.57430 161.72713
## + cyl   1 147.65456 161.74453
## - wt    1 175.22277 163.22238
## 
## Step:  AIC=157.93
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df  Deviance       AIC
## - gear  1 150.09325 156.26873
## - drat  1 150.46041 156.34692
## <none>    148.52829 157.93333
## - disp  1 158.63855 158.04062
## - am    1 160.85150 158.48393
## - hp    1 163.35392 158.97793
## + carb  1 147.84282 159.78531
## + vs    1 148.09444 159.83972
## + cyl   1 148.11386 159.84392
## - qsec  1 174.93634 161.17003
## - wt    1 217.65521 168.16171
## 
## Step:  AIC=156.27
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df  Deviance       AIC
## - drat  1 153.43781 154.97397
## - disp  1 158.63861 156.04064
## <none>    150.09325 156.26873
## - hp    1 163.37791 156.98263
## + gear  1 148.52829 157.93333
## + cyl   1 149.08986 158.05409
## + vs    1 149.44777 158.13082
## + carb  1 149.98651 158.24597
## - am    1 170.12913 158.27837
## - qsec  1 175.66766 159.30352
## - wt    1 217.66522 166.16318
## 
## Step:  AIC=154.97
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df  Deviance       AIC
## - disp  1 160.06646 154.32737
## <none>    153.43781 154.97397
## - hp    1 166.00986 155.49403
## + drat  1 150.09325 156.26873
## + gear  1 150.46041 156.34692
## + cyl   1 150.99111 156.45959
## + vs    1 152.31700 156.73936
## + carb  1 153.42638 156.97158
## - qsec  1 179.90760 158.06671
## - am    1 185.63532 159.06961
## - wt    1 222.48085 164.86343
## 
## Step:  AIC=154.33
## mpg ~ hp + wt + qsec + am
## 
##        Df  Deviance       AIC
## - hp    1 169.28593 154.11937
## <none>    160.06646 154.32737
## + disp  1 153.43781 154.97397
## + carb  1 156.83927 155.67561
## + drat  1 158.63861 156.04064
## - qsec  1 180.29107 156.13484
## + cyl   1 159.81748 156.27756
## + vs    1 159.81791 156.27764
## + gear  1 159.89534 156.29314
## - am    1 186.05930 157.14261
## - wt    1 238.56023 165.09642
## 
## Step:  AIC=154.12
## mpg ~ wt + qsec + am
## 
##        Df  Deviance       AIC
## <none>    169.28593 154.11937
## + hp    1 160.06646 154.32737
## + carb  1 161.24999 154.56311
## + disp  1 166.00986 155.49403
## + cyl   1 167.78487 155.83436
## + drat  1 167.88631 155.85370
## + gear  1 169.16321 156.09617
## + vs    1 169.28546 156.11928
## - am    1 195.46363 156.72050
## - qsec  1 278.31970 168.02917
## - wt    1 352.63319 175.60223

## 
## Call:
## glm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.4810670  -1.5555254  -0.7257309   1.4110122   4.6609983  
## 
## Coefficients:
##               Estimate Std. Error  t value     Pr(>|t|)    
## (Intercept)  9.6177805  6.9595930  1.38195   0.17791517    
## wt          -3.9165037  0.7112016 -5.50688 0.0000069527 ***
## qsec         1.2258860  0.2886696  4.24668   0.00021617 ***
## am           2.9358372  1.4109045  2.08082   0.04671551 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6.045926055)
## 
##     Null deviance: 1126.04719  on 31  degrees of freedom
## Residual deviance:  169.28593  on 28  degrees of freedom
## AIC: 154.11937
## 
## Number of Fisher Scoring iterations: 2

The residual plots for the model selected by the AIC are:

par(mar=c(10,5,2,1), mfcol=c(1,2), cex=1, ps=12)
plot(result.step.1, which = 1)
plot(result.step.1, which = 2)

Figure 1. Plots

Executive Summary

The mpg ~ wt + qsec + am with the residual plots at: @ref(fig:residual_plots_final_model) , is the most apt. model for predicting the mpg with the am as one of the predictors. The Automatic transmission is better for the MPG. I have used the family=gaussian(link=“log”) this is the interpretation of the coefficient of the transmission on the MPG. Keeping the rest of the coefficients constant, a change, from manual to automatic in the transmission, would result in a change of mpg by \(exp(coef(am))\) which is, 18.84