Question 1

The built-in data sets of R include one called “mtcars,” which stands for Motor Trend cars. Motor Trend was the name of an automotive magazine and this data set contains information on cars from the 1970s. Use “?mtcars” to display help about the data set. The data set includes a dichotomous variable called vs, which is coded as 0 for an engine with cylinders in a v-shape and 1 for so called “straight” engines. Use logistic regression to predict vs, using two metric variables in the data set, gear (number of forward gears) and hp (horsepower). Interpret the resulting null hypothesis significance tests.

dfCars <- data.frame(mtcars)
summary(dfCars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
glmOut <- glm(formula = vs ~ gear + hp, family = binomial(link="logit"), data =  dfCars)
plot(glmOut)

summary(glmOut)
## 
## Call:
## glm(formula = vs ~ gear + hp, family = binomial(link = "logit"), 
##     data = dfCars)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76095  -0.20263  -0.00889   0.38030   1.37305  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 13.43752    7.18161   1.871   0.0613 .
## gear        -0.96825    1.12809  -0.858   0.3907  
## hp          -0.08005    0.03261  -2.455   0.0141 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.860  on 31  degrees of freedom
## Residual deviance: 16.013  on 29  degrees of freedom
## AIC: 22.013
## 
## Number of Fisher Scoring iterations: 7
exp(coef(glmOut))
##  (Intercept)         gear           hp 
## 6.852403e+05 3.797461e-01 9.230734e-01

Analysis:

Predicting the engine shape, vs, utilizng the horsepower and gear as independent variables didn’t produce a model which would allow the rejection of the Null hypothesis that you cannot determine an engine shaped based on horsepower and number of gears. Utilizing the Null hypothesis significance test the accepted alpha to reject the Null Hypothesis is typically .05. Both the intercept and gear exceed this threshold indicating that the variables are not much different than 0. This indicates that both won’t having a meaningful effect on the model. The horsepower,on the other hand, is a more significant predictor in this model with a pvalue that is significantly lower than .05 at .014. These results can also be viewed as a probability. The odds of the gear and hp coefficients are .38: 1 and .92:1, respectively, of change in the odds for predicting the engine shape.

Question 5

As noted in the chapter, the BaylorEdPsych add-in package contains a procedure for generating pseudo-R-squared values from the output of the glm() procedure. Use the results of Exercise 1 to generate, report, and interpret a Nagelkerke pseudo-R-squared value.

library(BaylorEdPsych)
PseudoR2(glmOut)
##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
##        0.6349042        0.4525061        0.5811397        0.7789526 
## McKelvey.Zavoina           Effron            Count        Adj.Count 
##        0.8972195        0.6445327        0.8125000        0.5714286 
##              AIC    Corrected.AIC 
##       22.0131402       22.8702830

The PseudoR2 is an approximation of the R-squared of linear models in categorical models. There is no standard for which R-squared measure is the most accurate, but the Nagelkerke is typically much higher than the other R-squared. The Nagelkerke for this model was .78 suggesting that it is a strong model, but it suggest that Horsepower is more statisically significant predictor than anticiapted. This could be due to the small sample size of the mtCars.

Question 6

Continue the analysis of the Chile data set described in this chapter. The data set is in the “car” package, so you will have to install.packages() and library() that package first, and then use the data(Chile) command to get access to the data set. Pay close attention to the transformations needed to isolate cases with the Yes and No votes as shown in this chapter. Add a new predictor, statusquo, into the model and remove the income variable. Your new model specification should be vote ~ age + statusquo. The statusquo variable is a rating that each respondent gave indicating whether they preferred change or maintaining the status quo. Conduct general linear model and Bayesian analysis on this model and report and interpret all relevant results. Compare the AIC from this model to the AIC from the model that was developed in the chapter (using income and age as predictors).

library(car)
## Loading required package: carData
data(Chile)
dfChile <- data.frame(Chile)
ChileN <- dfChile[dfChile$vote =='N',]
ChileY <- dfChile[dfChile$vote =='Y',]
ChileYN <- rbind(ChileY, ChileN)
ChileYN <- ChileYN[complete.cases(ChileYN),]
ChileYN$vote <- factor(ChileYN$vote, levels = c('N','Y'))
chiLM <- glm(vote ~ age + statusquo, family = binomial(), data = ChileYN)
summary(chiLM)
## 
## Call:
## glm(formula = vote ~ age + statusquo, family = binomial(), data = ChileYN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2095  -0.2830  -0.1840   0.1889   2.8789  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.193759   0.270708  -0.716   0.4741    
## age          0.011322   0.006826   1.659   0.0972 .  
## statusquo    3.174487   0.143921  22.057   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.29  on 1702  degrees of freedom
## Residual deviance:  734.52  on 1700  degrees of freedom
## AIC: 740.52
## 
## Number of Fisher Scoring iterations: 6
library(MCMCpack)
## Loading required package: coda
## Loading required package: MASS
## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)
## ## Copyright (C) 2003-2019 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
## ##
## ## Support provided by the U.S. National Science Foundation
## ## (Grants SES-0350646 and SES-0350613)
## ##
ChileYN$vote <- as.numeric(ChileYN$vote) - 1
chiBayes <- MCMClogit(formula = vote ~ age + statusquo, family = binomial(), data = ChileYN)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored
summary(chiBayes)
## 
## Iterations = 1001:11000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                 Mean       SD  Naive SE Time-series SE
## (Intercept) -0.18272 0.272640 2.726e-03       0.008938
## age          0.01123 0.006817 6.817e-05       0.000223
## statusquo    3.19061 0.145853 1.459e-03       0.004993
## 
## 2. Quantiles for each variable:
## 
##                  2.5%       25%      50%        75%   97.5%
## (Intercept) -0.742761 -0.365241 -0.17552 -0.0003872 0.34439
## age         -0.002005  0.006733  0.01121  0.0157683 0.02499
## statusquo    2.914442  3.087259  3.18546  3.2847388 3.48698

Evaluating both the Logistic model and the Bayesian model seem to indicate that the status quo is the strongest predictor in this model. The logistic regression model returns a pvalue of 2e-16 which is much lower than the alpha and highly significant. The Bayesian model support the findings of the Logistic model and is evident in the HDI of statusquo. The lower quantile of the statusquo is 2.9 and the upper quantile is 3.5. Since the statusquo never overlaps 0, it can be inferred that statusquo is a strong predictor for determining how an individual will vote.

Question 7

Bonus R code question: Develop your own custom function that will take the posterior distribution of a coefficient from the output object from an MCMClogit() analysis and automatically create a histogram of the posterior distributions of the coefficient in terms of regular odds (instead of log-odds). Make sure to mark vertical lines on the histogram indicating the boundaries of the 95% HDI.

statusQuoLogOdds <- as.matrix(chiBayes[,"statusquo"])
statusOdds <- apply(statusQuoLogOdds,1,exp)
hist(statusOdds)
abline(v=quantile(statusOdds,c(.025)),col='black')
abline(v=quantile(statusOdds,c(.975)),col='black')