9.27 Titanic: Survival and age, CHOOSE and FIT

The Titanic was a British luxury oceanliner that sank famously in the icy North Atlantic Ocean on its maiden voyage in April 1912. Of the approximately 2200 passengers on board, 1500 died. The high death rate was blamed largely on the inadequate supply of lifeboats, a result of the manufacturer’s claim that the ship was “unsinkable.” A partial dataset of the passenger list was compiled by Philip Hinde in his Encyclopedia Titanic and is given in the datafile Titanic. Two question of interest are the relationship between survival and age and the relationship between survival and sex. The following variables will be useful for your work on the following questions:

Name Description
Age which gives the passenger’s age in years
Sex which gives the passenger’s sex (male of female)
Survived A binary variable, where 1 indicates the passenger survived and 0 indicates death
SexCode which numerically codes male as 0 and female as 1
  1. Use a plot to explore whether there is a relationship between survival and the passenger’s age. What do you conclude from this graph alone?
data("Titanic")
ggplot(Titanic) + geom_boxplot(aes(x=factor(Survived), y=Age))
## Warning: Removed 557 rows containing non-finite values (stat_boxplot).

  1. Use software to fit a logistic model to the survival and age variables to decide whether there is a statistically significant relationship between age and survival, and if there is, what its direction and magnitude are. Write the estimated logistic model using the output and interpret the output in light of the question.
logm = glm(Survived~Age, data=Titanic, family=binomial)
summary(logm)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = Titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1418  -1.0489  -0.9792   1.3039   1.4801  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.081428   0.173862  -0.468   0.6395  
## Age         -0.008795   0.005232  -1.681   0.0928 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1025.6  on 755  degrees of freedom
## Residual deviance: 1022.7  on 754  degrees of freedom
##   (557 observations deleted due to missingness)
## AIC: 1026.7
## 
## Number of Fisher Scoring iterations: 4
exp(coef(logm))
## (Intercept)         Age 
##   0.9217992   0.9912439

9.36 Red states and blue states in 2016: Compare models

Can we use state-level variables to predict whether a state votes for the Democratic versus the republican presidential nominee? The file Election16 contains data from 50 states plus the District of Columbia.

Name Description
State state
Abr abbreviation for the state
Income per capita income as of 2007
HS percentage of adults with at least a high school education
BA percentage of adults with at least a college education
Avd percentage of adults with advanced degrees
Dem.Rep %Democrat-%Republican in a state
including those who lean toward either party according to a 2015 Gallup poll
TrumpWin 1 or 0 indicating whether the Republican candidate Donald Trump did or did not win a majority of votes in the state

Fit separate logistic regression models to predict TrumpWin using each of the predictors Income, HS, BA, and Dem.Rep. Which of these variables does the most effective job of predicting this response? Which is the least effective? Explain the criteria you use to make these decisions.

data(Election16)
#TrumpWin with Income
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)
## 
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.7510   0.4074   0.6566   2.5000  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.118e+01  3.076e+00   3.635 0.000277 ***
## Income      -1.967e-04  5.582e-05  -3.523 0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 45.923  on 48  degrees of freedom
## AIC: 49.923
## 
## Number of Fisher Scoring iterations: 5
#TrumpWin with HS
loghs = glm(TrumpWin~HS, data=Election16, family=binomial)
summary(loghs)
## 
## Call:
## glm(formula = TrumpWin ~ HS, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6411  -1.2802   0.8704   1.0441   1.1918  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  9.06905    8.64223   1.049    0.294
## HS          -0.09809    0.09768  -1.004    0.315
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 66.262  on 48  degrees of freedom
## AIC: 70.262
## 
## Number of Fisher Scoring iterations: 4
#TrumpWin with BA
logba = glm(TrumpWin~BA, data=Election16, family=binomial)
summary(logba)
## 
## Call:
## glm(formula = TrumpWin ~ BA, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9138  -0.3059   0.1718   0.5829   1.4483  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.9973     5.1098   3.522 0.000428 ***
## BA           -0.5985     0.1735  -3.449 0.000562 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 34.433  on 48  degrees of freedom
## AIC: 38.433
## 
## Number of Fisher Scoring iterations: 6
#TrumpWin with Dem.Rep
logdem = glm(TrumpWin~Dem.Rep, data=Election16, family=binomial)
summary(logdem)
## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3018  -0.3360   0.1430   0.5468   1.4758  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.32273    0.43466   0.742 0.457788    
## Dem.Rep     -0.25034    0.07447  -3.361 0.000775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 33.596  on 48  degrees of freedom
## AIC: 37.596
## 
## Number of Fisher Scoring iterations: 6
logall = glm(TrumpWin~Income+HS+BA+Dem.Rep, data=Election16, family=binomial)
summary(logall)
## 
## Call:
## glm(formula = TrumpWin ~ Income + HS + BA + Dem.Rep, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7

9.37 Red states and blue states in 2016, income: Odds ratio

Refer to the data in Election16 that are described in Exercise 9.36. Run a logistic regression model to predict TrumpWin for each state using the per capita Income of the states.

data(Election16)
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)
## 
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.7510   0.4074   0.6566   2.5000  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.118e+01  3.076e+00   3.635 0.000277 ***
## Income      -1.967e-04  5.582e-05  -3.523 0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 45.923  on 48  degrees of freedom
## AIC: 49.923
## 
## Number of Fisher Scoring iterations: 5
  1. Use the estimated slope from the logistic regression to compute an estimated odds ratio and write a sentence that interprets this value in the context of this problem.
exp(coef(loginc)[2])
##    Income 
## 0.9998033
  1. Find a 95% confidence interval for the odds ratio in (a)
exp(confint(loginc))
## Waiting for profiling to be done...
##                   2.5 %       97.5 %
## (Intercept) 366.9418556 7.917787e+07
## Income        0.9996761 9.998988e-01

9.39 Gunnels

The dataset Gunnels comes from a study on the habitat preferences of a species of eel, called a gunnel. Biologist Jake Shorty sampled quadrats along a coastline and recorded a binary variable, Gunnel, where 1 represents the presence of the species found in the quadrat and a 0 its absence. Below is output from a logistic model fit with Time as the explanatory variable, where Time represents the time of day as “minutes from midnight,” with a minimum of 383 minutes (about 6:23 a.m.) and a maximum of 983 minutes (about 1:23 p.m.).

Estimate Std. Error z value Pr(>\(|z|\))
(Intercept) 0.371980 0.644047 0.578 0.564
Time -0.005899 0.001049 -5.624 1.86e-08***
  1. State the null hypothesis that the P-value of 1.86e-08 allows you to test.

\[ H_0: \beta_1=0\\ H_A: \beta_1\ne0 \]

  1. What happens to the probability of finding a gunnel as you get later in the day? Does it get smaller or larger?
  1. Find a 95% confidence interval for the slope parameter with logistic model.
-0.005899 + qnorm(0.05/2)*0.001049
## [1] -0.007955002
-0.005899 - qnorm(0.05/2)*0.001049
## [1] -0.003842998
  1. What quantity does 0.371980 + (-0.005899)(600) = -3.16742 estimate?
  1. Find the estimated odds ratio for an additional minute of time after midnight.
exp(-0.005899)
## [1] 0.9941184
  1. Give a 95% confidence interval for the odds ratio found in part (e).
exp(-0.005899) + qnorm(0.05/2)*0.001049
## [1] 0.9920624
exp(-0.005899) - qnorm(0.05/2)*0.001049
## [1] 0.9961744
  1. Find an estimate for the odds ratio for an additional hour or 80 minutes of time passing. Give a sentence interpreting the meaning of this number in simple language using the term “odds.”