The Titanic was a British luxury oceanliner that sank famously in the icy North Atlantic Ocean on its maiden voyage in April 1912. Of the approximately 2200 passengers on board, 1500 died. The high death rate was blamed largely on the inadequate supply of lifeboats, a result of the manufacturer’s claim that the ship was “unsinkable.” A partial dataset of the passenger list was compiled by Philip Hinde in his Encyclopedia Titanic and is given in the datafile Titanic. Two question of interest are the relationship between survival and age and the relationship between survival and sex. The following variables will be useful for your work on the following questions:
Name | Description |
---|---|
Age | which gives the passenger’s age in years |
Sex | which gives the passenger’s sex (male of female) |
Survived | A binary variable, where 1 indicates the passenger survived and 0 indicates death |
SexCode | which numerically codes male as 0 and female as 1 |
data("Titanic")
ggplot(Titanic) + geom_boxplot(aes(x=factor(Survived), y=Age))
## Warning: Removed 557 rows containing non-finite values (stat_boxplot).
logm = glm(Survived~Age, data=Titanic, family=binomial)
summary(logm)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = Titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1418 -1.0489 -0.9792 1.3039 1.4801
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.081428 0.173862 -0.468 0.6395
## Age -0.008795 0.005232 -1.681 0.0928 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1025.6 on 755 degrees of freedom
## Residual deviance: 1022.7 on 754 degrees of freedom
## (557 observations deleted due to missingness)
## AIC: 1026.7
##
## Number of Fisher Scoring iterations: 4
exp(coef(logm))
## (Intercept) Age
## 0.9217992 0.9912439
Can we use state-level variables to predict whether a state votes for the Democratic versus the republican presidential nominee? The file Election16 contains data from 50 states plus the District of Columbia.
Name | Description |
---|---|
State | state |
Abr | abbreviation for the state |
Income | per capita income as of 2007 |
HS | percentage of adults with at least a high school education |
BA | percentage of adults with at least a college education |
Avd | percentage of adults with advanced degrees |
Dem.Rep | %Democrat-%Republican in a state |
including those who lean toward either party according to a 2015 Gallup poll | |
TrumpWin | 1 or 0 indicating whether the Republican candidate Donald Trump did or did not win a majority of votes in the state |
Fit separate logistic regression models to predict TrumpWin using each of the predictors Income, HS, BA, and Dem.Rep. Which of these variables does the most effective job of predicting this response? Which is the least effective? Explain the criteria you use to make these decisions.
data(Election16)
#TrumpWin with Income
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)
##
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2049 -0.7510 0.4074 0.6566 2.5000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.118e+01 3.076e+00 3.635 0.000277 ***
## Income -1.967e-04 5.582e-05 -3.523 0.000426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 45.923 on 48 degrees of freedom
## AIC: 49.923
##
## Number of Fisher Scoring iterations: 5
#TrumpWin with HS
loghs = glm(TrumpWin~HS, data=Election16, family=binomial)
summary(loghs)
##
## Call:
## glm(formula = TrumpWin ~ HS, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6411 -1.2802 0.8704 1.0441 1.1918
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.06905 8.64223 1.049 0.294
## HS -0.09809 0.09768 -1.004 0.315
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 66.262 on 48 degrees of freedom
## AIC: 70.262
##
## Number of Fisher Scoring iterations: 4
#TrumpWin with BA
logba = glm(TrumpWin~BA, data=Election16, family=binomial)
summary(logba)
##
## Call:
## glm(formula = TrumpWin ~ BA, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9138 -0.3059 0.1718 0.5829 1.4483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.9973 5.1098 3.522 0.000428 ***
## BA -0.5985 0.1735 -3.449 0.000562 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 34.433 on 48 degrees of freedom
## AIC: 38.433
##
## Number of Fisher Scoring iterations: 6
#TrumpWin with Dem.Rep
logdem = glm(TrumpWin~Dem.Rep, data=Election16, family=binomial)
summary(logdem)
##
## Call:
## glm(formula = TrumpWin ~ Dem.Rep, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3018 -0.3360 0.1430 0.5468 1.4758
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.32273 0.43466 0.742 0.457788
## Dem.Rep -0.25034 0.07447 -3.361 0.000775 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 33.596 on 48 degrees of freedom
## AIC: 37.596
##
## Number of Fisher Scoring iterations: 6
logall = glm(TrumpWin~Income+HS+BA+Dem.Rep, data=Election16, family=binomial)
summary(logall)
##
## Call:
## glm(formula = TrumpWin ~ Income + HS + BA + Dem.Rep, family = binomial,
## data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44654 -0.05377 0.06772 0.25970 1.35391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.8785550 19.3235173 0.615 0.5387
## Income -0.0001785 0.0001436 -1.243 0.2139
## HS 0.0663105 0.2509851 0.264 0.7916
## BA -0.2753913 0.2733136 -1.008 0.3136
## Dem.Rep -0.2520207 0.1125109 -2.240 0.0251 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 19.555 on 45 degrees of freedom
## AIC: 29.555
##
## Number of Fisher Scoring iterations: 7
Refer to the data in Election16 that are described in Exercise 9.36. Run a logistic regression model to predict TrumpWin for each state using the per capita Income of the states.
data(Election16)
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)
##
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2049 -0.7510 0.4074 0.6566 2.5000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.118e+01 3.076e+00 3.635 0.000277 ***
## Income -1.967e-04 5.582e-05 -3.523 0.000426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 45.923 on 48 degrees of freedom
## AIC: 49.923
##
## Number of Fisher Scoring iterations: 5
exp(coef(loginc)[2])
## Income
## 0.9998033
exp(confint(loginc))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 366.9418556 7.917787e+07
## Income 0.9996761 9.998988e-01
The dataset Gunnels comes from a study on the habitat preferences of a species of eel, called a gunnel. Biologist Jake Shorty sampled quadrats along a coastline and recorded a binary variable, Gunnel, where 1 represents the presence of the species found in the quadrat and a 0 its absence. Below is output from a logistic model fit with Time as the explanatory variable, where Time represents the time of day as “minutes from midnight,” with a minimum of 383 minutes (about 6:23 a.m.) and a maximum of 983 minutes (about 1:23 p.m.).
Estimate | Std. Error | z value | Pr(>\(|z|\)) | ||
---|---|---|---|---|---|
(Intercept) | 0.371980 | 0.644047 | 0.578 | 0.564 | |
Time | -0.005899 | 0.001049 | -5.624 | 1.86e-08*** |
\[ H_0: \beta_1=0\\ H_A: \beta_1\ne0 \]
-0.005899 + qnorm(0.05/2)*0.001049
## [1] -0.007955002
-0.005899 - qnorm(0.05/2)*0.001049
## [1] -0.003842998
exp(-0.005899)
## [1] 0.9941184
exp(-0.005899) + qnorm(0.05/2)*0.001049
## [1] 0.9920624
exp(-0.005899) - qnorm(0.05/2)*0.001049
## [1] 0.9961744