Hw #9

Reading: Ch. 9, 10
Exercises to hand in: 9.27, 9.36, 9.37, 9.39

9.27 Titanic: Survival and age, CHOOSE and FIT

The Titanic was a British luxury oceanliner that sank famously in the icy North Atlantic Ocean on its maiden voyage in April 1912. Of the approximately 2200 passengers on board, 1500 died. The high death rate was blamed largely on the inadequate supply of lifeboats, a result of the manufacturer’s claim that the ship was “unsinkable.” A partial dataset of the passenger list was compiled by Philip Hinde in his Encyclopedia Titanic and is given in the datafile Titanic. Two question of interest are the relationship between survival and age and the relationship between survival and sex. The following variables will be useful for your work on the following questions:

Name	Description
Age	which gives the passenger’s age in years
Sex	which gives the passenger’s sex (male of female)
Survived	A binary variable, where 1 indicates the passenger survived and 0 indicates death
SexCode	which numerically codes male as 0 and female as 1

Use a plot to explore whether there is a relationship between survival and the passenger’s age. What do you conclude from this graph alone?

data("Titanic")
ggplot(Titanic) + geom_boxplot(aes(x=factor(Survived), y=Age))

## Warning: Removed 557 rows containing non-finite values (stat_boxplot).

We see that the mean for the age of those survived and those who did not survived are fairly similiar.

Use software to fit a logistic model to the survival and age variables to decide whether there is a statistically significant relationship between age and survival, and if there is, what its direction and magnitude are. Write the estimated logistic model using the output and interpret the output in light of the question.

logm = glm(Survived~Age, data=Titanic, family=binomial)
summary(logm)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = Titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1418  -1.0489  -0.9792   1.3039   1.4801  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.081428   0.173862  -0.468   0.6395  
## Age         -0.008795   0.005232  -1.681   0.0928 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1025.6  on 755  degrees of freedom
## Residual deviance: 1022.7  on 754  degrees of freedom
##   (557 observations deleted due to missingness)
## AIC: 1026.7
## 
## Number of Fisher Scoring iterations: 4

exp(coef(logm))

## (Intercept)         Age 
##   0.9217992   0.9912439

We see that our p-value for Age is 0.0928, therefore, we do not have enough evidence to reject the null hypothesis that there is no statistically significant relationship between age and survival. The estimated logistic model is $logit(\hat{\pi})=-0.081428-0.008795Age$.

9.36 Red states and blue states in 2016: Compare models

Can we use state-level variables to predict whether a state votes for the Democratic versus the republican presidential nominee? The file Election16 contains data from 50 states plus the District of Columbia.

Name	Description
State	state
Abr	abbreviation for the state
Income	per capita income as of 2007
HS	percentage of adults with at least a high school education
BA	percentage of adults with at least a college education
Avd	percentage of adults with advanced degrees
Dem.Rep	%Democrat-%Republican in a state
	including those who lean toward either party according to a 2015 Gallup poll
TrumpWin	1 or 0 indicating whether the Republican candidate Donald Trump did or did not win a majority of votes in the state

Fit separate logistic regression models to predict TrumpWin using each of the predictors Income, HS, BA, and Dem.Rep. Which of these variables does the most effective job of predicting this response? Which is the least effective? Explain the criteria you use to make these decisions.

data(Election16)
#TrumpWin with Income
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)

## 
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.7510   0.4074   0.6566   2.5000  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.118e+01  3.076e+00   3.635 0.000277 ***
## Income      -1.967e-04  5.582e-05  -3.523 0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 45.923  on 48  degrees of freedom
## AIC: 49.923
## 
## Number of Fisher Scoring iterations: 5

#TrumpWin with HS
loghs = glm(TrumpWin~HS, data=Election16, family=binomial)
summary(loghs)

## 
## Call:
## glm(formula = TrumpWin ~ HS, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6411  -1.2802   0.8704   1.0441   1.1918  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  9.06905    8.64223   1.049    0.294
## HS          -0.09809    0.09768  -1.004    0.315
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 66.262  on 48  degrees of freedom
## AIC: 70.262
## 
## Number of Fisher Scoring iterations: 4

#TrumpWin with BA
logba = glm(TrumpWin~BA, data=Election16, family=binomial)
summary(logba)

## 
## Call:
## glm(formula = TrumpWin ~ BA, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9138  -0.3059   0.1718   0.5829   1.4483  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.9973     5.1098   3.522 0.000428 ***
## BA           -0.5985     0.1735  -3.449 0.000562 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 34.433  on 48  degrees of freedom
## AIC: 38.433
## 
## Number of Fisher Scoring iterations: 6

#TrumpWin with Dem.Rep
logdem = glm(TrumpWin~Dem.Rep, data=Election16, family=binomial)
summary(logdem)

## 
## Call:
## glm(formula = TrumpWin ~ Dem.Rep, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3018  -0.3360   0.1430   0.5468   1.4758  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.32273    0.43466   0.742 0.457788    
## Dem.Rep     -0.25034    0.07447  -3.361 0.000775 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 33.596  on 48  degrees of freedom
## AIC: 37.596
## 
## Number of Fisher Scoring iterations: 6

logall = glm(TrumpWin~Income+HS+BA+Dem.Rep, data=Election16, family=binomial)
summary(logall)

## 
## Call:
## glm(formula = TrumpWin ~ Income + HS + BA + Dem.Rep, family = binomial, 
##     data = Election16)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44654  -0.05377   0.06772   0.25970   1.35391  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 11.8785550 19.3235173   0.615   0.5387  
## Income      -0.0001785  0.0001436  -1.243   0.2139  
## HS           0.0663105  0.2509851   0.264   0.7916  
## BA          -0.2753913  0.2733136  -1.008   0.3136  
## Dem.Rep     -0.2520207  0.1125109  -2.240   0.0251 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 19.555  on 45  degrees of freedom
## AIC: 29.555
## 
## Number of Fisher Scoring iterations: 7

We see that the p-value for the TrumpWin with Income predictor is 0.000426, TrumpWin with HS as a predictor is 0.315, TrumpWin with BA as a predictor is 0.000562, and TrumpWin with Dem.Rep as a predictor is 0.000775. The variable that does the most effective job at predicting this response is Income because the p-value for this predictor is smallest.

9.37 Red states and blue states in 2016, income: Odds ratio

Refer to the data in Election16 that are described in Exercise 9.36. Run a logistic regression model to predict TrumpWin for each state using the per capita Income of the states.

data(Election16)
loginc = glm(TrumpWin~Income, data=Election16, family=binomial)
summary(loginc)

## 
## Call:
## glm(formula = TrumpWin ~ Income, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.7510   0.4074   0.6566   2.5000  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.118e+01  3.076e+00   3.635 0.000277 ***
## Income      -1.967e-04  5.582e-05  -3.523 0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 45.923  on 48  degrees of freedom
## AIC: 49.923
## 
## Number of Fisher Scoring iterations: 5

Use the estimated slope from the logistic regression to compute an estimated odds ratio and write a sentence that interprets this value in the context of this problem.

exp(coef(loginc)[2])

##    Income 
## 0.9998033

The estimated odds ratio is 0.9998033. This means that a $1 increase in income is associated with multiplying the odds of TrumpWin by a factor of 0.9998033.

Find a 95% confidence interval for the odds ratio in (a)

exp(confint(loginc))

## Waiting for profiling to be done...

##                   2.5 %       97.5 %
## (Intercept) 366.9418556 7.917787e+07
## Income        0.9996761 9.998988e-01

Confidence interval is (0.999676,0.9998988). We are 95% confident that a $1 increase in income is associated with multiplying the odds of Trump winning by a factor between 0.999676 and 0.9998988.

9.39 Gunnels

The dataset Gunnels comes from a study on the habitat preferences of a species of eel, called a gunnel. Biologist Jake Shorty sampled quadrats along a coastline and recorded a binary variable, Gunnel, where 1 represents the presence of the species found in the quadrat and a 0 its absence. Below is output from a logistic model fit with Time as the explanatory variable, where Time represents the time of day as “minutes from midnight,” with a minimum of 383 minutes (about 6:23 a.m.) and a maximum of 983 minutes (about 1:23 p.m.).

	Estimate	Std. Error	z value	Pr(>$\|z\|$)
(Intercept)	0.371980	0.644047	0.578	0.564
Time	-0.005899	0.001049	-5.624	1.86e-08***

State the null hypothesis that the P-value of 1.86e-08 allows you to test.

\[ H_0: \beta_1=0\\ H_A: \beta_1\ne0 \]

What happens to the probability of finding a gunnel as you get later in the day? Does it get smaller or larger?

The probability of finding a gunnel decreases as you get later in the day because of a negative slope.. It will get smaller.

Find a 95% confidence interval for the slope parameter with logistic model.

-0.005899 + qnorm(0.05/2)*0.001049

## [1] -0.007955002

-0.005899 - qnorm(0.05/2)*0.001049

## [1] -0.003842998

The confidence interval for the slope parameter with the logistic model is (-0.007955002, -0.003842998). We are 95% confident that a 1 minute increase in Time is associated with multiplying the odds of finding a Gunnel between -0.007955002 and -0.003842998.

What quantity does 0.371980 + (-0.005899)(600) = -3.16742 estimate?

This quantity is estimating the log-odds presence at 10 a.m. (600).

Find the estimated odds ratio for an additional minute of time after midnight.

exp(-0.005899)

## [1] 0.9941184

The estimated odds ratio for an additional minute of time after midnight is 0.9941184.

Give a 95% confidence interval for the odds ratio found in part (e).

exp(-0.005899) + qnorm(0.05/2)*0.001049

## [1] 0.9920624

exp(-0.005899) - qnorm(0.05/2)*0.001049

## [1] 0.9961744

The 95% confidence interval for the odds ratio found in (e) is (0.9920624,0.9961744). We are 95% confident that an additional minute of time after midnight is associated with multiplying the odds of finding a Gunnel by a factor between 0.9920624 and 0.9961744.

Find an estimate for the odds ratio for an additional hour or 80 minutes of time passing. Give a sentence interpreting the meaning of this number in simple language using the term “odds.”

An estimate for the odds ratio for an additional hour of time passing is 0.7019.

Hw #9

Your name

Due Monday, November 18

9.27 Titanic: Survival and age, CHOOSE and FIT

9.36 Red states and blue states in 2016: Compare models

9.37 Red states and blue states in 2016, income: Odds ratio

9.39 Gunnels