Categorical Data Analysis

Homework 4: Logistic Regression

Patrick Oster

2019-03-15

(1) LI Data1 Pg 113-114 Exercise 4.1

A study investigated characteristics associated with y = whether a cancer patient achieved remission (1 = yes, 0 = no). An important explanatory variable was a labeling index (LI = percentage of โ€œlabeledโ€ cells) that measures proliferative activity of cells after a patient receives an injection of tritiated thymidine. Table 4.5 shows the data and R output for a logistic regression model.

# Importing LI Data
LI <- c(8,8,10,10,12,12,12,14,14,14,16,16,16,18,20,20,20,22,22,24,26,28,32,34,38,38,38)
y <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,1,1,0,1,1,1,0)
df <- data.frame(LI, y)
# Modeling
fit <- glm(y ~ LI, family = binomial, data = df)
gam.fit <- gam(y ~ s(LI), family = binomial, data = df)
# Summarizing
summary(fit)
## 
## Call:
## glm(formula = y ~ LI, family = binomial, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9448  -0.6465  -0.4947   0.6571   1.6971  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.77714    1.37862  -2.740  0.00615 **
## LI           0.14486    0.05934   2.441  0.01464 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.372  on 26  degrees of freedom
## Residual deviance: 26.073  on 25  degrees of freedom
## AIC: 30.073
## 
## Number of Fisher Scoring iterations: 4
confint(fit)
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -6.9951909 -1.4098443
## LI           0.0425232  0.2846668

a. Plot the jittered data (as on p.92) with the GAM and logistic models fits superimposed (and with a legend for the two plots in the top left corner of the plot) โ€“ use two distinct colors for the plots.

plot(jitter(y, 0.08) ~ LI, data = df)
curve(predict(gam.fit, data.frame(LI = x), type = "resp"), add = TRUE, col = "blue")
curve(predict(fit, data.frame(LI = x), type = "resp"), add = TRUE, col = "green")
legend(x = 7.5, y = .9, legend = c("Blue: GAM", "Green: logistic"))

# Attempting to overlay models on Jittered data with GGplot
#ggplot(df, aes(x = LI, y = y)) + 
#geom_smooth(method = "glm", data = fit, formula = y ~ LI, n = 27, aes(color = "red")) + 
#geom_smooth(method = "gam", data = gam.fit, formula = y ~ LI, n = 27, aes(color = "blue")) + 
#geom_jitter(width = .01, height = 0.01, inherit.aes = TRUE) 

b. Showing all needed calculations, find, report and interpret the estimated \(LD_{75}\) โ€“ the point on the x axis where the predicted response reaches 75% and clearly indicate the units.

findInt <- function(model, value){
  function(x) {
    predict(model, data.frame(LI = x), type = "resp") - value
  }
}
# LD 75
LD75 <- uniroot(findInt(fit, .75), range(df$LI))$root; LD75
## [1] 33.65764
# Updating Plot
plot(jitter(y, 0.08) ~ LI, data = df)
curve(predict(fit, data.frame(LI = x), type = "resp"), add = TRUE, col = "green")
abline(v = LD75, col = "red")
legend(x = 7.5, y = .9, legend = c("Green: logistic", "Red: LD75"))

The predicted response reaches 75% at a labeling index value of 33.6576408. That is, at an LI of 33.6576408 the estimated probability of a cancer patient achieving remission is 75%.

c. Showing all needed calculations, find, report and interpret the estimated \(LD_{10}\) โ€“ the point on the x axis where the predicted response reaches 10% and clearly indicate the units.

# LD 10
LD10 <- uniroot(findInt(fit, .1), range(df$LI))$root; LD10
## [1] 10.90626
plot(jitter(y, 0.08) ~ LI, data = df)
curve(predict(fit, data.frame(LI = x), type = "resp"), add = TRUE, col = "red")
abline(v = LD75, col = "red")
abline(v = LD10, col = "purple")
legend(x = 7.5, y = .9, legend = c("Green: logistic", "Red: LD75", "Purple: LD10"))

The predicted response reaches 10% at a labeling index value of 10.9062595. That is, at an LI of 10.9062595 the estimated probability of a cancer patient achieving remission is 10%.

(2) Crabs Data2 Pg 115 Exercise 4.8

Fit the logistic regression model for the probability of a satellite (y = 1) using x = weight as the sole explanatory variable.

# Importing Crabs Data
df <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/Crabs.dat(Table%203.2)",
                 header = TRUE)
kable(head(df, n = 5))
crab sat y weight width color spine
1 8 1 3.05 28.3 2 3
2 0 0 1.55 22.5 3 3
3 9 1 2.30 26.0 1 1
4 0 0 2.10 24.8 3 3
5 4 1 2.60 26.0 3 3
fit <- glm(y ~ weight, family = binomial, data = df); fit
## 
## Call:  glm(formula = y ~ weight, family = binomial, data = df)
## 
## Coefficients:
## (Intercept)       weight  
##      -3.695        1.815  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  171 Residual
## Null Deviance:       225.8 
## Residual Deviance: 195.7     AIC: 199.7
s.fit <- summary(fit); s.fit
## 
## Call:
## glm(formula = y ~ weight, family = binomial, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1108  -1.0749   0.5426   0.9122   1.6285  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6947     0.8802  -4.198 2.70e-05 ***
## weight        1.8151     0.3767   4.819 1.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 195.74  on 171  degrees of freedom
## AIC: 199.74
## 
## Number of Fisher Scoring iterations: 4
c <- fit$coefficients

a. Report the ML prediction equation. At the mean weight value of 2.437 kg, give a linear approximation for the estimated effect of (i) a 1-kg increase in weight. This represents a relatively large increase, so convert this to the effect of (ii) a 0.10-kg increase, (iii) a standard deviation increase in weight (0.58 kg).

ML Prediction Equation
\[logit[\hat{\pi(x)}] = -3.6947264 + 1.8151446x\]
\[\hat{\pi(x)} = \frac{e^{-3.6947264 + 1.8151446x}}{1 + e^{-3.6947264 + 1.8151446x}}\]

i. Linear approximation for estimated effect of a 1kg increase

c[2]*1
##   weight 
## 1.815145

ii. Linear approximation for estimated effect of a 0.1kg increase

c[2]*0.1
##    weight 
## 0.1815145

iii. Linear approximation for estimated effect of a 0.58kg increase

c[2]*0.58
##   weight 
## 1.052784

b. Find and interpret the average marginal effect of weight per 0.10-kg increase.

m <- margins(fit); m
## Average marginal effects
## glm(formula = y ~ weight, family = binomial, data = df)
##  weight
##  0.3495
# Alternative function for average marginal effect
# logitmfx(fit, atmean = FALSE, data = df)

The average marginal effect of weight per 1kg increase is represented by 0.3495341, thus the average marginal effect of weight per .1kg increase is one tenth of that effect, 0.0349534. The average marginal effect of a .1kg increase in weight can be interpreted as the average change in the probability of satellite as weight increases by .1kg is 0.0349534.

c. Construct the classification table using the sample proportion of y = 1 as the cutoff. Report the sensitivity and specificity. Interpret.

# Sample proportion of y = 1
prop <- sum(df$y)/nrow(df); prop
## [1] 0.6416185
# Classification table
predicted <- as.numeric(fitted(fit) > prop)
tabs <- xtabs(~ df$y + predicted); tabs
##     predicted
## df$y  0  1
##    0 45 17
##    1 43 68
# Sensitivity P[y]
sens <- tabs[4]/(tabs[2]+tabs[4]); sens
## [1] 0.6126126
# Specificity
spef <- tabs[1]/(tabs[1]+tabs[3]); spef
## [1] 0.7258065

A sensitivity of 0.6126126 means that the probability of predicting a crab having a satellite using the model given that a crab truly has a satellite is 61.2612613%.

A specificity of 0.7258065 means that the probability of predicting a crab does not have a satellite using the model given that a crab truly does not have a satellite is 72.5806452%.

d. Construct an ROC curve, and report and interpret the area under it.

rocplot <- roc(y ~ fitted(fit), data = df)
plot.roc(rocplot, legacy.axes=FALSE)

AUC <- auc(rocplot); AUC
## Area under the curve: 0.7379
concordance <- AUC[1]

The area under the ROC curve is a measure of the predictive power of a model also known as the concordance index. This model has a concordance index of 0.7379396, indicating that the estimated probability that predictions and outcomes are concordant is 73.7939552%. A concordance index of 1 represents a model which perfectly predicts the presence of a satellite, whereas a concordance index of 0.5 is the equivalent of random guessing. The concordance index of this model using crab width as the only predictor indicates that it does a fair job of correctly predicting outcomes.

(3) MBTI Data3 Pg 116-117 Exercise 4.12

The MBTI data file cross-classifies a sample of people from the MBTI Step II National Sample on whether they report drinking alcohol frequently and on the four binary scales of the Myersโ€“Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive (S/N), Thinking/Feeling (T/F) and Judging/Perceiving (J/P). The 16 predictor combinations correspond to the 16 personality types: ESTJ, ESTP, ESFJ, ESFP, ENTJ, ENTP, ENFJ, ENFP, ISTJ, ISTP, ISFJ, ISFP, INTJ, INTP, INFJ, INFP. (e.g., of the 77 people of type ESTJ, 13 reported smoking frequently.) Fit a model using the four scales as predictors of the probability of drinking alcohol frequently. Report the prediction equation, specifying how you set up the indicator variables. Based on the model parameter estimates, explain why the personality type with the highest estimated probability of drinking alcohol is ENTP.

df <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/MBTI.dat",
                 header = TRUE)
df$no <- df$n - df$drink; df$yes <- df$drink
kable(df)
EI SN TF JP smoke drink n no yes
e s t j 13 10 77 67 10
e s t p 11 8 42 34 8
e s f j 16 5 106 101 5
e s f p 19 7 79 72 7
e n t j 6 3 23 20 3
e n t p 4 2 18 16 2
e n f j 6 4 31 27 4
e n f p 23 15 80 65 15
i s t j 32 17 140 123 17
i s t p 9 3 52 49 3
i s f j 34 6 138 132 6
i s f p 29 4 106 102 4
i n t j 4 1 13 12 1
i n t p 9 5 35 30 5
i n f j 4 1 31 30 1
i n f p 22 6 79 73 6

b. Report the fitted/prediction equation and clearly indicate how you (or R) set up the 4 variables

Prediction Equation
\[logit[\hat{\pi(x)}] = -2.114047 + -0.5550115I + -0.4291508S + 0.6873349T + 0.2022295P\]
\[\hat{\pi(x)} = \frac{e^{-2.114047 + -0.5550115I + -0.4291508S + 0.6873349T + 0.2022295P}}{1 + e^{-2.114047 + -0.5550115I + -0.4291508S + 0.6873349T + 0.2022295P}}\]
By default, R set E, N, F & J to be built into the intercept term leaving estimated regression coefficients for I = 1, S = 1, T = 1 & P = 1.

c. Clearly/thoroughly explain why the highest predicted probability is associated with the ENTP run

It follows logically why the predicted probability of drinking alcohol frequently is associated with the ENTP run; the coefficients for I = S = 1 are negative while the coefficients for T = P = 1 are positive. When I = S = 0 and T = P = 1, the negative coefficients have no impact while the positive coefficients are impacting the predicted probability.

d. Using the prediction equation, give the predicted probability associated with the ENTP run showing all relevant work

\(\hat{\pi(ENTP)} = \frac{e^{-2.114047 + -0.5550115I + -0.4291508S + 0.6873349T + 0.2022295P}}{1 + e^{-2.114047 + -0.5550115I + -0.4291508S + 0.6873349T + 0.2022295P}}\)

\(\frac{e^{-2.114047 + -0.5550115(0) + -0.4291508(0) + 0.6873349(1) + 0.2022295(1)}}{1 + e^{-2.114047 + -0.5550115(0) + -0.4291508(0) + 0.6873349(1) + 0.2022295(1)}} = \frac{0.2939097}{1 + 0.2939097}\)

\(0.2271486\)

e. Using R (and providing your R program and output), provide a table (i.e., the R output) with each of the ๐Ÿ๐Ÿ” cases here as rows โ€“ include the raw data, the predicted probabilities (also called fitted values), and the ๐Ÿ—๐Ÿ“% confidence intervals for the predicted probabilities. Hint: see p.96.

pred.prob <- fitted(fit)
lp <- predict(fit, se.fit = TRUE)
LB <- lp$fit - 1.96*lp$se.fit
UB <- lp$fit + 1.96*lp$se.fit
LB.p <- exp(LB)/(1+exp(LB))
UB.p <- exp(UB)/(1+exp(UB))
out <- data.frame(df$EI, df$SN, df$TF, df$JP, 
                  pred.prob,
                  LB.p,
                  UB.p)
kable(out)
df.EI df.SN df.TF df.JP pred.prob LB.p UB.p
e s t j 0.1351860 0.0931166 0.1922332
e s t p 0.1606185 0.1046828 0.2384813
e s f j 0.0728848 0.0475777 0.1100969
e s f p 0.0877863 0.0569453 0.1329754
e n t j 0.1936115 0.1222293 0.2927765
e n t p 0.2271486 0.1504295 0.3278934
e n f j 0.1077390 0.0662294 0.1705142
e n f p 0.1287768 0.0873674 0.1858168
i s t j 0.0823472 0.0557648 0.1199910
i s t p 0.0989769 0.0633996 0.1512937
i s f j 0.0431812 0.0272289 0.0678276
i s f p 0.0523527 0.0329641 0.0821759
i n t j 0.1211352 0.0727629 0.1949056
i n t p 0.1443656 0.0911432 0.2211060
i n f j 0.0648240 0.0376312 0.1094325
i n f p 0.0782166 0.0500682 0.1201877

(4) Infection Data4 Pg 117-118 Exercise 4.14

Table 4.8 shows results of an eight-center clinical trial to compare a drug to placebo for curing an infection. At each center, subjectswere randomly assigned to treatments. Analyze these data, available in the Infection data file at the text website. Using logistic regression, describe and make inference about the treatment effect.

df <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/Infection.dat(Table%204.16)",
                 header = TRUE)
kable(head(df, n = 5))
center treat y n
1 1 11 36
1 0 10 37
2 1 16 20
2 0 22 32
3 1 14 19

a. Using the likelihood-based test, test whether one can accept that the odds ratios are equal across the 8 centers. Here, โ€˜oddsโ€™ refers to the Success as response comparing Drug and Control. Clearly give the hypotheses, test statistic, degrees of freedom, p-value, and your clear conclusion.

b. In the logistic model involving only center (as a factor) and treatment/drug (as a dummy variable), obtain and clearly interpret the estimated odds ratio associated with:

i. Center 2

ii. Center 4

iii. Treatment/lung

c. In the logistic model involving only center (as a factor) and treatment/drug (as a dummy variable), use the likelihood-based test to test whether treatment/drug is significant in this model. Give the hypotheses, test statistic, p-value, and your clear conclusion.

(5) Sore Throat Data5 Pg 118-119 Exercise 4.16

Table 4.10 shows results of a study about y = whether a patient having surgerywith general anesthesia experienced a sore throat on waking (1 = yes, 0 = no) as a function of d = duration of the surgery (in minutes) and t = type of device used to secure the airway (0 = laryngeal mask airway, 1 = tracheal tube).

df <- read.table(file = "http://users.stat.ufl.edu/~aa/intro-cda/data/SoreThroat.dat(Table%204.19)",
                 header = TRUE, 
                 skip = 5)
kable(head(df, n = 5), caption = "D = surgery duration (in min), T = type of device (0 = laryngeal mask airway, 1 = tracheal tube), Y = sore throat on waking (1=yes, 0=no)")

D = surgery duration (in min), T = type of device (0 = laryngeal mask airway, 1 = tracheal tube), Y = sore throat on waking (1=yes, 0=no)

D T Y
45 0 0
15 0 0
40 0 1
83 1 1
90 1 1

a. Fit a model permitting interaction between the explanatory variables. Report and interpret the prediction equation for the effect of d when (i) t = 1, (ii) t = 0. Conduct inference about whether you need the interaction term.

b. Compare the predictive power of models with and without the interaction term by finding the correlation R between the observed and fitted values for each model.

c. In the model with no interaction term (between d and t), obtain and clearly interpret the odds ratios for both d (duration) and t (type of device)

d. For the model with the d x t interaction term, use R to plot the fitted logistic curves for the t = 0 and the t = 1 cases juxtaposed on the same plot. Submit the code and the plot.

e. For the model without the d x t interaction term, use R to plot the fitted logistic curves for the t = 0 and the t = 1 cases juxtaposed on the same plot. Submit the code and the plot.

f. For the model in part (e), give the estimated LD_50_โ€™s for d in both the t = 0 and t = 1 cases.

(6) Relationship Between Age & Facebook Membership6 Pg 120 Exercise 4.22

You plan to study the relation between x = age and y = whether a member of Facebook (1 = yes, 0 = no). A priori, you predict that P(Y = 1) is currently about 0.80 at x = 18 and about 0.20 at x = 65. Assuming that the logistic regression model describes this relation well, approximate the value for the effect \(\beta\) of x in the model.