Homework 8 Basic Homework Solutions (5:30 PM - 6:15 PM)
- Homework 8
Bayesian model averaging of binary logistic regression models:
- Background talk on Model Uncertainty and Averaging for Categorical Data Analysis
  - Based on an American Statistical Association talk by Chris Franck (Virginia Tech, 6/21/22)
  - Chapter 21: Multinomial logistic regression (Andrews 2021 is a great guide)
Sleuth Chapter 22: Poisson regression
Frank Harrell’s Regression Modeling Strategies: Highlights
- Harrell’s Modeling principles
  - Comparing two models
Examples of non-linear models, first restricted cubic splines.
Plot the data
Save knot locations
rcspline.eval is the rcs workhorse
- Now fit seasonal patterns with a cosine function, 12-month period
Next add more nodes near intervention to allow for sudden change
Now make the slow trend simpler (6 knots) and more finely control times at which predictions are requested, to handle discontinuity.
Final plot
GPT-4 Sicilian cardiovascular event example.
Note only 3 classes left

Homework 8 Basic Homework Solutions (5:30 PM - 6:15 PM)

Homework 8

ex13.17: Dinosaur Extinctions - An Observational Study Skyla Agostono

ex13.20 : Gender Differences in Performance on Mathematics Acheivement Tests Kara Falvey

ex13.19 Nature-Nurture Anant Misra

ex13.21 & 13.22 Pygmalion old and new methods, with Master: Mixed Effects models Quinn Pham

Bayesian model averaging of binary logistic regression models:

Background talk on Model Uncertainty and Averaging for Categorical Data Analysis

Based on an American Statistical Association talk by Chris Franck (Virginia Tech, 6/21/22)

Franck Webinar, Slide 1.

Franck Webinar, Slide 2.

Overview of crab data * Data obtained from Alan Agresti’s Introduction to Categorical Data Analysis book (Agresti 2018). * The data measure whether or not a female horseshoe crab has additional male suitors beyond their current mate, essentially. The outcome is binary. * There are four candidate predictors including: shell width, weight, shell color, spine condition.

Franck Webinar, Slide 3.

The data contain four potential predictors

Model selection involves choosing single “best” model, and all subsequent inference is based on that model. Many selection approaches available.
There are $2^{p}$ possible models. Sixteen for the crab example. Model space grows exponentially with number of new predictors.
The above approach ignores the uncertainty in the model selection process. Model averaging is one way to sensibly conduct inference in the context of all available models.
In cases where p is large, the model space becomes too large to computationally assess every model. Algorithms such as Markov Chain Monte Carlo Model Composition ($MC^{2}$) can be used to search the space effectively. See (Hoeting et al. 1999) for a good overview.

#R demo for ADA SDNS webinar
#An Introduction to Model Uncertainty and Averaging for Categorical Data Analysis
#Chris Franck, June 21, 2022

#Import data
Crabs<-read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat",
                  header=TRUE)

#Probability predictions for simple logistic regression
#maximum likelihood
grid<-seq(0,35,.1)
plot(Crabs$width,jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell width")
fit<-glm(y~width,family=binomial,data=Crabs)
grid.frame<-data.frame(width=grid)
grid.pred<-predict.glm(fit,newdata=grid.frame,type='response')
lines(grid, grid.pred,lwd=2,col='red')

#plot the data
par(mfrow=c(2,2),las=1)
plot(Crabs$width,jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell width")
plot(jitter(Crabs$color,.3),jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell color")
plot(jitter(Crabs$spine,.3),jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Spine condition")
plot(Crabs$weight,jitter(Crabs$y,.1),pch=16,ylab="Satellites?")

Multicollinearity is a concern

#A look at potential multicollinearity
y.j<-Crabs$y+rnorm(173,0,.1)
color.j<-Crabs$color+rnorm(173,0,.1)
spine.j<-Crabs$spine+rnorm(173,0,.1)
pairs(data.frame(y=y.j,width=Crabs$width, color=color.j, spine=spine.j, weight=Crabs$weight),
      pch=16)

Chapter 21: Multinomial logistic regression (Andrews 2021 is a great guide)

SHALLOW WATER OBJECT DETECTION, CHARACTERIZATION, AND LOCALIZATION THROUGH REFLECTIVITY BACKSCATTER FROM PHASE-MEASURING SIDESCAN SONAR by Bryan P. McCormack SFE M.Sc. 2022

Bryan McCormack 2022 M.Sc. Figure 7

Bryan McCormack 2022 M.Sc. Figure 1

Bryan McCormack 2022 M.Sc.

Sleuth Chapter 22: Poisson regression

$Sleuth3 Display 22.1$ Fit the Poisson Log-Linear Regression Model

Poisson Log-Linear Regression Model Sleuth3 p 677

Statistical Conclusion

The average number of successful matings increased with age, for male elephants bettween 25 and 50 years old

$Sleuth3 Display 22.2 2nd edition$

Sleuth3 Display 22.2 2nd edition

Sleuth3 Display 22.2 3rd edition Sleuth3 Display 22.2 Gallagher’s Matlab program Sleuth3 Display 22.2 OLS Regression with Gallagher’s Matlab program Sleuth3 Display 22.2 Adam Loy’s tidysleuth

The data provided no evidence of a leveling off of mating with age (quadratic term 1-sided p value = 0.33)
The estimated mean number of successful matings (in 8 years) for 27-year-old males was 1.31, and the mean increased by a factor of 2.00 for each 10 year increase in age up to about 50 years (95% confidence interval for the mulitplicative factor: 1.52 to 2.60)

Scope of Inference

There may be bias due to measurement error in estimating the age of each elephant and lack of independence.
Bias would also accrue if successful matings weren’t successfully attributed.
Attempting to apply to a larger population would requie careful attention to the method of sampling.

Sleuth 3 Code for Case Study 20.1

str(case2201)

## 'data.frame':    41 obs. of  2 variables:
##  $ Age    : int  27 28 28 28 28 29 29 29 29 29 ...
##  $ Matings: int  0 1 1 1 3 0 0 0 2 2 ...

attach(case2201)

EXPLORATION AND MODEL BUILDING

plot(Matings ~ Age,  log="y")

## Warning in xy.coords(x, y, xlabel, ylabel, log): 5 y values <= 0 omitted from
## logarithmic plot

ageSquared  <- Age^2
myGlm1 <- glm(Matings ~ Age + ageSquared, family=poisson)
summary(myGlm1)  # No evidence of a need for ageSquared

## 
## Call:
## glm(formula = Matings ~ Age + ageSquared, family = poisson)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## Age          0.1359544  0.1580095   0.860    0.390
## ageSquared  -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5

INFERENCE AND INTERPRETATION

myGlm2  <- update(myGlm1, ~ . - ageSquared)
summary(myGlm2)

## 
## Call:
## glm(formula = Matings ~ Age, family = poisson)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## Age          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

anova(myGlm2,myGlm1,test="Chisq") # May not be valid, many counts < 5

## Analysis of Deviance Table
## 
## Model 1: Matings ~ Age
## Model 2: Matings ~ Age + ageSquared
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6667

beta  <- myGlm2$coef
exp(beta[2])  #1.071107

##      Age 
## 1.071107

exp(confint(myGlm2,2))  #1.042558 1.100360

## Waiting for profiling to be done...

##    2.5 %   97.5 % 
## 1.042558 1.100360

Interpretation: Associated with each 1 year increase in age is a 7% increase in the mean number of matings (95% confidence interval 4% to 10% increase).

GRAPHICAL DISPLAY FOR PRESENTATION

plot(Matings ~ Age, ylab="Number of Successful Matings",
     xlab="Age of Male Elephant (Years)",
     main="Age and Number of Successful Matings for 41 African Elephants",
     pch=21, bg="green", cex=2, lwd=2)
dummyAge <- seq(min(Age),max(Age), length=50)
lp <- beta[1] + beta[2]*dummyAge
curve <- exp(lp)
lines(curve ~ dummyAge,lwd=2)

detach(case2201)

Adam Loy’s tidysleuth code

options(digits = 4, show.signif.stars = FALSE)

Plot the jittered data on a log scale

gf_jitter(log(Matings + 0.5) ~ Age, data = case2201, height = 0.25, width = 0.25, pch = 1) %>%
  gf_labs(x = "Age (years) -- Slightly Jittered", y = "Number of Matings (log scale)")

Fit the Poisson glm, with quadratic terms.

glm1 <- glm(Matings ~ Age + I(Age^2), data = case2201, family = poisson)
summary(glm1)

## 
## Call:
## glm(formula = Matings ~ Age + I(Age^2), family = poisson, data = case2201)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.85741    3.03564   -0.94     0.35
## Age          0.13595    0.15801    0.86     0.39
## I(Age^2)    -0.00086    0.00201   -0.43     0.67
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.3
## 
## Number of Fisher Scoring iterations: 5

summary(fitted(glm1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.21    1.44    2.16    2.68    3.81    6.61

tidy(glm1)

## # A tibble: 3 × 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -2.86       3.04       -0.941   0.347
## 2 Age          0.136      0.158       0.860   0.390
## 3 I(Age^2)    -0.000860   0.00201    -0.427   0.669

Eliminate the quadratic term

reduced <- update(glm1, . ~ . - I(Age^2))
summary(reduced)

## 
## Call:
## glm(formula = Matings ~ Age, family = poisson, data = case2201)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.5820     0.5446    -2.9   0.0037
## Age           0.0687     0.0137     5.0  5.8e-07
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.5
## 
## Number of Fisher Scoring iterations: 5

Calculate Wald confidence interval

beta1 <- coef(reduced)[2]
se <- sqrt(vcov(reduced)[2,2])
beta1 + c(-1, 1) * qnorm(.975) * se

## [1] 0.04175 0.09563

exp(beta1 + c(-1, 1) * qnorm(.975) * se)

## [1] 1.043 1.100

Profile-likelihood confidence interval

confint(glm1)

## Waiting for profiling to be done...

##                 2.5 %   97.5 %
## (Intercept) -9.044944 2.893178
## Age         -0.163750 0.457734
## I(Age^2)    -0.004968 0.002948

Compare the full and reduced models

anova(reduced, glm1, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Matings ~ Age
## Model 2: Matings ~ Age + I(Age^2)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39       51.0                     
## 2        38       50.8  1    0.185     0.67

Gallagher’s Matlab analysis of OLS, Poisson & Poisson quadratic

Case Study 22.2 Characteristics Associated with Salamander Habitat_–An Observational Study

The Del Norte Salamander (plethodon elongates) is a small (5-7 cm) salamander found among rock rubble, rock outcrops, and moss-covered talus in a narrow range of norwest California.
To study the relationship to old-growth forest, researchers selected 47 sites from plausible salamander habitat in national forest and parkland. At each site, a 7 m x 7 m search area was examined for the number of salamanders. Display 22.3 shows the number of salamanders and characteristics of the forest cover.

Sleuth2 Display 22.3

Statistical Conclusion
After accounting for canopy cover, there is no evidence that the mean number os salamanders depends on forest age ( p = 0.78 from F = 0.53 with 6 and 35 df from a quasilikelihood fit to a log-linear model).
Two distinct canopy cover conditions were evident: closed canopies with percentages > 70% and open canopy with percentages < 70%. The salamanders followed different curved abundance patterns under the two conditions.

Sleuth2 Display 22.4 Gallagher’s Matlab model display

Gallagher’s Matlab Quasilikelihood model display

Sleuth2 Display 22.5

Plot of Deviance residuals

Sleuth3 Display 22.7

Sleuth 3 Code for Case Study 20.2

EXPLORATION AND MODEL BUILDING

attach(case2202)

logSalamanders <- log(Salamanders + .5)
logForestAge   <- log(ForestAge + .5)
myMatrix       <- cbind(PctCover,logForestAge,logSalamanders)

if (require(car)) { # Use car library
  scatterplotMatrix(myMatrix, diagonal="histogram", reg.line=FALSE, spread=FALSE)
}

Fit the Poisson log-linear model

myGlm1  <- glm(Salamanders ~ PctCover + logForestAge + PctCover:logForestAge, 
               family=poisson)
summary(myGlm1)   # Backward elimination...

## 
## Call:
## glm(formula = Salamanders ~ PctCover + logForestAge + PctCover:logForestAge, 
##     family = poisson)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -0.96945    0.82093   -1.18   0.2376
## PctCover               0.02907    0.01118    2.60   0.0094
## logForestAge          -0.21311    0.29301   -0.73   0.4670
## PctCover:logForestAge  0.00196    0.00342    0.57   0.5674
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 190.22  on 46  degrees of freedom
## Residual deviance: 120.70  on 43  degrees of freedom
## AIC: 213.8
## 
## Number of Fisher Scoring iterations: 5

myGlm2 <- update(myGlm1, ~ . - PctCover:logForestAge)
summary(myGlm2)

## 
## Call:
## glm(formula = Salamanders ~ PctCover + logForestAge, family = poisson)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.37255    0.49574   -2.77   0.0056
## PctCover      0.03443    0.00653    5.27  1.3e-07
## logForestAge -0.05470    0.10035   -0.55   0.5857
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 190.22  on 46  degrees of freedom
## Residual deviance: 121.01  on 44  degrees of freedom
## AIC: 212.1
## 
## Number of Fisher Scoring iterations: 5

myGlm3  <- update(myGlm2, ~ . - logForestAge)
summary(myGlm3)   # PctCover is the only explanatory variable remaining

## 
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.48196    0.45578   -3.25   0.0011
## PctCover     0.03241    0.00539    6.01  1.8e-09
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 190.22  on 46  degrees of freedom
## Residual deviance: 121.31  on 45  degrees of freedom
## AIC: 210.4
## 
## Number of Fisher Scoring iterations: 5

Plot the penultimate model

plot(Salamanders ~ PctCover)  # It appears that there are 2 distributions

# of Salamander counts; one for PctCover < 70 and one for PctCover > 70

See if PctCover is associated Salamanders in each subset

myGlm4 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover > 70))
summary(myGlm4)           # No evidence of an effect for this subset

## 
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson, subset = (PctCover > 
##     70))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.8240     2.0652   -0.88     0.38
## PctCover      0.0364     0.0236    1.54     0.12
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 105.46  on 27  degrees of freedom
## Residual deviance: 102.98  on 26  degrees of freedom
## AIC: 177.4
## 
## Number of Fisher Scoring iterations: 6

myGlm5 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover < 70))
summary(myGlm5)           # No evidence on this subset either

## 
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson, subset = (PctCover < 
##     70))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.96968    0.55081   -1.76    0.078
## PctCover     0.00561    0.02180    0.26    0.797
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16.613  on 18  degrees of freedom
## Residual deviance: 16.548  on 17  degrees of freedom
## AIC: 35.16
## 
## Number of Fisher Scoring iterations: 6

INFERENCE (2 means)

Group <- ifelse(PctCover > 70,"High","Low")
Group <- factor(Group, levels=c("Low","High") )  # Make "Low Cover" the ref group
myGlm6 <- glm(Salamanders ~ Group, family=poisson)
summary(myGlm6)

## 
## Call:
## glm(formula = Salamanders ~ Group, family = poisson)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -0.865      0.354   -2.45    0.014
## GroupHigh      2.215      0.366    6.05  1.5e-09
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 190.22  on 46  degrees of freedom
## Residual deviance: 122.07  on 45  degrees of freedom
## AIC: 211.1
## 
## Number of Fisher Scoring iterations: 5

GRAPHICAL DISPLAY FOR PRESENTATION

plot(Salamanders ~ PctCover, ylab="Number of Salamanders",
     xlab="Percentage of Canopy Covered",
     main="Number of Salamanders versus Percent Canopy Cover",
     pch=21,bg="green", cex=2, lwd=2)

beta <- myGlm6$coef
lines(c(0,55),exp(c(beta[1],beta[1])),lwd=2)
text(56,exp(beta[1]),paste("mean= ",round(exp(beta[1]),1)),adj=0)
lines(c(76,93),exp(c(beta[1]+beta[2],beta[1]+beta[2])),lwd=2)
text(56,exp(beta[1]+beta[2]),paste("mean=",round((beta[1]+beta[2]),1)),adj=-1)

detach(case2202)

What a poor R graphic

Gallagher’s Matlab Quasilikelihood model display

Adam Loy’s (Carleton) tidySleuth case 22.2 code

options(digits = 4, show.signif.stars = FALSE)
summary(case2202)

## EXPLORATION AND MODEL BUILDING

logSalamanders <- log(Salamanders + .5)
logForestAge   <- log(ForestAge + .5)
myMatrix       <- cbind(PctCover,logForestAge,logSalamanders)
if (require(car)) { # Use car library
  scatterplotMatrix(myMatrix, diagonal="histogram", reg.line=FALSE, spread=FALSE)
}

_Fit the Poisson glm__

myGlm1  <- glm(Salamanders ~ PctCover + logForestAge + PctCover:logForestAge, 
               family=poisson)
summary(myGlm1)   # Backward elimination...
myGlm2 <- update(myGlm1, ~ . - PctCover:logForestAge)
summary(myGlm2)
myGlm3  <- update(myGlm2, ~ . - logForestAge)
summary(myGlm3)   # PctCover is the only explanatory variable remaining

Plot the data

plot(Salamanders ~ PctCover)  # It appears that there are 2 distributions

Number of Salamander counts; one for PctCover < 70 and one for PctCover > 70

See if PctCover is associated Salamanders in each subset

myGlm4 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover > 70))
summary(myGlm4)           # No evidence of an effect for this subset
myGlm5 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover < 70))
summary(myGlm5)           # No evidence on this subset either

INFERENCE (2 means)

Group <- ifelse(PctCover > 70,"High","Low")
Group <- factor(Group, levels=c("Low","High") )  # Make "Low Cover" the ref group
myGlm6 <- glm(Salamanders ~ Group, family=poisson)
summary(myGlm6)

# Using 70% as the dividing line
case2202$Closed <- ifelse(case2202$PctCover > 70,"closed", "open")

ssom <- glm(Salamanders ~ PctCover * ForestAge * Closed + I(PctCover^2) + I(ForestAge^2) + 
              I(PctCover^2):Closed + I(ForestAge^2):Closed, 
            data = case2202, family = poisson)
summary(ssom)

1 - pchisq(89.178, df = 35)

gf_point(residuals(ssom, type = "deviance") ~ fitted(ssom)) %>%
  gf_hline(yintercept = 0, color = "blue") %>%
  gf_hline(yintercept = 2, color = "gray60", linetype = 2)  %>%
  gf_hline(yintercept = -2, color = "gray60", linetype = 2) %>%
  gf_labs(x = "Fitted means", y = "Deviance residual")

# The full model via quasi-likelihood
ssom <- update(ssom, . ~ ., family = quasipoisson)

# The reduced (inferential) model
inferential_model <- glm(Salamanders ~ PctCover * Closed + I(PctCover^2) + 
                           I(PctCover^2):Closed, data = case2202, family = quasipoisson)

# drop-in-deviance F test
anova(inferential_model, ssom, test = "F")

Plot the model

aug <- augment(inferential_model)
gf_point(Salamanders ~ PctCover, data = aug) %>%
  gf_line(exp(.fitted) ~ PctCover, color = ~Closed) %>%
  gf_labs(x = "Percentage of Canopy Cover",
          y = "Salamander Count")

tidySleuth Case Study 22.2

RScs2202 Final Model from Matlab Gallagher’s Matlab Quasilikelihood model display

Melanie Jetter’s Poisson Regression of New England Tornados

Jetter Poisson Regression EEOS611

Jetter Poisson Regression Model EEOS611

Another example of the need for negative binomial glms

VerHoef Seals 2007

Another Case Study Migrating Elk, a Poisson regression with minutes spent vigilant as the response variable.

Robinson et al. (2013) AIC

Frank Harrell’s Regression Modeling Strategies: Highlights

Harrell’s Modeling principles

What can keep a sample of data from being appropriate for modeling

Most important predictor or response variables not collected
Subjects in the dataset are ill-defined or not representative of the population to which inferences are needed
Data collection sites do not represent the population of sites
Key variables missing in large numbers of subjects
Data not missing at random
No operational definitions for key variables and/or measurement errors severe
No observer variability studies done

What else can go wrong in modeling?

The process generating the data is not stable.
The model is misspecified with regard to nonlinearities or interactions, or there are predictors missing.
The model is misspecified in terms of the transformation of the response variable or the model’s distributional assumptions.
The model contains discontinuities (e.g., by categorizing continuous predictors or setting regression shapes with sudden changes) that can be gamed by users.
Correlations among subjects are not specified, or the correlation structure is misspecified, resulting in inefficient parameter estimates and overconfident inference.
The model is overfitted, resulting in predictions that are too extreme or positive associations that are false.
The user of the model relies on predictions obtained by extrapolating to combinations of predictor values well outside the range of the dataset used to develop the model.
Accurate and discriminating predictions can lead to behavior changes that make future predictions inaccurate.

Comparing two models

Level playing field (independent datasets, same number of candidate d.f., careful bootstrapping)

Criteria
1. calibration
2. discrimination
3. face validity
4. measurement errors in required predictors
5. use of continuous predictors (which are usually better defined than categorical ones)
6. omission of “insignicant”variables that nonetheless make sense as risk factors
7. simplicity (though this is less important with the availability of computers)
8. lack of fit for specific types of subjects

** Preferred Modeling Strategy in a Nutshell**

Decide how many d.f. can be spent
Decide where to spend them
Spend them
Don’t reconsider, especially if inference needed

Examples of non-linear models, first restricted cubic splines.

Harrell’s Titanic analysis using binomial logistic regression and rcs

Harrell’s Titanic Analysis

Obtain and Inspect the Titanic Data

## Gallagher changed these 3 statements
#titanic_data <- getHdata(titanic3)
#str(titanic_data)
#summary(titanic_data)
getHdata(titanic3)
str(titanic3)

## 'data.frame':    1309 obs. of  14 variables:
##  $ pclass   : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : 'labelled' int  1 1 0 0 0 1 1 0 1 0 ...
##   ..- attr(*, "label")= chr "Survived"
##  $ name     : 'labelled' chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Crei" ...
##   ..- attr(*, "label")= chr "Name"
##  $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age      : 'labelled' num  29 0.917 2 30 25 ...
##   ..- attr(*, "units")= chr "Year"
##   ..- attr(*, "label")= chr "Age"
##  $ sibsp    : 'labelled' int  0 1 1 1 1 0 1 0 2 0 ...
##   ..- attr(*, "label")= chr "Number of Siblings/Spouses Aboard"
##  $ parch    : 'labelled' int  0 2 2 2 2 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "Number of Parents/Children Aboard"
##  $ ticket   : 'labelled' chr  "24160" "113781" "113781" "113781" ...
##   ..- attr(*, "label")= chr "Ticket Number"
##  $ fare     : 'labelled' num  211 152 152 152 152 ...
##   ..- attr(*, "units")= chr "British Pound (\\243)"
##   ..- attr(*, "label")= chr "Passenger Fare"
##  $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
##  $ embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ boat     : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
##  $ body     : 'labelled' int  NA NA NA 135 NA NA NA NA NA 22 ...
##   ..- attr(*, "label")= chr "Body Identification Number"
##  $ home.dest: 'labelled' chr  "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
##   ..- attr(*, "label")= chr "Home/Destination"

summary(titanic3)

##  pclass       survived         name               sex           age        
##  1st:323   Min.   :0.000   Length:1309        female:466   Min.   : 0.167  
##  2nd:277   1st Qu.:0.000   Class :labelled    male  :843   1st Qu.:21.000  
##  3rd:709   Median :0.000   Mode  :character                Median :28.000  
##            Mean   :0.382                                   Mean   :29.881  
##            3rd Qu.:1.000                                   3rd Qu.:39.000  
##            Max.   :1.000                                   Max.   :80.000  
##                                                            NA's   :263     
##      sibsp           parch          ticket               fare      
##  Min.   :0.000   Min.   :0.000   Length:1309        Min.   :  0.0  
##  1st Qu.:0.000   1st Qu.:0.000   Class :labelled    1st Qu.:  7.9  
##  Median :0.000   Median :0.000   Mode  :character   Median : 14.5  
##  Mean   :0.499   Mean   :0.385                      Mean   : 33.3  
##  3rd Qu.:1.000   3rd Qu.:0.000                      3rd Qu.: 31.3  
##  Max.   :8.000   Max.   :9.000                      Max.   :512.3  
##                                                     NA's   :1      
##              cabin             embarked        boat          body     
##                 :1014   Cherbourg  :270          :823   Min.   :  1   
##  C23 C25 C27    :   6   Queenstown :123   13     : 39   1st Qu.: 72   
##  B57 B59 B63 B66:   5   Southampton:914   C      : 38   Median :155   
##  G6             :   5   NA's       :  2   15     : 37   Mean   :161   
##  B96 B98        :   4                     14     : 33   3rd Qu.:256   
##  C22 C26        :   4                     4      : 31   Max.   :328   
##  (Other)        : 271                     (Other):308   NA's   :1188  
##   home.dest        
##  Length:1309       
##  Class :labelled   
##  Mode  :character  
##                    
##                    
##                    
##

Impute Missing Values

set.seed(123)  # for reproducibility
## Author changed this line:
# impute_data <- aregImpute(~ survived + sex + age + pclass, data = titanic_data, n.impute = 5)
imp <- aregImpute(~ survived + sex + age + pclass, data = titanic3, n.impute = 5)

## Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5

Fit the Binary Logistic Regression Model

dd <- datadist(titanic3)
options(datadist="dd")
fit <- fit.mult.impute(survived ~ (sex + pclass + rcs(age,5))^2,lrm, 
                       imp, data=titanic3, pr=FALSE)

Display the odds ratios of survival and p-values for effects

summary(fit)

##              Effects              Response : survived 
## 
##  Factor            Low High Diff. Effect  S.E.   Lower 0.95 Upper 0.95
##  age               21  39   18    -0.7211 0.4921 -1.6856    0.2433    
##   Odds Ratio       21  39   18     0.4862     NA  0.1853    1.2755    
##  sex - female:male  2   1   NA     1.3399 0.3930  0.5696    2.1102    
##   Odds Ratio        2   1   NA     3.8187     NA  1.7676    8.2500    
##  pclass - 1st:3rd   3   1   NA     1.1200 0.4415  0.2547    1.9853    
##   Odds Ratio        3   1   NA     3.0649     NA  1.2901    7.2814    
##  pclass - 2nd:3rd   3   2   NA    -0.5871 0.4390 -1.4476    0.2735    
##   Odds Ratio        3   2   NA     0.5560     NA  0.2351    1.3145    
## 
## Adjusted to: sex=male pclass=3rd age=28

print(anova(fit)) # To test for the significance of the interactions.

##                 Wald Statistics          Response: survived 
## 
##  Factor                                      Chi-Square d.f. P     
##  sex  (Factor+Higher Order Factors)          239.24      7   <.0001
##   All Interactions                            49.36      6   <.0001
##  pclass  (Factor+Higher Order Factors)       120.11     12   <.0001
##   All Interactions                            38.92     10   <.0001
##  age  (Factor+Higher Order Factors)           30.86     16   0.0140
##   All Interactions                            11.50     12   0.4869
##   Nonlinear (Factor+Higher Order Factors)     12.66     12   0.3940
##  sex * pclass  (Factor+Higher Order Factors)  28.86      2   <.0001
##  sex * age  (Factor+Higher Order Factors)      4.45      4   0.3488
##   Nonlinear                                    3.77      3   0.2873
##   Nonlinear Interaction : f(A,B) vs. AB        3.77      3   0.2873
##  pclass * age  (Factor+Higher Order Factors)   7.66      8   0.4674
##   Nonlinear                                    7.33      6   0.2911
##   Nonlinear Interaction : f(A,B) vs. AB        7.33      6   0.2911
##  TOTAL NONLINEAR                              12.66     12   0.3940
##  TOTAL INTERACTION                            54.69     14   <.0001
##  TOTAL NONLINEAR + INTERACTION                57.70     17   <.0001
##  TOTAL                                       274.06     21   <.0001

Create Predicted Probabilities for Each Passenger Class

#newdata1 <- expand.grid(age = seq(min(titanic_data$age, na.rm = TRUE), 
#                                  max(titanic_data$age, na.rm = TRUE), 
#                                  length = 100), 
#                        sex = c("male", "female"), 
#                        pclass = "1st")
## Author replaced this statement with Harrell's Predict
# newdata1$predict <- predict(fit, newdata = newdata1, type = "fitted", se.fit = TRUE)
pred <- Predict(fit, age = seq(0, 80, length = 100), sex = c('male', 'female'),
                pclass = c('1st','2nd','3rd'), fun=plogis)

Generate Plots

pred_df <- as.data.frame(pred)

pclass1_df <- subset(pred_df, pclass == "1st")
pclass2_df <- subset(pred_df, pclass == "2nd")
pclass3_df <- subset(pred_df, pclass == "3rd")
plot1 <- ggplot(pclass1_df, aes(x = age, y = yhat, color = sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 1") +
  theme_minimal()

plot2 <- ggplot(pclass2_df, aes(x = age, y = yhat, color = sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 2") +
  theme_minimal()

plot3 <- ggplot(pclass3_df, aes(x = age, y = yhat, color = sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 3") +
  theme_minimal()

# Arrange the plots vertically in a column
grid.arrange(plot1, plot2, plot3, nrow = 3)

Gallagher’s Donner Party Analysis with Restricted Cubic Splines & GAMs

# Read the data from Gallagher's github site (public has read, not edit access)

Donner <- read.csv("https://raw.githubusercontent.com/EugeneGall/donner-data-analysis/main/Donner.csv")

# Calculate family size based on Last_Name, but Family_Group_Size from Grayson
# (1990, 2018) Table will be used in this code's analyses.
Donner$Family_Size <- as.integer(ave(Donner$Last_Name, Donner$Last_Name, FUN =length))

str(Donner)

## 'data.frame':    87 obs. of  13 variables:
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : num  23 13 1 5 14 40 51 9 3 7 ...
##  $ Sex              : chr  "Male" "Male" "Female" "Male" ...
##  $ Status           : chr  "Died" "Survived" "Survived" "Survived" ...
##  $ Last_Name        : chr  "Antonio" "Breen" "Breen" "Breen" ...
##  $ First_Name       : chr  "" "Edward" "Isabella" "James" ...
##  $ Family_Group_Size: int  1 9 9 9 9 9 9 9 9 9 ...
##  $ First_Snow       : chr  "1846-10-28" "1846-10-28" "1846-10-28" "1846-10-28" ...
##  $ Last_Date        : chr  "1846-12-25" "1847-04-29" "1847-04-29" "1847-04-29" ...
##  $ Delete           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Employee         : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Comments         : chr  "Cattleherder" "Familysurvivedcompletely" "Familysurvivedcompletely" "Familysurvivedcompletely" ...
##  $ Family_Size      : int  1 9 9 9 9 9 9 9 9 9 ...

# This produced slightly different Family Sizes than Grayson (1990 Table 1, 
# 2018 Table 2.1. Used Grayson (1990, 2018) Family Group Sizes with included
# family links using data from Stewart's (1960, p 363-364) Donner Party Roster.

# Convert Status to binary
Donner$Status <- ifelse(Donner$Status == "Survived", 1, 0)

# Calculate the Survival_Time which is Death Date or last rescue (Lewis 
# Keseberg 4/29/1847) - First_Snow, October 28 1846.
# Convert the character strings to Date objects
Donner$First_Snow_Date <- as.Date(Donner$First_Snow, format="%Y-%m-%d")
Donner$Last_Date_Date <- as.Date(Donner$Last_Date, format="%Y-%m-%d")

# Calculate the difference in days for individual survivorship
Donner$Survival_Time <- as.numeric(Donner$Last_Date_Date - Donner$First_Snow_Date)

# View the first few rows of the data frame to confirm the results
head(Donner)

##   ID Age    Sex Status Last_Name First_Name Family_Group_Size First_Snow
## 1  1  23   Male      0   Antonio                            1 1846-10-28
## 2  2  13   Male      1     Breen     Edward                 9 1846-10-28
## 3  3   1 Female      1     Breen   Isabella                 9 1846-10-28
## 4  4   5   Male      1     Breen      James                 9 1846-10-28
## 5  5  14   Male      1     Breen       John                 9 1846-10-28
## 6  6  40 Female      1     Breen       Mary                 9 1846-10-28
##    Last_Date Delete Employee                 Comments Family_Size
## 1 1846-12-25      0        1             Cattleherder           1
## 2 1847-04-29      0        0 Familysurvivedcompletely           9
## 3 1847-04-29      0        0 Familysurvivedcompletely           9
## 4 1847-04-29      0        0 Familysurvivedcompletely           9
## 5 1847-04-29      0        0 Familysurvivedcompletely           9
## 6 1847-04-29      0        0 Familysurvivedcompletely           9
##   First_Snow_Date Last_Date_Date Survival_Time
## 1      1846-10-28     1846-12-25            58
## 2      1846-10-28     1847-04-29           183
## 3      1846-10-28     1847-04-29           183
## 4      1846-10-28     1847-04-29           183
## 5      1846-10-28     1847-04-29           183
## 6      1846-10-28     1847-04-29           183

### Delete 8 cases for individuals who died before the first Snowstorm
# on 1846-10-28. This will reduce the number of cases to 87 - 8 = 79

Donner <- Donner %>%
  filter(Delete != 1)
# View the structure to confirm the filtering
str(Donner)

## 'data.frame':    79 obs. of  16 variables:
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : num  23 13 1 5 14 40 51 9 3 7 ...
##  $ Sex              : chr  "Male" "Male" "Female" "Male" ...
##  $ Status           : num  0 1 1 1 1 1 1 1 1 1 ...
##  $ Last_Name        : chr  "Antonio" "Breen" "Breen" "Breen" ...
##  $ First_Name       : chr  "" "Edward" "Isabella" "James" ...
##  $ Family_Group_Size: int  1 9 9 9 9 9 9 9 9 9 ...
##  $ First_Snow       : chr  "1846-10-28" "1846-10-28" "1846-10-28" "1846-10-28" ...
##  $ Last_Date        : chr  "1846-12-25" "1847-04-29" "1847-04-29" "1847-04-29" ...
##  $ Delete           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Employee         : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Comments         : chr  "Cattleherder" "Familysurvivedcompletely" "Familysurvivedcompletely" "Familysurvivedcompletely" ...
##  $ Family_Size      : int  1 9 9 9 9 9 9 9 9 9 ...
##  $ First_Snow_Date  : Date, format: "1846-10-28" "1846-10-28" ...
##  $ Last_Date_Date   : Date, format: "1846-12-25" "1847-04-29" ...
##  $ Survival_Time    : num  58 183 183 183 183 183 183 183 183 183 ...

# Prepare data for rms, Harrell's 'Regression Modeling Strategies' These
# two statements are required for the summary of the Harrell's Glm
ddist <- datadist(Donner)

## Warning in datadist(Donner): Delete is constant

## Warning in datadist(Donner): First_Snow_Date is constant

options(datadist = "ddist")

# Fit the model with Age, Sex, and Grayson's (1990, 2018) Family_Group_Size
# In deciding on knot size, I used the rule that the minimum knot size should
# be chosen (rcs has a min of 3) unless there is another larger number of knots
# with an AIC which is more than than 4 AIC units lower (Burnham & Anderson
# 2004, p 271)

# Two methods will be coded for finding the appropriate number of knots:
# 1) Brute force by manually plugging in different knot numbers in Glm
# 2) An optimization routine based on AIC

##### First approach: brute force fitting to find minimal AICs.
# 3 is the minimum knot size permitted with Harrell's rcs function

# Fit Model 1:    Status ~ rcs(Age, 3) * Sex
Mod1  <- glm(Status ~ rcs(Age,3) * Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)

# Need Harrell's Glm for effect sizes and graphic with rms::Predict
mod1  <- Glm(Status ~ rcs(Age,3) * Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
# Where are the knots located?
k <- attr (rcs(Donner$Age,3) ,'parms')
k

## [1]  1 14 46

# knots located at Ages 1, 14, and 46
# Null model
mod_null <- glm(Status ~ 1, data = Donner, family = binomial())
# Additive model
mod2  <- Glm(Status ~ rcs(Age,3) + Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod2  <- glm(Status ~ rcs(Age,3) + Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
# Test the models
chi_sq_n1 <- anova(mod_null, Mod1, test="Chisq")
# Print the results
print(chi_sq_n1)

## Analysis of Deviance Table
## 
## Model 1: Status ~ 1
## Model 2: Status ~ rcs(Age, 3) * Sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        78      109.2                     
## 2        73       83.4  5     25.8  9.8e-05

# Summary and ANOVA for Mod1
# Test whether the interaction model adds value.
chi_sq_21 <- anova(Mod2, Mod1, test="Chisq")
# Print the results
print(chi_sq_21)

## Analysis of Deviance Table
## 
## Model 1: Status ~ rcs(Age, 3) + Sex
## Model 2: Status ~ rcs(Age, 3) * Sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        75       92.7                     
## 2        73       83.4  2     9.24   0.0098

# Summaries and effect sizes of the Age(rcs4) * Sex model
summary(Mod1)

## 
## Call:
## glm(formula = Status ~ rcs(Age, 3) * Sex, family = binomial(), 
##     data = Donner, x = TRUE, y = TRUE)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -0.836      0.811   -1.03   0.3026
## rcs(Age, 3)Age             0.391      0.143    2.73   0.0064
## rcs(Age, 3)Age'           -0.938      0.336   -2.79   0.0052
## SexMale                    0.853      1.076    0.79   0.4275
## rcs(Age, 3)Age:SexMale    -0.403      0.158   -2.55   0.0106
## rcs(Age, 3)Age':SexMale    0.897      0.359    2.50   0.0124
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  83.408  on 73  degrees of freedom
## AIC: 95.41
## 
## Number of Fisher Scoring iterations: 5

anova(Mod1)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                               78      109.2         
## rcs(Age, 3)      2     8.10        76      101.1   0.0174
## Sex              1     8.45        75       92.7   0.0037
## rcs(Age, 3):Sex  2     9.24        73       83.4   0.0098

AIC(Mod1)

## [1] 95.41

#Summary of Brute force fitting of AICs for interaction model
#   mod1     AIC
# 3 knots  95.40752 * Chosen because within 4 of lowest AIC
# 4 knots  92.3582 ** Not chosen because only 3.04932 less than 3 knot solution
# 5 knots  92.88625 
# 6 knots Apparently Singular Matrix, no estimation possible

# Fit Family_Group_Size fit with base glm and Harrell's rms::Glm
Mod3 <- glm(Status ~ rcs(Family_Group_Size,5), data = Donner, 
            family = binomial(), x = TRUE, y = TRUE)
mod3 <- Glm(Status ~ rcs(Family_Group_Size,5), data = Donner, 
            family = binomial(), x = TRUE, y = TRUE)
# Null model
# Test the models
chi_sq_n2 <- anova(mod_null, Mod3, test="Chisq")
# Print the results
print(chi_sq_n2)

## Analysis of Deviance Table
## 
## Model 1: Status ~ 1
## Model 2: Status ~ rcs(Family_Group_Size, 5)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        78      109.2                     
## 2        74       90.4  4     18.8  0.00087

# Summary and ANOVA for mod
summary(Mod3)

## 
## Call:
## glm(formula = Status ~ rcs(Family_Group_Size, 5), family = binomial(), 
##     data = Donner, x = TRUE, y = TRUE)
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                     -1.927      0.985   -1.95
## rcs(Family_Group_Size, 5)Family_Group_Size       0.542      0.384    1.41
## rcs(Family_Group_Size, 5)Family_Group_Size'      0.783      2.430    0.32
## rcs(Family_Group_Size, 5)Family_Group_Size''    -3.640      4.762   -0.76
## rcs(Family_Group_Size, 5)Family_Group_Size'''   63.836     36.130    1.77
##                                               Pr(>|z|)
## (Intercept)                                      0.051
## rcs(Family_Group_Size, 5)Family_Group_Size       0.158
## rcs(Family_Group_Size, 5)Family_Group_Size'      0.747
## rcs(Family_Group_Size, 5)Family_Group_Size''     0.445
## rcs(Family_Group_Size, 5)Family_Group_Size'''    0.077
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  90.414  on 74  degrees of freedom
## AIC: 100.4
## 
## Number of Fisher Scoring iterations: 4

anova(Mod3)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                           Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                                         78      109.2         
## rcs(Family_Group_Size, 5)  4     18.8        74       90.4  0.00087

summary(mod3)  # For effect size

##              Effects              Response : Status 
## 
##  Factor            Low High Diff. Effect  S.E.   Lower 0.95 Upper 0.95
##  Family_Group_Size 4   13   9     -0.7111 0.7485 -2.203     0.7803

AIC(mod3)

## [1] 100.4

AIC(Mod3)

## [1] 100.4

mod4 <- Glm(Status ~ Age + Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod4 <- glm(Status ~ Age + Sex + rcs(Family_Group_Size,4), data = Donner, family = binomial(), x = TRUE, y = TRUE)
# for mod5, knot sizes of 3 & 5 determined by k-fold cross validation, code below
mod5 <- Glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod5 <- glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)

##### Second approach: GPT4 Streamlined the above code block with functions:

# Function to fit models with one rcs term, updated on 8/17/23 by GPT-4
fit_best_rcs <- function(formula_template, data, knot_range) {
  
  aic_values <- numeric(length(knot_range))
  models <- vector("list", length(knot_range))
  
  # Calculate AIC for each knot size
  for (k in knot_range) {
    formula <- as.formula(gsub("KNOTS", k, formula_template))
    model <- glm(formula, data = data, family = binomial())
    aic_values[k - knot_range[1] + 1] <- AIC(model)
    models[[k - knot_range[1] + 1]] <- model
  }
  
  min_aic <- min(aic_values)
  best_knots <- knot_range[which.min(aic_values)]
  
  # Iterate through the aic_values and apply the rules
  for (i in seq_along(aic_values)) {
    if (aic_values[i] - min_aic <= 4 && knot_range[i] < best_knots) {
      best_knots <- knot_range[i]
    }
  }
  
  return(list(model = models[[which(knot_range == best_knots)]], knots = best_knots, results = setNames(aic_values, knot_range)))
}

# Function to fit model with two rcs terms, updated by GPT-4 on 8/17/23
fit_best_double_rcs <- function(formula, data, knot_range, family = binomial()) {
  best_aic <- Inf
  best_k_age <- max(knot_range)
  best_k_fgs <- max(knot_range)
  aic_matrix <- matrix(NA, nrow = length(knot_range), ncol = length(knot_range), dimnames = list(knot_range, knot_range))
  
  for (k_age in knot_range) {
    for (k_fgs in knot_range) {
      tryCatch({
        current_formula <- as.formula(
          gsub("knots_age", as.character(k_age), 
               gsub("knots_fgs", as.character(k_fgs), formula)))
        current_model <- glm(current_formula, data = data, family = family)
        current_aic <- AIC(current_model)
        
        aic_matrix[as.character(k_age), as.character(k_fgs)] <- current_aic
        
        # Update if current AIC is smaller than the best so far
        if (current_aic < best_aic) {
          best_aic <- current_aic
          best_k_age <- k_age
          best_k_fgs <- k_fgs
        }
      }, error = function(e) {}) # Handle potential errors
    }
  }
  
  min_aic <- best_aic  # best AIC among all combinations
  best_sum_knots <- best_k_age + best_k_fgs  # initialize with the sum of best knots
  
  # Find combinations with the smallest sum of knots but AIC within 4 of min_aic
  for (k_age in knot_range) {
    for (k_fgs in knot_range) {
      current_sum_knots <- k_age + k_fgs
      if ((aic_matrix[as.character(k_age), as.character(k_fgs)] - min_aic <= 4) && current_sum_knots < best_sum_knots) {
        best_k_age <- k_age
        best_k_fgs <- k_fgs
        best_sum_knots <- current_sum_knots
      }
    }
  }
  
  best_formula <- as.formula(
    gsub("knots_age", as.character(best_k_age), 
         gsub("knots_fgs", as.character(best_k_fgs), formula)))
  best_model <- glm(best_formula, data = data, family = family)
  
  return(list(model = best_model, knots_age = best_k_age, knots_fgs = best_k_fgs))
}

# Extract and print details
print_details <- function(model_result, model_name) {
  cat("\n---", model_name, "---\n")
  if ("knots_age" %in% names(model_result)) {
    cat("Knots: ", model_result$knots_age, " & ", model_result$knots_fgs, "\n")
  } else {
    cat("Knots: ", model_result$knots, "\n")
  }
  cat("AIC: ", AIC(model_result$model), "\n")
  cat("\nModel Summary:\n")
  print(summary(model_result$model))
  cat("\nModel Anova:\n")
  print(anova(model_result$model))
}

# Fitting models
# Note that 3 is the minimum knot size for rcs, dozens of warnings with knots>6
knot_range <- 3:6
mod1_results <- fit_best_rcs("Status ~ rcs(Age, KNOTS) * Sex", Donner, knot_range)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

mod3_results <- fit_best_rcs("Status ~ rcs(Family_Group_Size, KNOTS)", Donner, knot_range)
mod4_results <- fit_best_rcs("Status ~ rcs(Family_Group_Size, KNOTS) * Sex", Donner, knot_range)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

mod5_results <- fit_best_double_rcs("Status ~ rcs(Age, knots_age) * Sex + rcs(Family_Group_Size, knots_fgs)", Donner, knot_range)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Printing the details
print_details(mod1_results, "mod1")

## 
## --- mod1 ---
## Knots:  3 
## AIC:  95.41 
## 
## Model Summary:
## 
## Call:
## glm(formula = formula, family = binomial(), data = data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -0.836      0.811   -1.03   0.3026
## rcs(Age, 3)Age             0.391      0.143    2.73   0.0064
## rcs(Age, 3)Age'           -0.938      0.336   -2.79   0.0052
## SexMale                    0.853      1.076    0.79   0.4275
## rcs(Age, 3)Age:SexMale    -0.403      0.158   -2.55   0.0106
## rcs(Age, 3)Age':SexMale    0.897      0.359    2.50   0.0124
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  83.408  on 73  degrees of freedom
## AIC: 95.41
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## Model Anova:
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                               78      109.2         
## rcs(Age, 3)      2     8.10        76      101.1   0.0174
## Sex              1     8.45        75       92.7   0.0037
## rcs(Age, 3):Sex  2     9.24        73       83.4   0.0098

print_details(mod3_results, "mod3")

## 
## --- mod3 ---
## Knots:  4 
## AIC:  101 
## 
## Model Summary:
## 
## Call:
## glm(formula = formula, family = binomial(), data = data)
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                    -2.613      0.944   -2.77
## rcs(Family_Group_Size, 4)Family_Group_Size      0.912      0.301    3.03
## rcs(Family_Group_Size, 4)Family_Group_Size'    -1.959      0.801   -2.45
## rcs(Family_Group_Size, 4)Family_Group_Size''    3.948      1.946    2.03
##                                              Pr(>|z|)
## (Intercept)                                    0.0056
## rcs(Family_Group_Size, 4)Family_Group_Size     0.0024
## rcs(Family_Group_Size, 4)Family_Group_Size'    0.0145
## rcs(Family_Group_Size, 4)Family_Group_Size''   0.0424
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  92.962  on 75  degrees of freedom
## AIC: 101
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Model Anova:
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                           Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                                         78        109         
## rcs(Family_Group_Size, 4)  3     16.2        75         93    0.001

print_details(mod4_results, "mod4")

## 
## --- mod4 ---
## Knots:  3 
## AIC:  96.68 
## 
## Model Summary:
## 
## Call:
## glm(formula = formula, family = binomial(), data = data)
## 
## Coefficients:
##                                                     Estimate Std. Error z value
## (Intercept)                                           0.1411     1.2874    0.11
## rcs(Family_Group_Size, 3)Family_Group_Size            0.1167     0.2376    0.49
## rcs(Family_Group_Size, 3)Family_Group_Size'          -0.0767     0.2229   -0.34
## SexMale                                              -2.3309     1.5287   -1.52
## rcs(Family_Group_Size, 3)Family_Group_Size:SexMale    0.4964     0.3039    1.63
## rcs(Family_Group_Size, 3)Family_Group_Size':SexMale  -0.7009     0.3355   -2.09
##                                                     Pr(>|z|)
## (Intercept)                                            0.913
## rcs(Family_Group_Size, 3)Family_Group_Size             0.623
## rcs(Family_Group_Size, 3)Family_Group_Size'            0.731
## SexMale                                                0.127
## rcs(Family_Group_Size, 3)Family_Group_Size:SexMale     0.102
## rcs(Family_Group_Size, 3)Family_Group_Size':SexMale    0.037
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  84.681  on 73  degrees of freedom
## AIC: 96.68
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## Model Anova:
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                                             78      109.2         
## rcs(Family_Group_Size, 3)      2    12.10        76       97.1   0.0024
## Sex                            1     5.93        75       91.2   0.0149
## rcs(Family_Group_Size, 3):Sex  2     6.49        73       84.7   0.0389

print_details(mod5_results, "mod5")

## 
## --- mod5 ---
## Knots:  6  &  6 
## AIC:  56.33 
## 
## Model Summary:
## 
## Call:
## glm(formula = best_formula, family = family, data = data)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -2.89e+02   8.58e+05    0.00
## rcs(Age, 6)Age                                  1.04e+02   3.29e+05    0.00
## rcs(Age, 6)Age'                                 5.32e+03   8.07e+06    0.00
## rcs(Age, 6)Age''                               -1.43e+04   1.41e+07    0.00
## rcs(Age, 6)Age'''                               1.75e+04   7.69e+06    0.00
## rcs(Age, 6)Age''''                             -1.05e+04   1.96e+06   -0.01
## SexMale                                         2.75e+02   8.58e+05    0.00
## rcs(Family_Group_Size, 6)Family_Group_Size     -1.65e+01   2.37e+03   -0.01
## rcs(Family_Group_Size, 6)Family_Group_Size'     4.77e+02   5.91e+04    0.01
## rcs(Family_Group_Size, 6)Family_Group_Size''   -1.31e+03   1.60e+05   -0.01
## rcs(Family_Group_Size, 6)Family_Group_Size'''   2.77e+03   3.36e+05    0.01
## rcs(Family_Group_Size, 6)Family_Group_Size'''' -5.37e+03   6.62e+05   -0.01
## rcs(Age, 6)Age:SexMale                         -9.85e+01   3.29e+05    0.00
## rcs(Age, 6)Age':SexMale                        -5.42e+03   8.07e+06    0.00
## rcs(Age, 6)Age'':SexMale                        1.45e+04   1.41e+07    0.00
## rcs(Age, 6)Age''':SexMale                      -1.76e+04   7.69e+06    0.00
## rcs(Age, 6)Age'''':SexMale                      1.05e+04   1.96e+06    0.01
##                                                Pr(>|z|)
## (Intercept)                                        1.00
## rcs(Age, 6)Age                                     1.00
## rcs(Age, 6)Age'                                    1.00
## rcs(Age, 6)Age''                                   1.00
## rcs(Age, 6)Age'''                                  1.00
## rcs(Age, 6)Age''''                                 1.00
## SexMale                                            1.00
## rcs(Family_Group_Size, 6)Family_Group_Size         0.99
## rcs(Family_Group_Size, 6)Family_Group_Size'        0.99
## rcs(Family_Group_Size, 6)Family_Group_Size''       0.99
## rcs(Family_Group_Size, 6)Family_Group_Size'''      0.99
## rcs(Family_Group_Size, 6)Family_Group_Size''''     0.99
## rcs(Age, 6)Age:SexMale                             1.00
## rcs(Age, 6)Age':SexMale                            1.00
## rcs(Age, 6)Age'':SexMale                           1.00
## rcs(Age, 6)Age''':SexMale                          1.00
## rcs(Age, 6)Age'''':SexMale                         1.00
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  22.328  on 62  degrees of freedom
## AIC: 56.33
## 
## Number of Fisher Scoring iterations: 25
## 
## 
## Model Anova:
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                           Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                                         78      109.2         
## rcs(Age, 6)                5     17.4        73       91.8  0.00379
## Sex                        1     13.1        72       78.7  0.00030
## rcs(Family_Group_Size, 6)  5     32.6        67       46.2  4.6e-06
## rcs(Age, 6):Sex            5     23.8        62       22.3  0.00023

###### End of GPT-4 AIC rcs optimization code ######

# Generate 4 ggplot2 graphics
# Define new levels for the predictors
new_age <- seq(min(Donner$Age), max(Donner$Age), length.out = 100)
new_family_group_size <- seq(min(Donner$Family_Group_Size), 
                             max(Donner$Family_Group_Size), length.out = 100)
new_sex <- unique(Donner$Sex)

# Make predictions on the link scale (logit scale)
# The Predict function require fit by Glm not glm
link_pred1 <- Predict(mod1, Age = new_age, Sex = new_sex)
link_pred3 <- Predict(mod3, Family_Group_Size = new_family_group_size)
link_pred4 <- Predict(mod4, Age = new_age, Sex = new_sex,
                      Family_Group_Size = new_family_group_size)
link_pred5 <- Predict(mod5, Age = new_age, Sex = new_sex, 
                      Family_Group_Size = new_family_group_size)

# Transform predictions back to the original scale (probability scale)
pred1 <- data.frame(
  Age = rep(new_age, times = length(new_sex)),
  Sex = rep(new_sex, each = length(new_age)),
  fit = plogis(link_pred1$yhat),
  lower = plogis(link_pred1$lower),
  upper = plogis(link_pred1$upper)
)

pred3 <- data.frame(
  Sex = rep(new_sex, each = length(new_age)),
  Family_Group_Size = new_family_group_size,
  fit = plogis(link_pred3$yhat),
  lower = plogis(link_pred3$lower),
  upper = plogis(link_pred3$upper)
)

# Transform predictions back to the original scale (probability scale)
pred4 <- data.frame(
  Age = rep(new_age, times = length(new_sex)),
  Sex = rep(new_sex, each = length(new_age)),
  Family_Group_Size = rep(new_family_group_size, each = length(new_family_group_size)),
  fit = plogis(link_pred4$yhat),
  lower = plogis(link_pred4$lower),
  upper = plogis(link_pred4$upper)
)

pred5 <-  data.frame(
  Age = rep(new_age, times = length(new_sex)),
  Sex = rep(new_sex, each = length(new_age)),
  Family_Group_Size = rep(new_family_group_size, each = length(new_family_group_size)),
  fit = plogis(link_pred5$yhat),
  lower = plogis(link_pred5$lower),
  upper = plogis(link_pred5$upper)
)
# Adjust y-values for jittered points
Donner$AdjustedStatus <- ifelse(Donner$Status == 0, -0.03, 1.03)

# Plot the results for Figure 1 rcs(Age,3) * Sex model
set.seed(13) 
ggplot(pred1, aes(x = Age, y = fit, color = Sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_jitter(data = Donner, aes(x = Age, y = AdjustedStatus,
                                 shape = Sex, color = Sex), width = 0.3, height = 0.03, size = 1.5) +
  labs(x = "Age (Years)", y = "Estimated Probability of Survival",
       title = "Fig. 1. rcs(Age, 3) * Sex with 95% confidence intervals",
       color = "Sex") +
  theme_minimal() +
  scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))

# Plot the results for the rcs(Family_Group_Size,4) model
ggplot(pred3, aes(x = Family_Group_Size, y = fit, color = Sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_jitter(data = Donner, aes(x = Family_Group_Size, y = AdjustedStatus,
                                 shape = Sex, color = Sex), width = 0.3, height = 0.03,
              size = 1.5) +
  labs(x = "Family Group Size", y = "Estimated Probability of Survival",
       title = "Fig. 3. rcs(Family Group Size, 5) with 95% confidence intervals",
       color = "Sex") +
  theme_minimal() +
  scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))

# Plot the results for the additive model, mod2 (graphic not used in manuscript)
plot_3d_mod4 <- plot_ly(data = subset(pred4, Sex == "Male"), x = ~Age, y = ~Family_Group_Size, z = ~fit, 
                        type = "mesh3d", opacity = 0.6, name = "Male", showscale = FALSE) %>%
  add_trace(data = subset(pred4, Sex == "Female"), x = ~Age, y = ~Family_Group_Size, z = ~fit, 
            type = "mesh3d", opacity = 0.6, name = "Female", showscale = FALSE) %>%
  layout(scene = list(zaxis = list(range = c(0, 1)),
                      xaxis = list(title = "Age"),
                      yaxis = list(title = "Family Group Size"),
                      zaxis = list(title = "Estimated Probability of Survival")),
         #         title = "Age + Sex + rcs(Family_Group_Size,6)")
         title = "Age + Sex + rcs(Family_Group_Size,6)")
plot_3d_mod4

# Plot the results for the rcs(Age,3) * Sex * rcs(Family_Group-Size,5) 3-d model
# Graphic not used in the manuscript (too curvy)
plot_3d_mod5 <- plot_ly(data = subset(pred5, Sex == "Male"), x = ~Age, y = ~Family_Group_Size, z = ~fit, 
                        type = "mesh3d", opacity = 0.6, name = "Male", showscale = FALSE) %>%
  add_trace(data = subset(pred5, Sex == "Female"), x = ~Age, y = ~Family_Group_Size, z = ~fit, 
            type = "mesh3d", opacity = 0.6, name = "Female", showscale = FALSE) %>%
  layout(scene = list(zaxis = list(range = c(0, 1)),
                      xaxis = list(title = "Age"),
                      yaxis = list(title = "Family Group Size"),
                      zaxis = list(title = "Estimated Probability of Survival")),
         title = "rcs(Age,3) * Sex + rcs(Family_Group_Size,5)")
plot_3d_mod5

### 8/20/23 addendum. Fitting rcs knots with k-fold cross validation ##########
# The AIC approach produces a badly overfit mod5 showing
# a 3-d plot that was far too flexible. I sent a 3-p prompt on the problem to
# GPT-4 and it provided this solution

cv_rcs <- function(k_age, k_fgs, data, n_folds = 10){
  
  # Create a vector to store CV errors
  cv_errors <- numeric(n_folds)
  
  # Create fold indices
  fold_indices <- sample(rep(1:n_folds, length.out = nrow(data)))
  
  for(i in 1:n_folds){
    
    # Split the data into training and test sets
    training_data <- data[fold_indices != i, ]
    test_data <- data[fold_indices == i, ]
    
    # Fit the GLM model on the training data with rcs terms
    glm_model <- glm(Status ~ rcs(Age, k_age) * Sex + rcs(Family_Group_Size, k_fgs), 
                     data = training_data, family = binomial())
    
    # Predict on the test data
    predictions <- predict(glm_model, newdata = test_data, type = "response")
    
    # Compute the binomial deviance (log loss) for the current fold and store it
    cv_errors[i] <- -2 * sum(test_data$Status * log(predictions) + (1 - test_data$Status) * log(1 - predictions))
    
  }
  
  # Return the mean CV error
  return(mean(cv_errors))
}

# Perform k-fold CV for a range of knot sizes
k_values <- 3:6
cv_results_matrix <- matrix(NA, nrow = length(k_values), ncol = length(k_values),
                            dimnames = list(k_values, k_values))

for(k_age in k_values){
  for(k_fgs in k_values){
    cv_results_matrix[as.character(k_age), as.character(k_fgs)] <- cv_rcs(k_age, k_fgs, Donner)
  }
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied =
## fractied): 6 knots requested with 8 unique values of x.  knots set to 6
## interior values.

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Determine the optimal k_age and k_fgs based on minimum CV error
optimal_indices <- which(cv_results_matrix == min(cv_results_matrix), arr.ind = TRUE)
optimal_k_age <- as.numeric(rownames(cv_results_matrix)[optimal_indices[1,1]])
optimal_k_fgs <- as.numeric(colnames(cv_results_matrix)[optimal_indices[1,2]])

print(paste("Optimal k for Age: ", optimal_k_age))

## [1] "Optimal k for Age:  4"

print(paste("Optimal k for Family_Group_Size: ", optimal_k_fgs))

## [1] "Optimal k for Family_Group_Size:  3"

# Fit the optimal model
best_glm_model <- glm(Status ~ rcs(Age, optimal_k_age) * Sex + 
                        rcs(Family_Group_Size, optimal_k_fgs), 
                      data = Donner, family = binomial())
summary(best_glm_model)

## 
## Call:
## glm(formula = Status ~ rcs(Age, optimal_k_age) * Sex + rcs(Family_Group_Size, 
##     optimal_k_fgs), family = binomial(), data = Donner)
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                               -5.334      2.005
## rcs(Age, optimal_k_age)Age                                 2.012      0.811
## rcs(Age, optimal_k_age)Age'                              -12.151      5.203
## rcs(Age, optimal_k_age)Age''                              21.164      9.295
## SexMale                                                    1.878      2.004
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size     0.511      0.194
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size'   -0.673      0.220
## rcs(Age, optimal_k_age)Age:SexMale                        -1.646      0.803
## rcs(Age, optimal_k_age)Age':SexMale                        9.947      5.155
## rcs(Age, optimal_k_age)Age'':SexMale                     -17.303      9.214
##                                                         z value Pr(>|z|)
## (Intercept)                                               -2.66   0.0078
## rcs(Age, optimal_k_age)Age                                 2.48   0.0131
## rcs(Age, optimal_k_age)Age'                               -2.34   0.0195
## rcs(Age, optimal_k_age)Age''                               2.28   0.0228
## SexMale                                                    0.94   0.3489
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size     2.64   0.0084
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size'   -3.06   0.0022
## rcs(Age, optimal_k_age)Age:SexMale                        -2.05   0.0404
## rcs(Age, optimal_k_age)Age':SexMale                        1.93   0.0537
## rcs(Age, optimal_k_age)Age'':SexMale                      -1.88   0.0604
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  61.443  on 69  degrees of freedom
## AIC: 81.44
## 
## Number of Fisher Scoring iterations: 7

## This optimization produces Age: 3 knots, Family Group Size 5 knots.

# Summary and ANOVA for mod5 with knot sizes determined by k-fold cross validation
# Just repeating thise calculations for clarity.
mod5 <- Glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod5 <- glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
summary(Mod5)

## 
## Call:
## glm(formula = Status ~ rcs(Age, 3) * Sex + rcs(Family_Group_Size, 
##     5), family = binomial(), data = Donner, x = TRUE, y = TRUE)
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                     -3.854      1.840   -2.09
## rcs(Age, 3)Age                                   0.541      0.193    2.80
## rcs(Age, 3)Age'                                 -1.334      0.480   -2.78
## SexMale                                         -0.500      1.407   -0.36
## rcs(Family_Group_Size, 5)Family_Group_Size       0.514      0.519    0.99
## rcs(Family_Group_Size, 5)Family_Group_Size'      3.109      3.150    0.99
## rcs(Family_Group_Size, 5)Family_Group_Size''    -9.172      6.151   -1.49
## rcs(Family_Group_Size, 5)Family_Group_Size'''  123.583     47.027    2.63
## rcs(Age, 3)Age:SexMale                          -0.358      0.209   -1.72
## rcs(Age, 3)Age':SexMale                          0.959      0.496    1.93
##                                               Pr(>|z|)
## (Intercept)                                     0.0362
## rcs(Age, 3)Age                                  0.0051
## rcs(Age, 3)Age'                                 0.0055
## SexMale                                         0.7223
## rcs(Family_Group_Size, 5)Family_Group_Size      0.3220
## rcs(Family_Group_Size, 5)Family_Group_Size'     0.3237
## rcs(Family_Group_Size, 5)Family_Group_Size''    0.1359
## rcs(Family_Group_Size, 5)Family_Group_Size'''   0.0086
## rcs(Age, 3)Age:SexMale                          0.0862
## rcs(Age, 3)Age':SexMale                         0.0533
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109.201  on 78  degrees of freedom
## Residual deviance:  58.369  on 69  degrees of freedom
## AIC: 78.37
## 
## Number of Fisher Scoring iterations: 6

anova(Mod5)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Status
## 
## Terms added sequentially (first to last)
## 
## 
##                           Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                                         78      109.2         
## rcs(Age, 3)                2     8.10        76      101.1   0.0174
## Sex                        1     8.45        75       92.7   0.0037
## rcs(Family_Group_Size, 5)  4    27.74        71       64.9  1.4e-05
## rcs(Age, 3):Sex            2     6.55        69       58.4   0.0379

summary(mod5)  # For effect size

##              Effects              Response : Status 
## 
##  Factor            Low High Diff. Effect S.E.   Lower 0.95 Upper 0.95
##  Age               6   28.5 22.5   1.078 1.0241 -0.9647    3.1215    
##  Family_Group_Size 4   13.0  9.0  -1.042 0.9906 -3.0184    0.9341    
##  Sex - Female:Male 2    1.0   NA   4.473 1.7069  1.0674    7.8778    
## 
## Adjusted to: Age=14 Sex=Male

AIC(mod5)

## [1] 78.37

### end of 8/20/23 GPT-4 code. #################################################

##### GAM analysis of the 79 Travelers #####
######## k-fold cross validation to find optimal k for the GAMs

# Set a seed for reproducibility
set.seed(123)

# Create a function to perform cross-validation for a given k
cv_gam <- function(k_value){
  
  # Number of folds
  n_folds <- 5
  
  # Create a vector to store CV errors
  cv_errors <- numeric(n_folds)
  
  # Create fold indices
  fold_indices <- sample(rep(1:n_folds, length.out = nrow(Donner)))
  
  for(i in 1:n_folds){
    
    # Split the data into training and test sets
    training_data <- Donner[fold_indices != i, ]
    test_data <- Donner[fold_indices == i, ]
    
    # Fit the GAM model on the training data
    gam_model <- gam(Status ~ s(Age, by = as.numeric(Sex == "Male"), k=k_value) +
                       s(Age, by = as.numeric(Sex == "Female"), k=k_value),
                     data = training_data, family = binomial())
    
    # Predict on the test data
    predictions <- predict(gam_model, newdata = test_data, type = "response")
    
    # Compute the binomial deviance (log loss) for the current fold and store it
    cv_errors[i] <- -2 * sum(test_data$Status * log(predictions) + (1 - test_data$Status) * log(1 - predictions))
    
  }
  
  # Return the mean CV error
  return(mean(cv_errors))
}

# Test a range of k values
# This statement produced many warnings, reduced max to 8 and increased min to 3
# k_values <- 2:10
k_values <- 3:6
cv_results <- sapply(k_values, cv_gam)

## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L,
## : Iteration limit reached without full convergence - check carefully

cv_results

## [1] 21.07 25.94 21.98 44.70

# Determine the optimal k based on minimum CV error
optimal_k <- k_values[which.min(cv_results)]

print(optimal_k)

## [1] 3

## optimal_k is 2, so redo the model

mod_gam_k2 <- gam(Status ~ s(Age, by = as.numeric(Sex == "Male"), k=2) +
                    s(Age, by = as.numeric(Sex == "Female"), k=2), 
                  data = Donner, family = binomial())

## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible

## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible

mod_gam_k2

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Status ~ s(Age, by = as.numeric(Sex == "Male"), k = 2) + s(Age, 
##     by = as.numeric(Sex == "Female"), k = 2)
## 
## Estimated degrees of freedom:
## 2.0 1.9  total = 4.9 
## 
## UBRE score: 0.2035     rank: 6/7

# Check the summary
summary(mod_gam_k2)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Status ~ s(Age, by = as.numeric(Sex == "Male"), k = 2) + s(Age, 
##     by = as.numeric(Sex == "Female"), k = 2)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)     9.19       3.13    2.94   0.0033
## 
## Approximate significance of smooth terms:
##                                    edf Ref.df Chi.sq p-value
## s(Age):as.numeric(Sex == "Male")   2.0   2.00   11.3  0.0036
## s(Age):as.numeric(Sex == "Female") 1.9   1.99    7.5  0.0229
## 
## Rank: 6/7
## R-sq.(adj) =  0.229   Deviance explained = 21.9%
## UBRE = 0.20352  Scale est. = 1         n = 79

# Create a prediction dataset
new_age <- seq(min(Donner$Age), max(Donner$Age), length.out = 100)
new_sex <- unique(Donner$Sex)
pred_data <- expand.grid(Age = new_age, Sex = new_sex)

# Predict using the GAM model
predictions <- predict(mod_gam_k2, newdata = pred_data, type = "link", se.fit = TRUE)
pred_data$fit <- plogis(predictions$fit)
pred_data$lower <- plogis(predictions$fit - 1.96 * predictions$se.fit)
pred_data$upper <- plogis(predictions$fit + 1.96 * predictions$se.fit)

# plot the GAM analysis for Sex (Figure 2 in manuscript)
# Adjust y-values for jittered points
Donner$AdjustedStatus <- ifelse(Donner$Status == 0, -0.03, 1.03)

# Plot Fig. 2, GAM analysis of Sex
set.seed(3)
ggplot(pred_data, aes(x = Age, y = fit, color = Sex, shape = Sex)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_jitter(data = Donner, aes(x = Age, y = AdjustedStatus),
              width = 0.3, height = 0.03, size = 1.5) +
  labs(x = "Age (Years)", y = "Estimated Probability of Survival",
       title = "Fig. 2. GAM (k=2) with Age smooths by Sex with 95% confidence intervals") +
  theme_minimal() +
  scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2)) +
  scale_color_manual(values = c("Male" = "blue", "Female" = "pink2"), name = "Sex") +
  scale_shape_manual(values = c("Male" = 17, "Female" = 16), name = "Sex") + 
  guides(color = guide_legend(override.aes = list(shape = c(17, 16))))

### GAMs for Family Group Size, code from GPT-4 ################################

# Define the k-fold cross-validation manually for GAM:
set.seed(123) # for reproducibility
folds <- createFolds(Donner$Status, k = 10)

cv_errors <- data.frame(k = integer(), RMSE = numeric(), Rsquared = numeric())

for (k_value in 3:6) {
  rmse_values <- numeric()
  rsq_values <- numeric()
  
  # Attempt to fit models and catch errors
  for (fold in folds) {
    train_data <- Donner[-fold, ]
    test_data <- Donner[fold, ]
    success <- TRUE
    
    model <- tryCatch({
      gam(Status ~ s(Family_Group_Size, k = k_value), data = train_data, family = binomial())
    }, error = function(e) {
      message(paste("Error with k =", k_value, ":", e$message))
      success <- FALSE
      return(NULL)
    }, warning = function(w) {
      message(paste("Warning with k =", k_value, ":", w$message))
    })
    
    if(success && !is.null(model)){
      preds <- predict(model, newdata = test_data, type = "response")
      
      rmse <- sqrt(mean((test_data$Status - preds)^2))
      rsq <- 1 - (sum((test_data$Status - preds)^2) / sum((test_data$Status - mean(test_data$Status))^2))
      
      rmse_values <- c(rmse_values, rmse)
      rsq_values <- c(rsq_values, rsq)
    }
  }
  
  if (length(rmse_values) > 0 && length(rsq_values) > 0) {
    cv_errors <- rbind(cv_errors, data.frame(k = k_value, RMSE = mean(rmse_values), Rsquared = mean(rsq_values)))
  }
}

# Check the results
cv_errors

##   k   RMSE Rsquared
## 1 3 0.4685 -0.02279
## 2 4 0.4705 -0.03109
## 3 5 0.4658 -0.02021
## 4 6 0.4703 -0.04122

# The value of k with the smallest RMSE would be the optimal choice
best_k <- cv_errors[which.min(cv_errors$RMSE),]$k
best_k

## [1] 5

# Fit the GAM with optimal k

mod_gam <- gam(Status ~ s(Family_Group_Size, k = best_k), data = Donner, family = binomial())

summary(mod_gam)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Status ~ s(Family_Group_Size, k = best_k)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.308      0.294    1.05     0.29
## 
## Approximate significance of smooth terms:
##                       edf Ref.df Chi.sq p-value
## s(Family_Group_Size) 3.77   3.96   11.6    0.02
## 
## R-sq.(adj) =  0.173   Deviance explained = 19.2%
## UBRE = 0.23719  Scale est. = 1         n = 79

# Generate ggplot graphics for GAM
# Create a prediction dataset
new_family_group_size <- seq(min(Donner$Family_Group_Size), 
                             max(Donner$Family_Group_Size), length.out = 100)
# Create prediction data
# Predict on the link (logit) scale
link_predictions <- predict(mod_gam, newdata = data.frame(Family_Group_Size = new_family_group_size), type = "link", se.fit = TRUE)

# Compute confidence intervals on the link scale
link_lower <- link_predictions$fit - 1.96 * link_predictions$se.fit
link_upper <- link_predictions$fit + 1.96 * link_predictions$se.fit

# Back-transform to the probability scale
pred_data <- data.frame(
  Family_Group_Size = new_family_group_size,
  fit = plogis(link_predictions$fit),
  lower = plogis(link_lower),
  upper = plogis(link_upper)
)

# Plot the GAM (k = best_k = 5) model.
ggplot(pred_data, aes(x = Family_Group_Size, y = fit)) +
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  geom_jitter(data = Donner, aes(x = Family_Group_Size, y = AdjustedStatus,
                                 shape = Sex, color = Sex), width = 0.3, height = 0.03,
              size = 1.5) +
  labs(x = "Family Group Size", y = "Estimated Probability of Survival",
       # title = paste0("GAM (k=", best_k, ") with Family Group Size with 95% confidence intervals"),
       title = paste0("Fig. 4. GAM (k=", best_k, ") with Family Group Size with 95% confidence intervals"),
       color = "Sex") +
  theme_minimal() +
  scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))

##### Cox Proportional Hazards Model for Sex

Donner$Death <- abs(Donner$Status-1)
# Code initially from GPT-4

# Assuming the data frame is already loaded as `Donner`
# 2. Fit the Cox proportional hazards model
cox_model <- coxph(Surv(Survival_Time, Death) ~ Sex, data = Donner)

# Print the summary of the model
print(summary(cox_model))

## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Sex, data = Donner)
## 
##   n= 79, number of events= 37 
## 
##          coef exp(coef) se(coef)    z Pr(>|z|)
## SexMale 1.037     2.819    0.371 2.79   0.0052
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## SexMale      2.82      0.355      1.36      5.83
## 
## Concordance= 0.633  (se = 0.036 )
## Likelihood ratio test= 8.82  on 1 df,   p=0.003
## Wald test            = 7.81  on 1 df,   p=0.005
## Score (logrank) test = 8.52  on 1 df,   p=0.004

# Check the proportional hazards assumption with 
ph_test <- cox.zph(cox_model)

# Print the results
# print(ph_test)

# Plot the Schoenfeld residuals
# plot(ph_test)

# 3. Visualize the results 

# Fit Kaplan-Meier survival curves for Sex
km_fit <- survfit(Surv(Survival_Time, Death) ~ Sex, data = Donner)

# Define the days of arrival of the relief parties
relief_days <- c(83, 124, 136, 171)
colors <- c("red", "blue", "magenta", "purple")
linetypes <- c("dashed", "dotted", "dotdash", "longdash")
labels <- c("1st", "2nd", "3rd", "4th")

# Plot Fig. 6 the KM curve & Risk Table for Sex

# Generate KM plot without the risk table
p1 <- ggsurvplot(km_fit, data = Donner, risk.table = FALSE,
                 pval = TRUE, pval.coord = c(0.8, 0.25),
                 legend = "right", legend.title = "Sex",
                 title = "Fig. 6. Kaplan-Meier Survivorship by Sex",
                 legend.labs = c("Female", "Male"), conf.int = TRUE,
                 break.x.by = 20)

# Add the relief party vertical lines and annotations
for (i in 1:4) {
  p1$plot <- p1$plot + 
    annotate("segment", x = relief_days[i], xend = relief_days[i], 
             y = 0, yend = 1, colour = colors[i], linetype = linetypes[i]) +
    annotate("text", x = relief_days[i], y = 0.2, 
             label = labels[i], angle = 90, vjust = 0, color = colors[i])
}

# Generate risk table separately
p2 <- ggsurvplot(km_fit, data = Donner, risk.table = TRUE,
                 pval = FALSE, legend = "none",
                 break.x.by = 20)

# Use patchwork to combine the two plots

# Use patchwork to combine the two plots with specified heights
combined_plot <- p1$plot / p2$table + plot_layout(heights = c(5, 1))

# Print the combined plot
combined_plot

#### Hazard ratio plot
# Convert model coefficients to hazard ratios
hr <- exp(coef(cox_model))
ci <- exp(confint(cox_model))

# Create a data frame for plotting
df <- data.frame(
  Variable = names(hr),
  HR = hr,
  lower = ci[, 1],
  upper = ci[, 2]
)

# Organize data for forestplot
labeltext <- list(
  c("", df$Variable),
  c("HR", as.character(round(df$HR, 2))),
  c("Lower 95% CI", as.character(round(df$lower, 2))),
  c("Upper 95% CI", as.character(round(df$upper, 2)))
)

# Plot using forestplot
forestplot(
  labeltext = labeltext,
  graph.pos = 3,
  mean = c(NA, df$HR),
  lower = c(NA, df$lower),
  upper = c(NA, df$upper),
  xlog = TRUE, # because hazard ratios are typically plotted on a log scale
  clip = c(0.5, 2), # you can adjust these values if necessary
  xticks = c(0.5, 1, 2),
  title = "Hazard Ratios by Sex (95% CI)"
)

# 4. Employee (Teamsters & 2 Servants) Survival Analysis
# 4.1 Cox PH model

cox_model_2 <- coxph(Surv(Survival_Time, Death) ~ Employee, data = Donner)

# Print the summary of the model
print(summary(cox_model_2))

## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee, data = Donner)
## 
##   n= 79, number of events= 37 
## 
##           coef exp(coef) se(coef)    z Pr(>|z|)
## Employee 1.047     2.849    0.422 2.48    0.013
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## Employee      2.85      0.351      1.25      6.52
## 
## Concordance= 0.571  (se = 0.035 )
## Likelihood ratio test= 4.93  on 1 df,   p=0.03
## Wald test            = 6.15  on 1 df,   p=0.01
## Score (logrank) test = 6.73  on 1 df,   p=0.009

# Check the proportional hazards assumption
ph_test_2 <- cox.zph(cox_model_2)

# Print the results
print(ph_test_2)

##          chisq df     p
## Employee  4.17  1 0.041
## GLOBAL    4.17  1 0.041

# Some evidence that the constant proportional hazard assumption is violated.
# p = 0.041 for Employee and Global chisq 4.17 with 1 df

# Plot the Schoenfeld residuals
plot(ph_test_2)

# 4.2 Kaplan-Meier survival curve for Employment

km_fit <- survfit(Surv(Survival_Time, Death) ~ Employee, data = Donner)

# Define the days of arrival of the relief parties
relief_days <- c(83, 124, 136, 171)
colors <- c("red", "blue", "magenta", "purple")
linetypes <- c("dashed", "dotted", "dotdash", "longdash")
labels <- c("1st", "2nd", "3rd", "4th")

# Generate Fig. 7 KM plot for employment with the risk table
p1 <- ggsurvplot(km_fit, data = Donner, risk.table = FALSE,
                 pval = TRUE, pval.coord = c(0.8, 0.25),
                 legend = "right", legend.title = "Sex",
                 title = "Fig. 7. Kaplan-Meier Survivorship by Employment",
                 legend.labs = c("Family Member", "Employee"), conf.int = TRUE,
                 break.x.by = 20)

# Add the relief party vertical lines and annotations
for (i in 1:4) {
  p1$plot <- p1$plot + 
    annotate("segment", x = relief_days[i], xend = relief_days[i], 
             y = 0, yend = 1, colour = colors[i], linetype = linetypes[i]) +
    annotate("text", x = relief_days[i], y = 0.05, 
             label = labels[i], angle = 90, vjust = 0, color = colors[i])
}

# Generate risk table separately
p2 <- ggsurvplot(km_fit, data = Donner, risk.table = TRUE,
                 pval = FALSE, legend = "none",
                 break.x.by = 20)

# Use patchwork to combine the two plots

# Use patchwork to combine the two plots with specified heights
combined_plot <- p1$plot / p2$table + plot_layout(heights = c(5, 1))

# Print the combined plot
combined_plot

# GPT-4 suggested that I try stratified KM plots to show the change in
# Proportional Hazard ratio with time. Here is the codefrom GPT-4

# Split the Survival_Time into intervals

Donner$TimeGroup <- cut(Donner$Survival_Time, breaks = c(0, 50, 100, max(Donner$Survival_Time)), 
                        labels = c("Early", "Middle", "Late"), include.lowest = TRUE)

# KM fit using Employee and stratified by TimeGroup
km_fit_time <- survfit(Surv(Survival_Time, Death) ~ Employee + strata(TimeGroup), data = Donner)

# Plot Stratified Kaplan_Meier curves.
ggsurvplot(km_fit_time, data = Donner, risk.table = FALSE, 
           legend = "right", legend.title = "Sex",
           title = "Fig. 8. Kaplan-Meir Employee Survivorship Stratified by Time Intervals", conf.int = TRUE,
           break.x.by = 20)

# 4.3 Calculate and plot the hazard ratio for Employment

# Convert model coefficients to hazard ratios
hr <- exp(coef(cox_model_2))
ci <- exp(confint(cox_model_2))

# Create a data frame for plotting
df <- data.frame(
  Variable = names(hr),
  HR = hr,
  lower = ci[, 1],
  upper = ci[, 2]
)

# Organize data for forestplot
labeltext <- list(
  c("", df$Variable),
  c("HR", as.character(round(df$HR, 2))),
  c("Lower 95% CI", as.character(round(df$lower, 2))),
  c("Upper 95% CI", as.character(round(df$upper, 2)))
)

# Plot using forestplot
forestplot(
  labeltext = labeltext,
  graph.pos = 3,
  mean = c(NA, df$HR),
  lower = c(NA, df$lower),
  upper = c(NA, df$upper),
  xlog = TRUE, # because hazard ratios are typically plotted on a log scale
  clip = c(0.5, 2), # you can adjust these values if necessary
  xticks = c(0.5, 1, 2.5),
  title = "Hazard Ratios (95% CI)"
)

# The proportional hazards assumption is clearly violated, so GPT-4 suggested
# an alternative:
Donner$SurvTime_Employee <- with(Donner, Survival_Time * Employee)
cox_time_dep <- coxph(Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee, data = Donner)
summary(cox_time_dep)

## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee, 
##     data = Donner)
## 
##   n= 79, number of events= 37 
## 
##                       coef exp(coef) se(coef)     z Pr(>|z|)
## Employee            6.1680  477.2294   1.2615  4.89    1e-06
## SurvTime_Employee  -0.0456    0.9554   0.0143 -3.18   0.0015
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## Employee            477.229     0.0021    40.263  5656.520
## SurvTime_Employee     0.955     1.0466     0.929     0.983
## 
## Concordance= 0.621  (se = 0.035 )
## Likelihood ratio test= 27  on 2 df,   p=1e-06
## Wald test            = 32.1  on 2 df,   p=1e-07
## Score (logrank) test = 63.6  on 2 df,   p=2e-14

# Print the summary of the model
print(summary(cox_time_dep))

## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee, 
##     data = Donner)
## 
##   n= 79, number of events= 37 
## 
##                       coef exp(coef) se(coef)     z Pr(>|z|)
## Employee            6.1680  477.2294   1.2615  4.89    1e-06
## SurvTime_Employee  -0.0456    0.9554   0.0143 -3.18   0.0015
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## Employee            477.229     0.0021    40.263  5656.520
## SurvTime_Employee     0.955     1.0466     0.929     0.983
## 
## Concordance= 0.621  (se = 0.035 )
## Likelihood ratio test= 27  on 2 df,   p=1e-06
## Wald test            = 32.1  on 2 df,   p=1e-07
## Score (logrank) test = 63.6  on 2 df,   p=2e-14

# Check the proportional hazards assumption
ph_test_3 <- cox.zph(cox_time_dep)

# Print the results
print(ph_test_3)

##                   chisq df     p
## Employee           4.97  1 0.026
## SurvTime_Employee  2.70  1 0.100
## GLOBAL             5.14  2 0.077

# Evidence for the violation of the constant hazard ratio assumption for Employee
# But not for Global

# Plot the Schoenfeld residuals
plot(ph_test_3)

# Fairly level distribution for Betas for SurvTime_Employees

### 6. Forlorn Hope Fisher's exact test:
# Create the data matrix
# The matrix format is:
#        Survived Died
# Women      5     0
# Men        2     8

donner_matrix <- matrix(c(5, 0, 2, 8), nrow=2, byrow=TRUE)
colnames(donner_matrix) <- c("Survived", "Died")
rownames(donner_matrix) <- c("Women", "Men")
print(donner_matrix)

##       Survived Died
## Women        5    0
## Men          2    8

# Fisher's exact test
test_result <- fisher.test(donner_matrix)

# Print the results
print(test_result)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  donner_matrix
## p-value = 0.007
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.688   Inf
## sample estimates:
## odds ratio 
##        Inf

# Odds ratio
print(paste("Odds Ratio:", test_result$estimate))

## [1] "Odds Ratio: Inf"

# Confidence interval for the odds ratio
print(paste("95% Confidence Interval:", round(test_result$conf.int[1], 3), "to", round(test_result$conf.int[2], 3)))

## [1] "95% Confidence Interval: 1.688 to Inf"

# Produces an SPSS-like table.  Note that with the Essex data, warnings are
# issued because the minimum sample sizes for 2 x 2 have expectations less
# than 5. The proportions and Fisher's exact test are correct, hence CrossTable
CrossTable(donner_matrix,digits=3,fisher = TRUE, chisq = TRUE, expected = TRUE,
           format = "SPSS")

## Warning in chisq.test(t, correct = TRUE, ...): Chi-squared approximation may be
## incorrect

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may
## be incorrect

## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |         Expected Values |
## | Chi-square contribution |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  15 
## 
##              |  
##              | Survived  |     Died  | Row Total | 
## -------------|-----------|-----------|-----------|
##        Women |        5  |        0  |        5  | 
##              |    2.333  |    2.667  |           | 
##              |    3.048  |    2.667  |           | 
##              |  100.000% |    0.000% |   33.333% | 
##              |   71.429% |    0.000% |           | 
##              |   33.333% |    0.000% |           | 
## -------------|-----------|-----------|-----------|
##          Men |        2  |        8  |       10  | 
##              |    4.667  |    5.333  |           | 
##              |    1.524  |    1.333  |           | 
##              |   20.000% |   80.000% |   66.667% | 
##              |   28.571% |  100.000% |           | 
##              |   13.333% |   53.333% |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        7  |        8  |       15  | 
##              |   46.667% |   53.333% |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  8.571     d.f. =  1     p =  0.003415 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  5.658     d.f. =  1     p =  0.01737 
## 
##  
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio:  Inf 
## 
## Alternative hypothesis: true odds ratio is not equal to 1
## p =  0.006993 
## 95% confidence interval:  1.688 Inf 
## 
## Alternative hypothesis: true odds ratio is less than 1
## p =  1 
## 95% confidence interval:  0 Inf 
## 
## Alternative hypothesis: true odds ratio is greater than 1
## p =  0.006993 
## 95% confidence interval:  2.365 Inf 
## 
## 
##  
##        Minimum expected frequency: 2.333 
## Cells with Expected Frequency < 5: 3 of 4 (75%)

Harrell’s Example of Poisson Regression & Restricted Cubic Splines: Sicilian hospital admissions for acute coronary events after a smoking ban in Sicily 10 Jan 2005 (Bernal et al. 2017)

Bernal et al. 2017

# options (prType ='latex') # applies to printingmodelfits
getHdata (sicily ) #fetch dataset from hbiostat.org/data
d<-sicily
dd<-datadist (d); options (datadist = 'dd')
g <- function (x) exp(x) * 100000
off <-list ( stdpop = mean (d$ stdpop )) # offset for prediction (383464.4)
w <- geom_point (aes(x=time , y= rate ), data =d)
v <- geom_vline (aes( xintercept =37 , col=I('red')))
yl <- ylab ('Acute Coronary Cases Per 100,000')
f <- Glm(aces ~ offset(log(stdpop)) + rcs(time,6), data=d, family = poisson)
f$aic

## [1] 721.5

Plot the data

ggplot(Predict(f, fun=g, offset =off)) + w + v + yl

Save knot locations

k <- attr (rcs(d$time, 6), 'parms')
k

## [1]  5.00 14.34 24.78 35.22 45.66 55.00

kn <- k

rcspline.eval is the rcs workhorse

Now fit seasonal patterns with a cosine function, 12-month period

h <-  function (x) cbind(rcspline.eval (x, kn),
                         sin=sin (2*pi*x/12), cos=cos (2*pi*x/12))
f <- Glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
         data =d, family = poisson )
f$aic

## [1] 674.1

ggplot (Predict(f, fun=g, offset =off)) + w + v + yl

Next add more nodes near intervention to allow for sudden change

kn <- sort (c(k, c(36 , 37, 38)))
f <-Glm( aces ~ offset (log(stdpop)) + gTrans (time , h),
         data =d, family = poisson )
f$aic

## [1] 661.8

ggplot(Predict (f, fun=g, offset =off)) + w + v + yl

Now make the slow trend simpler (6 knots) and more finely control times at which predictions are requested, to handle discontinuity.

# Harrell uses greater than or equal 37, but I don't know how to code that
h <- function (x) cbind ( rcspline.eval (x, k),
                         sin=sin (2*pi*x/12) , cos=cos (2*pi*x/12),
                         jump =x > 36.999)
f <-  Glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
         data =d, family = poisson)
f$aic

## [1] 659.6

times <- sort (c(seq (0, 60, length =200), 36.998 , 37, 37.001 ))

Final plot

ggplot(Predict(f, time =times ,fun=g, offset =off)) + w + v + yl

# Look at fit statistics, especially evidence for the jump
# summary(f)
# summary generates Latex code which is unreadable, i'll regenrate f with glm
f2 <-  glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
           data =d, family = poisson)
# These two lines not in Harrell's code, but match his output.
summary(f2)

## 
## Call:
## glm(formula = aces ~ offset(log(stdpop)) + gTrans(time, h), family = poisson, 
##     data = d)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -6.21185    0.00947 -656.01  < 2e-16
## gTrans(time, h)time     0.06355    0.01128    5.63  1.8e-08
## gTrans(time, h)time'   -0.19119    0.04334   -4.41  1.0e-05
## gTrans(time, h)time''   0.26531    0.07600    3.49  0.00048
## gTrans(time, h)time''' -0.24086    0.09245   -2.61  0.00918
## gTrans(time, h)sin      0.03434    0.00672    5.11  3.2e-07
## gTrans(time, h)cos      0.03796    0.00647    5.86  4.5e-09
## gTrans(time, h)jump    -0.12681    0.03127   -4.06  5.0e-05
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 138.88  on 51  degrees of freedom
## AIC: 659.6
## 
## Number of Fisher Scoring iterations: 3

anova(f2,test="Chisq")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: aces
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                               58        308         
## gTrans(time, h)  7      170        51        139   <2e-16

# Strong evidence for a jump effect
# gTrans(time, h)jump    -0.126814   0.031270   -4.055 5.00e-05 ***
exp(-0.126814) # 0.8808975

## [1] 0.8809

There was a decrease of 12% in cardiovascular events resulting from the cigarette ban with strong evidence for seasonality

GPT-4 Sicilian cardiovascular event example.

# Sicily_redux_Working
# Prompt by Eugene.Gallagher@umb.edu, 7/10/23 8:20 AM
# GPT-4 Prompt
# Sicily Smoking Ban
# Using R, analyze the interrupted monthly time series of Acute Cardiac Events
# (aces) from Sicily and the effects of a ban of smoking at all indoor locations
# at the start of 2005 (time = 37). The data are available at 
# rms::getHdata(sicily). The analysis is described in Bernal et al. 2017
# (International Journal of  Epidemiology 46: 348-355, 
# academic.oup.com/ije/article/46/1/348/2622842).  Fit the aces response with
# a Poisson generalized linear model with an offset of population size,
# log(stdpop). Use a restricted cubic spline regression with 6 knots to assess
# seasonal effects on aces. Model the annual pattern in aces with sin and cos
# functions of time. Assess the intervention of a smoking ban at month 37 on
# aces and report model estimates and standard errors. Document the percentage
# change of the intervention on aces with a 95% confidence interval. Use ggplot2
# to graphically display the results with 95% confidence intervals and a legend,
# noting the intervention point at month 37 and actual data plotted with points.

# Load the required libraries and data
# library(rms)
# library(ggplot2)

# Load the data
getHdata(sicily)

# Create the variables necessary for analysis
sicily$time <- 1:nrow(sicily)
sicily$year <- sicily$time / 12
sicily$smoking_ban <- ifelse(sicily$time >= 37, 1, 0)

# Fit the Poisson generalized linear model
fit <- glm(aces ~ rcs(time, 6) + I(sin(2 * pi * year)) + I(cos(2 * pi * year)) + offset(log(stdpop)) + smoking_ban,
           family = poisson, data = sicily)

# Extract the estimated effects and standard errors
coefs <- summary(fit)$coefficients
smoking_ban_effect <- coefs["smoking_ban", "Estimate"]
smoking_ban_se <- coefs["smoking_ban", "Std. Error"]

# Compute the percentage change of the intervention on ACES
percentage_change <- exp(smoking_ban_effect) - 1
percentage_change_ci <- exp(smoking_ban_effect + qnorm(c(0.025, 0.975)) * smoking_ban_se) - 1

# Print the results
print(coefs)

##                        Estimate Std. Error    z value  Pr(>|z|)
## (Intercept)           -6.287547   0.021198 -296.60774 0.000e+00
## rcs(time, 6)time       0.009869   0.002453    4.02277 5.752e-05
## rcs(time, 6)time'     -0.032573   0.026369   -1.23529 2.167e-01
## rcs(time, 6)time''     0.061630   0.076209    0.80869 4.187e-01
## rcs(time, 6)time'''    0.005871   0.099571    0.05896 9.530e-01
## rcs(time, 6)time''''  -0.097929   0.099016   -0.98902 3.227e-01
## I(sin(2 * pi * year))  0.040166   0.006866    5.84955 4.929e-09
## I(cos(2 * pi * year))  0.037582   0.006474    5.80532 6.424e-09
## smoking_ban           -0.146324   0.031578   -4.63380 3.590e-06

print(smoking_ban_effect)

## [1] -0.1463

print(smoking_ban_se)

## [1] 0.03158

print(percentage_change)

## [1] -0.1361

print(percentage_change_ci)

## [1] -0.18797 -0.08097

# Graphically display the results
sicily$fitted <- predict(fit, type = "response")
ggplot(sicily, aes(x = time)) +
  geom_point(aes(y = aces)) +
  geom_line(aes(y = fitted), color = "blue") +
  geom_vline(xintercept = 37, linetype = "dashed", color = "red") +
  geom_ribbon(aes(ymin = fitted - 1.96 * sqrt(fitted), ymax = fitted + 1.96 * sqrt(fitted)),
              alpha = 0.1) +
  labs(title = "Effects of Smoking Ban on ACES in Sicily",
       x = "Time (months)",
       y = "Acute Cardiac Events",
       caption = paste0("Estimated smoking ban effect: ", round(percentage_change * 100, 2),
                        "% (95% CI: ", round(percentage_change_ci[1] * 100, 2), "% to ",
                        round(percentage_change_ci[2] * 100, 2), "%)")) +
  theme_minimal()

Note only 3 classes left

Next Tuesday (11/21/23): Multilevel (mixed) models

Readings for next week on Multilevel (Mixed) models

Hurlbert (1984) on pseudoreplication
Chapter 14 Multifactor studies without replication
Chapter 16 Nested, repeated measures (longitudinal), mixed, and other multilevel/hierarchical response models.
Andrews Chapter 12 on Multilevel models

Week 11 GLM Poisson and other GLMs: Sleuth Chapter 22 & Harrell on Regression Modeling Stratgies

Eugene D. Gallagher

11/14/23