Overview of crab data * Data obtained from Alan Agresti’s Introduction to Categorical Data Analysis book (Agresti 2018). * The data measure whether or not a female horseshoe crab has additional male suitors beyond their current mate, essentially. The outcome is binary. * There are four candidate predictors including: shell width, weight, shell color, spine condition.
The data contain four potential predictors
#R demo for ADA SDNS webinar
#An Introduction to Model Uncertainty and Averaging for Categorical Data Analysis
#Chris Franck, June 21, 2022
#Import data
Crabs<-read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat",
header=TRUE)
#Probability predictions for simple logistic regression
#maximum likelihood
grid<-seq(0,35,.1)
plot(Crabs$width,jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell width")
fit<-glm(y~width,family=binomial,data=Crabs)
grid.frame<-data.frame(width=grid)
grid.pred<-predict.glm(fit,newdata=grid.frame,type='response')
lines(grid, grid.pred,lwd=2,col='red')
#plot the data
par(mfrow=c(2,2),las=1)
plot(Crabs$width,jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell width")
plot(jitter(Crabs$color,.3),jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Shell color")
plot(jitter(Crabs$spine,.3),jitter(Crabs$y,.1),pch=16,ylab='Satellites?',xlab="Spine condition")
plot(Crabs$weight,jitter(Crabs$y,.1),pch=16,ylab="Satellites?")
Multicollinearity is a concern
#A look at potential multicollinearity
y.j<-Crabs$y+rnorm(173,0,.1)
color.j<-Crabs$color+rnorm(173,0,.1)
spine.j<-Crabs$spine+rnorm(173,0,.1)
pairs(data.frame(y=y.j,width=Crabs$width, color=color.j, spine=spine.j, weight=Crabs$weight),
pch=16)
Fit the Poisson Log-Linear Regression Model
Statistical Conclusion
The data provided no evidence of a leveling off of mating with age (quadratic term 1-sided p value = 0.33)
The estimated mean number of successful matings (in 8 years) for 27-year-old males was 1.31, and the mean increased by a factor of 2.00 for each 10 year increase in age up to about 50 years (95% confidence interval for the mulitplicative factor: 1.52 to 2.60)
Scope of Inference
There may be bias due to measurement error in estimating the age of each elephant and lack of independence.
Bias would also accrue if successful matings weren’t successfully attributed.
Attempting to apply to a larger population would requie careful attention to the method of sampling.
Sleuth 3 Code for Case Study 20.1
str(case2201)
## 'data.frame': 41 obs. of 2 variables:
## $ Age : int 27 28 28 28 28 29 29 29 29 29 ...
## $ Matings: int 0 1 1 1 3 0 0 0 2 2 ...
attach(case2201)
EXPLORATION AND MODEL BUILDING
plot(Matings ~ Age, log="y")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 5 y values <= 0 omitted from
## logarithmic plot
ageSquared <- Age^2
myGlm1 <- glm(Matings ~ Age + ageSquared, family=poisson)
summary(myGlm1) # No evidence of a need for ageSquared
##
## Call:
## glm(formula = Matings ~ Age + ageSquared, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## Age 0.1359544 0.1580095 0.860 0.390
## ageSquared -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
INFERENCE AND INTERPRETATION
myGlm2 <- update(myGlm1, ~ . - ageSquared)
summary(myGlm2)
##
## Call:
## glm(formula = Matings ~ Age, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## Age 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
anova(myGlm2,myGlm1,test="Chisq") # May not be valid, many counts < 5
## Analysis of Deviance Table
##
## Model 1: Matings ~ Age
## Model 2: Matings ~ Age + ageSquared
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6667
beta <- myGlm2$coef
exp(beta[2]) #1.071107
## Age
## 1.071107
exp(confint(myGlm2,2)) #1.042558 1.100360
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.042558 1.100360
Interpretation: Associated with each 1 year increase in age is a 7% increase in the mean number of matings (95% confidence interval 4% to 10% increase).
GRAPHICAL DISPLAY FOR PRESENTATION
plot(Matings ~ Age, ylab="Number of Successful Matings",
xlab="Age of Male Elephant (Years)",
main="Age and Number of Successful Matings for 41 African Elephants",
pch=21, bg="green", cex=2, lwd=2)
dummyAge <- seq(min(Age),max(Age), length=50)
lp <- beta[1] + beta[2]*dummyAge
curve <- exp(lp)
lines(curve ~ dummyAge,lwd=2)
detach(case2201)
Adam Loy’s tidysleuth code
options(digits = 4, show.signif.stars = FALSE)
Plot the jittered data on a log scale
gf_jitter(log(Matings + 0.5) ~ Age, data = case2201, height = 0.25, width = 0.25, pch = 1) %>%
gf_labs(x = "Age (years) -- Slightly Jittered", y = "Number of Matings (log scale)")
Fit the Poisson glm, with quadratic terms.
glm1 <- glm(Matings ~ Age + I(Age^2), data = case2201, family = poisson)
summary(glm1)
##
## Call:
## glm(formula = Matings ~ Age + I(Age^2), family = poisson, data = case2201)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.85741 3.03564 -0.94 0.35
## Age 0.13595 0.15801 0.86 0.39
## I(Age^2) -0.00086 0.00201 -0.43 0.67
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.3
##
## Number of Fisher Scoring iterations: 5
summary(fitted(glm1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.21 1.44 2.16 2.68 3.81 6.61
tidy(glm1)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.86 3.04 -0.941 0.347
## 2 Age 0.136 0.158 0.860 0.390
## 3 I(Age^2) -0.000860 0.00201 -0.427 0.669
Eliminate the quadratic term
reduced <- update(glm1, . ~ . - I(Age^2))
summary(reduced)
##
## Call:
## glm(formula = Matings ~ Age, family = poisson, data = case2201)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5820 0.5446 -2.9 0.0037
## Age 0.0687 0.0137 5.0 5.8e-07
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.5
##
## Number of Fisher Scoring iterations: 5
Calculate Wald confidence interval
beta1 <- coef(reduced)[2]
se <- sqrt(vcov(reduced)[2,2])
beta1 + c(-1, 1) * qnorm(.975) * se
## [1] 0.04175 0.09563
exp(beta1 + c(-1, 1) * qnorm(.975) * se)
## [1] 1.043 1.100
Profile-likelihood confidence interval
confint(glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -9.044944 2.893178
## Age -0.163750 0.457734
## I(Age^2) -0.004968 0.002948
Compare the full and reduced models
anova(reduced, glm1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Matings ~ Age
## Model 2: Matings ~ Age + I(Age^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.0
## 2 38 50.8 1 0.185 0.67
The Del Norte Salamander (plethodon elongates) is a small (5-7 cm) salamander found among rock rubble, rock outcrops, and moss-covered talus in a narrow range of norwest California.
To study the relationship to old-growth forest, researchers selected 47 sites from plausible salamander habitat in national forest and parkland. At each site, a 7 m x 7 m search area was examined for the number of salamanders. Display 22.3 shows the number of salamanders and characteristics of the forest cover.
Statistical Conclusion
After accounting for canopy cover, there is no evidence that the mean number os salamanders depends on forest age ( p = 0.78 from F = 0.53 with 6 and 35 df from a quasilikelihood fit to a log-linear model).
Two distinct canopy cover conditions were evident: closed canopies with percentages > 70% and open canopy with percentages < 70%. The salamanders followed different curved abundance patterns under the two conditions.
Plot of Deviance residuals
Sleuth 3 Code for Case Study 20.2
EXPLORATION AND MODEL BUILDING
attach(case2202)
logSalamanders <- log(Salamanders + .5)
logForestAge <- log(ForestAge + .5)
myMatrix <- cbind(PctCover,logForestAge,logSalamanders)
if (require(car)) { # Use car library
scatterplotMatrix(myMatrix, diagonal="histogram", reg.line=FALSE, spread=FALSE)
}
Fit the Poisson log-linear model
myGlm1 <- glm(Salamanders ~ PctCover + logForestAge + PctCover:logForestAge,
family=poisson)
summary(myGlm1) # Backward elimination...
##
## Call:
## glm(formula = Salamanders ~ PctCover + logForestAge + PctCover:logForestAge,
## family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.96945 0.82093 -1.18 0.2376
## PctCover 0.02907 0.01118 2.60 0.0094
## logForestAge -0.21311 0.29301 -0.73 0.4670
## PctCover:logForestAge 0.00196 0.00342 0.57 0.5674
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 120.70 on 43 degrees of freedom
## AIC: 213.8
##
## Number of Fisher Scoring iterations: 5
myGlm2 <- update(myGlm1, ~ . - PctCover:logForestAge)
summary(myGlm2)
##
## Call:
## glm(formula = Salamanders ~ PctCover + logForestAge, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.37255 0.49574 -2.77 0.0056
## PctCover 0.03443 0.00653 5.27 1.3e-07
## logForestAge -0.05470 0.10035 -0.55 0.5857
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 121.01 on 44 degrees of freedom
## AIC: 212.1
##
## Number of Fisher Scoring iterations: 5
myGlm3 <- update(myGlm2, ~ . - logForestAge)
summary(myGlm3) # PctCover is the only explanatory variable remaining
##
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.48196 0.45578 -3.25 0.0011
## PctCover 0.03241 0.00539 6.01 1.8e-09
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 121.31 on 45 degrees of freedom
## AIC: 210.4
##
## Number of Fisher Scoring iterations: 5
Plot the penultimate model
plot(Salamanders ~ PctCover) # It appears that there are 2 distributions
# of Salamander counts; one for PctCover < 70 and one for PctCover > 70
See if PctCover is associated Salamanders in each subset
myGlm4 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover > 70))
summary(myGlm4) # No evidence of an effect for this subset
##
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson, subset = (PctCover >
## 70))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8240 2.0652 -0.88 0.38
## PctCover 0.0364 0.0236 1.54 0.12
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 105.46 on 27 degrees of freedom
## Residual deviance: 102.98 on 26 degrees of freedom
## AIC: 177.4
##
## Number of Fisher Scoring iterations: 6
myGlm5 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover < 70))
summary(myGlm5) # No evidence on this subset either
##
## Call:
## glm(formula = Salamanders ~ PctCover, family = poisson, subset = (PctCover <
## 70))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.96968 0.55081 -1.76 0.078
## PctCover 0.00561 0.02180 0.26 0.797
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 16.613 on 18 degrees of freedom
## Residual deviance: 16.548 on 17 degrees of freedom
## AIC: 35.16
##
## Number of Fisher Scoring iterations: 6
INFERENCE (2 means)
Group <- ifelse(PctCover > 70,"High","Low")
Group <- factor(Group, levels=c("Low","High") ) # Make "Low Cover" the ref group
myGlm6 <- glm(Salamanders ~ Group, family=poisson)
summary(myGlm6)
##
## Call:
## glm(formula = Salamanders ~ Group, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.865 0.354 -2.45 0.014
## GroupHigh 2.215 0.366 6.05 1.5e-09
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 122.07 on 45 degrees of freedom
## AIC: 211.1
##
## Number of Fisher Scoring iterations: 5
GRAPHICAL DISPLAY FOR PRESENTATION
plot(Salamanders ~ PctCover, ylab="Number of Salamanders",
xlab="Percentage of Canopy Covered",
main="Number of Salamanders versus Percent Canopy Cover",
pch=21,bg="green", cex=2, lwd=2)
beta <- myGlm6$coef
lines(c(0,55),exp(c(beta[1],beta[1])),lwd=2)
text(56,exp(beta[1]),paste("mean= ",round(exp(beta[1]),1)),adj=0)
lines(c(76,93),exp(c(beta[1]+beta[2],beta[1]+beta[2])),lwd=2)
text(56,exp(beta[1]+beta[2]),paste("mean=",round((beta[1]+beta[2]),1)),adj=-1)
detach(case2202)
What a poor R graphic
Adam Loy’s (Carleton) tidySleuth case 22.2 code
options(digits = 4, show.signif.stars = FALSE)
summary(case2202)
## EXPLORATION AND MODEL BUILDING
logSalamanders <- log(Salamanders + .5)
logForestAge <- log(ForestAge + .5)
myMatrix <- cbind(PctCover,logForestAge,logSalamanders)
if (require(car)) { # Use car library
scatterplotMatrix(myMatrix, diagonal="histogram", reg.line=FALSE, spread=FALSE)
}
_Fit the Poisson glm__
myGlm1 <- glm(Salamanders ~ PctCover + logForestAge + PctCover:logForestAge,
family=poisson)
summary(myGlm1) # Backward elimination...
myGlm2 <- update(myGlm1, ~ . - PctCover:logForestAge)
summary(myGlm2)
myGlm3 <- update(myGlm2, ~ . - logForestAge)
summary(myGlm3) # PctCover is the only explanatory variable remaining
Plot the data
plot(Salamanders ~ PctCover) # It appears that there are 2 distributions
Number of Salamander counts; one for PctCover < 70 and one for PctCover > 70
See if PctCover is associated Salamanders in each subset
myGlm4 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover > 70))
summary(myGlm4) # No evidence of an effect for this subset
myGlm5 <- glm(Salamanders ~ PctCover, family=poisson,subset=(PctCover < 70))
summary(myGlm5) # No evidence on this subset either
INFERENCE (2 means)
Group <- ifelse(PctCover > 70,"High","Low")
Group <- factor(Group, levels=c("Low","High") ) # Make "Low Cover" the ref group
myGlm6 <- glm(Salamanders ~ Group, family=poisson)
summary(myGlm6)
# Using 70% as the dividing line
case2202$Closed <- ifelse(case2202$PctCover > 70,"closed", "open")
ssom <- glm(Salamanders ~ PctCover * ForestAge * Closed + I(PctCover^2) + I(ForestAge^2) +
I(PctCover^2):Closed + I(ForestAge^2):Closed,
data = case2202, family = poisson)
summary(ssom)
1 - pchisq(89.178, df = 35)
gf_point(residuals(ssom, type = "deviance") ~ fitted(ssom)) %>%
gf_hline(yintercept = 0, color = "blue") %>%
gf_hline(yintercept = 2, color = "gray60", linetype = 2) %>%
gf_hline(yintercept = -2, color = "gray60", linetype = 2) %>%
gf_labs(x = "Fitted means", y = "Deviance residual")
# The full model via quasi-likelihood
ssom <- update(ssom, . ~ ., family = quasipoisson)
# The reduced (inferential) model
inferential_model <- glm(Salamanders ~ PctCover * Closed + I(PctCover^2) +
I(PctCover^2):Closed, data = case2202, family = quasipoisson)
# drop-in-deviance F test
anova(inferential_model, ssom, test = "F")
Plot the model
aug <- augment(inferential_model)
gf_point(Salamanders ~ PctCover, data = aug) %>%
gf_line(exp(.fitted) ~ PctCover, color = ~Closed) %>%
gf_labs(x = "Percentage of Canopy Cover",
y = "Salamander Count")
What can keep a sample of data from being appropriate for modeling
Most important predictor or response variables not collected
Subjects in the dataset are ill-defined or not representative of the population to which inferences are needed
Data collection sites do not represent the population of sites
Key variables missing in large numbers of subjects
Data not missing at random
No operational definitions for key variables and/or measurement errors severe
No observer variability studies done
What else can go wrong in modeling?
The process generating the data is not stable.
The model is misspecified with regard to nonlinearities or interactions, or there are predictors missing.
The model is misspecified in terms of the transformation of the response variable or the model’s distributional assumptions.
The model contains discontinuities (e.g., by categorizing continuous predictors or setting regression shapes with sudden changes) that can be gamed by users.
Correlations among subjects are not specified, or the correlation structure is misspecified, resulting in inefficient parameter estimates and overconfident inference.
The model is overfitted, resulting in predictions that are too extreme or positive associations that are false.
The user of the model relies on predictions obtained by extrapolating to combinations of predictor values well outside the range of the dataset used to develop the model.
Accurate and discriminating predictions can lead to behavior changes that make future predictions inaccurate.
Level playing field (independent datasets, same number of candidate d.f., careful bootstrapping)
Criteria
calibration
discrimination
face validity
measurement errors in required predictors
use of continuous predictors (which are usually better defined than categorical ones)
omission of “insignicant”variables that nonetheless make sense as risk factors
simplicity (though this is less important with the availability of computers)
lack of fit for specific types of subjects
** Preferred Modeling Strategy in a Nutshell**
Decide how many d.f. can be spent
Decide where to spend them
Spend them
Don’t reconsider, especially if inference needed
## Gallagher changed these 3 statements
#titanic_data <- getHdata(titanic3)
#str(titanic_data)
#summary(titanic_data)
getHdata(titanic3)
str(titanic3)
## 'data.frame': 1309 obs. of 14 variables:
## $ pclass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : 'labelled' int 1 1 0 0 0 1 1 0 1 0 ...
## ..- attr(*, "label")= chr "Survived"
## $ name : 'labelled' chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Crei" ...
## ..- attr(*, "label")= chr "Name"
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : 'labelled' num 29 0.917 2 30 25 ...
## ..- attr(*, "units")= chr "Year"
## ..- attr(*, "label")= chr "Age"
## $ sibsp : 'labelled' int 0 1 1 1 1 0 1 0 2 0 ...
## ..- attr(*, "label")= chr "Number of Siblings/Spouses Aboard"
## $ parch : 'labelled' int 0 2 2 2 2 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Number of Parents/Children Aboard"
## $ ticket : 'labelled' chr "24160" "113781" "113781" "113781" ...
## ..- attr(*, "label")= chr "Ticket Number"
## $ fare : 'labelled' num 211 152 152 152 152 ...
## ..- attr(*, "units")= chr "British Pound (\\243)"
## ..- attr(*, "label")= chr "Passenger Fare"
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ boat : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
## $ body : 'labelled' int NA NA NA 135 NA NA NA NA NA 22 ...
## ..- attr(*, "label")= chr "Body Identification Number"
## $ home.dest: 'labelled' chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
## ..- attr(*, "label")= chr "Home/Destination"
summary(titanic3)
## pclass survived name sex age
## 1st:323 Min. :0.000 Length:1309 female:466 Min. : 0.167
## 2nd:277 1st Qu.:0.000 Class :labelled male :843 1st Qu.:21.000
## 3rd:709 Median :0.000 Mode :character Median :28.000
## Mean :0.382 Mean :29.881
## 3rd Qu.:1.000 3rd Qu.:39.000
## Max. :1.000 Max. :80.000
## NA's :263
## sibsp parch ticket fare
## Min. :0.000 Min. :0.000 Length:1309 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:0.000 Class :labelled 1st Qu.: 7.9
## Median :0.000 Median :0.000 Mode :character Median : 14.5
## Mean :0.499 Mean :0.385 Mean : 33.3
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.: 31.3
## Max. :8.000 Max. :9.000 Max. :512.3
## NA's :1
## cabin embarked boat body
## :1014 Cherbourg :270 :823 Min. : 1
## C23 C25 C27 : 6 Queenstown :123 13 : 39 1st Qu.: 72
## B57 B59 B63 B66: 5 Southampton:914 C : 38 Median :155
## G6 : 5 NA's : 2 15 : 37 Mean :161
## B96 B98 : 4 14 : 33 3rd Qu.:256
## C22 C26 : 4 4 : 31 Max. :328
## (Other) : 271 (Other):308 NA's :1188
## home.dest
## Length:1309
## Class :labelled
## Mode :character
##
##
##
##
set.seed(123) # for reproducibility
## Author changed this line:
# impute_data <- aregImpute(~ survived + sex + age + pclass, data = titanic_data, n.impute = 5)
imp <- aregImpute(~ survived + sex + age + pclass, data = titanic3, n.impute = 5)
## Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
dd <- datadist(titanic3)
options(datadist="dd")
fit <- fit.mult.impute(survived ~ (sex + pclass + rcs(age,5))^2,lrm,
imp, data=titanic3, pr=FALSE)
summary(fit)
## Effects Response : survived
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 21 39 18 -0.7211 0.4921 -1.6856 0.2433
## Odds Ratio 21 39 18 0.4862 NA 0.1853 1.2755
## sex - female:male 2 1 NA 1.3399 0.3930 0.5696 2.1102
## Odds Ratio 2 1 NA 3.8187 NA 1.7676 8.2500
## pclass - 1st:3rd 3 1 NA 1.1200 0.4415 0.2547 1.9853
## Odds Ratio 3 1 NA 3.0649 NA 1.2901 7.2814
## pclass - 2nd:3rd 3 2 NA -0.5871 0.4390 -1.4476 0.2735
## Odds Ratio 3 2 NA 0.5560 NA 0.2351 1.3145
##
## Adjusted to: sex=male pclass=3rd age=28
print(anova(fit)) # To test for the significance of the interactions.
## Wald Statistics Response: survived
##
## Factor Chi-Square d.f. P
## sex (Factor+Higher Order Factors) 239.24 7 <.0001
## All Interactions 49.36 6 <.0001
## pclass (Factor+Higher Order Factors) 120.11 12 <.0001
## All Interactions 38.92 10 <.0001
## age (Factor+Higher Order Factors) 30.86 16 0.0140
## All Interactions 11.50 12 0.4869
## Nonlinear (Factor+Higher Order Factors) 12.66 12 0.3940
## sex * pclass (Factor+Higher Order Factors) 28.86 2 <.0001
## sex * age (Factor+Higher Order Factors) 4.45 4 0.3488
## Nonlinear 3.77 3 0.2873
## Nonlinear Interaction : f(A,B) vs. AB 3.77 3 0.2873
## pclass * age (Factor+Higher Order Factors) 7.66 8 0.4674
## Nonlinear 7.33 6 0.2911
## Nonlinear Interaction : f(A,B) vs. AB 7.33 6 0.2911
## TOTAL NONLINEAR 12.66 12 0.3940
## TOTAL INTERACTION 54.69 14 <.0001
## TOTAL NONLINEAR + INTERACTION 57.70 17 <.0001
## TOTAL 274.06 21 <.0001
#newdata1 <- expand.grid(age = seq(min(titanic_data$age, na.rm = TRUE),
# max(titanic_data$age, na.rm = TRUE),
# length = 100),
# sex = c("male", "female"),
# pclass = "1st")
## Author replaced this statement with Harrell's Predict
# newdata1$predict <- predict(fit, newdata = newdata1, type = "fitted", se.fit = TRUE)
pred <- Predict(fit, age = seq(0, 80, length = 100), sex = c('male', 'female'),
pclass = c('1st','2nd','3rd'), fun=plogis)
pred_df <- as.data.frame(pred)
pclass1_df <- subset(pred_df, pclass == "1st")
pclass2_df <- subset(pred_df, pclass == "2nd")
pclass3_df <- subset(pred_df, pclass == "3rd")
plot1 <- ggplot(pclass1_df, aes(x = age, y = yhat, color = sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 1") +
theme_minimal()
plot2 <- ggplot(pclass2_df, aes(x = age, y = yhat, color = sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 2") +
theme_minimal()
plot3 <- ggplot(pclass3_df, aes(x = age, y = yhat, color = sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
labs(x = "Age", y = "Probability of Survival", color = "Sex", title = "Pclass 3") +
theme_minimal()
# Arrange the plots vertically in a column
grid.arrange(plot1, plot2, plot3, nrow = 3)
# Read the data from Gallagher's github site (public has read, not edit access)
Donner <- read.csv("https://raw.githubusercontent.com/EugeneGall/donner-data-analysis/main/Donner.csv")
# Calculate family size based on Last_Name, but Family_Group_Size from Grayson
# (1990, 2018) Table will be used in this code's analyses.
Donner$Family_Size <- as.integer(ave(Donner$Last_Name, Donner$Last_Name, FUN =length))
str(Donner)
## 'data.frame': 87 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 23 13 1 5 14 40 51 9 3 7 ...
## $ Sex : chr "Male" "Male" "Female" "Male" ...
## $ Status : chr "Died" "Survived" "Survived" "Survived" ...
## $ Last_Name : chr "Antonio" "Breen" "Breen" "Breen" ...
## $ First_Name : chr "" "Edward" "Isabella" "James" ...
## $ Family_Group_Size: int 1 9 9 9 9 9 9 9 9 9 ...
## $ First_Snow : chr "1846-10-28" "1846-10-28" "1846-10-28" "1846-10-28" ...
## $ Last_Date : chr "1846-12-25" "1847-04-29" "1847-04-29" "1847-04-29" ...
## $ Delete : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Employee : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Comments : chr "Cattleherder" "Familysurvivedcompletely" "Familysurvivedcompletely" "Familysurvivedcompletely" ...
## $ Family_Size : int 1 9 9 9 9 9 9 9 9 9 ...
# This produced slightly different Family Sizes than Grayson (1990 Table 1,
# 2018 Table 2.1. Used Grayson (1990, 2018) Family Group Sizes with included
# family links using data from Stewart's (1960, p 363-364) Donner Party Roster.
# Convert Status to binary
Donner$Status <- ifelse(Donner$Status == "Survived", 1, 0)
# Calculate the Survival_Time which is Death Date or last rescue (Lewis
# Keseberg 4/29/1847) - First_Snow, October 28 1846.
# Convert the character strings to Date objects
Donner$First_Snow_Date <- as.Date(Donner$First_Snow, format="%Y-%m-%d")
Donner$Last_Date_Date <- as.Date(Donner$Last_Date, format="%Y-%m-%d")
# Calculate the difference in days for individual survivorship
Donner$Survival_Time <- as.numeric(Donner$Last_Date_Date - Donner$First_Snow_Date)
# View the first few rows of the data frame to confirm the results
head(Donner)
## ID Age Sex Status Last_Name First_Name Family_Group_Size First_Snow
## 1 1 23 Male 0 Antonio 1 1846-10-28
## 2 2 13 Male 1 Breen Edward 9 1846-10-28
## 3 3 1 Female 1 Breen Isabella 9 1846-10-28
## 4 4 5 Male 1 Breen James 9 1846-10-28
## 5 5 14 Male 1 Breen John 9 1846-10-28
## 6 6 40 Female 1 Breen Mary 9 1846-10-28
## Last_Date Delete Employee Comments Family_Size
## 1 1846-12-25 0 1 Cattleherder 1
## 2 1847-04-29 0 0 Familysurvivedcompletely 9
## 3 1847-04-29 0 0 Familysurvivedcompletely 9
## 4 1847-04-29 0 0 Familysurvivedcompletely 9
## 5 1847-04-29 0 0 Familysurvivedcompletely 9
## 6 1847-04-29 0 0 Familysurvivedcompletely 9
## First_Snow_Date Last_Date_Date Survival_Time
## 1 1846-10-28 1846-12-25 58
## 2 1846-10-28 1847-04-29 183
## 3 1846-10-28 1847-04-29 183
## 4 1846-10-28 1847-04-29 183
## 5 1846-10-28 1847-04-29 183
## 6 1846-10-28 1847-04-29 183
### Delete 8 cases for individuals who died before the first Snowstorm
# on 1846-10-28. This will reduce the number of cases to 87 - 8 = 79
Donner <- Donner %>%
filter(Delete != 1)
# View the structure to confirm the filtering
str(Donner)
## 'data.frame': 79 obs. of 16 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 23 13 1 5 14 40 51 9 3 7 ...
## $ Sex : chr "Male" "Male" "Female" "Male" ...
## $ Status : num 0 1 1 1 1 1 1 1 1 1 ...
## $ Last_Name : chr "Antonio" "Breen" "Breen" "Breen" ...
## $ First_Name : chr "" "Edward" "Isabella" "James" ...
## $ Family_Group_Size: int 1 9 9 9 9 9 9 9 9 9 ...
## $ First_Snow : chr "1846-10-28" "1846-10-28" "1846-10-28" "1846-10-28" ...
## $ Last_Date : chr "1846-12-25" "1847-04-29" "1847-04-29" "1847-04-29" ...
## $ Delete : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Employee : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Comments : chr "Cattleherder" "Familysurvivedcompletely" "Familysurvivedcompletely" "Familysurvivedcompletely" ...
## $ Family_Size : int 1 9 9 9 9 9 9 9 9 9 ...
## $ First_Snow_Date : Date, format: "1846-10-28" "1846-10-28" ...
## $ Last_Date_Date : Date, format: "1846-12-25" "1847-04-29" ...
## $ Survival_Time : num 58 183 183 183 183 183 183 183 183 183 ...
# Prepare data for rms, Harrell's 'Regression Modeling Strategies' These
# two statements are required for the summary of the Harrell's Glm
ddist <- datadist(Donner)
## Warning in datadist(Donner): Delete is constant
## Warning in datadist(Donner): First_Snow_Date is constant
options(datadist = "ddist")
# Fit the model with Age, Sex, and Grayson's (1990, 2018) Family_Group_Size
# In deciding on knot size, I used the rule that the minimum knot size should
# be chosen (rcs has a min of 3) unless there is another larger number of knots
# with an AIC which is more than than 4 AIC units lower (Burnham & Anderson
# 2004, p 271)
# Two methods will be coded for finding the appropriate number of knots:
# 1) Brute force by manually plugging in different knot numbers in Glm
# 2) An optimization routine based on AIC
##### First approach: brute force fitting to find minimal AICs.
# 3 is the minimum knot size permitted with Harrell's rcs function
# Fit Model 1: Status ~ rcs(Age, 3) * Sex
Mod1 <- glm(Status ~ rcs(Age,3) * Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
# Need Harrell's Glm for effect sizes and graphic with rms::Predict
mod1 <- Glm(Status ~ rcs(Age,3) * Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
# Where are the knots located?
k <- attr (rcs(Donner$Age,3) ,'parms')
k
## [1] 1 14 46
# knots located at Ages 1, 14, and 46
# Null model
mod_null <- glm(Status ~ 1, data = Donner, family = binomial())
# Additive model
mod2 <- Glm(Status ~ rcs(Age,3) + Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod2 <- glm(Status ~ rcs(Age,3) + Sex, data = Donner, family = binomial(), x = TRUE, y = TRUE)
# Test the models
chi_sq_n1 <- anova(mod_null, Mod1, test="Chisq")
# Print the results
print(chi_sq_n1)
## Analysis of Deviance Table
##
## Model 1: Status ~ 1
## Model 2: Status ~ rcs(Age, 3) * Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 78 109.2
## 2 73 83.4 5 25.8 9.8e-05
# Summary and ANOVA for Mod1
# Test whether the interaction model adds value.
chi_sq_21 <- anova(Mod2, Mod1, test="Chisq")
# Print the results
print(chi_sq_21)
## Analysis of Deviance Table
##
## Model 1: Status ~ rcs(Age, 3) + Sex
## Model 2: Status ~ rcs(Age, 3) * Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 75 92.7
## 2 73 83.4 2 9.24 0.0098
# Summaries and effect sizes of the Age(rcs4) * Sex model
summary(Mod1)
##
## Call:
## glm(formula = Status ~ rcs(Age, 3) * Sex, family = binomial(),
## data = Donner, x = TRUE, y = TRUE)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.836 0.811 -1.03 0.3026
## rcs(Age, 3)Age 0.391 0.143 2.73 0.0064
## rcs(Age, 3)Age' -0.938 0.336 -2.79 0.0052
## SexMale 0.853 1.076 0.79 0.4275
## rcs(Age, 3)Age:SexMale -0.403 0.158 -2.55 0.0106
## rcs(Age, 3)Age':SexMale 0.897 0.359 2.50 0.0124
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 83.408 on 73 degrees of freedom
## AIC: 95.41
##
## Number of Fisher Scoring iterations: 5
anova(Mod1)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Age, 3) 2 8.10 76 101.1 0.0174
## Sex 1 8.45 75 92.7 0.0037
## rcs(Age, 3):Sex 2 9.24 73 83.4 0.0098
AIC(Mod1)
## [1] 95.41
#Summary of Brute force fitting of AICs for interaction model
# mod1 AIC
# 3 knots 95.40752 * Chosen because within 4 of lowest AIC
# 4 knots 92.3582 ** Not chosen because only 3.04932 less than 3 knot solution
# 5 knots 92.88625
# 6 knots Apparently Singular Matrix, no estimation possible
# Fit Family_Group_Size fit with base glm and Harrell's rms::Glm
Mod3 <- glm(Status ~ rcs(Family_Group_Size,5), data = Donner,
family = binomial(), x = TRUE, y = TRUE)
mod3 <- Glm(Status ~ rcs(Family_Group_Size,5), data = Donner,
family = binomial(), x = TRUE, y = TRUE)
# Null model
# Test the models
chi_sq_n2 <- anova(mod_null, Mod3, test="Chisq")
# Print the results
print(chi_sq_n2)
## Analysis of Deviance Table
##
## Model 1: Status ~ 1
## Model 2: Status ~ rcs(Family_Group_Size, 5)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 78 109.2
## 2 74 90.4 4 18.8 0.00087
# Summary and ANOVA for mod
summary(Mod3)
##
## Call:
## glm(formula = Status ~ rcs(Family_Group_Size, 5), family = binomial(),
## data = Donner, x = TRUE, y = TRUE)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.927 0.985 -1.95
## rcs(Family_Group_Size, 5)Family_Group_Size 0.542 0.384 1.41
## rcs(Family_Group_Size, 5)Family_Group_Size' 0.783 2.430 0.32
## rcs(Family_Group_Size, 5)Family_Group_Size'' -3.640 4.762 -0.76
## rcs(Family_Group_Size, 5)Family_Group_Size''' 63.836 36.130 1.77
## Pr(>|z|)
## (Intercept) 0.051
## rcs(Family_Group_Size, 5)Family_Group_Size 0.158
## rcs(Family_Group_Size, 5)Family_Group_Size' 0.747
## rcs(Family_Group_Size, 5)Family_Group_Size'' 0.445
## rcs(Family_Group_Size, 5)Family_Group_Size''' 0.077
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 90.414 on 74 degrees of freedom
## AIC: 100.4
##
## Number of Fisher Scoring iterations: 4
anova(Mod3)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Family_Group_Size, 5) 4 18.8 74 90.4 0.00087
summary(mod3) # For effect size
## Effects Response : Status
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## Family_Group_Size 4 13 9 -0.7111 0.7485 -2.203 0.7803
AIC(mod3)
## [1] 100.4
AIC(Mod3)
## [1] 100.4
mod4 <- Glm(Status ~ Age + Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod4 <- glm(Status ~ Age + Sex + rcs(Family_Group_Size,4), data = Donner, family = binomial(), x = TRUE, y = TRUE)
# for mod5, knot sizes of 3 & 5 determined by k-fold cross validation, code below
mod5 <- Glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod5 <- glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
##### Second approach: GPT4 Streamlined the above code block with functions:
# Function to fit models with one rcs term, updated on 8/17/23 by GPT-4
fit_best_rcs <- function(formula_template, data, knot_range) {
aic_values <- numeric(length(knot_range))
models <- vector("list", length(knot_range))
# Calculate AIC for each knot size
for (k in knot_range) {
formula <- as.formula(gsub("KNOTS", k, formula_template))
model <- glm(formula, data = data, family = binomial())
aic_values[k - knot_range[1] + 1] <- AIC(model)
models[[k - knot_range[1] + 1]] <- model
}
min_aic <- min(aic_values)
best_knots <- knot_range[which.min(aic_values)]
# Iterate through the aic_values and apply the rules
for (i in seq_along(aic_values)) {
if (aic_values[i] - min_aic <= 4 && knot_range[i] < best_knots) {
best_knots <- knot_range[i]
}
}
return(list(model = models[[which(knot_range == best_knots)]], knots = best_knots, results = setNames(aic_values, knot_range)))
}
# Function to fit model with two rcs terms, updated by GPT-4 on 8/17/23
fit_best_double_rcs <- function(formula, data, knot_range, family = binomial()) {
best_aic <- Inf
best_k_age <- max(knot_range)
best_k_fgs <- max(knot_range)
aic_matrix <- matrix(NA, nrow = length(knot_range), ncol = length(knot_range), dimnames = list(knot_range, knot_range))
for (k_age in knot_range) {
for (k_fgs in knot_range) {
tryCatch({
current_formula <- as.formula(
gsub("knots_age", as.character(k_age),
gsub("knots_fgs", as.character(k_fgs), formula)))
current_model <- glm(current_formula, data = data, family = family)
current_aic <- AIC(current_model)
aic_matrix[as.character(k_age), as.character(k_fgs)] <- current_aic
# Update if current AIC is smaller than the best so far
if (current_aic < best_aic) {
best_aic <- current_aic
best_k_age <- k_age
best_k_fgs <- k_fgs
}
}, error = function(e) {}) # Handle potential errors
}
}
min_aic <- best_aic # best AIC among all combinations
best_sum_knots <- best_k_age + best_k_fgs # initialize with the sum of best knots
# Find combinations with the smallest sum of knots but AIC within 4 of min_aic
for (k_age in knot_range) {
for (k_fgs in knot_range) {
current_sum_knots <- k_age + k_fgs
if ((aic_matrix[as.character(k_age), as.character(k_fgs)] - min_aic <= 4) && current_sum_knots < best_sum_knots) {
best_k_age <- k_age
best_k_fgs <- k_fgs
best_sum_knots <- current_sum_knots
}
}
}
best_formula <- as.formula(
gsub("knots_age", as.character(best_k_age),
gsub("knots_fgs", as.character(best_k_fgs), formula)))
best_model <- glm(best_formula, data = data, family = family)
return(list(model = best_model, knots_age = best_k_age, knots_fgs = best_k_fgs))
}
# Extract and print details
print_details <- function(model_result, model_name) {
cat("\n---", model_name, "---\n")
if ("knots_age" %in% names(model_result)) {
cat("Knots: ", model_result$knots_age, " & ", model_result$knots_fgs, "\n")
} else {
cat("Knots: ", model_result$knots, "\n")
}
cat("AIC: ", AIC(model_result$model), "\n")
cat("\nModel Summary:\n")
print(summary(model_result$model))
cat("\nModel Anova:\n")
print(anova(model_result$model))
}
# Fitting models
# Note that 3 is the minimum knot size for rcs, dozens of warnings with knots>6
knot_range <- 3:6
mod1_results <- fit_best_rcs("Status ~ rcs(Age, KNOTS) * Sex", Donner, knot_range)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mod3_results <- fit_best_rcs("Status ~ rcs(Family_Group_Size, KNOTS)", Donner, knot_range)
mod4_results <- fit_best_rcs("Status ~ rcs(Family_Group_Size, KNOTS) * Sex", Donner, knot_range)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mod5_results <- fit_best_double_rcs("Status ~ rcs(Age, knots_age) * Sex + rcs(Family_Group_Size, knots_fgs)", Donner, knot_range)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Printing the details
print_details(mod1_results, "mod1")
##
## --- mod1 ---
## Knots: 3
## AIC: 95.41
##
## Model Summary:
##
## Call:
## glm(formula = formula, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.836 0.811 -1.03 0.3026
## rcs(Age, 3)Age 0.391 0.143 2.73 0.0064
## rcs(Age, 3)Age' -0.938 0.336 -2.79 0.0052
## SexMale 0.853 1.076 0.79 0.4275
## rcs(Age, 3)Age:SexMale -0.403 0.158 -2.55 0.0106
## rcs(Age, 3)Age':SexMale 0.897 0.359 2.50 0.0124
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 83.408 on 73 degrees of freedom
## AIC: 95.41
##
## Number of Fisher Scoring iterations: 5
##
##
## Model Anova:
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Age, 3) 2 8.10 76 101.1 0.0174
## Sex 1 8.45 75 92.7 0.0037
## rcs(Age, 3):Sex 2 9.24 73 83.4 0.0098
print_details(mod3_results, "mod3")
##
## --- mod3 ---
## Knots: 4
## AIC: 101
##
## Model Summary:
##
## Call:
## glm(formula = formula, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.613 0.944 -2.77
## rcs(Family_Group_Size, 4)Family_Group_Size 0.912 0.301 3.03
## rcs(Family_Group_Size, 4)Family_Group_Size' -1.959 0.801 -2.45
## rcs(Family_Group_Size, 4)Family_Group_Size'' 3.948 1.946 2.03
## Pr(>|z|)
## (Intercept) 0.0056
## rcs(Family_Group_Size, 4)Family_Group_Size 0.0024
## rcs(Family_Group_Size, 4)Family_Group_Size' 0.0145
## rcs(Family_Group_Size, 4)Family_Group_Size'' 0.0424
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 92.962 on 75 degrees of freedom
## AIC: 101
##
## Number of Fisher Scoring iterations: 4
##
##
## Model Anova:
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109
## rcs(Family_Group_Size, 4) 3 16.2 75 93 0.001
print_details(mod4_results, "mod4")
##
## --- mod4 ---
## Knots: 3
## AIC: 96.68
##
## Model Summary:
##
## Call:
## glm(formula = formula, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.1411 1.2874 0.11
## rcs(Family_Group_Size, 3)Family_Group_Size 0.1167 0.2376 0.49
## rcs(Family_Group_Size, 3)Family_Group_Size' -0.0767 0.2229 -0.34
## SexMale -2.3309 1.5287 -1.52
## rcs(Family_Group_Size, 3)Family_Group_Size:SexMale 0.4964 0.3039 1.63
## rcs(Family_Group_Size, 3)Family_Group_Size':SexMale -0.7009 0.3355 -2.09
## Pr(>|z|)
## (Intercept) 0.913
## rcs(Family_Group_Size, 3)Family_Group_Size 0.623
## rcs(Family_Group_Size, 3)Family_Group_Size' 0.731
## SexMale 0.127
## rcs(Family_Group_Size, 3)Family_Group_Size:SexMale 0.102
## rcs(Family_Group_Size, 3)Family_Group_Size':SexMale 0.037
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 84.681 on 73 degrees of freedom
## AIC: 96.68
##
## Number of Fisher Scoring iterations: 5
##
##
## Model Anova:
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Family_Group_Size, 3) 2 12.10 76 97.1 0.0024
## Sex 1 5.93 75 91.2 0.0149
## rcs(Family_Group_Size, 3):Sex 2 6.49 73 84.7 0.0389
print_details(mod5_results, "mod5")
##
## --- mod5 ---
## Knots: 6 & 6
## AIC: 56.33
##
## Model Summary:
##
## Call:
## glm(formula = best_formula, family = family, data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.89e+02 8.58e+05 0.00
## rcs(Age, 6)Age 1.04e+02 3.29e+05 0.00
## rcs(Age, 6)Age' 5.32e+03 8.07e+06 0.00
## rcs(Age, 6)Age'' -1.43e+04 1.41e+07 0.00
## rcs(Age, 6)Age''' 1.75e+04 7.69e+06 0.00
## rcs(Age, 6)Age'''' -1.05e+04 1.96e+06 -0.01
## SexMale 2.75e+02 8.58e+05 0.00
## rcs(Family_Group_Size, 6)Family_Group_Size -1.65e+01 2.37e+03 -0.01
## rcs(Family_Group_Size, 6)Family_Group_Size' 4.77e+02 5.91e+04 0.01
## rcs(Family_Group_Size, 6)Family_Group_Size'' -1.31e+03 1.60e+05 -0.01
## rcs(Family_Group_Size, 6)Family_Group_Size''' 2.77e+03 3.36e+05 0.01
## rcs(Family_Group_Size, 6)Family_Group_Size'''' -5.37e+03 6.62e+05 -0.01
## rcs(Age, 6)Age:SexMale -9.85e+01 3.29e+05 0.00
## rcs(Age, 6)Age':SexMale -5.42e+03 8.07e+06 0.00
## rcs(Age, 6)Age'':SexMale 1.45e+04 1.41e+07 0.00
## rcs(Age, 6)Age''':SexMale -1.76e+04 7.69e+06 0.00
## rcs(Age, 6)Age'''':SexMale 1.05e+04 1.96e+06 0.01
## Pr(>|z|)
## (Intercept) 1.00
## rcs(Age, 6)Age 1.00
## rcs(Age, 6)Age' 1.00
## rcs(Age, 6)Age'' 1.00
## rcs(Age, 6)Age''' 1.00
## rcs(Age, 6)Age'''' 1.00
## SexMale 1.00
## rcs(Family_Group_Size, 6)Family_Group_Size 0.99
## rcs(Family_Group_Size, 6)Family_Group_Size' 0.99
## rcs(Family_Group_Size, 6)Family_Group_Size'' 0.99
## rcs(Family_Group_Size, 6)Family_Group_Size''' 0.99
## rcs(Family_Group_Size, 6)Family_Group_Size'''' 0.99
## rcs(Age, 6)Age:SexMale 1.00
## rcs(Age, 6)Age':SexMale 1.00
## rcs(Age, 6)Age'':SexMale 1.00
## rcs(Age, 6)Age''':SexMale 1.00
## rcs(Age, 6)Age'''':SexMale 1.00
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 22.328 on 62 degrees of freedom
## AIC: 56.33
##
## Number of Fisher Scoring iterations: 25
##
##
## Model Anova:
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Age, 6) 5 17.4 73 91.8 0.00379
## Sex 1 13.1 72 78.7 0.00030
## rcs(Family_Group_Size, 6) 5 32.6 67 46.2 4.6e-06
## rcs(Age, 6):Sex 5 23.8 62 22.3 0.00023
###### End of GPT-4 AIC rcs optimization code ######
# Generate 4 ggplot2 graphics
# Define new levels for the predictors
new_age <- seq(min(Donner$Age), max(Donner$Age), length.out = 100)
new_family_group_size <- seq(min(Donner$Family_Group_Size),
max(Donner$Family_Group_Size), length.out = 100)
new_sex <- unique(Donner$Sex)
# Make predictions on the link scale (logit scale)
# The Predict function require fit by Glm not glm
link_pred1 <- Predict(mod1, Age = new_age, Sex = new_sex)
link_pred3 <- Predict(mod3, Family_Group_Size = new_family_group_size)
link_pred4 <- Predict(mod4, Age = new_age, Sex = new_sex,
Family_Group_Size = new_family_group_size)
link_pred5 <- Predict(mod5, Age = new_age, Sex = new_sex,
Family_Group_Size = new_family_group_size)
# Transform predictions back to the original scale (probability scale)
pred1 <- data.frame(
Age = rep(new_age, times = length(new_sex)),
Sex = rep(new_sex, each = length(new_age)),
fit = plogis(link_pred1$yhat),
lower = plogis(link_pred1$lower),
upper = plogis(link_pred1$upper)
)
pred3 <- data.frame(
Sex = rep(new_sex, each = length(new_age)),
Family_Group_Size = new_family_group_size,
fit = plogis(link_pred3$yhat),
lower = plogis(link_pred3$lower),
upper = plogis(link_pred3$upper)
)
# Transform predictions back to the original scale (probability scale)
pred4 <- data.frame(
Age = rep(new_age, times = length(new_sex)),
Sex = rep(new_sex, each = length(new_age)),
Family_Group_Size = rep(new_family_group_size, each = length(new_family_group_size)),
fit = plogis(link_pred4$yhat),
lower = plogis(link_pred4$lower),
upper = plogis(link_pred4$upper)
)
pred5 <- data.frame(
Age = rep(new_age, times = length(new_sex)),
Sex = rep(new_sex, each = length(new_age)),
Family_Group_Size = rep(new_family_group_size, each = length(new_family_group_size)),
fit = plogis(link_pred5$yhat),
lower = plogis(link_pred5$lower),
upper = plogis(link_pred5$upper)
)
# Adjust y-values for jittered points
Donner$AdjustedStatus <- ifelse(Donner$Status == 0, -0.03, 1.03)
# Plot the results for Figure 1 rcs(Age,3) * Sex model
set.seed(13)
ggplot(pred1, aes(x = Age, y = fit, color = Sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
geom_jitter(data = Donner, aes(x = Age, y = AdjustedStatus,
shape = Sex, color = Sex), width = 0.3, height = 0.03, size = 1.5) +
labs(x = "Age (Years)", y = "Estimated Probability of Survival",
title = "Fig. 1. rcs(Age, 3) * Sex with 95% confidence intervals",
color = "Sex") +
theme_minimal() +
scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))
# Plot the results for the rcs(Family_Group_Size,4) model
ggplot(pred3, aes(x = Family_Group_Size, y = fit, color = Sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
geom_jitter(data = Donner, aes(x = Family_Group_Size, y = AdjustedStatus,
shape = Sex, color = Sex), width = 0.3, height = 0.03,
size = 1.5) +
labs(x = "Family Group Size", y = "Estimated Probability of Survival",
title = "Fig. 3. rcs(Family Group Size, 5) with 95% confidence intervals",
color = "Sex") +
theme_minimal() +
scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))
# Plot the results for the additive model, mod2 (graphic not used in manuscript)
plot_3d_mod4 <- plot_ly(data = subset(pred4, Sex == "Male"), x = ~Age, y = ~Family_Group_Size, z = ~fit,
type = "mesh3d", opacity = 0.6, name = "Male", showscale = FALSE) %>%
add_trace(data = subset(pred4, Sex == "Female"), x = ~Age, y = ~Family_Group_Size, z = ~fit,
type = "mesh3d", opacity = 0.6, name = "Female", showscale = FALSE) %>%
layout(scene = list(zaxis = list(range = c(0, 1)),
xaxis = list(title = "Age"),
yaxis = list(title = "Family Group Size"),
zaxis = list(title = "Estimated Probability of Survival")),
# title = "Age + Sex + rcs(Family_Group_Size,6)")
title = "Age + Sex + rcs(Family_Group_Size,6)")
plot_3d_mod4
# Plot the results for the rcs(Age,3) * Sex * rcs(Family_Group-Size,5) 3-d model
# Graphic not used in the manuscript (too curvy)
plot_3d_mod5 <- plot_ly(data = subset(pred5, Sex == "Male"), x = ~Age, y = ~Family_Group_Size, z = ~fit,
type = "mesh3d", opacity = 0.6, name = "Male", showscale = FALSE) %>%
add_trace(data = subset(pred5, Sex == "Female"), x = ~Age, y = ~Family_Group_Size, z = ~fit,
type = "mesh3d", opacity = 0.6, name = "Female", showscale = FALSE) %>%
layout(scene = list(zaxis = list(range = c(0, 1)),
xaxis = list(title = "Age"),
yaxis = list(title = "Family Group Size"),
zaxis = list(title = "Estimated Probability of Survival")),
title = "rcs(Age,3) * Sex + rcs(Family_Group_Size,5)")
plot_3d_mod5
### 8/20/23 addendum. Fitting rcs knots with k-fold cross validation ##########
# The AIC approach produces a badly overfit mod5 showing
# a 3-d plot that was far too flexible. I sent a 3-p prompt on the problem to
# GPT-4 and it provided this solution
cv_rcs <- function(k_age, k_fgs, data, n_folds = 10){
# Create a vector to store CV errors
cv_errors <- numeric(n_folds)
# Create fold indices
fold_indices <- sample(rep(1:n_folds, length.out = nrow(data)))
for(i in 1:n_folds){
# Split the data into training and test sets
training_data <- data[fold_indices != i, ]
test_data <- data[fold_indices == i, ]
# Fit the GLM model on the training data with rcs terms
glm_model <- glm(Status ~ rcs(Age, k_age) * Sex + rcs(Family_Group_Size, k_fgs),
data = training_data, family = binomial())
# Predict on the test data
predictions <- predict(glm_model, newdata = test_data, type = "response")
# Compute the binomial deviance (log loss) for the current fold and store it
cv_errors[i] <- -2 * sum(test_data$Status * log(predictions) + (1 - test_data$Status) * log(1 - predictions))
}
# Return the mean CV error
return(mean(cv_errors))
}
# Perform k-fold CV for a range of knot sizes
k_values <- 3:6
cv_results_matrix <- matrix(NA, nrow = length(k_values), ncol = length(k_values),
dimnames = list(k_values, k_values))
for(k_age in k_values){
for(k_fgs in k_values){
cv_results_matrix[as.character(k_age), as.character(k_fgs)] <- cv_rcs(k_age, k_fgs, Donner)
}
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied =
## fractied): 6 knots requested with 8 unique values of x. knots set to 6
## interior values.
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Determine the optimal k_age and k_fgs based on minimum CV error
optimal_indices <- which(cv_results_matrix == min(cv_results_matrix), arr.ind = TRUE)
optimal_k_age <- as.numeric(rownames(cv_results_matrix)[optimal_indices[1,1]])
optimal_k_fgs <- as.numeric(colnames(cv_results_matrix)[optimal_indices[1,2]])
print(paste("Optimal k for Age: ", optimal_k_age))
## [1] "Optimal k for Age: 4"
print(paste("Optimal k for Family_Group_Size: ", optimal_k_fgs))
## [1] "Optimal k for Family_Group_Size: 3"
# Fit the optimal model
best_glm_model <- glm(Status ~ rcs(Age, optimal_k_age) * Sex +
rcs(Family_Group_Size, optimal_k_fgs),
data = Donner, family = binomial())
summary(best_glm_model)
##
## Call:
## glm(formula = Status ~ rcs(Age, optimal_k_age) * Sex + rcs(Family_Group_Size,
## optimal_k_fgs), family = binomial(), data = Donner)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -5.334 2.005
## rcs(Age, optimal_k_age)Age 2.012 0.811
## rcs(Age, optimal_k_age)Age' -12.151 5.203
## rcs(Age, optimal_k_age)Age'' 21.164 9.295
## SexMale 1.878 2.004
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size 0.511 0.194
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size' -0.673 0.220
## rcs(Age, optimal_k_age)Age:SexMale -1.646 0.803
## rcs(Age, optimal_k_age)Age':SexMale 9.947 5.155
## rcs(Age, optimal_k_age)Age'':SexMale -17.303 9.214
## z value Pr(>|z|)
## (Intercept) -2.66 0.0078
## rcs(Age, optimal_k_age)Age 2.48 0.0131
## rcs(Age, optimal_k_age)Age' -2.34 0.0195
## rcs(Age, optimal_k_age)Age'' 2.28 0.0228
## SexMale 0.94 0.3489
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size 2.64 0.0084
## rcs(Family_Group_Size, optimal_k_fgs)Family_Group_Size' -3.06 0.0022
## rcs(Age, optimal_k_age)Age:SexMale -2.05 0.0404
## rcs(Age, optimal_k_age)Age':SexMale 1.93 0.0537
## rcs(Age, optimal_k_age)Age'':SexMale -1.88 0.0604
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 61.443 on 69 degrees of freedom
## AIC: 81.44
##
## Number of Fisher Scoring iterations: 7
## This optimization produces Age: 3 knots, Family Group Size 5 knots.
# Summary and ANOVA for mod5 with knot sizes determined by k-fold cross validation
# Just repeating thise calculations for clarity.
mod5 <- Glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
Mod5 <- glm(Status ~ rcs(Age,3) * Sex + rcs(Family_Group_Size,5), data = Donner, family = binomial(), x = TRUE, y = TRUE)
summary(Mod5)
##
## Call:
## glm(formula = Status ~ rcs(Age, 3) * Sex + rcs(Family_Group_Size,
## 5), family = binomial(), data = Donner, x = TRUE, y = TRUE)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.854 1.840 -2.09
## rcs(Age, 3)Age 0.541 0.193 2.80
## rcs(Age, 3)Age' -1.334 0.480 -2.78
## SexMale -0.500 1.407 -0.36
## rcs(Family_Group_Size, 5)Family_Group_Size 0.514 0.519 0.99
## rcs(Family_Group_Size, 5)Family_Group_Size' 3.109 3.150 0.99
## rcs(Family_Group_Size, 5)Family_Group_Size'' -9.172 6.151 -1.49
## rcs(Family_Group_Size, 5)Family_Group_Size''' 123.583 47.027 2.63
## rcs(Age, 3)Age:SexMale -0.358 0.209 -1.72
## rcs(Age, 3)Age':SexMale 0.959 0.496 1.93
## Pr(>|z|)
## (Intercept) 0.0362
## rcs(Age, 3)Age 0.0051
## rcs(Age, 3)Age' 0.0055
## SexMale 0.7223
## rcs(Family_Group_Size, 5)Family_Group_Size 0.3220
## rcs(Family_Group_Size, 5)Family_Group_Size' 0.3237
## rcs(Family_Group_Size, 5)Family_Group_Size'' 0.1359
## rcs(Family_Group_Size, 5)Family_Group_Size''' 0.0086
## rcs(Age, 3)Age:SexMale 0.0862
## rcs(Age, 3)Age':SexMale 0.0533
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109.201 on 78 degrees of freedom
## Residual deviance: 58.369 on 69 degrees of freedom
## AIC: 78.37
##
## Number of Fisher Scoring iterations: 6
anova(Mod5)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 78 109.2
## rcs(Age, 3) 2 8.10 76 101.1 0.0174
## Sex 1 8.45 75 92.7 0.0037
## rcs(Family_Group_Size, 5) 4 27.74 71 64.9 1.4e-05
## rcs(Age, 3):Sex 2 6.55 69 58.4 0.0379
summary(mod5) # For effect size
## Effects Response : Status
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## Age 6 28.5 22.5 1.078 1.0241 -0.9647 3.1215
## Family_Group_Size 4 13.0 9.0 -1.042 0.9906 -3.0184 0.9341
## Sex - Female:Male 2 1.0 NA 4.473 1.7069 1.0674 7.8778
##
## Adjusted to: Age=14 Sex=Male
AIC(mod5)
## [1] 78.37
### end of 8/20/23 GPT-4 code. #################################################
##### GAM analysis of the 79 Travelers #####
######## k-fold cross validation to find optimal k for the GAMs
# Set a seed for reproducibility
set.seed(123)
# Create a function to perform cross-validation for a given k
cv_gam <- function(k_value){
# Number of folds
n_folds <- 5
# Create a vector to store CV errors
cv_errors <- numeric(n_folds)
# Create fold indices
fold_indices <- sample(rep(1:n_folds, length.out = nrow(Donner)))
for(i in 1:n_folds){
# Split the data into training and test sets
training_data <- Donner[fold_indices != i, ]
test_data <- Donner[fold_indices == i, ]
# Fit the GAM model on the training data
gam_model <- gam(Status ~ s(Age, by = as.numeric(Sex == "Male"), k=k_value) +
s(Age, by = as.numeric(Sex == "Female"), k=k_value),
data = training_data, family = binomial())
# Predict on the test data
predictions <- predict(gam_model, newdata = test_data, type = "response")
# Compute the binomial deviance (log loss) for the current fold and store it
cv_errors[i] <- -2 * sum(test_data$Status * log(predictions) + (1 - test_data$Status) * log(1 - predictions))
}
# Return the mean CV error
return(mean(cv_errors))
}
# Test a range of k values
# This statement produced many warnings, reduced max to 8 and increased min to 3
# k_values <- 2:10
k_values <- 3:6
cv_results <- sapply(k_values, cv_gam)
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L,
## : Iteration limit reached without full convergence - check carefully
cv_results
## [1] 21.07 25.94 21.98 44.70
# Determine the optimal k based on minimum CV error
optimal_k <- k_values[which.min(cv_results)]
print(optimal_k)
## [1] 3
## optimal_k is 2, so redo the model
mod_gam_k2 <- gam(Status ~ s(Age, by = as.numeric(Sex == "Male"), k=2) +
s(Age, by = as.numeric(Sex == "Female"), k=2),
data = Donner, family = binomial())
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
mod_gam_k2
##
## Family: binomial
## Link function: logit
##
## Formula:
## Status ~ s(Age, by = as.numeric(Sex == "Male"), k = 2) + s(Age,
## by = as.numeric(Sex == "Female"), k = 2)
##
## Estimated degrees of freedom:
## 2.0 1.9 total = 4.9
##
## UBRE score: 0.2035 rank: 6/7
# Check the summary
summary(mod_gam_k2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## Status ~ s(Age, by = as.numeric(Sex == "Male"), k = 2) + s(Age,
## by = as.numeric(Sex == "Female"), k = 2)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.19 3.13 2.94 0.0033
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Age):as.numeric(Sex == "Male") 2.0 2.00 11.3 0.0036
## s(Age):as.numeric(Sex == "Female") 1.9 1.99 7.5 0.0229
##
## Rank: 6/7
## R-sq.(adj) = 0.229 Deviance explained = 21.9%
## UBRE = 0.20352 Scale est. = 1 n = 79
# Create a prediction dataset
new_age <- seq(min(Donner$Age), max(Donner$Age), length.out = 100)
new_sex <- unique(Donner$Sex)
pred_data <- expand.grid(Age = new_age, Sex = new_sex)
# Predict using the GAM model
predictions <- predict(mod_gam_k2, newdata = pred_data, type = "link", se.fit = TRUE)
pred_data$fit <- plogis(predictions$fit)
pred_data$lower <- plogis(predictions$fit - 1.96 * predictions$se.fit)
pred_data$upper <- plogis(predictions$fit + 1.96 * predictions$se.fit)
# plot the GAM analysis for Sex (Figure 2 in manuscript)
# Adjust y-values for jittered points
Donner$AdjustedStatus <- ifelse(Donner$Status == 0, -0.03, 1.03)
# Plot Fig. 2, GAM analysis of Sex
set.seed(3)
ggplot(pred_data, aes(x = Age, y = fit, color = Sex, shape = Sex)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
geom_jitter(data = Donner, aes(x = Age, y = AdjustedStatus),
width = 0.3, height = 0.03, size = 1.5) +
labs(x = "Age (Years)", y = "Estimated Probability of Survival",
title = "Fig. 2. GAM (k=2) with Age smooths by Sex with 95% confidence intervals") +
theme_minimal() +
scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2)) +
scale_color_manual(values = c("Male" = "blue", "Female" = "pink2"), name = "Sex") +
scale_shape_manual(values = c("Male" = 17, "Female" = 16), name = "Sex") +
guides(color = guide_legend(override.aes = list(shape = c(17, 16))))
### GAMs for Family Group Size, code from GPT-4 ################################
# Define the k-fold cross-validation manually for GAM:
set.seed(123) # for reproducibility
folds <- createFolds(Donner$Status, k = 10)
cv_errors <- data.frame(k = integer(), RMSE = numeric(), Rsquared = numeric())
for (k_value in 3:6) {
rmse_values <- numeric()
rsq_values <- numeric()
# Attempt to fit models and catch errors
for (fold in folds) {
train_data <- Donner[-fold, ]
test_data <- Donner[fold, ]
success <- TRUE
model <- tryCatch({
gam(Status ~ s(Family_Group_Size, k = k_value), data = train_data, family = binomial())
}, error = function(e) {
message(paste("Error with k =", k_value, ":", e$message))
success <- FALSE
return(NULL)
}, warning = function(w) {
message(paste("Warning with k =", k_value, ":", w$message))
})
if(success && !is.null(model)){
preds <- predict(model, newdata = test_data, type = "response")
rmse <- sqrt(mean((test_data$Status - preds)^2))
rsq <- 1 - (sum((test_data$Status - preds)^2) / sum((test_data$Status - mean(test_data$Status))^2))
rmse_values <- c(rmse_values, rmse)
rsq_values <- c(rsq_values, rsq)
}
}
if (length(rmse_values) > 0 && length(rsq_values) > 0) {
cv_errors <- rbind(cv_errors, data.frame(k = k_value, RMSE = mean(rmse_values), Rsquared = mean(rsq_values)))
}
}
# Check the results
cv_errors
## k RMSE Rsquared
## 1 3 0.4685 -0.02279
## 2 4 0.4705 -0.03109
## 3 5 0.4658 -0.02021
## 4 6 0.4703 -0.04122
# The value of k with the smallest RMSE would be the optimal choice
best_k <- cv_errors[which.min(cv_errors$RMSE),]$k
best_k
## [1] 5
# Fit the GAM with optimal k
mod_gam <- gam(Status ~ s(Family_Group_Size, k = best_k), data = Donner, family = binomial())
summary(mod_gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## Status ~ s(Family_Group_Size, k = best_k)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.308 0.294 1.05 0.29
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Family_Group_Size) 3.77 3.96 11.6 0.02
##
## R-sq.(adj) = 0.173 Deviance explained = 19.2%
## UBRE = 0.23719 Scale est. = 1 n = 79
# Generate ggplot graphics for GAM
# Create a prediction dataset
new_family_group_size <- seq(min(Donner$Family_Group_Size),
max(Donner$Family_Group_Size), length.out = 100)
# Create prediction data
# Predict on the link (logit) scale
link_predictions <- predict(mod_gam, newdata = data.frame(Family_Group_Size = new_family_group_size), type = "link", se.fit = TRUE)
# Compute confidence intervals on the link scale
link_lower <- link_predictions$fit - 1.96 * link_predictions$se.fit
link_upper <- link_predictions$fit + 1.96 * link_predictions$se.fit
# Back-transform to the probability scale
pred_data <- data.frame(
Family_Group_Size = new_family_group_size,
fit = plogis(link_predictions$fit),
lower = plogis(link_lower),
upper = plogis(link_upper)
)
# Plot the GAM (k = best_k = 5) model.
ggplot(pred_data, aes(x = Family_Group_Size, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
geom_jitter(data = Donner, aes(x = Family_Group_Size, y = AdjustedStatus,
shape = Sex, color = Sex), width = 0.3, height = 0.03,
size = 1.5) +
labs(x = "Family Group Size", y = "Estimated Probability of Survival",
# title = paste0("GAM (k=", best_k, ") with Family Group Size with 95% confidence intervals"),
title = paste0("Fig. 4. GAM (k=", best_k, ") with Family Group Size with 95% confidence intervals"),
color = "Sex") +
theme_minimal() +
scale_y_continuous(limits = c(-0.06, 1.06), breaks = seq(0, 1, 0.2))
##### Cox Proportional Hazards Model for Sex
Donner$Death <- abs(Donner$Status-1)
# Code initially from GPT-4
# Assuming the data frame is already loaded as `Donner`
# 2. Fit the Cox proportional hazards model
cox_model <- coxph(Surv(Survival_Time, Death) ~ Sex, data = Donner)
# Print the summary of the model
print(summary(cox_model))
## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Sex, data = Donner)
##
## n= 79, number of events= 37
##
## coef exp(coef) se(coef) z Pr(>|z|)
## SexMale 1.037 2.819 0.371 2.79 0.0052
##
## exp(coef) exp(-coef) lower .95 upper .95
## SexMale 2.82 0.355 1.36 5.83
##
## Concordance= 0.633 (se = 0.036 )
## Likelihood ratio test= 8.82 on 1 df, p=0.003
## Wald test = 7.81 on 1 df, p=0.005
## Score (logrank) test = 8.52 on 1 df, p=0.004
# Check the proportional hazards assumption with
ph_test <- cox.zph(cox_model)
# Print the results
# print(ph_test)
# Plot the Schoenfeld residuals
# plot(ph_test)
# 3. Visualize the results
# Fit Kaplan-Meier survival curves for Sex
km_fit <- survfit(Surv(Survival_Time, Death) ~ Sex, data = Donner)
# Define the days of arrival of the relief parties
relief_days <- c(83, 124, 136, 171)
colors <- c("red", "blue", "magenta", "purple")
linetypes <- c("dashed", "dotted", "dotdash", "longdash")
labels <- c("1st", "2nd", "3rd", "4th")
# Plot Fig. 6 the KM curve & Risk Table for Sex
# Generate KM plot without the risk table
p1 <- ggsurvplot(km_fit, data = Donner, risk.table = FALSE,
pval = TRUE, pval.coord = c(0.8, 0.25),
legend = "right", legend.title = "Sex",
title = "Fig. 6. Kaplan-Meier Survivorship by Sex",
legend.labs = c("Female", "Male"), conf.int = TRUE,
break.x.by = 20)
# Add the relief party vertical lines and annotations
for (i in 1:4) {
p1$plot <- p1$plot +
annotate("segment", x = relief_days[i], xend = relief_days[i],
y = 0, yend = 1, colour = colors[i], linetype = linetypes[i]) +
annotate("text", x = relief_days[i], y = 0.2,
label = labels[i], angle = 90, vjust = 0, color = colors[i])
}
# Generate risk table separately
p2 <- ggsurvplot(km_fit, data = Donner, risk.table = TRUE,
pval = FALSE, legend = "none",
break.x.by = 20)
# Use patchwork to combine the two plots
# Use patchwork to combine the two plots with specified heights
combined_plot <- p1$plot / p2$table + plot_layout(heights = c(5, 1))
# Print the combined plot
combined_plot
#### Hazard ratio plot
# Convert model coefficients to hazard ratios
hr <- exp(coef(cox_model))
ci <- exp(confint(cox_model))
# Create a data frame for plotting
df <- data.frame(
Variable = names(hr),
HR = hr,
lower = ci[, 1],
upper = ci[, 2]
)
# Organize data for forestplot
labeltext <- list(
c("", df$Variable),
c("HR", as.character(round(df$HR, 2))),
c("Lower 95% CI", as.character(round(df$lower, 2))),
c("Upper 95% CI", as.character(round(df$upper, 2)))
)
# Plot using forestplot
forestplot(
labeltext = labeltext,
graph.pos = 3,
mean = c(NA, df$HR),
lower = c(NA, df$lower),
upper = c(NA, df$upper),
xlog = TRUE, # because hazard ratios are typically plotted on a log scale
clip = c(0.5, 2), # you can adjust these values if necessary
xticks = c(0.5, 1, 2),
title = "Hazard Ratios by Sex (95% CI)"
)
# 4. Employee (Teamsters & 2 Servants) Survival Analysis
# 4.1 Cox PH model
cox_model_2 <- coxph(Surv(Survival_Time, Death) ~ Employee, data = Donner)
# Print the summary of the model
print(summary(cox_model_2))
## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee, data = Donner)
##
## n= 79, number of events= 37
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Employee 1.047 2.849 0.422 2.48 0.013
##
## exp(coef) exp(-coef) lower .95 upper .95
## Employee 2.85 0.351 1.25 6.52
##
## Concordance= 0.571 (se = 0.035 )
## Likelihood ratio test= 4.93 on 1 df, p=0.03
## Wald test = 6.15 on 1 df, p=0.01
## Score (logrank) test = 6.73 on 1 df, p=0.009
# Check the proportional hazards assumption
ph_test_2 <- cox.zph(cox_model_2)
# Print the results
print(ph_test_2)
## chisq df p
## Employee 4.17 1 0.041
## GLOBAL 4.17 1 0.041
# Some evidence that the constant proportional hazard assumption is violated.
# p = 0.041 for Employee and Global chisq 4.17 with 1 df
# Plot the Schoenfeld residuals
plot(ph_test_2)
# 4.2 Kaplan-Meier survival curve for Employment
km_fit <- survfit(Surv(Survival_Time, Death) ~ Employee, data = Donner)
# Define the days of arrival of the relief parties
relief_days <- c(83, 124, 136, 171)
colors <- c("red", "blue", "magenta", "purple")
linetypes <- c("dashed", "dotted", "dotdash", "longdash")
labels <- c("1st", "2nd", "3rd", "4th")
# Generate Fig. 7 KM plot for employment with the risk table
p1 <- ggsurvplot(km_fit, data = Donner, risk.table = FALSE,
pval = TRUE, pval.coord = c(0.8, 0.25),
legend = "right", legend.title = "Sex",
title = "Fig. 7. Kaplan-Meier Survivorship by Employment",
legend.labs = c("Family Member", "Employee"), conf.int = TRUE,
break.x.by = 20)
# Add the relief party vertical lines and annotations
for (i in 1:4) {
p1$plot <- p1$plot +
annotate("segment", x = relief_days[i], xend = relief_days[i],
y = 0, yend = 1, colour = colors[i], linetype = linetypes[i]) +
annotate("text", x = relief_days[i], y = 0.05,
label = labels[i], angle = 90, vjust = 0, color = colors[i])
}
# Generate risk table separately
p2 <- ggsurvplot(km_fit, data = Donner, risk.table = TRUE,
pval = FALSE, legend = "none",
break.x.by = 20)
# Use patchwork to combine the two plots
# Use patchwork to combine the two plots with specified heights
combined_plot <- p1$plot / p2$table + plot_layout(heights = c(5, 1))
# Print the combined plot
combined_plot
# GPT-4 suggested that I try stratified KM plots to show the change in
# Proportional Hazard ratio with time. Here is the codefrom GPT-4
# Split the Survival_Time into intervals
Donner$TimeGroup <- cut(Donner$Survival_Time, breaks = c(0, 50, 100, max(Donner$Survival_Time)),
labels = c("Early", "Middle", "Late"), include.lowest = TRUE)
# KM fit using Employee and stratified by TimeGroup
km_fit_time <- survfit(Surv(Survival_Time, Death) ~ Employee + strata(TimeGroup), data = Donner)
# Plot Stratified Kaplan_Meier curves.
ggsurvplot(km_fit_time, data = Donner, risk.table = FALSE,
legend = "right", legend.title = "Sex",
title = "Fig. 8. Kaplan-Meir Employee Survivorship Stratified by Time Intervals", conf.int = TRUE,
break.x.by = 20)
# 4.3 Calculate and plot the hazard ratio for Employment
# Convert model coefficients to hazard ratios
hr <- exp(coef(cox_model_2))
ci <- exp(confint(cox_model_2))
# Create a data frame for plotting
df <- data.frame(
Variable = names(hr),
HR = hr,
lower = ci[, 1],
upper = ci[, 2]
)
# Organize data for forestplot
labeltext <- list(
c("", df$Variable),
c("HR", as.character(round(df$HR, 2))),
c("Lower 95% CI", as.character(round(df$lower, 2))),
c("Upper 95% CI", as.character(round(df$upper, 2)))
)
# Plot using forestplot
forestplot(
labeltext = labeltext,
graph.pos = 3,
mean = c(NA, df$HR),
lower = c(NA, df$lower),
upper = c(NA, df$upper),
xlog = TRUE, # because hazard ratios are typically plotted on a log scale
clip = c(0.5, 2), # you can adjust these values if necessary
xticks = c(0.5, 1, 2.5),
title = "Hazard Ratios (95% CI)"
)
# The proportional hazards assumption is clearly violated, so GPT-4 suggested
# an alternative:
Donner$SurvTime_Employee <- with(Donner, Survival_Time * Employee)
cox_time_dep <- coxph(Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee, data = Donner)
summary(cox_time_dep)
## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee,
## data = Donner)
##
## n= 79, number of events= 37
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Employee 6.1680 477.2294 1.2615 4.89 1e-06
## SurvTime_Employee -0.0456 0.9554 0.0143 -3.18 0.0015
##
## exp(coef) exp(-coef) lower .95 upper .95
## Employee 477.229 0.0021 40.263 5656.520
## SurvTime_Employee 0.955 1.0466 0.929 0.983
##
## Concordance= 0.621 (se = 0.035 )
## Likelihood ratio test= 27 on 2 df, p=1e-06
## Wald test = 32.1 on 2 df, p=1e-07
## Score (logrank) test = 63.6 on 2 df, p=2e-14
# Print the summary of the model
print(summary(cox_time_dep))
## Call:
## coxph(formula = Surv(Survival_Time, Death) ~ Employee + SurvTime_Employee,
## data = Donner)
##
## n= 79, number of events= 37
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Employee 6.1680 477.2294 1.2615 4.89 1e-06
## SurvTime_Employee -0.0456 0.9554 0.0143 -3.18 0.0015
##
## exp(coef) exp(-coef) lower .95 upper .95
## Employee 477.229 0.0021 40.263 5656.520
## SurvTime_Employee 0.955 1.0466 0.929 0.983
##
## Concordance= 0.621 (se = 0.035 )
## Likelihood ratio test= 27 on 2 df, p=1e-06
## Wald test = 32.1 on 2 df, p=1e-07
## Score (logrank) test = 63.6 on 2 df, p=2e-14
# Check the proportional hazards assumption
ph_test_3 <- cox.zph(cox_time_dep)
# Print the results
print(ph_test_3)
## chisq df p
## Employee 4.97 1 0.026
## SurvTime_Employee 2.70 1 0.100
## GLOBAL 5.14 2 0.077
# Evidence for the violation of the constant hazard ratio assumption for Employee
# But not for Global
# Plot the Schoenfeld residuals
plot(ph_test_3)
# Fairly level distribution for Betas for SurvTime_Employees
### 6. Forlorn Hope Fisher's exact test:
# Create the data matrix
# The matrix format is:
# Survived Died
# Women 5 0
# Men 2 8
donner_matrix <- matrix(c(5, 0, 2, 8), nrow=2, byrow=TRUE)
colnames(donner_matrix) <- c("Survived", "Died")
rownames(donner_matrix) <- c("Women", "Men")
print(donner_matrix)
## Survived Died
## Women 5 0
## Men 2 8
# Fisher's exact test
test_result <- fisher.test(donner_matrix)
# Print the results
print(test_result)
##
## Fisher's Exact Test for Count Data
##
## data: donner_matrix
## p-value = 0.007
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.688 Inf
## sample estimates:
## odds ratio
## Inf
# Odds ratio
print(paste("Odds Ratio:", test_result$estimate))
## [1] "Odds Ratio: Inf"
# Confidence interval for the odds ratio
print(paste("95% Confidence Interval:", round(test_result$conf.int[1], 3), "to", round(test_result$conf.int[2], 3)))
## [1] "95% Confidence Interval: 1.688 to Inf"
# Produces an SPSS-like table. Note that with the Essex data, warnings are
# issued because the minimum sample sizes for 2 x 2 have expectations less
# than 5. The proportions and Fisher's exact test are correct, hence CrossTable
CrossTable(donner_matrix,digits=3,fisher = TRUE, chisq = TRUE, expected = TRUE,
format = "SPSS")
## Warning in chisq.test(t, correct = TRUE, ...): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may
## be incorrect
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 15
##
## |
## | Survived | Died | Row Total |
## -------------|-----------|-----------|-----------|
## Women | 5 | 0 | 5 |
## | 2.333 | 2.667 | |
## | 3.048 | 2.667 | |
## | 100.000% | 0.000% | 33.333% |
## | 71.429% | 0.000% | |
## | 33.333% | 0.000% | |
## -------------|-----------|-----------|-----------|
## Men | 2 | 8 | 10 |
## | 4.667 | 5.333 | |
## | 1.524 | 1.333 | |
## | 20.000% | 80.000% | 66.667% |
## | 28.571% | 100.000% | |
## | 13.333% | 53.333% | |
## -------------|-----------|-----------|-----------|
## Column Total | 7 | 8 | 15 |
## | 46.667% | 53.333% | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 8.571 d.f. = 1 p = 0.003415
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 5.658 d.f. = 1 p = 0.01737
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: Inf
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 0.006993
## 95% confidence interval: 1.688 Inf
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 1
## 95% confidence interval: 0 Inf
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 0.006993
## 95% confidence interval: 2.365 Inf
##
##
##
## Minimum expected frequency: 2.333
## Cells with Expected Frequency < 5: 3 of 4 (75%)
# options (prType ='latex') # applies to printingmodelfits
getHdata (sicily ) #fetch dataset from hbiostat.org/data
d<-sicily
dd<-datadist (d); options (datadist = 'dd')
g <- function (x) exp(x) * 100000
off <-list ( stdpop = mean (d$ stdpop )) # offset for prediction (383464.4)
w <- geom_point (aes(x=time , y= rate ), data =d)
v <- geom_vline (aes( xintercept =37 , col=I('red')))
yl <- ylab ('Acute Coronary Cases Per 100,000')
f <- Glm(aces ~ offset(log(stdpop)) + rcs(time,6), data=d, family = poisson)
f$aic
## [1] 721.5
ggplot(Predict(f, fun=g, offset =off)) + w + v + yl
k <- attr (rcs(d$time, 6), 'parms')
k
## [1] 5.00 14.34 24.78 35.22 45.66 55.00
kn <- k
h <- function (x) cbind(rcspline.eval (x, kn),
sin=sin (2*pi*x/12), cos=cos (2*pi*x/12))
f <- Glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
data =d, family = poisson )
f$aic
## [1] 674.1
ggplot (Predict(f, fun=g, offset =off)) + w + v + yl
kn <- sort (c(k, c(36 , 37, 38)))
f <-Glm( aces ~ offset (log(stdpop)) + gTrans (time , h),
data =d, family = poisson )
f$aic
## [1] 661.8
ggplot(Predict (f, fun=g, offset =off)) + w + v + yl
# Harrell uses greater than or equal 37, but I don't know how to code that
h <- function (x) cbind ( rcspline.eval (x, k),
sin=sin (2*pi*x/12) , cos=cos (2*pi*x/12),
jump =x > 36.999)
f <- Glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
data =d, family = poisson)
f$aic
## [1] 659.6
times <- sort (c(seq (0, 60, length =200), 36.998 , 37, 37.001 ))
ggplot(Predict(f, time =times ,fun=g, offset =off)) + w + v + yl
# Look at fit statistics, especially evidence for the jump
# summary(f)
# summary generates Latex code which is unreadable, i'll regenrate f with glm
f2 <- glm( aces ~ offset (log( stdpop )) + gTrans (time, h),
data =d, family = poisson)
# These two lines not in Harrell's code, but match his output.
summary(f2)
##
## Call:
## glm(formula = aces ~ offset(log(stdpop)) + gTrans(time, h), family = poisson,
## data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.21185 0.00947 -656.01 < 2e-16
## gTrans(time, h)time 0.06355 0.01128 5.63 1.8e-08
## gTrans(time, h)time' -0.19119 0.04334 -4.41 1.0e-05
## gTrans(time, h)time'' 0.26531 0.07600 3.49 0.00048
## gTrans(time, h)time''' -0.24086 0.09245 -2.61 0.00918
## gTrans(time, h)sin 0.03434 0.00672 5.11 3.2e-07
## gTrans(time, h)cos 0.03796 0.00647 5.86 4.5e-09
## gTrans(time, h)jump -0.12681 0.03127 -4.06 5.0e-05
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 308.52 on 58 degrees of freedom
## Residual deviance: 138.88 on 51 degrees of freedom
## AIC: 659.6
##
## Number of Fisher Scoring iterations: 3
anova(f2,test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: aces
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 58 308
## gTrans(time, h) 7 170 51 139 <2e-16
# Strong evidence for a jump effect
# gTrans(time, h)jump -0.126814 0.031270 -4.055 5.00e-05 ***
exp(-0.126814) # 0.8808975
## [1] 0.8809
There was a decrease of 12% in cardiovascular events resulting from the cigarette ban with strong evidence for seasonality
# Sicily_redux_Working
# Prompt by Eugene.Gallagher@umb.edu, 7/10/23 8:20 AM
# GPT-4 Prompt
# Sicily Smoking Ban
# Using R, analyze the interrupted monthly time series of Acute Cardiac Events
# (aces) from Sicily and the effects of a ban of smoking at all indoor locations
# at the start of 2005 (time = 37). The data are available at
# rms::getHdata(sicily). The analysis is described in Bernal et al. 2017
# (International Journal of Epidemiology 46: 348-355,
# academic.oup.com/ije/article/46/1/348/2622842). Fit the aces response with
# a Poisson generalized linear model with an offset of population size,
# log(stdpop). Use a restricted cubic spline regression with 6 knots to assess
# seasonal effects on aces. Model the annual pattern in aces with sin and cos
# functions of time. Assess the intervention of a smoking ban at month 37 on
# aces and report model estimates and standard errors. Document the percentage
# change of the intervention on aces with a 95% confidence interval. Use ggplot2
# to graphically display the results with 95% confidence intervals and a legend,
# noting the intervention point at month 37 and actual data plotted with points.
# Load the required libraries and data
# library(rms)
# library(ggplot2)
# Load the data
getHdata(sicily)
# Create the variables necessary for analysis
sicily$time <- 1:nrow(sicily)
sicily$year <- sicily$time / 12
sicily$smoking_ban <- ifelse(sicily$time >= 37, 1, 0)
# Fit the Poisson generalized linear model
fit <- glm(aces ~ rcs(time, 6) + I(sin(2 * pi * year)) + I(cos(2 * pi * year)) + offset(log(stdpop)) + smoking_ban,
family = poisson, data = sicily)
# Extract the estimated effects and standard errors
coefs <- summary(fit)$coefficients
smoking_ban_effect <- coefs["smoking_ban", "Estimate"]
smoking_ban_se <- coefs["smoking_ban", "Std. Error"]
# Compute the percentage change of the intervention on ACES
percentage_change <- exp(smoking_ban_effect) - 1
percentage_change_ci <- exp(smoking_ban_effect + qnorm(c(0.025, 0.975)) * smoking_ban_se) - 1
# Print the results
print(coefs)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.287547 0.021198 -296.60774 0.000e+00
## rcs(time, 6)time 0.009869 0.002453 4.02277 5.752e-05
## rcs(time, 6)time' -0.032573 0.026369 -1.23529 2.167e-01
## rcs(time, 6)time'' 0.061630 0.076209 0.80869 4.187e-01
## rcs(time, 6)time''' 0.005871 0.099571 0.05896 9.530e-01
## rcs(time, 6)time'''' -0.097929 0.099016 -0.98902 3.227e-01
## I(sin(2 * pi * year)) 0.040166 0.006866 5.84955 4.929e-09
## I(cos(2 * pi * year)) 0.037582 0.006474 5.80532 6.424e-09
## smoking_ban -0.146324 0.031578 -4.63380 3.590e-06
print(smoking_ban_effect)
## [1] -0.1463
print(smoking_ban_se)
## [1] 0.03158
print(percentage_change)
## [1] -0.1361
print(percentage_change_ci)
## [1] -0.18797 -0.08097
# Graphically display the results
sicily$fitted <- predict(fit, type = "response")
ggplot(sicily, aes(x = time)) +
geom_point(aes(y = aces)) +
geom_line(aes(y = fitted), color = "blue") +
geom_vline(xintercept = 37, linetype = "dashed", color = "red") +
geom_ribbon(aes(ymin = fitted - 1.96 * sqrt(fitted), ymax = fitted + 1.96 * sqrt(fitted)),
alpha = 0.1) +
labs(title = "Effects of Smoking Ban on ACES in Sicily",
x = "Time (months)",
y = "Acute Cardiac Events",
caption = paste0("Estimated smoking ban effect: ", round(percentage_change * 100, 2),
"% (95% CI: ", round(percentage_change_ci[1] * 100, 2), "% to ",
round(percentage_change_ci[2] * 100, 2), "%)")) +
theme_minimal()
Hurlbert (1984) on pseudoreplication
Chapter 14 Multifactor studies without replication
Chapter 16 Nested, repeated measures (longitudinal), mixed, and other multilevel/hierarchical response models.
Andrews Chapter 12 on Multilevel models