Day 3/4: Regression Models
09-2024
Regression analysis is a fundamental statistical method that models relationships between variables:
Key Concepts:
Dependent Variable (\(Y\)): The outcome we want to predict or understand
Independent Variables (\(X, X_1,X_2, \dots\)): The predictors or explanatory variables
Simplest Functional Relationship: linear
Effects (explained) versus Error Term (unexplained variation)
Applications in Medical Research: Predicting patient outcomes based on clinical measurements, Studying dose-response relationships, Adjusting for known confounding factors
\[Y = \alpha + \beta X + \varepsilon, \quad i = 1,\ldots,n\]
Components:
\(Y\): Dependent variable observation - Must be continuous
\(X\): Independent variable observation
Parameters:
\(\varepsilon_i\): Random error
Least Squares Method: Minimize the sum of squared residuals with observations \(x_i, y_i\) \[Q(\alpha,\beta) = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2\]
Normal Equations: \[\frac{\partial Q}{\partial \alpha} = 0: \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0\] \[\frac{\partial Q}{\partial \beta} = 0: \sum_{i=1}^n x_i(y_i - \alpha - \beta x_i) = 0\]
Resulting Estimators: \[\hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\] \[\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}\]
Unbiasedness: \[E[\hat{\beta}] = \beta\text{, }E[\hat{\alpha}] = \alpha\]
Variance: \[Var(\hat{\beta}) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\] \[Var(\hat{\alpha}) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\right)\]
Efficiency:
Achieve Cramér-Rao lower bound under normality
BLUE (Best Linear Unbiased Estimator)
# Basic summary statistics
body_dat <- read_excel("body_dat.xlsx") %>%
dplyr::select(Age,Height,Weight,Gender) %>%
mutate(Gender_fac = factor(Gender, labels = c("Female","Male")))
describe(body_dat,ranges=T,skew=F) vars n mean sd median min max range se
Age 1 507 30.18 9.61 27.0 18.0 67.0 49.0 0.43
Height 2 507 171.14 9.41 170.3 147.2 198.1 50.9 0.42
Weight 3 507 69.15 13.35 68.2 42.0 116.4 74.4 0.59
Gender 4 507 0.49 0.50 0.0 0.0 1.0 1.0 0.02
Gender_fac* 5 507 1.49 0.50 1.0 1.0 2.0 1.0 0.02
# Fit simple linear model
model_simple <- lm(Weight ~ Height, data = body_dat)
# Detailed summary
summary(model_simple)
Call:
lm(formula = Weight ~ Height, data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-18.743 -6.402 -1.231 5.059 41.103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
Height 1.01762 0.04399 23.14 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.308 on 505 degrees of freedom
Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
Additional Information from the Model: Significance of Results
\(R^2\): The square of the correlation between both variables. Indicates what proportion of the total variance is explained by the linear model. The F-test (p-value) tests the null hypothesis that the true correlation is zero - meaning no relationship exists.
t-Test: For each regression parameter, a p-value (Pr(>|t|)) is reported. The t-test tests the null hypothesis that the respective parameter is actually zero.
Sampling Distribution of Estimators: Under normality assumptions: \[\hat{\beta} \sim N(\beta, \frac{\sigma^2}{\sum(X_i - \bar{X})^2})\]
Standard Error Estimation: \[s_{\hat{\beta}} = \sqrt{\frac{\sum e_i^2/(n-2)}{\sum(x_i - \bar{x})^2}}\] where \(e_i\) are residuals
t-Statistics: \[t = \frac{\hat{\beta} - \beta_0}{s_{\hat{\beta}}} \sim t_{n-2}\]
2.5 % 97.5 %
(Intercept) -119.8237251 -90.198783
Height 0.9311971 1.104036
Confidence Interval for mean response: \[\hat{y} \pm t_{n-2,\alpha/2} \cdot s_{\hat{y}_h}\] where: \[s_{\hat{y}}^2 = \hat{\sigma}^2[1/n + (x - \bar{x})^2/\sum(x_i - \bar{x})^2]\]
Prediction Interval for individual response: \[\hat{y} \pm t_{n-2,\alpha/2} \cdot s_{pred}\] where: \[s_{pred}^2 = \hat{\sigma}^2[1 + 1/n + (x - \bar{x})^2/\sum(x_i - \bar{x})^2]\]
# Create new data for prediction
new_Weights <- data.frame(Height = c(150,160,170,180,190))
# Calculate both types of intervals
predict(model_simple, new_Weights,
interval = "confidence") fit lwr upr
1 47.63126 45.63166 49.63087
2 57.80743 56.54764 59.06722
3 67.98360 67.16544 68.80176
4 78.15977 77.04380 79.27573
5 88.33593 86.51521 90.15666
fit lwr upr
1 47.63126 29.23501 66.02751
2 57.80743 39.47684 76.13802
3 67.98360 49.67805 86.28914
4 78.15977 59.83849 96.48104
5 88.33593 69.95827 106.71360
Model with p predictors: \[Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + ... + \beta_pX_{p} + \varepsilon\]
Matrix Form: With \(n\) observations \(y_i, x_{i1},\dots,x_{ip}\) \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
where:
Types of Prediction Intervals:
Mean Response: \[\hat{Y}_h \pm t_{n-p-1,\alpha/2}\sqrt{\mathbf{x}_h'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_h\hat{\sigma}^2}\]
Individual Response: \[\hat{Y}_h \pm t_{n-p-1,\alpha/2}\sqrt{\mathbf{x}_h'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_h\hat{\sigma}^2 + \hat{\sigma}^2}\]
Least Squares Estimator: \[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Properties:
# Fit multiple regression model
mult_model <- lm(Weight ~ Height + Age, data=body_dat)
# Detailed summary
summary(mult_model)
Call:
lm(formula = Weight ~ Height + Age, data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-19.996 -5.909 -1.321 5.055 41.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -109.06383 7.38820 -14.76 < 2e-16 ***
Height 1.00227 0.04297 23.33 < 2e-16 ***
Age 0.22127 0.04207 5.26 2.14e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.072 on 504 degrees of freedom
Multiple R-squared: 0.5398, Adjusted R-squared: 0.538
F-statistic: 295.6 on 2 and 504 DF, p-value: < 2.2e-16
2.5 % 97.5 %
(Intercept) -123.5792876 -94.5483638
Height 0.9178550 1.0866946
Age 0.1386185 0.3039214
Sums of Squares Decomposition: \[\sum_{i=1}^n (Y_i - \bar{Y})^2 = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2 + \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\]
Components: 1. Total Sum of Squares (SST) 2. Regression Sum of Squares (SSR) 3. Error Sum of Squares (SSE)
Degrees of freedom:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 46370 46370 563.469 < 2.2e-16 ***
Age 1 2277 2277 27.665 2.136e-07 ***
Residuals 504 41476 82
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: Weight ~ Height
Model 2: Weight ~ Height + Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 505 43753
2 504 41476 1 2276.7 27.665 2.136e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Weight ~ Gender_fac, data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-24.245 -6.345 -1.400 6.355 44.600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.6004 0.6241 97.11 <2e-16 ***
Gender_facMale 17.5441 0.8941 19.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.06 on 505 degrees of freedom
Multiple R-squared: 0.4326, Adjusted R-squared: 0.4315
F-statistic: 385 on 1 and 505 DF, p-value: < 2.2e-16
Definition: An interaction occurs when the effect of one predictor variable on the response depends on the level of another predictor.
ANOVA: Interactions between several categorical variables. ANCOVA: Interaction between categorical and metric variables
Mathematical Representation: For two predictors X and Z: \[Y_i = \beta_0 + \beta_1X + \beta_2Z + \beta_3(X \cdot Z) + \varepsilon_i\]
Interpretation:
# Create interaction model with birth weight data
interact_model <- lm(Weight ~ Height*Gender_fac, data = body_dat)
summary(interact_model)
Call:
lm(formula = Weight ~ Height * Gender_fac, data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-20.187 -5.957 -1.439 4.955 43.355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.81929 13.77877 -3.180 0.00156 **
Height 0.63334 0.08351 7.584 1.63e-13 ***
Gender_facMale -17.13407 19.56250 -0.876 0.38152
Height:Gender_facMale 0.14923 0.11431 1.305 0.19233
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.795 on 503 degrees of freedom
Multiple R-squared: 0.5682, Adjusted R-squared: 0.5657
F-statistic: 220.7 on 3 and 503 DF, p-value: < 2.2e-16
# Compare with additive model
additive_model <- lm(Weight ~ Height + Gender_fac, data = body_dat)
anova(additive_model, interact_model)Analysis of Variance Table
Model 1: Weight ~ Height + Gender_fac
Model 2: Weight ~ Height * Gender_fac
Res.Df RSS Df Sum of Sq F Pr(>F)
1 504 39043
2 503 38912 1 131.84 1.7043 0.1923
Interpretation: 1. Top-left: Check for linearity and equal variance 2. Top-right: Check normality assumption 3. Bottom-left: Check homoscedasticity 4. Bottom-right: Check for influential points
Types of residuals:
Raw Residuals: \[e_i = y_i - \hat{y}_i\]
Standardized Residuals: \[r_i = \frac{e_i}{s\sqrt{1-h_{ii}}}\]
Studentized Residuals: \[t_i = \frac{e_i}{s_{(i)}\sqrt{1-h_{ii}}}\]
# Calculate different residuals
res_raw <- residuals(mult_model)
res_stand <- rstandard(mult_model)
res_stud <- rstudent(mult_model)
# Compare distributions
par(mfrow=c(1,3))
hist(res_raw, main="Raw Residuals")
hist(res_stand, main="Standardized Residuals")
hist(res_stud, main="Studentized Residuals")Methods for Detection:
Height Age
Height 1.00000000 0.06788349
Age 0.06788349 1.00000000
Interpretation: - VIF > 5 or 10 indicates potential problems - High correlations (> 0.8) warrant investigation
Goals of Model Selection: Parsimony (simplicity),Prediction accuracy
Adjusted R²: \[R^2_{adj} = 1 - \frac{n-1}{n-p-1}(1-R^2)\]
AIC (Akaike Information Criterion): \[AIC = -2\log L + 2p\]
BIC (Bayesian Information Criterion): \[BIC = -2\log L + p\log(n)\]
where: - n = sample size, p = number of parameters, L = (maximum) likelihood
1. Forward Selection: - Start with no predictors - Add most significant variable at each step - Stop when no significant improvement
2. Backward Elimination: - Start with all predictors - Remove least significant variable at each step - Stop when all remaining variables significant
3. Stepwise Selection: - Combination of forward and backward - Can add or remove variables at each step
# Full model with all predictors
full_model <- lm(Weight ~ Gender_fac*Height*Age, data = body_dat)
# Stepwise selection
library(MASS)
step_model <- stepAIC(full_model, direction = "both")Start: AIC=2196.29
Weight ~ Gender_fac * Height * Age
Df Sum of Sq RSS AIC
- Gender_fac:Height:Age 1 14.091 37394 2194.5
<none> 37380 2196.3
Step: AIC=2194.48
Weight ~ Gender_fac + Height + Age + Gender_fac:Height + Gender_fac:Age +
Height:Age
Df Sum of Sq RSS AIC
- Gender_fac:Age 1 16.814 37410 2192.7
- Height:Age 1 61.110 37455 2193.3
- Gender_fac:Height 1 134.454 37528 2194.3
<none> 37394 2194.5
+ Gender_fac:Height:Age 1 14.091 37380 2196.3
Step: AIC=2192.71
Weight ~ Gender_fac + Height + Age + Gender_fac:Height + Height:Age
Df Sum of Sq RSS AIC
- Height:Age 1 48.064 37459 2191.4
- Gender_fac:Height 1 126.243 37537 2192.4
<none> 37410 2192.7
+ Gender_fac:Age 1 16.814 37394 2194.5
Step: AIC=2191.36
Weight ~ Gender_fac + Height + Age + Gender_fac:Height
Df Sum of Sq RSS AIC
- Gender_fac:Height 1 121.63 37580 2191.0
<none> 37459 2191.4
+ Height:Age 1 48.06 37410 2192.7
+ Gender_fac:Age 1 3.77 37455 2193.3
- Age 1 1453.02 38912 2208.7
Step: AIC=2191
Weight ~ Gender_fac + Height + Age
Df Sum of Sq RSS AIC
<none> 37580 2191.0
+ Gender_fac:Height 1 121.6 37459 2191.4
+ Height:Age 1 43.5 37537 2192.4
+ Gender_fac:Age 1 6.3 37574 2192.9
- Age 1 1463.2 39043 2208.4
- Gender_fac 1 3896.2 41476 2239.0
- Height 1 12480.1 50060 2334.4
Analysis of Variance Table
Model 1: Weight ~ Gender_fac + Height + Age
Model 2: Weight ~ Gender_fac * Height * Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 503 37580
2 499 37380 4 200.6 0.6695 0.6134
Call:
lm(formula = Weight ~ Gender_fac + Height + Age, data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-18.989 -5.707 -1.238 4.154 40.784
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.12138 9.39615 -6.824 2.55e-11 ***
Gender_facMale 7.68906 1.06475 7.221 1.91e-12 ***
Height 0.72520 0.05611 12.925 < 2e-16 ***
Age 0.17925 0.04050 4.425 1.18e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.644 on 503 degrees of freedom
Multiple R-squared: 0.583, Adjusted R-squared: 0.5805
F-statistic: 234.4 on 3 and 503 DF, p-value: < 2.2e-16
K-fold Cross-Validation Process: 1. Train on n-k sampled observations, 2. Test on remaining k observations, 4. Repeat several times and calculate average performance metrics
# Implement k-fold CV
library(caret)
ctrl <- trainControl(method = "cv", number = 10)
cv_model <- train(Weight ~ Gender_fac + Height + Age,
data = body_dat,
method = "lm",
trControl = ctrl)
print(cv_model)Linear Regression
507 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 455, 457, 457, 457, 455, 455, ...
Resampling results:
RMSE Rsquared MAE
8.605987 0.5863791 6.597972
Tuning parameter 'intercept' was held constant at a value of TRUE
Model Form: \[Y = \beta_0 + \beta_1X + \beta_2X^2 + ... + \beta_pX^p + \varepsilon\]
Example with Quadratic Term:
# Fit polynomial model
poly_model <- lm(Weight ~ Height + I(Height^2), data = body_dat)
summary(poly_model)
Call:
lm(formula = Weight ~ Height + I(Height^2), data = body_dat)
Residuals:
Min 1Q Median 3Q Max
-18.536 -6.281 -1.275 5.107 41.206
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.494326 111.850417 -0.308 0.758
Height 0.194491 1.303371 0.149 0.881
I(Height^2) 0.002395 0.003790 0.632 0.528
Residual standard error: 9.314 on 504 degrees of freedom
Multiple R-squared: 0.5149, Adjusted R-squared: 0.513
F-statistic: 267.5 on 2 and 504 DF, p-value: < 2.2e-16
Purpose: Less sensitive to outliers than OLS
Methods: 1. M-estimation 2. MM-estimation 3. LTS (Least Trimmed Squares) 4. LAD (Least Absolute Deviation)
# Compare OLS and robust regression
library(MASS)
robust_model <- rlm(Weight ~ Gender_fac + Height + Age,
data = body_dat)
# Compare coefficients
cbind(
OLS = coef(step_model),
Robust = coef(robust_model)
) OLS Robust
(Intercept) -64.1213821 -66.2329710
Gender_facMale 7.6890571 8.0784915
Height 0.7251970 0.7355451
Age 0.1792494 0.1614021
Essential Elements:
Model specification
Parameter estimates with CIs
Model fit statistics, significance of effects
Diagnostic results
Prediction performance
1. Multicollinearity: Correlated predictors
Solution: VIF analysis, variable selection, or dimension reduction
2. Outliers: Influential observations
Solution: Robust methods, careful examination of unusual cases
3. Non-linearity: Misspecified relationships
Solution: Transformation, polynomial terms, non-parametric methods
4. Heteroscedasticity: Non-constant variance
Solution: Weighted least squares, transformation
ggplot2: advanced plotting
effects: effect displays
visreg: regression visualization
2. Systematic Component: Linear predictor \[\theta = \beta\cdot x = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]
3. Link Function: Connects expectation \(E(Y|x)=\mu(x)\) to linear predictor \(\theta\): \[g(\mu) = \theta\]
The Complete Chain: \[ A'(\theta) = E[Y|x] = \mu(x) = g^{-1}(\beta\cdot x) \]
For canonical links: η = θ and \(b' = g^{-1}\)
| Model | Response Distribution | Link Function | Natural Parameter θ | Mean μ = b’(θ) | Variance b’’(θ) |
|---|---|---|---|---|---|
| Linear | \(Y \sim N(\mu,\sigma^2)\) | \(g(\mu)=\mu\) | \(\theta=\mu\) | \(\mu=\theta\) | 1 |
| Logistic | \(Y \sim \text{Bin}(n,p)\) | \(g(\mu)=\log(\frac{\mu}{1-\mu})\) | \(\theta=\log(\frac{p}{1-p})\) | \(\mu=\frac{e^\theta}{1+e^\theta}\) | \(\frac{e^\theta}{(1+e^\theta)^2}\) |
| Poisson | \(Y \sim \text{Pois}(\lambda)\) | \(g(\mu)=\log(\mu)\) | \(\theta=\log(\lambda)\) | \(\mu=e^\theta\) | \(e^\theta\) |
| Gamma | \(Y \sim \text{Gamma}(\nu,\lambda)\) | \(g(\mu)=\frac{1}{\mu}\) | \(\theta=-\frac{1}{\mu}\) | \(\mu=-\frac{1}{\theta}\) | \(\frac{1}{\theta^2}\) |
Often the dependent variable is binary (success/no success): Leads to problems with linear regression:
Let \(p=p(x)\) denote the probability of a success. It depends on the independent variables \(x\). In simplified notation this dependency is skipped.
Model Form: \(p\) is estimated indirectly \[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]
Key Properties of the logit function:
Regression parameters are interpretable as logarithmized odds ratios!
The output (for given \(x\)) is interpretable as log-oddsreta\[odds = e^{logit}\]
success probabilities (for given \(x\)) can be calculated as\[p=e^{\frac{e^{logit}}{1+e^{logit}}}\]
Regression parameters are interpretable as logarithmized odds ratios.
Positive parameter: logit/probability increases when variable increases
Negative parameters: probability decreases if the variable increases
If an independent variable \(x_i\) changes by \(\Delta\), the new odds are\[odds_{new}=e^{\beta_i \Delta}\cdot odds_{old}\]
The case \(\Delta=1\) is especially important for factor variables.
# Load Pima Indians Diabetes dataset
data(Pima.tr)
# Fit logistic regression with key risk factors
model_logistic <- glm(type ~ glu + bmi + age + bp,
family = binomial(link = "logit"),
data = Pima.tr)
summary(model_logistic)
Call:
glm(formula = type ~ glu + bmi + age + bp, family = binomial(link = "logit"),
data = Pima.tr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.144169 1.647351 -5.551 2.84e-08 ***
glu 0.031224 0.006560 4.760 1.94e-06 ***
bmi 0.093564 0.032645 2.866 0.00416 **
age 0.054793 0.018166 3.016 0.00256 **
bp -0.006076 0.017345 -0.350 0.72612
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 256.41 on 199 degrees of freedom
Residual deviance: 188.27 on 195 degrees of freedom
AIC: 198.27
Number of Fisher Scoring iterations: 5
interpretation of parameters is completely different to case
The exponentiated parameters \(\\e^{\beta_i}\) interpretable as odds ratios of a change in \(X_i\) by one unit.
If \(X_{i}\) is a categorical variable, \(\\e^{\beta_{ij}}\) interpreted as odds ratio for j-th category relative reference (control)
\(\\e^{\beta_0}\) : odds when all variables are at reference value
Choose based on: Cost of false positives and missed cases
Often the dependent variable represents counts (e.g. number of events).
Model Form: \[\log(\lambda) = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]
Properties:
Count responses: 0,1,2,… events over a time period
Response is Poisson distributed with parameter \(\lambda\).
\(\lambda\) is interpreted as the expected number of events over the time period
Mean = Variance (equidispersion)
Log link ensures positive means - Interpretable as log-rates
Study of anti-epileptic drug progabide (n=59 patients)
Variables:
y: Number of seizures per 2-week period
trt: Treatment (0=placebo, 1=progabide)
base: Baseline seizures (8 weeks before trial)
age: Patient’s age in years
period: Measurement period (1-4)
id: Patient identifier (1-59)
4 repeated measurements per patient (236 total observations)
# Fit Poisson model for seizure counts
model_seizures <- glm(y ~ period+ period:trt + I(age - 18) + I(base -6),
family = poisson(link = "log"),
data = epil)
summary(model_seizures)
Call:
glm(formula = y ~ period + period:trt + I(age - 18) + I(base -
6), family = poisson(link = "log"), data = epil)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1977758 0.0862054 13.894 < 2e-16 ***
period2 -0.1216071 0.0901506 -1.349 0.1774
period3 -0.0711763 0.0889669 -0.800 0.4237
period4 -0.1611727 0.0911104 -1.769 0.0769 .
I(age - 18) 0.0223476 0.0040274 5.549 2.88e-08 ***
I(base - 6) 0.0226352 0.0005094 44.436 < 2e-16 ***
period1:trtTreatment -0.1634243 0.0883773 -1.849 0.0644 .
period2:trtTreatment -0.0607931 0.0915211 -0.664 0.5065
period3:trtTreatment -0.1463153 0.0911093 -1.606 0.1083
period4:trtTreatment -0.2482098 0.0976030 -2.543 0.0110 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2517.83 on 235 degrees of freedom
Residual deviance: 944.53 on 226 degrees of freedom
AIC: 1730.4
Number of Fisher Scoring iterations: 5
(Intercept) period2 period3
3.3127404 0.8854962 0.9312977
period4 I(age - 18) I(base - 6)
0.8511450 1.0225992 1.0228934
period1:trtTreatment period2:trtTreatment period3:trtTreatment
0.8492308 0.9410179 0.8638853
period4:trtTreatment
0.7801962
Overdispersion test
data: model_seizures
z = 2.6993, p-value = 0.003474
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
4.920118
When Poisson variance assumption is violated: \(Var[Y] = \mu + \alpha\mu^2\)
Call:
glm.nb(formula = y ~ period + period:trt + I(age - 18) + I(base -
6), data = epil, init.theta = 2.45404119, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1102375 0.1814082 6.120 9.35e-10 ***
period2 -0.1046903 0.2018153 -0.519 0.6039
period3 -0.0952657 0.2016820 -0.472 0.6367
period4 -0.1421672 0.2023559 -0.703 0.4823
I(age - 18) 0.0180364 0.0081967 2.200 0.0278 *
I(base - 6) 0.0269269 0.0017455 15.426 < 2e-16 ***
period1:trtTreatment -0.3245884 0.2003595 -1.620 0.1052
period2:trtTreatment 0.0002938 0.1985914 0.001 0.9988
period3:trtTreatment -0.1506198 0.2004423 -0.751 0.4524
period4:trtTreatment -0.2968729 0.2042469 -1.453 0.1461
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(2.454) family taken to be 1)
Null deviance: 559.97 on 235 degrees of freedom
Residual deviance: 264.87 on 226 degrees of freedom
AIC: 1332.5
Number of Fisher Scoring iterations: 1
Theta: 2.454
Std. Err.: 0.331
2 x log-likelihood: -1310.468
df AIC
model_seizures 10 1730.433
model_nb 11 1332.468
Response: Count per unit exposure
Link: \(\log(\lambda) = X\beta + \log(t)\)
No parameter is estimated for the last term, “offset”
If some rates are zero: zero inflated model
R Packages:
Example Datasets:
lung: Advanced lung cancer data
melanoma: Melanoma survival study
BMT: Bone marrow transplant data
Time-to-Event Data:
Clear starting point (e.g., diagnosis, surgery)
Well-defined endpoint (e.g., death, recurrence)
Time to event (waiting time)
Typical Applications:
Clinical trials: Time to death/progression
Reliability: Time to component failure
Unique Feature: Censored observations - not all observations observed until event
Incomplete information about survival time
Types of Censoring:
Right Censoring: Event occurs after observation period
e.g., Patient lost to follow-up
Left Censoring: Event occurs before first observation
e.g., Disease onset before study
Interval Censoring Exact time to event unknown. Only a time interval known.
e.g., Disease detected at check-up
lung <- lung %>%
mutate(
status = as.numeric(status == 2), # This will create 0 for censored, 1 for dead
sex = factor(sex, labels = c("female", "male"))
)
head(lung[, c("time", "status", "sex")]) time status sex
1 306 1 female
2 455 1 female
3 1010 0 female
4 210 1 female
5 883 1 female
6 1022 0 female
For each subject \(i\):
True survival time: \(T_i\)
Censoring time: \(C_i\)
Observed time: \(Y_i = \min(T_i, C_i)\)
Event indicator: \(\delta_i = I(T_i \leq C_i)\)
Key Assumption: Independent censoring: \(C_i\) independent of \(T_i\)
Impact on Analysis: Cannot use standard regression methods
Definition: \[S(t) = P(T > t) = 1 - F(t)\]
Interpretation:
Probability of surviving beyond time t
Basis for survival curves
Properties:
Monotone decreasing
\(S(0) = 1\): All participants are alive at start time
\(\lim_{t \to \infty} S(t) = 0\): In the long run everybody is dead …
Right-continuous
Definition: \[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]
Interpretation:
Instantaneous risk of event
Rate of failure at time t
Conditional on survival to time t
Probability for event exactly at time \(t\): \(P(T=t)=S(t)\cdot h(t)\)
Characteristics:
Non-negative
No upper bound
Key Relationships: \[h(t) = \frac{f(t)}{S(t)} = -\frac{d}{dt}\log S(t)\]
Cumulative Hazard: \[H(t) = \int_0^t h(u)du = -\log S(t)\]
Product-Limit Estimator (nonparametric): \[\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\]
Components:
\(n_i\) = number at risk at time \(t_i\)
\(d_i\) = number of events at time \(t_i\)
Product over all event times ≤ t
Key Features: - Consistent estimator of S(t) - Asymptotically unbiased - Confidence intervals widen over time - More precise with larger samples
Assumptions: - Independent censoring - No secular trends - Similar prospects for censored/uncensored subjects
# Create survival object and fit
km_fit <- survfit2(Surv(time, status) ~ 1, data = lung)
# Basic summary
summary(km_fit)Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 228 1 0.9956 0.00438 0.9871 1.000
11 227 3 0.9825 0.00869 0.9656 1.000
12 224 1 0.9781 0.00970 0.9592 0.997
13 223 2 0.9693 0.01142 0.9472 0.992
15 221 1 0.9649 0.01219 0.9413 0.989
26 220 1 0.9605 0.01290 0.9356 0.986
30 219 1 0.9561 0.01356 0.9299 0.983
31 218 1 0.9518 0.01419 0.9243 0.980
53 217 2 0.9430 0.01536 0.9134 0.974
54 215 1 0.9386 0.01590 0.9079 0.970
59 214 1 0.9342 0.01642 0.9026 0.967
60 213 2 0.9254 0.01740 0.8920 0.960
61 211 1 0.9211 0.01786 0.8867 0.957
62 210 1 0.9167 0.01830 0.8815 0.953
65 209 2 0.9079 0.01915 0.8711 0.946
71 207 1 0.9035 0.01955 0.8660 0.943
79 206 1 0.8991 0.01995 0.8609 0.939
81 205 2 0.8904 0.02069 0.8507 0.932
88 203 2 0.8816 0.02140 0.8406 0.925
92 201 1 0.8772 0.02174 0.8356 0.921
93 199 1 0.8728 0.02207 0.8306 0.917
95 198 2 0.8640 0.02271 0.8206 0.910
105 196 1 0.8596 0.02302 0.8156 0.906
107 194 2 0.8507 0.02362 0.8056 0.898
110 192 1 0.8463 0.02391 0.8007 0.894
116 191 1 0.8418 0.02419 0.7957 0.891
118 190 1 0.8374 0.02446 0.7908 0.887
122 189 1 0.8330 0.02473 0.7859 0.883
131 188 1 0.8285 0.02500 0.7810 0.879
132 187 2 0.8197 0.02550 0.7712 0.871
135 185 1 0.8153 0.02575 0.7663 0.867
142 184 1 0.8108 0.02598 0.7615 0.863
144 183 1 0.8064 0.02622 0.7566 0.859
145 182 2 0.7975 0.02667 0.7469 0.852
147 180 1 0.7931 0.02688 0.7421 0.848
153 179 1 0.7887 0.02710 0.7373 0.844
156 178 2 0.7798 0.02751 0.7277 0.836
163 176 3 0.7665 0.02809 0.7134 0.824
166 173 2 0.7577 0.02845 0.7039 0.816
167 171 1 0.7532 0.02863 0.6991 0.811
170 170 1 0.7488 0.02880 0.6944 0.807
175 167 1 0.7443 0.02898 0.6896 0.803
176 165 1 0.7398 0.02915 0.6848 0.799
177 164 1 0.7353 0.02932 0.6800 0.795
179 162 2 0.7262 0.02965 0.6704 0.787
180 160 1 0.7217 0.02981 0.6655 0.783
181 159 2 0.7126 0.03012 0.6559 0.774
182 157 1 0.7081 0.03027 0.6511 0.770
183 156 1 0.7035 0.03041 0.6464 0.766
186 154 1 0.6989 0.03056 0.6416 0.761
189 152 1 0.6943 0.03070 0.6367 0.757
194 149 1 0.6897 0.03085 0.6318 0.753
197 147 1 0.6850 0.03099 0.6269 0.749
199 145 1 0.6803 0.03113 0.6219 0.744
201 144 2 0.6708 0.03141 0.6120 0.735
202 142 1 0.6661 0.03154 0.6071 0.731
207 139 1 0.6613 0.03168 0.6020 0.726
208 138 1 0.6565 0.03181 0.5970 0.722
210 137 1 0.6517 0.03194 0.5920 0.717
212 135 1 0.6469 0.03206 0.5870 0.713
218 134 1 0.6421 0.03218 0.5820 0.708
222 132 1 0.6372 0.03231 0.5769 0.704
223 130 1 0.6323 0.03243 0.5718 0.699
226 126 1 0.6273 0.03256 0.5666 0.694
229 125 1 0.6223 0.03268 0.5614 0.690
230 124 1 0.6172 0.03280 0.5562 0.685
239 121 2 0.6070 0.03304 0.5456 0.675
245 117 1 0.6019 0.03316 0.5402 0.670
246 116 1 0.5967 0.03328 0.5349 0.666
267 112 1 0.5913 0.03341 0.5294 0.661
268 111 1 0.5860 0.03353 0.5239 0.656
269 110 1 0.5807 0.03364 0.5184 0.651
270 108 1 0.5753 0.03376 0.5128 0.645
283 104 1 0.5698 0.03388 0.5071 0.640
284 103 1 0.5642 0.03400 0.5014 0.635
285 101 2 0.5531 0.03424 0.4899 0.624
286 99 1 0.5475 0.03434 0.4841 0.619
288 98 1 0.5419 0.03444 0.4784 0.614
291 97 1 0.5363 0.03454 0.4727 0.608
293 94 1 0.5306 0.03464 0.4669 0.603
301 91 1 0.5248 0.03475 0.4609 0.597
303 89 1 0.5189 0.03485 0.4549 0.592
305 87 1 0.5129 0.03496 0.4488 0.586
306 86 1 0.5070 0.03506 0.4427 0.581
310 85 2 0.4950 0.03523 0.4306 0.569
320 82 1 0.4890 0.03532 0.4244 0.563
329 81 1 0.4830 0.03539 0.4183 0.558
337 79 1 0.4768 0.03547 0.4121 0.552
340 78 1 0.4707 0.03554 0.4060 0.546
345 77 1 0.4646 0.03560 0.3998 0.540
348 76 1 0.4585 0.03565 0.3937 0.534
350 75 1 0.4524 0.03569 0.3876 0.528
351 74 1 0.4463 0.03573 0.3815 0.522
353 73 2 0.4340 0.03578 0.3693 0.510
361 70 1 0.4278 0.03581 0.3631 0.504
363 69 2 0.4154 0.03583 0.3508 0.492
364 67 1 0.4092 0.03582 0.3447 0.486
371 65 2 0.3966 0.03581 0.3323 0.473
387 60 1 0.3900 0.03582 0.3258 0.467
390 59 1 0.3834 0.03582 0.3193 0.460
394 58 1 0.3768 0.03580 0.3128 0.454
426 55 1 0.3700 0.03580 0.3060 0.447
428 54 1 0.3631 0.03579 0.2993 0.440
429 53 1 0.3563 0.03576 0.2926 0.434
433 52 1 0.3494 0.03573 0.2860 0.427
442 51 1 0.3426 0.03568 0.2793 0.420
444 50 1 0.3357 0.03561 0.2727 0.413
450 48 1 0.3287 0.03555 0.2659 0.406
455 47 1 0.3217 0.03548 0.2592 0.399
457 46 1 0.3147 0.03539 0.2525 0.392
460 44 1 0.3076 0.03530 0.2456 0.385
473 43 1 0.3004 0.03520 0.2388 0.378
477 42 1 0.2933 0.03508 0.2320 0.371
519 39 1 0.2857 0.03498 0.2248 0.363
520 38 1 0.2782 0.03485 0.2177 0.356
524 37 2 0.2632 0.03455 0.2035 0.340
533 34 1 0.2554 0.03439 0.1962 0.333
550 32 1 0.2475 0.03423 0.1887 0.325
558 30 1 0.2392 0.03407 0.1810 0.316
567 28 1 0.2307 0.03391 0.1729 0.308
574 27 1 0.2221 0.03371 0.1650 0.299
583 26 1 0.2136 0.03348 0.1571 0.290
613 24 1 0.2047 0.03325 0.1489 0.281
624 23 1 0.1958 0.03297 0.1407 0.272
641 22 1 0.1869 0.03265 0.1327 0.263
643 21 1 0.1780 0.03229 0.1247 0.254
654 20 1 0.1691 0.03188 0.1169 0.245
655 19 1 0.1602 0.03142 0.1091 0.235
687 18 1 0.1513 0.03090 0.1014 0.226
689 17 1 0.1424 0.03034 0.0938 0.216
705 16 1 0.1335 0.02972 0.0863 0.207
707 15 1 0.1246 0.02904 0.0789 0.197
728 14 1 0.1157 0.02830 0.0716 0.187
731 13 1 0.1068 0.02749 0.0645 0.177
735 12 1 0.0979 0.02660 0.0575 0.167
765 10 1 0.0881 0.02568 0.0498 0.156
791 9 1 0.0783 0.02462 0.0423 0.145
814 7 1 0.0671 0.02351 0.0338 0.133
883 4 1 0.0503 0.02285 0.0207 0.123
Hazard Model: \[h(t|X_i) = h_0(t)\exp(\beta^T X_i)\]
Components:
\(h_0(t)\) = baseline hazard (nonparametric)
\(X_i\) = covariate vector
\(\beta\) = regression coefficients (parametric)
Key Features:
Semi-parametric approach
Assumes proportional hazards
Handles multiple predictors
Definition: Hazard ratio constant over time
Mathematical Expression: \[\frac{h_1(t)}{h_2(t)} = \exp(\beta) \text{ for all } t\]
Key Implications: - Parallel log-survival curves - Non-crossing hazard functions - Time-independent effect sizes
Common Violations: - Early vs. late treatment effects - Delayed treatment onset - Changing risk factors over time
Sir David Cox : - British statistician who revolutionized survival analysis - Developed the Cox proportional hazards model (1972) - “Regression Models and Life Tables” - one of statistics’ most cited papers
The Innovation: - Combined flexibility of non-parametric methods with regression analysis - Introduced partial likelihood estimation - Enabled analysis of censored data with multiple predictors
Impact: - Standard method in medical research - Enables evidence-based clinical decisions - Applications across many scientific fields
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
sexmale -0.5310 0.5880 0.1672 -3.176 0.00149 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sexmale 0.588 1.701 0.4237 0.816
Concordance= 0.579 (se = 0.021 )
Likelihood ratio test= 10.63 on 1 df, p=0.001
Wald test = 10.09 on 1 df, p=0.001
Score (logrank) test = 10.33 on 1 df, p=0.001
sexmale
0.5880028
sexmale age
0.598566 1.017191
Kaplan-Meier Approach:
Non-parametric estimation
Limited to categorical predictors
Cox Model Advantages:
Semi-parametric estimation
Handles multiple predictors (adjusts for confounders)
Continuous variables allowed
Multiple possible event types that are mutually exclusive
Cumulative Incidence Function: \[F_k(t) = P(T \leq t, D = k) = \int_0^t S(u)h_k(u)du\]
Definition: \[CS(y|x) = P(T > x + y | T > x) = \frac{S(x + y)}{S(x)}\]
Probability to survive time \(x+y\), if already survived until time \(x\)
Increases with time \(x\) already survived
Copyright: Raimund Kovacevic