##| include: false
library(ggplot2)
library(mosaic)
library(MASS)
library(readxl)
library(psych)

library(dplyr)
library(effects)
library(pROC)
library(ordinal)
library(pscl)
library(nnet)

Introduction to Regression Analysis

  • Regression analysis is a fundamental statistical method that models relationships between variables:

  • Key Concepts:

    • Dependent Variable (\(Y\)): The outcome we want to predict or understand

    • Independent Variables (\(X, X_1,X_2, \dots\)): The predictors or explanatory variables

    • Simplest Functional Relationship: linear

    • Effects (explained) versus Error Term (unexplained variation)

  • Applications in Medical Research: Predicting patient outcomes based on clinical measurements, Studying dose-response relationships, Adjusting for known confounding factors

Historical Development in Detail

  • 1805 - Legendre and Least Squares:
    • First published account of method of least squares (astronomy)
  • 1809 - Gauss’s Contributions:
    • Connected least squares to probability theory
    • Developed normal distribution properties
  • 1886 - Galton’s Regression:
    • Studied parent-child height relationships (“regression to mediocrity”)
  • 20th Century Developments:
    • 1922: Fisher’s statistical framework
    • 1950s: Computational methods
    • 1970s: Robust methods
    • 1980s: Generalized linear models

Regression Line

Simple Linear Regression

\[Y = \alpha + \beta X + \varepsilon, \quad i = 1,\ldots,n\]

  • Components:

    • \(Y\): Dependent variable observation - Must be continuous

    • \(X\): Independent variable observation

    • Parameters:

      • \(\alpha\): Y-intercept (baseline value, \(X=0\))
      • \(\beta\): Slope (change in Y per unit change in X)
    • \(\varepsilon_i\): Random error

      • Assumption: i.i.d. \(N(0,\sigma^2)\)
      • Critical for inference

Basic Assumptions

  • Linear relation between predictor(s) and outcome
  • Independence of residuals
  • Normal distribution of residuals
  • Equal variance of residuals

Parameter Estimation

Least Squares Method: Minimize the sum of squared residuals with observations \(x_i, y_i\) \[Q(\alpha,\beta) = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2\]

Normal Equations: \[\frac{\partial Q}{\partial \alpha} = 0: \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0\] \[\frac{\partial Q}{\partial \beta} = 0: \sum_{i=1}^n x_i(y_i - \alpha - \beta x_i) = 0\]

Parameter Estimation

Resulting Estimators: \[\hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\] \[\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}\]

Properties of Least Squares Estimators

  • Unbiasedness: \[E[\hat{\beta}] = \beta\text{, }E[\hat{\alpha}] = \alpha\]

  • Variance: \[Var(\hat{\beta}) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\] \[Var(\hat{\alpha}) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\right)\]

  • Efficiency:

    • Achieve Cramér-Rao lower bound under normality

    • BLUE (Best Linear Unbiased Estimator)

Dataset body_dat

# Basic summary statistics
body_dat <- read_excel("body_dat.xlsx") %>% 
  dplyr::select(Age,Height,Weight,Gender) %>% 
  mutate(Gender_fac = factor(Gender, labels = c("Female","Male")))
describe(body_dat,ranges=T,skew=F)
            vars   n   mean    sd median   min   max range   se
Age            1 507  30.18  9.61   27.0  18.0  67.0  49.0 0.43
Height         2 507 171.14  9.41  170.3 147.2 198.1  50.9 0.42
Weight         3 507  69.15 13.35   68.2  42.0 116.4  74.4 0.59
Gender         4 507   0.49  0.50    0.0   0.0   1.0   1.0 0.02
Gender_fac*    5 507   1.49  0.50    1.0   1.0   2.0   1.0 0.02
plot(body_dat$Height,body_dat$Weight,xlab = "Height", ylab="Weight")

Simple Linear Regression

# Fit simple linear model
model_simple <- lm(Weight ~ Height, data = body_dat)

# Detailed summary
summary(model_simple)

Call:
lm(formula = Weight ~ Height, data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.743  -6.402  -1.231   5.059  41.103 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
Height         1.01762    0.04399   23.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.308 on 505 degrees of freedom
Multiple R-squared:  0.5145,    Adjusted R-squared:  0.5136 
F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Simple Regression

  • Additional Information from the Model: Significance of Results

  • \(R^2\): The square of the correlation between both variables. Indicates what proportion of the total variance is explained by the linear model. The F-test (p-value) tests the null hypothesis that the true correlation is zero - meaning no relationship exists.

  • t-Test: For each regression parameter, a p-value (Pr(>|t|)) is reported. The t-test tests the null hypothesis that the respective parameter is actually zero.

Inference in Simple Regression

Sampling Distribution of Estimators: Under normality assumptions: \[\hat{\beta} \sim N(\beta, \frac{\sigma^2}{\sum(X_i - \bar{X})^2})\]

Standard Error Estimation: \[s_{\hat{\beta}} = \sqrt{\frac{\sum e_i^2/(n-2)}{\sum(x_i - \bar{x})^2}}\] where \(e_i\) are residuals

t-Statistics: \[t = \frac{\hat{\beta} - \beta_0}{s_{\hat{\beta}}} \sim t_{n-2}\]

Example: Inference for Height and Weight

# Model with confidence intervals
confint(model_simple, level = 0.95)
                   2.5 %     97.5 %
(Intercept) -119.8237251 -90.198783
Height         0.9311971   1.104036
# Visualization with confidence bands
ggplot(body_dat, aes(x = Height, y = Weight)) +
  geom_point() +
  geom_smooth(method = "lm", level = 0.95) +
  labs(title = "Birth Weight vs Height with 95% Confidence Bands")

Prediction in Simple Regression

  • Confidence Interval for mean response: \[\hat{y} \pm t_{n-2,\alpha/2} \cdot s_{\hat{y}_h}\] where: \[s_{\hat{y}}^2 = \hat{\sigma}^2[1/n + (x - \bar{x})^2/\sum(x_i - \bar{x})^2]\]

  • Prediction Interval for individual response: \[\hat{y} \pm t_{n-2,\alpha/2} \cdot s_{pred}\] where: \[s_{pred}^2 = \hat{\sigma}^2[1 + 1/n + (x - \bar{x})^2/\sum(x_i - \bar{x})^2]\]

Prediction Intervals Example

# Create new data for prediction
new_Weights <- data.frame(Height = c(150,160,170,180,190))

# Calculate both types of intervals
predict(model_simple, new_Weights, 
                    interval = "confidence")
       fit      lwr      upr
1 47.63126 45.63166 49.63087
2 57.80743 56.54764 59.06722
3 67.98360 67.16544 68.80176
4 78.15977 77.04380 79.27573
5 88.33593 86.51521 90.15666
predict(model_simple, new_Weights, 
                    interval = "prediction")
       fit      lwr       upr
1 47.63126 29.23501  66.02751
2 57.80743 39.47684  76.13802
3 67.98360 49.67805  86.28914
4 78.15977 59.83849  96.48104
5 88.33593 69.95827 106.71360

Multiple Regression

  • Model with p predictors: \[Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + ... + \beta_pX_{p} + \varepsilon\]

  • Matrix Form: With \(n\) observations \(y_i, x_{i1},\dots,x_{ip}\) \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

  • where:

    • \(\mathbf{y}\) is n × 1 response vector
    • \(\mathbf{X}\) is n × (p+1) design matrix
    • \(\boldsymbol{\beta}\) is (p+1) × 1 parameter vector
    • \(\boldsymbol{\varepsilon}\) is n × 1 error vector

Assumptions

  1. Linearity:
    • \(E[\mathbf{Y}|\mathbf{X}] = \mathbf{X}\boldsymbol{\beta}\)
    • Relationship is linear in parameters
  2. Full Rank:
    • \(\mathbf{X}\) has full column rank
    • No perfect multicollinearity, all predictors provide unique information
  3. Exogeneity:
    • \(E[\boldsymbol{\varepsilon}|\mathbf{X}] = \mathbf{0}\)
    • Errors independent of predictors, no endogeneity
  4. Homoscedasticity and Independence:
    • \(Var[\boldsymbol{\varepsilon}|\mathbf{X}] = \sigma^2\mathbf{I}\)
    • Constant error variance, no autocorrelation

Prediction Methods

Types of Prediction Intervals:

  • Mean Response: \[\hat{Y}_h \pm t_{n-p-1,\alpha/2}\sqrt{\mathbf{x}_h'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_h\hat{\sigma}^2}\]

  • Individual Response: \[\hat{Y}_h \pm t_{n-p-1,\alpha/2}\sqrt{\mathbf{x}_h'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_h\hat{\sigma}^2 + \hat{\sigma}^2}\]

Multiple Regression Estimation

  • Least Squares Estimator: \[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

  • Properties:

    • Unbiasedness: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
    • Variance-Covariance Matrix: \[Var[\hat{\boldsymbol{\beta}}] = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
    • BLUE Property: Minimum variance among all linear unbiased estimators, “best linear unbiased estimator”

Multiple Predictors of Weight

# Fit multiple regression model
mult_model <- lm(Weight ~ Height + Age, data=body_dat)

# Detailed summary
summary(mult_model)

Call:
lm(formula = Weight ~ Height + Age, data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.996  -5.909  -1.321   5.055  41.285 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -109.06383    7.38820  -14.76  < 2e-16 ***
Height         1.00227    0.04297   23.33  < 2e-16 ***
Age            0.22127    0.04207    5.26 2.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.072 on 504 degrees of freedom
Multiple R-squared:  0.5398,    Adjusted R-squared:  0.538 
F-statistic: 295.6 on 2 and 504 DF,  p-value: < 2.2e-16
confint(mult_model)
                   2.5 %      97.5 %
(Intercept) -123.5792876 -94.5483638
Height         0.9178550   1.0866946
Age            0.1386185   0.3039214

Variance (ANOVA) Decomposition

Sums of Squares Decomposition: \[\sum_{i=1}^n (Y_i - \bar{Y})^2 = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2 + \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\]

  • Components: 1. Total Sum of Squares (SST) 2. Regression Sum of Squares (SSR) 3. Error Sum of Squares (SSE)

  • Degrees of freedom:

    • Total: n - 1
    • Regression: p
    • Error: n - p - 1

ANOVA Table Construction

# Generate ANOVA table
anova(mult_model)
Analysis of Variance Table

Response: Weight
           Df Sum Sq Mean Sq F value    Pr(>F)    
Height      1  46370   46370 563.469 < 2.2e-16 ***
Age         1   2277    2277  27.665 2.136e-07 ***
Residuals 504  41476      82                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_simple,mult_model)
Analysis of Variance Table

Model 1: Weight ~ Height
Model 2: Weight ~ Height + Age
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    505 43753                                  
2    504 41476  1    2276.7 27.665 2.136e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Interpretation
    • Sequential sums of squares
    • Contribution of each predictor

Analysis of Variance (ANOVA)

  • We have already addressed differences between means through t-tests.
  • Here we examine the same question using a regression model.
  • The advantage of regression models is:
    • mean comparisons for variables with more than two levels.
mod3 <- lm(Weight ~ Gender_fac, data=body_dat) 
summary(mod3)

Call:
lm(formula = Weight ~ Gender_fac, data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.245  -6.345  -1.400   6.355  44.600 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     60.6004     0.6241   97.11   <2e-16 ***
Gender_facMale  17.5441     0.8941   19.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.06 on 505 degrees of freedom
Multiple R-squared:  0.4326,    Adjusted R-squared:  0.4315 
F-statistic:   385 on 1 and 505 DF,  p-value: < 2.2e-16

Interaction Effects: Theory

  • Definition: An interaction occurs when the effect of one predictor variable on the response depends on the level of another predictor.

  • ANOVA: Interactions between several categorical variables. ANCOVA: Interaction between categorical and metric variables

  • Mathematical Representation: For two predictors X and Z: \[Y_i = \beta_0 + \beta_1X + \beta_2Z + \beta_3(X \cdot Z) + \varepsilon_i\]

  • Interpretation:

    • \(\beta_1\): Effect of X when Z = 0
    • \(\beta_2\): Effect of Z when X = 0
    • \(\beta_3\): Change in X’s effect for one unit change in Z

Visualizing Interactions ANCOVA

#Alternative visualization with ggplot2
ggplot(body_dat, aes(x = Height, y = Weight, color = Gender_fac)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Gender-Height Interacion",
       x = "Height",
       y = "Weight",
       color = "Gender")

Analysis of Covariance

# Create interaction model with birth weight data
interact_model <- lm(Weight ~ Height*Gender_fac, data = body_dat)
summary(interact_model)

Call:
lm(formula = Weight ~ Height * Gender_fac, data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.187  -5.957  -1.439   4.955  43.355 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -43.81929   13.77877  -3.180  0.00156 ** 
Height                  0.63334    0.08351   7.584 1.63e-13 ***
Gender_facMale        -17.13407   19.56250  -0.876  0.38152    
Height:Gender_facMale   0.14923    0.11431   1.305  0.19233    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.795 on 503 degrees of freedom
Multiple R-squared:  0.5682,    Adjusted R-squared:  0.5657 
F-statistic: 220.7 on 3 and 503 DF,  p-value: < 2.2e-16

Analysis of Covariance

# Compare with additive model
additive_model <- lm(Weight ~ Height + Gender_fac, data = body_dat)
anova(additive_model, interact_model)
Analysis of Variance Table

Model 1: Weight ~ Height + Gender_fac
Model 2: Weight ~ Height * Gender_fac
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    504 39043                           
2    503 38912  1    131.84 1.7043 0.1923

Model Diagnostics

  1. Linearity:
    • Relationship is linear
    • Check: Residuals vs. fitted values plot
  2. Independence:
    • Errors are independent
    • Check: Residuals vs. order/time plot
  3. Homoscedasticity:
    • Constant error variance
    • Check: Scale-location plot
  4. Normality:
    • Errors are normally distributed
    • Check: Q-Q plot

Diagnostic Plots

par(mfrow = c(2,2))
plot(mult_model)

Interpretation: 1. Top-left: Check for linearity and equal variance 2. Top-right: Check normality assumption 3. Bottom-left: Check homoscedasticity 4. Bottom-right: Check for influential points

Residual Analysis

Types of residuals:

  1. Raw Residuals: \[e_i = y_i - \hat{y}_i\]

  2. Standardized Residuals: \[r_i = \frac{e_i}{s\sqrt{1-h_{ii}}}\]

  3. Studentized Residuals: \[t_i = \frac{e_i}{s_{(i)}\sqrt{1-h_{ii}}}\]

Residual Analysis

# Calculate different residuals
res_raw <- residuals(mult_model)
res_stand <- rstandard(mult_model)
res_stud <- rstudent(mult_model)

# Compare distributions
par(mfrow=c(1,3))
hist(res_raw, main="Raw Residuals")
hist(res_stand, main="Standardized Residuals")
hist(res_stud, main="Studentized Residuals")

Multicollinearity Detection

Methods for Detection:

  1. Correlation Matrix:
# Correlation matrix for continuous predictors
cor(body_dat[c("Height", "Age")])
           Height        Age
Height 1.00000000 0.06788349
Age    0.06788349 1.00000000
  1. Variance Inflation Factors (VIF):
library(car)
vif(mult_model)
 Height     Age 
1.00463 1.00463 

Interpretation: - VIF > 5 or 10 indicates potential problems - High correlations (> 0.8) warrant investigation

Model Selection

Goals of Model Selection: Parsimony (simplicity),Prediction accuracy

  1. Adjusted R²: \[R^2_{adj} = 1 - \frac{n-1}{n-p-1}(1-R^2)\]

  2. AIC (Akaike Information Criterion): \[AIC = -2\log L + 2p\]

  3. BIC (Bayesian Information Criterion): \[BIC = -2\log L + p\log(n)\]

where: - n = sample size, p = number of parameters, L = (maximum) likelihood

Stepwise Selection Procedures

1. Forward Selection: - Start with no predictors - Add most significant variable at each step - Stop when no significant improvement

2. Backward Elimination: - Start with all predictors - Remove least significant variable at each step - Stop when all remaining variables significant

3. Stepwise Selection: - Combination of forward and backward - Can add or remove variables at each step

Model Selection

# Full model with all predictors
full_model <- lm(Weight ~ Gender_fac*Height*Age, data = body_dat)

# Stepwise selection
library(MASS)
step_model <- stepAIC(full_model, direction = "both")
Start:  AIC=2196.29
Weight ~ Gender_fac * Height * Age

                        Df Sum of Sq   RSS    AIC
- Gender_fac:Height:Age  1    14.091 37394 2194.5
<none>                               37380 2196.3

Step:  AIC=2194.48
Weight ~ Gender_fac + Height + Age + Gender_fac:Height + Gender_fac:Age + 
    Height:Age

                        Df Sum of Sq   RSS    AIC
- Gender_fac:Age         1    16.814 37410 2192.7
- Height:Age             1    61.110 37455 2193.3
- Gender_fac:Height      1   134.454 37528 2194.3
<none>                               37394 2194.5
+ Gender_fac:Height:Age  1    14.091 37380 2196.3

Step:  AIC=2192.71
Weight ~ Gender_fac + Height + Age + Gender_fac:Height + Height:Age

                    Df Sum of Sq   RSS    AIC
- Height:Age         1    48.064 37459 2191.4
- Gender_fac:Height  1   126.243 37537 2192.4
<none>                           37410 2192.7
+ Gender_fac:Age     1    16.814 37394 2194.5

Step:  AIC=2191.36
Weight ~ Gender_fac + Height + Age + Gender_fac:Height

                    Df Sum of Sq   RSS    AIC
- Gender_fac:Height  1    121.63 37580 2191.0
<none>                           37459 2191.4
+ Height:Age         1     48.06 37410 2192.7
+ Gender_fac:Age     1      3.77 37455 2193.3
- Age                1   1453.02 38912 2208.7

Step:  AIC=2191
Weight ~ Gender_fac + Height + Age

                    Df Sum of Sq   RSS    AIC
<none>                           37580 2191.0
+ Gender_fac:Height  1     121.6 37459 2191.4
+ Height:Age         1      43.5 37537 2192.4
+ Gender_fac:Age     1       6.3 37574 2192.9
- Age                1    1463.2 39043 2208.4
- Gender_fac         1    3896.2 41476 2239.0
- Height             1   12480.1 50060 2334.4

Model Selection

# Compare models
anova(step_model, full_model)
Analysis of Variance Table

Model 1: Weight ~ Gender_fac + Height + Age
Model 2: Weight ~ Gender_fac * Height * Age
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    503 37580                           
2    499 37380  4     200.6 0.6695 0.6134
# Show selected model
summary(step_model)

Call:
lm(formula = Weight ~ Gender_fac + Height + Age, data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.989  -5.707  -1.238   4.154  40.784 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -64.12138    9.39615  -6.824 2.55e-11 ***
Gender_facMale   7.68906    1.06475   7.221 1.91e-12 ***
Height           0.72520    0.05611  12.925  < 2e-16 ***
Age              0.17925    0.04050   4.425 1.18e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.644 on 503 degrees of freedom
Multiple R-squared:  0.583, Adjusted R-squared:  0.5805 
F-statistic: 234.4 on 3 and 503 DF,  p-value: < 2.2e-16

Cross-Validation

K-fold Cross-Validation Process: 1. Train on n-k sampled observations, 2. Test on remaining k observations, 4. Repeat several times and calculate average performance metrics

# Implement k-fold CV
library(caret)
ctrl <- trainControl(method = "cv", number = 10)
cv_model <- train(Weight ~ Gender_fac + Height + Age, 
                 data = body_dat,
                 method = "lm",
                 trControl = ctrl)

print(cv_model)
Linear Regression 

507 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 455, 457, 457, 457, 455, 455, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  8.605987  0.5863791  6.597972

Tuning parameter 'intercept' was held constant at a value of TRUE

Polynomial Regression

Model Form: \[Y = \beta_0 + \beta_1X + \beta_2X^2 + ... + \beta_pX^p + \varepsilon\]

Example with Quadratic Term:

# Fit polynomial model
poly_model <- lm(Weight ~ Height + I(Height^2), data = body_dat)
summary(poly_model)

Call:
lm(formula = Weight ~ Height + I(Height^2), data = body_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.536  -6.281  -1.275   5.107  41.206 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.494326 111.850417  -0.308    0.758
Height        0.194491   1.303371   0.149    0.881
I(Height^2)   0.002395   0.003790   0.632    0.528

Residual standard error: 9.314 on 504 degrees of freedom
Multiple R-squared:  0.5149,    Adjusted R-squared:  0.513 
F-statistic: 267.5 on 2 and 504 DF,  p-value: < 2.2e-16

Robust Regression

Purpose: Less sensitive to outliers than OLS

Methods: 1. M-estimation 2. MM-estimation 3. LTS (Least Trimmed Squares) 4. LAD (Least Absolute Deviation)

# Compare OLS and robust regression
library(MASS)
robust_model <- rlm(Weight ~ Gender_fac + Height + Age, 
                   data = body_dat)

# Compare coefficients
cbind(
  OLS = coef(step_model),
  Robust = coef(robust_model)
)
                       OLS      Robust
(Intercept)    -64.1213821 -66.2329710
Gender_facMale   7.6890571   8.0784915
Height           0.7251970   0.7355451
Age              0.1792494   0.1614021

Reporting Results

Essential Elements:

  • Model specification

  • Parameter estimates with CIs

  • Model fit statistics, significance of effects

  • Diagnostic results

  • Prediction performance

Common Pitfalls and Solutions

1. Multicollinearity: Correlated predictors
Solution: VIF analysis, variable selection, or dimension reduction

2. Outliers: Influential observations
Solution: Robust methods, careful examination of unusual cases

3. Non-linearity: Misspecified relationships
Solution: Transformation, polynomial terms, non-parametric methods

4. Heteroscedasticity: Non-constant variance
Solution: Weighted least squares, transformation

Packages for Regression Analysis:

  • Base R:
    • lm() for basic linear regression
    • summary(), anova() for analysis
  • Specialized Packages:
    • MASS: robust regression
    • car: regression diagnostics
    • effects: visualization
    • lme4: mixed models
  • Visualization:
    • ggplot2: advanced plotting

    • effects: effect displays

    • visreg: regression visualization

Generalized Linear Models

Generalized Linear Models

  • Classical Linear Model Limitations:
    • Assumes normal distribution
    • Assumes constant variance
    • Not suitable for discrete responses
  • GLM Extension:
    • Accounts for these problems …
    • but maintains linearity as far as possible
  • Core Innovation:
    • Separate treatment of random variation, systematic effects (linear predictor), and the link between them

GLM Components

  • Random Component: Response Y follows exponential family \[f_X (y | \theta)=h(y)\mathrm{~exp}\left[\mathrm{~}\theta\cdot y-A(\theta)\mathrm{~}\right]\] where:
  • θ: canonical parameter
  • A(θ): Cumulant function determining mean and variance:
    • \(E[Y] = A'(θ) = μ\)
    • \(Var[Y] = A''(θ)\)
  • A huge variety of distributions falls into this category, among them the Normal, Poisson, Binomial …

GLM Components

  • 2. Systematic Component: Linear predictor \[\theta = \beta\cdot x = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]

  • 3. Link Function: Connects expectation \(E(Y|x)=\mu(x)\) to linear predictor \(\theta\): \[g(\mu) = \theta\]

  • The Complete Chain: \[ A'(\theta) = E[Y|x] = \mu(x) = g^{-1}(\beta\cdot x) \]

For canonical links: η = θ and \(b' = g^{-1}\)

Common GLMs

Model Response Distribution Link Function Natural Parameter θ Mean μ = b’(θ) Variance b’’(θ)
Linear \(Y \sim N(\mu,\sigma^2)\) \(g(\mu)=\mu\) \(\theta=\mu\) \(\mu=\theta\) 1
Logistic \(Y \sim \text{Bin}(n,p)\) \(g(\mu)=\log(\frac{\mu}{1-\mu})\) \(\theta=\log(\frac{p}{1-p})\) \(\mu=\frac{e^\theta}{1+e^\theta}\) \(\frac{e^\theta}{(1+e^\theta)^2}\)
Poisson \(Y \sim \text{Pois}(\lambda)\) \(g(\mu)=\log(\mu)\) \(\theta=\log(\lambda)\) \(\mu=e^\theta\) \(e^\theta\)
Gamma \(Y \sim \text{Gamma}(\nu,\lambda)\) \(g(\mu)=\frac{1}{\mu}\) \(\theta=-\frac{1}{\mu}\) \(\mu=-\frac{1}{\theta}\) \(\frac{1}{\theta^2}\)

John Nelder (1924-2010)

  • The Man:
    • From farming background to Cambridge mathematician
    • Spent career at agricultural research stations
    • Fellow of the Royal Society (1976)
  • The Innovation (1972):
    • Unified theory of Generalized Linear Models (+ Wedderburn)
    • Linear, logistic, and Poisson regression as special cases
    • GLIM software made GLMs practically applicable

Motivation Binary

Often the dependent variable is binary (success/no success): Leads to problems with linear regression:

  1. Probability Boundaries Violated
    • Linear model can predict < 0 or > 1
    • Probabilities must be between 0 and 1 - success probability
  2. Non-Linear Relationships
    • Binary outcomes often have S-shaped relationship
    • Linear model assumes straight line
    • Misses crucial threshold effects
  3. Non-Normal Errors
    • Binary data violates normality assumption
    • Error variance not constant (heteroscedasticity)

Binary Logistic Regression

  • Let \(p=p(x)\) denote the probability of a success. It depends on the independent variables \(x\). In simplified notation this dependency is skipped.

  • Model Form: \(p\) is estimated indirectly \[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]

  • Key Properties of the logit function:

    • Probabilities \(p\) are bounded, \(p\in [0,1]\)
    • S-shaped response curve
  • Regression parameters are interpretable as logarithmized odds ratios!

Interpretion

  • The output (for given \(x\)) is interpretable as log-oddsreta\[odds = e^{logit}\]

  • success probabilities (for given \(x\)) can be calculated as\[p=e^{\frac{e^{logit}}{1+e^{logit}}}\]

  • Regression parameters are interpretable as logarithmized odds ratios.

    • Positive parameter: logit/probability increases when variable increases

    • Negative parameters: probability decreases if the variable increases

    • If an independent variable \(x_i\) changes by \(\Delta\), the new odds are\[odds_{new}=e^{\beta_i \Delta}\cdot odds_{old}\]

    • The case \(\Delta=1\) is especially important for factor variables.

Logistic Regression in R

# Load Pima Indians Diabetes dataset
data(Pima.tr)
# Fit logistic regression with key risk factors
model_logistic <- glm(type ~ glu + bmi + age + bp, 
                     family = binomial(link = "logit"),
                     data = Pima.tr)
summary(model_logistic)

Call:
glm(formula = type ~ glu + bmi + age + bp, family = binomial(link = "logit"), 
    data = Pima.tr)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.144169   1.647351  -5.551 2.84e-08 ***
glu          0.031224   0.006560   4.760 1.94e-06 ***
bmi          0.093564   0.032645   2.866  0.00416 ** 
age          0.054793   0.018166   3.016  0.00256 ** 
bp          -0.006076   0.017345  -0.350  0.72612    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 256.41  on 199  degrees of freedom
Residual deviance: 188.27  on 195  degrees of freedom
AIC: 198.27

Number of Fisher Scoring iterations: 5

https://rdrr.io/cran/MASS/man/Pima.tr.html

Calculating odds ratios and CI

  • interpretation of parameters is completely different to case

    • The exponentiated parameters \(\\e^{\beta_i}\) interpretable as odds ratios of a change in \(X_i\) by one unit.

    • If \(X_{i}\) is a categorical variable, \(\\e^{\beta_{ij}}\) interpreted as odds ratio for j-th category relative reference (control)

    • \(\\e^{\beta_0}\) : odds when all variables are at reference value

# Convert to odds ratios with CI
exp(cbind(OR = coef(model_logistic),
          confint(model_logistic))) %>%
  round(3)
               OR 2.5 % 97.5 %
(Intercept) 0.000 0.000  0.002
glu         1.032 1.019  1.046
bmi         1.098 1.032  1.173
age         1.056 1.020  1.096
bp          0.994 0.960  1.028

Visualizing Logistic Effects

plot(allEffects(model_logistic), type="response",ylab="probability")

Prediction

  • Predicting the odds for given values of the independent variables
p <- data.frame(glu=100,bmi=40,age=56, bp=90)
predict(model_logistic,p)
        1 
0.2423494 
  • Predicting success probabilities for given values of the independent variables
p <- data.frame(glu=100,bmi=40,age=56, bp=90)
predict(model_logistic,p, type="response")
        1 
0.5602925 

ROC Curve

  • ROC = Receiver Operating Characteristic
  • Plots Sensitivity vs Specificity
  • Shows classifier performance across all thresholds
  • Key interpretation points:
    • Always passes through (0,0) and (1,1)
    • Better curve: higher towards upper-right corner
    • Area Under Curve: 1.0 = perfect, 0.5 = random, 0.8 typically considered good
  • High threshold:
    • ↑ Specificity
    • ↓ Sensitivity
    • Better at ruling in
    • Misses more cases
  • Low threshold:
    • ↑ Sensitivity
    • ↓ Specificity
    • Better at ruling out
    • More false positives

Choose based on: Cost of false positives and missed cases

ROC Analysis for Logistic Models

# Calculate predicted probabilities
pred_prob <- predict(model_logistic, type = "response")

# Create and plot ROC curve
roc_curve <- roc(Pima.tr$type, pred_prob)
plot(roc_curve, main="ROC Curve for Diabetes Prediction",xlim=c(0,1))
text(0.6, 0.2, paste("AUC =", round(auc(roc_curve), 3)))

Motivation: Count Data

Often the dependent variable represents counts (e.g. number of events).

  1. Count Boundaries Violated
    • Linear model can predict negative values
    • Counts must be non-negative integers (0, 1, 2, …)
  2. Non-Linear Relationships
    • Count processes often show exponential growth
    • Linear model misses multiplicative effects
  3. Non-Normal, Non-Constant Variance
    • Count data follows Poisson distribution
    • Variance increases with the mean

Poisson Regression for Count Data

  • Model Form: \[\log(\lambda) = \beta_0 + \beta_1x_1 + ... + \beta_px_p\]

  • Properties:

    • Count responses: 0,1,2,… events over a time period

    • Response is Poisson distributed with parameter \(\lambda\).

    • \(\lambda\) is interpreted as the expected number of events over the time period

    • Mean = Variance (equidispersion)

    • Log link ensures positive means - Interpretable as log-rates

Epilepsy Clinical Trial Data

  • Study of anti-epileptic drug progabide (n=59 patients)

  • Variables:

    • y: Number of seizures per 2-week period

    • trt: Treatment (0=placebo, 1=progabide)

    • base: Baseline seizures (8 weeks before trial)

    • age: Patient’s age in years

    • period: Measurement period (1-4)

    • id: Patient identifier (1-59)

  • 4 repeated measurements per patient (236 total observations)

# Load epilepsy dataset
data(epil)
epil$period <- factor(epil$period)
epil$trt <- factor(epil$trt, labels=c("Placebo", "Treatment"))

Poisson Regression in R

# Fit Poisson model for seizure counts
model_seizures <- glm(y ~  period+ period:trt + I(age - 18) + I(base -6), 
                      family = poisson(link = "log"),
                      data = epil)
summary(model_seizures)

Call:
glm(formula = y ~ period + period:trt + I(age - 18) + I(base - 
    6), family = poisson(link = "log"), data = epil)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.1977758  0.0862054  13.894  < 2e-16 ***
period2              -0.1216071  0.0901506  -1.349   0.1774    
period3              -0.0711763  0.0889669  -0.800   0.4237    
period4              -0.1611727  0.0911104  -1.769   0.0769 .  
I(age - 18)           0.0223476  0.0040274   5.549 2.88e-08 ***
I(base - 6)           0.0226352  0.0005094  44.436  < 2e-16 ***
period1:trtTreatment -0.1634243  0.0883773  -1.849   0.0644 .  
period2:trtTreatment -0.0607931  0.0915211  -0.664   0.5065    
period3:trtTreatment -0.1463153  0.0911093  -1.606   0.1083    
period4:trtTreatment -0.2482098  0.0976030  -2.543   0.0110 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2517.83  on 235  degrees of freedom
Residual deviance:  944.53  on 226  degrees of freedom
AIC: 1730.4

Number of Fisher Scoring iterations: 5

Poisson Regression in R

library(effects)
plot(allEffects(model_seizures))

Poisson Regression in R

exp(coef(model_seizures))
         (Intercept)              period2              period3 
           3.3127404            0.8854962            0.9312977 
             period4          I(age - 18)          I(base - 6) 
           0.8511450            1.0225992            1.0228934 
period1:trtTreatment period2:trtTreatment period3:trtTreatment 
           0.8492308            0.9410179            0.8638853 
period4:trtTreatment 
           0.7801962 
library(AER)
dispersiontest(model_seizures)

    Overdispersion test

data:  model_seizures
z = 2.6993, p-value = 0.003474
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  4.920118 

Negative Binomial Model

When Poisson variance assumption is violated: \(Var[Y] = \mu + \alpha\mu^2\)

model_nb <- glm.nb(y ~  period+period:trt + I(age-18) +I(base-6), data = epil)
summary(model_nb)

Call:
glm.nb(formula = y ~ period + period:trt + I(age - 18) + I(base - 
    6), data = epil, init.theta = 2.45404119, link = log)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.1102375  0.1814082   6.120 9.35e-10 ***
period2              -0.1046903  0.2018153  -0.519   0.6039    
period3              -0.0952657  0.2016820  -0.472   0.6367    
period4              -0.1421672  0.2023559  -0.703   0.4823    
I(age - 18)           0.0180364  0.0081967   2.200   0.0278 *  
I(base - 6)           0.0269269  0.0017455  15.426  < 2e-16 ***
period1:trtTreatment -0.3245884  0.2003595  -1.620   0.1052    
period2:trtTreatment  0.0002938  0.1985914   0.001   0.9988    
period3:trtTreatment -0.1506198  0.2004423  -0.751   0.4524    
period4:trtTreatment -0.2968729  0.2042469  -1.453   0.1461    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.454) family taken to be 1)

    Null deviance: 559.97  on 235  degrees of freedom
Residual deviance: 264.87  on 226  degrees of freedom
AIC: 1332.5

Number of Fisher Scoring iterations: 1

              Theta:  2.454 
          Std. Err.:  0.331 

 2 x log-likelihood:  -1310.468 
# Compare models
AIC(model_seizures, model_nb)
               df      AIC
model_seizures 10 1730.433
model_nb       11 1332.468

Rate Models and Offsets

  • Structure:
    • Response: Count per unit exposure

    • Link: \(\log(\lambda) = X\beta + \log(t)\)

    • No parameter is estimated for the last term, “offset”

    • If some rates are zero: zero inflated model

Gamma Regression

  • Response (independent) variable is continuous and positive
  • Response variance increases linearly with the mean
  • Response follows a Gamma-distribution
  • Model Form: \[log(\mu) = \beta_0 + \beta_1x_1 + ... + \beta_px_p\] where g is often log or inverse link
  • Used for
    • right skewed data
    • e.g. medical costs/expenses, duration
  • use glm(formula, family=Gamma(link=“log”), data=…)

Survival Analysis

Used packages

R Packages:

library(survival)    # Core survival analysis
library(ggsurvfit)  # Enhanced survival plots
library(dplyr)      # Data manipulation
library(gtsummary)  # Summary tables

Example Datasets:

  • lung: Advanced lung cancer data

  • melanoma: Melanoma survival study

  • BMT: Bone marrow transplant data

Introduction - Key Concepts

  • Time-to-Event Data:

    • Clear starting point (e.g., diagnosis, surgery)

    • Well-defined endpoint (e.g., death, recurrence)

    • Time to event (waiting time)

  • Typical Applications:

    • Clinical trials: Time to death/progression

    • Reliability: Time to component failure

Unique Feature: Censored observations - not all observations observed until event

Censoring

  • Incomplete information about survival time

  • Types of Censoring:

    1. Right Censoring: Event occurs after observation period
      e.g., Patient lost to follow-up

    2. Left Censoring: Event occurs before first observation
      e.g., Disease onset before study

    3. Interval Censoring Exact time to event unknown. Only a time interval known.
      e.g., Disease detected at check-up

Censoring Data

  • Event status
    • Binary indicator
    • Event occurred (1) or censored (0)
lung <- lung %>% 
  mutate(
    status = as.numeric(status == 2),  # This will create 0 for censored, 1 for dead
    sex = factor(sex, labels = c("female", "male"))
  )
head(lung[, c("time", "status", "sex")])
  time status    sex
1  306      1 female
2  455      1 female
3 1010      0 female
4  210      1 female
5  883      1 female
6 1022      0 female

lung dataset: description

Censoring - Notation

  • For each subject \(i\):

    • True survival time: \(T_i\)

    • Censoring time: \(C_i\)

    • Observed time: \(Y_i = \min(T_i, C_i)\)

    • Event indicator: \(\delta_i = I(T_i \leq C_i)\)

Key Assumption: Independent censoring: \(C_i\) independent of \(T_i\)

Impact on Analysis: Cannot use standard regression methods

Survival Function

  • Definition: \[S(t) = P(T > t) = 1 - F(t)\]

  • Interpretation:

    • Probability of surviving beyond time t

    • Basis for survival curves

  • Properties:

    • Monotone decreasing

    • \(S(0) = 1\): All participants are alive at start time

    • \(\lim_{t \to \infty} S(t) = 0\): In the long run everybody is dead …

    • Right-continuous

Hazard Function

  • Definition: \[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]

  • Interpretation:

    • Instantaneous risk of event

    • Rate of failure at time t

    • Conditional on survival to time t

  • Probability for event exactly at time \(t\): \(P(T=t)=S(t)\cdot h(t)\)

  • Characteristics:

    • Non-negative

    • No upper bound

Relationship Between Functions

  • Key Relationships: \[h(t) = \frac{f(t)}{S(t)} = -\frac{d}{dt}\log S(t)\]

  • Cumulative Hazard: \[H(t) = \int_0^t h(u)du = -\log S(t)\]

Kaplan-Meier Method

  • Product-Limit Estimator (nonparametric): \[\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\]

  • Components:

    • \(n_i\) = number at risk at time \(t_i\)

    • \(d_i\) = number of events at time \(t_i\)

    • Product over all event times ≤ t

Kaplan-Meier - Statistical Properties

  • Variance of estimate (Greenwood): \[\text{Var}[\hat{S}(t)] = [\hat{S}(t)]^2 \sum_{t_i \leq t} \frac{d_i}{n_i(n_i - d_i)}\]

Key Features: - Consistent estimator of S(t) - Asymptotically unbiased - Confidence intervals widen over time - More precise with larger samples

Assumptions: - Independent censoring - No secular trends - Similar prospects for censored/uncensored subjects

Kaplan-Meier estimate in R

# Create survival object and fit
km_fit <- survfit2(Surv(time, status) ~ 1, data = lung)

# Basic summary
summary(km_fit)
Call: survfit(formula = Surv(time, status) ~ 1, data = lung)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5    228       1   0.9956 0.00438       0.9871        1.000
   11    227       3   0.9825 0.00869       0.9656        1.000
   12    224       1   0.9781 0.00970       0.9592        0.997
   13    223       2   0.9693 0.01142       0.9472        0.992
   15    221       1   0.9649 0.01219       0.9413        0.989
   26    220       1   0.9605 0.01290       0.9356        0.986
   30    219       1   0.9561 0.01356       0.9299        0.983
   31    218       1   0.9518 0.01419       0.9243        0.980
   53    217       2   0.9430 0.01536       0.9134        0.974
   54    215       1   0.9386 0.01590       0.9079        0.970
   59    214       1   0.9342 0.01642       0.9026        0.967
   60    213       2   0.9254 0.01740       0.8920        0.960
   61    211       1   0.9211 0.01786       0.8867        0.957
   62    210       1   0.9167 0.01830       0.8815        0.953
   65    209       2   0.9079 0.01915       0.8711        0.946
   71    207       1   0.9035 0.01955       0.8660        0.943
   79    206       1   0.8991 0.01995       0.8609        0.939
   81    205       2   0.8904 0.02069       0.8507        0.932
   88    203       2   0.8816 0.02140       0.8406        0.925
   92    201       1   0.8772 0.02174       0.8356        0.921
   93    199       1   0.8728 0.02207       0.8306        0.917
   95    198       2   0.8640 0.02271       0.8206        0.910
  105    196       1   0.8596 0.02302       0.8156        0.906
  107    194       2   0.8507 0.02362       0.8056        0.898
  110    192       1   0.8463 0.02391       0.8007        0.894
  116    191       1   0.8418 0.02419       0.7957        0.891
  118    190       1   0.8374 0.02446       0.7908        0.887
  122    189       1   0.8330 0.02473       0.7859        0.883
  131    188       1   0.8285 0.02500       0.7810        0.879
  132    187       2   0.8197 0.02550       0.7712        0.871
  135    185       1   0.8153 0.02575       0.7663        0.867
  142    184       1   0.8108 0.02598       0.7615        0.863
  144    183       1   0.8064 0.02622       0.7566        0.859
  145    182       2   0.7975 0.02667       0.7469        0.852
  147    180       1   0.7931 0.02688       0.7421        0.848
  153    179       1   0.7887 0.02710       0.7373        0.844
  156    178       2   0.7798 0.02751       0.7277        0.836
  163    176       3   0.7665 0.02809       0.7134        0.824
  166    173       2   0.7577 0.02845       0.7039        0.816
  167    171       1   0.7532 0.02863       0.6991        0.811
  170    170       1   0.7488 0.02880       0.6944        0.807
  175    167       1   0.7443 0.02898       0.6896        0.803
  176    165       1   0.7398 0.02915       0.6848        0.799
  177    164       1   0.7353 0.02932       0.6800        0.795
  179    162       2   0.7262 0.02965       0.6704        0.787
  180    160       1   0.7217 0.02981       0.6655        0.783
  181    159       2   0.7126 0.03012       0.6559        0.774
  182    157       1   0.7081 0.03027       0.6511        0.770
  183    156       1   0.7035 0.03041       0.6464        0.766
  186    154       1   0.6989 0.03056       0.6416        0.761
  189    152       1   0.6943 0.03070       0.6367        0.757
  194    149       1   0.6897 0.03085       0.6318        0.753
  197    147       1   0.6850 0.03099       0.6269        0.749
  199    145       1   0.6803 0.03113       0.6219        0.744
  201    144       2   0.6708 0.03141       0.6120        0.735
  202    142       1   0.6661 0.03154       0.6071        0.731
  207    139       1   0.6613 0.03168       0.6020        0.726
  208    138       1   0.6565 0.03181       0.5970        0.722
  210    137       1   0.6517 0.03194       0.5920        0.717
  212    135       1   0.6469 0.03206       0.5870        0.713
  218    134       1   0.6421 0.03218       0.5820        0.708
  222    132       1   0.6372 0.03231       0.5769        0.704
  223    130       1   0.6323 0.03243       0.5718        0.699
  226    126       1   0.6273 0.03256       0.5666        0.694
  229    125       1   0.6223 0.03268       0.5614        0.690
  230    124       1   0.6172 0.03280       0.5562        0.685
  239    121       2   0.6070 0.03304       0.5456        0.675
  245    117       1   0.6019 0.03316       0.5402        0.670
  246    116       1   0.5967 0.03328       0.5349        0.666
  267    112       1   0.5913 0.03341       0.5294        0.661
  268    111       1   0.5860 0.03353       0.5239        0.656
  269    110       1   0.5807 0.03364       0.5184        0.651
  270    108       1   0.5753 0.03376       0.5128        0.645
  283    104       1   0.5698 0.03388       0.5071        0.640
  284    103       1   0.5642 0.03400       0.5014        0.635
  285    101       2   0.5531 0.03424       0.4899        0.624
  286     99       1   0.5475 0.03434       0.4841        0.619
  288     98       1   0.5419 0.03444       0.4784        0.614
  291     97       1   0.5363 0.03454       0.4727        0.608
  293     94       1   0.5306 0.03464       0.4669        0.603
  301     91       1   0.5248 0.03475       0.4609        0.597
  303     89       1   0.5189 0.03485       0.4549        0.592
  305     87       1   0.5129 0.03496       0.4488        0.586
  306     86       1   0.5070 0.03506       0.4427        0.581
  310     85       2   0.4950 0.03523       0.4306        0.569
  320     82       1   0.4890 0.03532       0.4244        0.563
  329     81       1   0.4830 0.03539       0.4183        0.558
  337     79       1   0.4768 0.03547       0.4121        0.552
  340     78       1   0.4707 0.03554       0.4060        0.546
  345     77       1   0.4646 0.03560       0.3998        0.540
  348     76       1   0.4585 0.03565       0.3937        0.534
  350     75       1   0.4524 0.03569       0.3876        0.528
  351     74       1   0.4463 0.03573       0.3815        0.522
  353     73       2   0.4340 0.03578       0.3693        0.510
  361     70       1   0.4278 0.03581       0.3631        0.504
  363     69       2   0.4154 0.03583       0.3508        0.492
  364     67       1   0.4092 0.03582       0.3447        0.486
  371     65       2   0.3966 0.03581       0.3323        0.473
  387     60       1   0.3900 0.03582       0.3258        0.467
  390     59       1   0.3834 0.03582       0.3193        0.460
  394     58       1   0.3768 0.03580       0.3128        0.454
  426     55       1   0.3700 0.03580       0.3060        0.447
  428     54       1   0.3631 0.03579       0.2993        0.440
  429     53       1   0.3563 0.03576       0.2926        0.434
  433     52       1   0.3494 0.03573       0.2860        0.427
  442     51       1   0.3426 0.03568       0.2793        0.420
  444     50       1   0.3357 0.03561       0.2727        0.413
  450     48       1   0.3287 0.03555       0.2659        0.406
  455     47       1   0.3217 0.03548       0.2592        0.399
  457     46       1   0.3147 0.03539       0.2525        0.392
  460     44       1   0.3076 0.03530       0.2456        0.385
  473     43       1   0.3004 0.03520       0.2388        0.378
  477     42       1   0.2933 0.03508       0.2320        0.371
  519     39       1   0.2857 0.03498       0.2248        0.363
  520     38       1   0.2782 0.03485       0.2177        0.356
  524     37       2   0.2632 0.03455       0.2035        0.340
  533     34       1   0.2554 0.03439       0.1962        0.333
  550     32       1   0.2475 0.03423       0.1887        0.325
  558     30       1   0.2392 0.03407       0.1810        0.316
  567     28       1   0.2307 0.03391       0.1729        0.308
  574     27       1   0.2221 0.03371       0.1650        0.299
  583     26       1   0.2136 0.03348       0.1571        0.290
  613     24       1   0.2047 0.03325       0.1489        0.281
  624     23       1   0.1958 0.03297       0.1407        0.272
  641     22       1   0.1869 0.03265       0.1327        0.263
  643     21       1   0.1780 0.03229       0.1247        0.254
  654     20       1   0.1691 0.03188       0.1169        0.245
  655     19       1   0.1602 0.03142       0.1091        0.235
  687     18       1   0.1513 0.03090       0.1014        0.226
  689     17       1   0.1424 0.03034       0.0938        0.216
  705     16       1   0.1335 0.02972       0.0863        0.207
  707     15       1   0.1246 0.02904       0.0789        0.197
  728     14       1   0.1157 0.02830       0.0716        0.187
  731     13       1   0.1068 0.02749       0.0645        0.177
  735     12       1   0.0979 0.02660       0.0575        0.167
  765     10       1   0.0881 0.02568       0.0498        0.156
  791      9       1   0.0783 0.02462       0.0423        0.145
  814      7       1   0.0671 0.02351       0.0338        0.133
  883      4       1   0.0503 0.02285       0.0207        0.123

Kaplan-Meier - Visualization

survfit2(Surv(time, status) ~ 1, data = lung) %>% 
  ggsurvfit() +
  labs(x = "Days", y = "Overall survival probability") + 
  add_confidence_interval() +
  add_risktable()

Kaplan-Meier - Group Comparisons

  • Log-rank Test: Non-parametric test for group differences. Null hypothesis: no difference between curves
# Compare survival by sex
survdiff(Surv(time, status) ~ sex, data = lung)
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

             N Observed Expected (O-E)^2/E (O-E)^2/V
sex=female 138      112     91.6      4.55      10.3
sex=male    90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001 

Kaplan-Meier - Group Comparisons

# Visualize group comparison
survfit2(Surv(time, status) ~ sex, data = lung) %>%
  ggsurvfit() +
  add_pvalue()

Cox Proportional Hazards - Model

  • Hazard Model: \[h(t|X_i) = h_0(t)\exp(\beta^T X_i)\]

  • Components:

    • \(h_0(t)\) = baseline hazard (nonparametric)

    • \(X_i\) = covariate vector

    • \(\beta\) = regression coefficients (parametric)

  • Key Features:

    • Semi-parametric approach

    • Assumes proportional hazards

    • Handles multiple predictors

Proportional Hazards - Concept

Definition: Hazard ratio constant over time

Mathematical Expression: \[\frac{h_1(t)}{h_2(t)} = \exp(\beta) \text{ for all } t\]

Key Implications: - Parallel log-survival curves - Non-crossing hazard functions - Time-independent effect sizes

Common Violations: - Early vs. late treatment effects - Delayed treatment onset - Changing risk factors over time

Sir David Cox (1924-2022)

Sir David Cox : - British statistician who revolutionized survival analysis - Developed the Cox proportional hazards model (1972) - “Regression Models and Life Tables” - one of statistics’ most cited papers

The Innovation: - Combined flexibility of non-parametric methods with regression analysis - Introduced partial likelihood estimation - Enabled analysis of censored data with multiple predictors

Impact: - Standard method in medical research - Enables evidence-based clinical decisions - Applications across many scientific fields

Cox Model in R

# Fit model
cox_fit <- coxph(Surv(time, status) ~ sex, data = lung)
summary(cox_fit)
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

  n= 228, number of events= 165 

           coef exp(coef) se(coef)      z Pr(>|z|)   
sexmale -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
sexmale     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.021 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001
exp(coefficients(cox_fit))
  sexmale 
0.5880028 
  • Interpretation: Coefficients are log hazard ratios, - exp(β) is hazard ratio

Cox Model in R

  • The same base hazard rate for all categories:
cox_fit1 <- coxph(Surv(time, status) ~ sex + age, data = lung)
exp(coefficients(cox_fit1))
 sexmale      age 
0.598566 1.017191 
  • Different hazard rates for each category:
cox_fit2 <- coxph(Surv(time, status) ~ strata(sex) + age, data = lung)
exp(coefficients(cox_fit2))
     age 
1.016347 

Cox Model in R

plot(survfit(cox_fit1,data=lung), col = c("steelblue","red"),
     main = "Base hazard independent of gender")

Cox Model in R

plot(survfit(cox_fit2,data=lung), col = c("steelblue","red"),
     main = "Base hazard for gender strata")

Comparing Kaplan-Meier and Cox Models

  • Kaplan-Meier Approach:

    • Non-parametric estimation

    • Limited to categorical predictors

  • Cox Model Advantages:

    • Semi-parametric estimation

    • Handles multiple predictors (adjusts for confounders)

    • Continuous variables allowed

Competing Risks - Framework

  • Multiple possible event types that are mutually exclusive

    • Standard KM is biased
  • Cumulative Incidence Function: \[F_k(t) = P(T \leq t, D = k) = \int_0^t S(u)h_k(u)du\]

Conditional Survival

  • Definition: \[CS(y|x) = P(T > x + y | T > x) = \frac{S(x + y)}{S(x)}\]

  • Probability to survive time \(x+y\), if already survived until time \(x\)

  • Increases with time \(x\) already survived