source("Pre-Processing.R")
source("Cox Proportional Hazard Modeling.R")
Error in install.packages : Updating loaded packages
Restarting R session...
Since the surivival package is relatively straight-forward and the use case for predicting bank failure is very well-suited to Cox-PH modeling, I first want to play around with fitting all the data we have for a particular pre-crisis date. Univariate modeling vs. multi-variate modeling, as we’ve seen, might first show that certain of the CAMEL variables are (1) more statistically significant predictors of failure than others, and (2) that there are interactions at play between these variables.
Some conventions I’m asserting about the data at first are that:
time function will measure the days between the measure date and the event date. For example, the San Joaquin Bank (rssd_id = 23266), which failed on 6/26/2009, would have a (rounder integer) time of 727 if we collected data from 2007-06-30.build_univariate_models()
Chart 1: Measuring predictors on Q2 2007
Some custom code I’ve adapted to produce for the five CAMEL variables in question. Each row above represents a CoxPH model fitted with only the CAMEL variable in that row – along with the beta and the hazard ratio, and the outputs for the Wald test and the p.value of the model for that beta’s significance. This is a sanity check since the results are very intuitive:
build_univariate_models()
Chart 2: Measuring predictors on Q4 2006
Interesting finding – these betas, and significance levels, are highly dependent on the selected timeframe. In Chart 1, 4/5 variables are significant by themselves at the 0.01 level. When we consider measurements in 12/31/2006, only 2/5 are.
Finally, out of curiosity, I wonder if this type of model is sensitive to the scaling of the data by time period. Instead of days, let’s bucket the failures (or lack of failures) by the number of weeks rather than days (since the zero date). With the Q22006 zero-date again:
build_univariate_models()
Chart 3: Measuring predictors on Q2 2007, weeks (instead of months)
Literally indentical results! Forget I said anything.
Here, we’ll try the same zero-date of Q2 2006 and run the Cox-PH model on all 5 variables to inform about how these variables might interact. The summary() function provides most of what we’d want to know:
all_variable_model = coxph(Surv(time, failed) ~ ., data = dataset)
summary(all_variable_model)
Call:
coxph(formula = Surv(time, failed) ~ ., data = dataset)
n= 7731, number of events= 462
coef exp(coef) se(coef) z Pr(>|z|)
rb_tier_1_capital_ratio -5.391e+00 4.558e-03 7.248e-01 -7.438 1.02e-13 ***
non_performing_loan_ratio 2.998e+00 2.004e+01 9.023e-01 3.322 0.000893 ***
cost_to_income_ratio -4.731e-05 1.000e+00 8.924e-04 -0.053 0.957716
return_on_assets -3.846e+01 1.982e-17 4.151e+00 -9.265 < 2e-16 ***
liquid_assets_on_total_assets -1.908e+00 1.484e-01 3.845e-01 -4.962 6.97e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
rb_tier_1_capital_ratio 4.558e-03 2.194e+02 1.101e-03 1.887e-02
non_performing_loan_ratio 2.004e+01 4.990e-02 3.419e+00 1.175e+02
cost_to_income_ratio 1.000e+00 1.000e+00 9.982e-01 1.002e+00
return_on_assets 1.982e-17 5.045e+16 5.806e-21 6.768e-14
liquid_assets_on_total_assets 1.484e-01 6.738e+00 6.986e-02 3.153e-01
Concordance= 0.681 (se = 0.013 )
Rsquare= 0.018 (max possible= 0.656 )
Likelihood ratio test= 138.2 on 5 df, p=<2e-16
Wald test = 187.1 on 5 df, p=<2e-16
Score (logrank) test = 99.52 on 5 df, p=<2e-16
Chart 4: Measuting Predictors all at once, from Q2 2007 to 2011
Interesting to note how the coefficients remain constant for some variables (Tier 1 capital ratio) and shift for others (ROA).
It appears that MSE and traditional regression-related metrics are trickier to quantify for the Cox-PH models, but their appears to be some value in comparing the likelihood-ratio test produced by the fits of the different models as a measure of goodness of fit.
The below are the ratios of the goodness-of-fit of models using a n = 100 training set pulled at random from the full set available at that time. I’m a little unsure about how to interpret the ratios, but I imagine the full model’s highest LR test indicates that even for small training set sizes, the full model fits prediction better than its individual counterparts.
just_rb_tier_1_capital_ratio = coxph(Surv(time, failed) ~ rb_tier_1_capital_ratio, data = training_set)
just_non_performing_loan_ratio = coxph(Surv(time, failed) ~ non_performing_loan_ratio, data = training_set)
just_cost_to_income_ratio = coxph(Surv(time, failed) ~ cost_to_income_ratio, data = training_set)
just_return_on_assets = coxph(Surv(time, failed) ~ return_on_assets, data = training_set)
just_liquid_assets_on_total_assets = coxph(Surv(time, failed) ~ liquid_assets_on_total_assets, data = training_set)
full_model = coxph(Surv(time, failed) ~ ., data = training_set)
print(just_rb_tier_1_capital_ratio)
Call:
coxph(formula = Surv(time, failed) ~ rb_tier_1_capital_ratio,
data = training_set)
coef exp(coef) se(coef) z p
rb_tier_1_capital_ratio 1.148 3.153 0.585 1.96 0.05
Likelihood ratio test=2.28 on 1 df, p=0.1
n= 100, number of events= 10
print(just_non_performing_loan_ratio)
Call:
coxph(formula = Surv(time, failed) ~ non_performing_loan_ratio,
data = training_set)
coef exp(coef) se(coef) z p
non_performing_loan_ratio 8.82 6787.10 12.39 0.71 0.48
Likelihood ratio test=0.4 on 1 df, p=0.5
n= 100, number of events= 10
print(just_cost_to_income_ratio)
Call:
coxph(formula = Surv(time, failed) ~ cost_to_income_ratio, data = training_set)
coef exp(coef) se(coef) z p
cost_to_income_ratio -0.0315 0.9690 0.0266 -1.18 0.24
Likelihood ratio test=1.09 on 1 df, p=0.3
n= 100, number of events= 10
print(just_return_on_assets)
Call:
coxph(formula = Surv(time, failed) ~ return_on_assets, data = training_set)
coef exp(coef) se(coef) z p
return_on_assets -4.46e+01 4.40e-20 3.39e+01 -1.31 0.19
Likelihood ratio test=1.4 on 1 df, p=0.2
n= 100, number of events= 10
print(just_liquid_assets_on_total_assets)
Call:
coxph(formula = Surv(time, failed) ~ liquid_assets_on_total_assets,
data = training_set)
coef exp(coef) se(coef) z p
liquid_assets_on_total_assets -2.8173 0.0598 1.3932 -2.02 0.043
Likelihood ratio test=2.8 on 1 df, p=0.09
n= 100, number of events= 10
print(full_model)
Call:
coxph(formula = Surv(time, failed) ~ ., data = training_set)
coef exp(coef) se(coef) z p
rb_tier_1_capital_ratio 3.06e-02 1.03e+00 1.45e+00 0.02 0.98
non_performing_loan_ratio 1.28e+01 3.51e+05 1.54e+01 0.83 0.41
cost_to_income_ratio -3.26e-02 9.68e-01 3.32e-02 -0.98 0.33
return_on_assets -3.40e+00 3.33e-02 6.28e+01 -0.05 0.96
liquid_assets_on_total_assets -2.68e+00 6.87e-02 3.15e+00 -0.85 0.40
Likelihood ratio test=4.47 on 5 df, p=0.5
n= 100, number of events= 10