IUR, C-Index and Variable selection for SMR/SHR using R

Yuan Yang
04/16/2019

Main contents

IUR for SMR and SHR
C statistics (C index) for SMR and SHR
Variable selection for SHR

IUR by bootstrap for SMR-type measure

The SMR-type measure is calculated by \[ SMR_j = \frac{\sum_i^{n_j} O_{ji}}{\sum_i^{n_j} E_{ji}}, \] where

\( j \): \( 1,\cdots,J \), facility index
\( n_j \): the number of patients in facility \( j \)
\( i \): patient index
\( O_{ji} \): observed value for patient \( i \) in facility \( j \)
\( E_{ji} \): expected value for patient \( i \) in facility \( j \)
\( SMR_j \): SMR for facility \( j \).

Some notation

\( b \): bootstrap index
\( B \): total number of bootstraps
\( O_{ji}^{(b)} \): observed value for patient \( i \) in facility \( j \) for \( b \)-th bootstrap data
\( E_{ji}^{(b)} \): expected value for patient \( i \) in facility \( j \) for \( b \)-th bootstrap data
\( SMR_j^{(b)} \): SMR for facility \( j \) for \( b \)-th bootstrap data, \[ SMR_j^{(b)} = \frac{\sum_i^{n_j} O_{ji}^{(b)}}{\sum_i^{n_j} E_{ji}^{(b)}}. \]

Algorithm

For \( b = 1,\cdots, B \), sample data from original data using facility-stratified sampling.
Calculate \( SMR^{(b)} \) for each bootstrap data.
Calculate within-facility variance: \[ \sigma^2_w = \frac{\sum_j \sum_b (SMR_j^{(b)}-\overline{SMR}_j)^2}{\sum_j(n_j-1)}, \] where \( \overline{SMR}_j= \sum_b SMR_j^{(b)}/n_j \).
\[ n' = \frac{\sum_j n_j - \sum_j n_j^2/\sum_j n_j}{\sum_j n_j-1}. \]

Algorithm (cont'd)

Calculate total variance: \[ \sigma^2_t = \frac{\sum_j n_j(SMR_j-\overline{SMR})^2}{n'(\sum_j n_j -1)}, \] where \( \overline{SMR} = \sum_j n_jSMR_j/\sum_j n_j \).
Between facility variance \( \sigma^2_b = \sigma^2_t- \sigma^2_w \).
\[ IUR = \frac{\sigma^2_b}{\sigma^2_t}. \]
Facility IUR, \[ IUR_j = \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_w/n_j}. \]

Efficiency comparison

Real Data

Measure	Input	Data requirement	stratify.var
SMR	obs_death, exp_death, facility, stratify.var, stratify_cutoff, year	Remove ‘Short’ and small facilities (facility expected death < 3)	facility size
SHR	obs_admission, exp_admission, facility, stratify.var, stratify_cutoff, year	Remove ‘Short’ and small facilities (facility patient year < 5 )	facility size
STrR	obs_transyr, exp_transyr, facility, stratify.var, stratify_cutoff, year	Remove ‘Short’ and small facilities (facility trans_yar<10)	facility size

R code example

SMR

rm(list=ls())
uniqname = 'yuanyang'
output_path = 'K:/Users/kecc-yuanyang/IUR0321'
data_path = 'K:/Projects/Dialysis_Reports_Shared/Data/SMR/special_request'
raw_data_name = 'SMRSHR_2014to2017.sas7bdat'

obs_death = "dial_drd" #observed death
exp_death = "expectda" #expeceted death
facility = "provfs" #facility id
death_yar = "DIAL_yar" #death year
year = "year" #data year

Results

Results

IUR for 2014-2017 SMR data:

SHR

rm(list=ls())
uniqname = 'yuanyang'
output_path = 'K:/Users/kecc-yuanyang/IUR0321'
data_path = 'K:/Projects/Dialysis_Reports_Shared/Data/SMR/special_request'
raw_data_name = 'SMRSHR_2014to2017.sas7bdat'

obs_admission = "h_admits" #observed admission
exp_admission = "expectta" #expeceted admission
hosp_yar = "h_dy_yar" #hospital YAR
facility = "provfs" #facility id
year = "year" #data year

Results

IUR for 2017 SHR data:

General C statistic

C statistic is to measure the proportion of concordance pairs of observed and expected outcomes to all possible pairs.
Normally, C statistic is defined as \[ C_{stat}=\frac{\rm number \ of\ concordances}{\rm number\ of\ pairs}= \frac{ \sum_{i < j} I(y_i < y_j\ \&\ \hat{y}_i < \hat{y}_j) }{\sum_{i < j}1}. \] where
- \( y_i \): the obeserved outcome;
- \( \hat{y}_i \): the fitted value.
\( 0 \le C \le 1 \). Higher C indicates higher concordance between y and y_hat, and thus better model fitting.

A Simple Example

Data:

ID y y_hat

1 1.1 4.5

2 2.3 3.4

3 4.3 5.3
- Numerator = \( \sum_{i < j} I(y_i < y_j\ \&\ \hat{y}_i < \hat{y}_j) \)= 2
- Denominator = \( \sum_{i < j}1 \)=3
- Therefore, C=2/3.

ID	y	y_hat
1	1.1	4.5
2	2.3	3.4
3	4.3	5.3

C statistic for SHR

Data preparation:

\( m_i \)(h_admits): observed number of events.
\( T_i \)(h_dy_yar): year-at-risk.
\( {\rm year}_i \): patient hospitalization year. For example, pyf4_mod1 data is from 2010 to 2013.
\( e_i \)(expectta): expected number of events.
\( r_i=m_i/T_i \): observed event rate.
\( \hat{r}_i = e_i/T_i \): expected event rate.
\( SHR_i \)(shrty1_f, shrty2_f, shrty3_f, shrty4_f): SHR from 2010 to 2013 in pyf4_mod1 data.
\( \hat{r}_{2i}=\hat{r}_i*SHR_i \): SHR adjusted expected rate.

C statistics for SHR:

Define category by \( {\rm year}_i \).
Sort the data by \( r_i \).
Within category \( k \), \( \rm number \ of\ concordances = N_k = \sum_i \sum_{j:j > i} I((r_i < r_j)\ \&\ (\hat{r}_{2i} < \hat{r}_{2j})) \), \( \rm number \ of\ pairs = D_k = \sum_i \sum_{j:j > i} 1 \).
Then \( C_{stat} = \frac{\sum_k N_k}{\sum_k D_k} \).

Efficiency comparison (Rcpp vs. R)

R package

rm(list=ls())
uniqname = 'yuanyang'
output_path = 'K:/Users/kecc-yuanyang/C_statistics0214'
data_path = 'K:/Projects/Dialysis_Reports_Shared/Data/SMR/special_request'
raw_data_name = 'SMRSHR_2014to2017.sas7bdat'

SHR = 'shrty'
obs_admission = "h_admits" #observed admission
exp_admission = "expectta" #expeceted admission
hosp_yar = "h_dy_yar" #hospital YAR
facility = "provfs" #facility id
year = "year" #data year

Results

C Statistic for Survival model

Sort data by event time, then the C statistic for Survival model is defined as \[ C_{stat} = \frac{ \rm number \ of\ concordances}{\rm number\ of\ pairs} =\frac{\sum_i \sum_{j:j > i} I(T_i < T_j\ \&\ ( X \hat{\beta})_i>( X \hat{\beta})_j)}{\sum_i \sum_{j:j > i}\delta_i}, \] where

\( T_i \): the obeserved survival time
\( \delta_i \): 1= event; 0 = censored; \( \Delta \): \( \{ i:\delta_i=1 \} \)
\( (X\hat{\beta})_i=X_i \hat{\beta} \)

C statistic for SMR

Data:

\( T_i (\rm DIAL\_yar) \): observed survival time.
\( (X\hat{\beta})_i (pred\_xbeta) \): \( X \) is the covariates and \( \hat{\beta} \) is from the model.
\( k (\rm provfs) \): facility number.
\( \delta_i (\rm dial\_drd) \): event indicator.

Concordance for SMR

if \( \delta_i = 1 \) and \( \delta_j = 1 \),
- if \( |T_i-T_j|< 0.0001 \) (tie), then concordance = 0.5;
- else if \( |(X\hat{\beta})_i-(X\hat{\beta})_j|<0.0001 \), then concordance=0.5; else if \( (X\hat{\beta})_i>(X\hat{\beta})_j \), concordance = 1.
if \( \delta_i = 1 \) and \( \delta_j=0 \),
- if \( |(X\hat{\beta})_i-(X\hat{\beta})_j|<0.0001 \), then concordance=0.5;
- else if \( (X\hat{\beta})_i>(X\hat{\beta})_j \), then concordance = 1.

Algorithm

Sort data by \( T_i \) (increasing order).
Within facility \( k \), \[ \rm number \ of\ concordances = N_k = \sum_i \sum_{j:j>i} concordance(i,j), \] \[ \rm number \ of\ pairs = D_k = \sum_i \sum_{j:j>i} \delta_i. \]
Then \( C_{stat} = \frac{\sum_k N_k}{\sum_k D_k} \).

Efficiency comparison (Rcpp vs. SAS)

On a toy data with 1000 obervations:
About 1000 times faster.

R package

rm(list=ls())
uniqname = 'yuanyang'
output_path = 'K:/Users/kecc-yuanyang/C_statistics0214'
data_path = 'K:/Projects/Dialysis_Reports_Shared/Data/SMR/special_request'
raw_data_name = 'SMRSHR_2014to2017.sas7bdat'


obs_death = "dial_drd" #observed death
exp_xbeta = "xbeta_smr" #predicted Xbeta
death_yar = "DIAL_yar" #dialysis YAR
facility = "provfs" #facility id

Results

Results for SMR C Statistic

Variable selection for SHR model

Following Liu 2012, we first introduce some notation:

\( k \): facility index, \( 1,\dots,K \).
\( l \): interval index, \( 1, \dots, L \). The intervals are defined by \( a_0,\cdots,a_L \), e.g. \( [a_0,a_1],\cdots, [a_{L-1},a_{L}] \). In ED_SHR data, we use T_start and T_stop to define the intervals.
\( i \): patient index, \( 1,\dots, n_{kl} \).
\( \mathbf{Z}_{kli} \): covarites of \( i \)-th patient in \( k \)-th facility \( l \)-th interval. \( p*1 \) vector.

\( d_{kli} \): \( \int_{a_{l-1}}^{a_l}dN_{ki}(t) \), the observed number of events experienced in interval \( l \). In SHR data, it is ed_visits; in SHR, it is h_admits.
\( t_{kli} \): \( \int_{a_{l-1}}^{a_l}Y_{ki}(t)dt \), at-tisk time of \( i \)-th patient in \( k \)-th facility \( l \)-th interval. In ED_SHR data, it is ed_dy_yar; in SHR data, it is h_dy_yar.
\( o_{kli} \): \( o_{kli}=\frac{t_{kli}}{w_{kli}} \), where \( w_{kli}=\max(d_{kli},1) \).
\( \mathbf{\beta} \): coefficients. \( p*1 \) vector.

Partial likelihood

The partial likelihood for SHR model is \[ PL_f = \prod_{k=1}^{K} \prod_{l=1}^{L} \prod_{i=1}^{n_{kl}} \Big[ \frac{e^{ \{\mathbf{\beta}^T\mathbf{Z}_{kli}+o_{kli}\}}}{\sum_{j=1}^{n_{kl}}t_{kli}e^{\{\mathbf{\beta}^T\mathbf{Z}_{kli}\}}} \Big]^{d_{kli}}. \]

Then the log-partial likelihood is \[ l(\mathbf{\beta})=\sum_{j=1}^{K} \sum_{l=1}^{L} \sum_{i=1}^{n_{kl}} \Big[ \mathbf{\beta}^T\mathbf{Z}_{kli}+o_{kli} - log\{ \sum_{j=1}^{n_{kl}}t_{kli}e^{\{\mathbf{\beta}^T\mathbf{Z}_{kli}\}} \} \Big] {d_{kli}}. \]

First derivative: \[ U(\mathbf{\beta})=\frac{\partial l}{\partial \mathbf{\beta}}= \sum_{j=1}^{K} \sum_{l=1}^{L} \sum_{i=1}^{n_{kl}} \Big[ \mathbf{Z}_{kji} - \frac{\mathbf{S}_{kl}^{(1)}}{S_{kl}^{(0)}} \Big] {d_{kli}}, \] where \( S_{kl}^{(0)}=\sum_{i=1}^{n_{kl}} t_{kli}e^{\mathbf{\beta}^T\mathbf{Z}_{kli}} \) and \( \mathbf{S}_{kl}^{(1)}=\sum_{i=1}^{n_{kl}} t_{kli}e^{\mathbf{\beta}^T\mathbf{Z}_{kli}} \mathbf{Z}_{kli} \).

Second derivative: \[ G(\mathbf{\beta})=\frac{\partial^2 l}{\partial \mathbf{\beta}^T\partial \mathbf{\beta}}= \sum_{k=1}^{K} \sum_{l=1}^{L} \sum_{i=1}^{n_{kl}} \Big[ - \frac{\mathbf{S}_{kl}^{(2)} S_{kl}^{(0)} -\mathbf{S}_{kl}^{(1)} {\mathbf{S}_{kl}^{(1)}}^T}{{S_{kl}^{(0)}}^2} \Big] {d_{kli}}, \] where \( \mathbf{S}_{kl}^{(2)}=\sum_{i=1}^{n_{kl}}t_{kli}e^{\mathbf{\beta}^T\mathbf{Z}_{kli}}\mathbf{Z}_{kli}\mathbf{Z}_{kli}^T \).

Boosting

For \( j=1,\dots,p \), we calculate the first and second derivative of \( l(\mathbf{\beta}) \) with respect to \( \beta_j \), denoted as \( U(\mathbf{\beta})_j \) and \( G(\mathbf{\beta})_j \), respectively.
Find \( j^* \), where \( U(\mathbf{\beta})_{j^*} = \max\{U(\mathbf{\beta})_{j},j=1,\cdots,p\} \).
Update \( \beta_{j^*}=\beta_{j^*} - \frac{U(\mathbf{\beta})_j}{G(\mathbf{\beta})_j} \),

The goal is to select important comorbidity groupers. At each step, we update the coefficients for all adjustment variables and one comorbidity grouper using Boosting Algorithm. The groupers should be normalized before boosting.

Efficiency of Rcpp

Both R and Rcpp version are implemented for boosting algorithm.
We used a subdata from SHR data to test the efficiency. The subdata includes 5, 000 observations with 2,573 strata. The efficiency of Rcpp version is about 115 times than that of R version.

Results for SMR IUR

Fdr control

The goal of this task is to reduce the number of false positives.
Fdr is an abbreviation of empirical Bayes false discovery rate, which is defined as the probability we have made a false discovery given this variable is selected.
Selection rule: for a prespecified q (0<q<1), we select variables with their Fdr <= q. This procedure will ensure that at most q proportion of the selected variables would be false positives.
- For example, q=0.1, if 10 variables are selected, at most 1 of them is false positive.

Fdr control

\( S_0=\{ j:\beta_j \ne 0\} \): the set of indexes of true variables
\( \hat{S}=\{ j:\hat{\beta}_j \ne 0\} \): the set of indexes of selected variables
\( N_+ = |\hat{S}| \), \( e_+ = {\rm E}N_+ \)
\( N_0=|\hat{S}\setminus S_0| \), \( e_0={\rm E}N_0 \)
Empirical FDR \( \overline{Fdr} = \frac{e_0}{N_+} \)

Algorithm

On the original dataset, we bootstrap B times. Then we calculate \( \hat{\pi}_j=1/B \sum_{b=1}^{B}I\{ \hat{\beta}_j^{(b)} \ne 0\} \). \( \hat{\pi}_{(1)}\ge \hat{\pi}_{(2)} \ge \dots \ge \hat{\pi}_{(n)} \) are the ordered \( \hat{\pi}_i \)'s.
We permute orginal data M times.
- For \( m \)-th data, bootstrap B times; calculate \( \tilde{\pi}_j^m=\frac{1}{B} \sum_{b=1}^{B}I\{ \tilde{\beta}_j^{(m,b)} \ne 0\} \); get ordered \( \tilde{\pi}_{(i)}^m \)'s.
- \( \tilde{\pi}_{(j)}^* = \frac{1}{M} \sum_{m=1}^{M} \tilde{\pi}_{(j)}^{m} \).

Algorithm (cont'd)

Transform \( \pi \) to be a Normal variable \( z \).
- \( \triangle_{(j)} = \hat{z}_{(j)} - \tilde{z}_{(j)}^* \) and \( c_{(j)}=min \{ \hat{\pi}_{(l)}: \triangle_{(l)} \ge \triangle_{(j)} ,\quad l=1,\dots,p\} \).
- \( \overline{Fdr}_{(j)} = \underset{1\le j' \le j}{max} \frac{\sum_{m=1}^{M} \sum_{l=1}^{p} I(\tilde{\pi}_{(l)}^m>c_{(j')})/M}{\sum_{l=1}^{p} I(\hat{\pi}_{(l)}^0>c_{(j')})} \).

Computation burden

Results for SMR IUR

If boosting takes 3 hrs to work on one dataset, the total time = 1050*3/24 = 131 days.
For now, we can use server to reduce the burden. But a more efficient computation is needed.

Conclusion

Project	Status	Efficiency Improvement
IUR	Completed. User friendly code available on Gitlab.	new R vs. old R ~ 144 times
C Index	Completed. User friendly code available on Gitlab.	Rcpp vs. R ~60 times; Rcpp vs. SAS ~1000 times
Variable selection	80% completed. Need to test on a real data.	Rcpp vs. R ~ 115 times

Thank you and questions!

Reference

Liu, D., Schaubel, D. E., and Kalbfleisch, J. D. (2012). Computationally efficient marginal models for clustered recurrent event data. Biometrics, 68(2), 637-647.