1 Previous Analysis

Analysis from February 9, 2018

Analysis from February 9, 2018

Curves from February 9, 2018

Curves from February 9, 2018

2 Replicating Previous Analysis

3 Univariate Predictors

3.1 KM plots and Cox PH tables for the univariate predictors under consideration

The ‘short list’ of univariate predictors is: . These were selected on the basis of having an FDR <= - . `

3.2 Multivariate Model Selection

Fitting Proportional Hazards Regression Model: Surv(a_emr_tdiag, a_emr_crecur == 1) ~ v002_bn_crtnn_rt_3097_3_info + e_opd_anlgscs
  coef exp(coef) se(coef) z p
v002_bn_crtnn_rt_3097_3_infoNormal -1.118 0.3271 0.4644 -2.406 0.01611
e_opd_anlgscsTRUE 0.6246 1.867 0.3407 1.833 0.06675

Likelihood ratio test=10.87 on 2 df, p=0.004357139 n= 540, number of events= 75

3.3 Cox Proportional Hazard

Full model from last time

Fitting Proportional Hazards Regression Model: Surv(a_emr_tdiag, a_emr_crecur == 1) ~ e_hisp + e_rdwrbc + e_opd_anlgscs
  coef exp(coef) se(coef) z p
e_hispTRUE 0.2062 1.229 0.2792 0.7383 0.4603
e_rdwrbc 0.06473 1.067 0.0549 1.179 0.2383
e_opd_anlgscsTRUE 0.8273 2.287 0.4724 1.751 0.07988

Likelihood ratio test=4.33 on 3 df, p=0.2275098 n= 350, number of events= 52

This version only uses e_hisp as a predictor

Fitting Proportional Hazards Regression Model: Surv(a_emr_tdiag, a_emr_crecur == 1) ~ e_hisp
  coef exp(coef) se(coef) z p
e_hispTRUE 0.2008 1.222 0.278 0.7224 0.47

Likelihood ratio test=0.52 on 1 df, p=0.4698551 n= 350, number of events= 52

This version omits e_hisp

Fitting Proportional Hazards Regression Model: Surv(a_emr_tdiag, a_emr_crecur == 1) ~ e_rdwrbc + e_opd_anlgscs
  coef exp(coef) se(coef) z p
e_rdwrbc 0.0612 1.063 0.0546 1.121 0.2624
e_opd_anlgscsTRUE 0.842 2.321 0.4719 1.784 0.07439

Likelihood ratio test=3.79 on 2 df, p=0.1503818 n= 350, number of events= 52

3.4 KM Plot Using Linear Predictor from Cox Proportional Hazard Models

Full model

Ethnicity-only

No Ethnicity

4 Next milestones:

4.1 Re-run the original analysis that showed the disparity in our data (UT Med + UHS data)

The following variables were used by the earlier analysis:

  • v029_Hspnc_or_Ltn is the Hispanic ethnicity indicator
    • corresponds to e_hisp for EMR/i2b2
    • corresponds to a_hisp_naaccr for our NAACCR registry.
  • v050_RDW_RBC_At_Rt_GENERIC_KUH_COMPONENT_ID_5629_numnona, the red bloodcell count.
    • Corresponds to v113_rdw_rbc_at_rt_788_0_num in EMR/i2b2
    • NO EQUIVALENT AVAILABLE IN NAACCR
  • v037_CN_ANLGSCS, analgesics,
    • Split between v110_opd_anlgscs (opioid) and v123_nn_opd_anlgscs (non-opioid) in EMR/i2b2
    • NO EQUIVALENT AVAILABLE IN NAACCR
  • a_cens_1 is the event indicator variable, which in this case is the earliest occurrence of a secondary neoplasm as represented by any of these variables: v003_Scndr_nrndcrn, v004_mlgnt_unspcfd, v005_rsprtr_dgstv, and v006_unspcfd_mlgnt.
    • In the EMR/i2b2 data they are respectively: v008_scndr_nrndcrn, v008_scndr_nrndcrn_inactive, v009_mlgnt_unspcfd, v009_mlgnt_unspcfd_inactive, v010_rsprtr_dgstv, v010_rsprtr_dgstv_inactive, v011_unspcfd_mlgnt, v011_unspcfd_mlgnt_inactive, v012_unspcfd_mlgnt, v012_unspcfd_mlgnt_inactive, v013_rsprtr_dgstv, v013_rsprtr_dgstv_inactive, v014_mlgnt_spcfd, v014_mlgnt_spcfd_inactive.
    • In the NAACCR data this information can be obtained from the a_n_recur variable.
  • a_dxage3 was obtained by getting the patient’s age at the earliest of: their first secondary neoplasm (i.e. a_cens_1 is equal to 1), their last followup, or 2723 days (in the old data, the last recurrence to be observed) and then that patient’s age at initial diagnosis was subtracted from this quantity, so that a_dxage3 is (supposed to be) the number of days from initial diagnosis to first recurrence.
    • As I showed previously, deriving initial diagnosis from EMR data is problematic because it disagrees with the NAACCR records for the same patients. Nevertheless, the a_e_kc is the analytic variable in the EMR/i2b2 data that has this information.
    • The time of diagnosis for NAACCR is encoded by a_tdiag

So, the following model should reproduce the old results:

Surv(a_emr_tdiag,a_emr_crecur == 1) ~ e_hisp + v113_rdw_rbc_at_rt_788_0_num + v110_opd_anlgscs

The following model just shows the effect of Hispanic ethnicity:

Surv(a_emr_tdiag,a_emr_crecur == 1) ~ e_hisp

TODO:

  • Create analytic variables to fill in the TIME and EVENT placeholders in the above models using a_e_kc and the secondary tumor codes in data.R
  • Run analysis and survival plots in this script
  • Use the automatic variable formatting used in exploration.R in this script as well so that it’s readable

Time: 2-3 days

4.2 Run the same analysis in the local (UT Med) TX Cancer Registry data

TODO:

Only the second model can be run – Surv(TIME,EVENT) ~ a_hisp_naaccr.

Time: 1 day

4.3 Run the same analysis in UT Med data enhanced with the TX Cancer Registry data

TODO:

One source should be treated as the authoritative one and back-filled from the other source when missing/invalid. What standards should we use for missing or invalid?

Time: 1 day after above two are done and this question is answered.

4.4 Request the TX Cancer Registry data and run the analysis there (depending on time the request takes, this may not be possible for this meeting but you should have the data by then)

TODO:

  • Re-read online instructions for requesting de-id data from TCR
  • Make list of all relevant non-id variables and then fill out the request
  • Submit request

Time: 1-2 days

The results are saved and available for use by other scriports if you place 'disparity.R' among the values in their .deps variables.


  1. UT Health San Antonio