1 Setup

There are three sets of derived data that results and analyses depend on:

  • tab.Surv is the preprocessed data ready for survival analyses
  • missingTab is the same data but formatted for an analysis of missing data and comparisons of e.g. differences in variables between groups
  • allIDs is simply a vector of participant IDs

We have two preprocessing scripts that work directly on the CATIE source data:

  1. TRS_CATIE_v3.R, which does some initial data-munging work to get the raw CATIE source data into a format that can be processed, including things like assembling tables of medications and doses (which is not a trivial task with the CATIE data)

  2. KM_surivival_v3.R – takes the output of TRS_CATIE_v3.R and processes each participant’s trajectory/history through the trial (symptoms, social-occupational functioning, trials of treatment) and outputs the tables above.

I’m happy to deliver these scripts on request (they are long, idiosyncratic, not very readable and will require some time to make them more friendly). They are not included here because they cannot be run independently of the source CATIE data set, and I do not have rights to package/publish these datasets – they can be obtained by applying to to the NIMH for access : https://www.nimh.nih.gov/funding/clinical-research/datasets/index.shtml

This notebook uses only the derived data from TRS_CATIE_v3.R and KM_surivival_v3.R so the logic of the analyses can be followed.

rm( list = ls() )
require(knitr)
## Loading required package: knitr
missingTab <- read.csv("missingData_tabulated_v3.csv")
tab.Surv   <- read.csv("survivalData_tabulated_v3.csv")
allIDs     <- read.csv("all_participant_IDs.csv")

options(knitr.kable.NA = '')

2 Demographics Tables for Loss-To-Follow-Up

Some participants will be missing data which means that for establishing TRS cases, there may be bias introduced because – e.g. very unwell patients might have been lost to follow up more so than those who were relatively stable or well at recruitment.

To establish patterns in loss-to-followup, we first interrogate the data with multiple univariate analyses (t-tests / Komogorov-Smifnoff (KS) tests for continuous or Chi-square tests for proportions / ordinal data).

2.1 Establish Data Types

We use missingTab to create a vector of data types (for each variable/column), which will enable some automation to build the table summarising missing data.

# create vector of data types (numeric, ordinal/factor)
dataType <- as.numeric( unlist( lapply(missingTab, function(x) if (!is.numeric(x)) NA else max(x, na.rm = TRUE))) )

# shrink dataType so we don't test missingFU.rx, sx, sof : these three are dependent variables in the missing data 
# analysis, so don't need including
dataType <- dataType[ 1:(length( dataType ) -4 ) ]

# parse dataType and decide on a relevant univariate test - e.g. Chi square for categorical proportions and t and KS for continuous
testFlag <- rep(NA, length( dataType ) )

for( i in 1:length( dataType ) ) {
  if( dataType[i] == 1 | is.na( dataType[i] ) ) {
    # categorical (factor) or binary : proportion test - Chi-square : code = 1
    testFlag[i] <- 1
  } else {
    if( dataType[i] > 1 ) {
      # continuous, therefore KS and t-test
      testFlag[i] <- 2
    }
  }
}

2.2 Data Storage

Create a data frame that storing results:

testTable <- data.frame( Var = rep("", length( dataType ) ),
                         M.Missing          = rep(NA, length( dataType ) ),
                         Spread.Missing     = rep(NA, length( dataType ) ),
                         M.Present          = rep(NA, length( dataType ) ),
                         Spread.Present     = rep(NA, length( dataType ) ),
                         Perc.Missing       = rep(NA, length( dataType ) ),
                         Perc.Present       = rep(NA, length( dataType ) ),
                         P.Value            = rep(NA, length( dataType ) ),
                         Test.Used          = rep("", length( dataType ) ),
                         stringsAsFactors = FALSE
)

2.3 Populate Table

Note : we use suppressWarnings() because the Chi-square and KS test report warnings (due to e.g. sample sizes or ties) – normally, these would require attention but here we don’t need the warnings or output in the markup.

# for all variables in testFlag ...
for( i in 1:length( testFlag ) ) {
  thisVar <- names( missingTab )[i]
  testTable$Var[i] <- thisVar
  
  # Use missingFU.any (i.e. SOF, Sx or Rx)
  # Select out data for missing overall group and non-missing group
  grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), i ]
  grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), i ]
  
  # if this variable's testFlag == 1, it's ordinal / categorical data
  if( testFlag[i] == 1 ) {
    test.tab <- rbind( table( grpNotMissing ), table( grpMissing ) )
    rownames(test.tab) <- c("Present","Missing")
    
    suppressWarnings(
      xs <- chisq.test( test.tab )
    )
    testTable$Perc.Present[i]  <- round( length( which( grpNotMissing == 1 ) ) / length( grpNotMissing ) * 100, 2 )     ## proportion of 1's to 0s in subjects with present data 
    testTable$Perc.Missing[i]  <- round( length( which( grpMissing == 1 ) ) / length( grpMissing ) * 100, 2 )   
    ## proportion of 1's to 0s in subjects with Missing data 
    testTable$P.Value[i]       <- round( xs$p.value, 3 )
    testTable$Test.Used[i]     <- "ChiSq"
    
  } else {
    # This variable is continuous
    # We need either KS and t-test 
    
    # test normality with Shapiro-Wilks
    this.SW.test.missing    <- shapiro.test( grpMissing )
    this.SW.test.notmissing <- shapiro.test( grpNotMissing )
    
    if( this.SW.test.missing$p.value < 0.05 | this.SW.test.notmissing$p.value < 0.05 ) {
      # if p.value of either SW test < 0.05, then use KS (i.e. at least one distribution is non-normal)        
      suppressWarnings(
        this.test <- ks.test( grpMissing, grpNotMissing, alternative = "two.sided" )      
      )
    } else {
      # both missing and non-missing data are normal enough
      suppressWarnings(
        this.test  <- t.test( grpMissing, grpNotMissing, alternative = "two.sided" )
      )
    }
    
    if( this.test$method == "Two-sample Kolmogorov-Smirnov test" ) {
      # record median and IQR
      testTable$M.Missing[i]      <- median( grpMissing, na.rm = TRUE )
      testTable$Spread.Missing[i] <- IQR( grpMissing, na.rm = TRUE )
      testTable$M.Present[i]      <- median( grpNotMissing, na.rm = TRUE )
      testTable$Spread.Present[i] <- IQR( grpNotMissing, na.rm = TRUE )
      testTable$P.Value[i]        <- round( this.test$p.value, 3 )
      testTable$Test.Used[i]      <- "KS"
    }
    
    if( this.test$method == "Welch Two Sample t-test" ) {
      # record median and IQR
      testTable$M.Missing[i]      <- mean( grpMissing, na.rm = TRUE )
      testTable$Spread.Missing[i] <- sd( grpMissing, na.rm = TRUE )
      testTable$M.Present[i]      <- mean( grpNotMissing, na.rm = TRUE )
      testTable$Spread.Present[i] <- sd( grpNotMissing, na.rm = TRUE )
      testTable$P.Value[i]        <- round( this.test$p.value, 3 )
      testTable$Test.Used[i]      <- "T"
      
    }
    
  }
}

# add a significance flag (purely cosmetic)
testTable$signif <- ifelse( testTable$P.Value < 0.05, "*", " " )

2.4 Ancillary Tables

Above, the proportions for missing categorical/ordinal data where only counted for binary variables, not those for e.g. factors > 2 levels.

Two variables – race and marital status – have many levels, so we create two ancillary tables for these variables alone (the “format” will be different and not compatible with the larger table)

grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), "marital" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "marital" ]

prop.marital <- rbind( table( grpMissing ),
                       table( grpNotMissing ) )

rownames( prop.marital ) <- c( "Missing", "Not Missing")
prop.marital <- data.frame( t(prop.marital) )
prop.marital$N <- rowSums( prop.marital )

grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), "race" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "race" ]
prop.race     <- rbind( table( grpMissing ),
                        table( grpNotMissing ) )

rownames( prop.race ) <- c( "Missing", "Not Missing")
prop.race <- data.frame( t(prop.race ) )
prop.race$N <- rowSums( prop.race )

(prop.race)
(prop.marital)

2.5 Summary

testTable now contains multiple univariate tests for differences in missing data at follow-up for each of the demographics / baseline characteristics. These are highlighted only for use in adjusting the subsequent regressions/survival analysis – not because they are hypotheses that are worthy of formal testing. Hence, there’s no claim any difference survives post-hoc multiple correction – it’s just we need to know if there are differences, and to account for these variables later (i.e. it is better to tolerate Type I error and include a variable in adjusted analyses than not).

3 Tabular Output

require(xtable)
## Loading required package: xtable
testTable.latex <- print( xtable(testTable, 
                                 caption = "Differences in Missing Follow-Up Data"),
                          include.rownames=FALSE)
## % latex table generated in R 3.4.3 by xtable 1.8-2 package
## % Sat Mar 31 18:43:14 2018
## \begin{table}[ht]
## \centering
## \begin{tabular}{lrrrrrrrll}
##   \hline
## Var & M.Missing & Spread.Missing & M.Present & Spread.Present & Perc.Missing & Perc.Present & P.Value & Test.Used & signif \\ 
##   \hline
## P & 19.00 & 7.00 & 18.00 & 8.00 &  &  & 0.38 & KS &   \\ 
##   N & 20.00 & 9.00 & 20.00 & 9.00 &  &  & 0.79 & KS &   \\ 
##   G & 37.00 & 12.00 & 37.00 & 13.00 &  &  & 0.57 & KS &   \\ 
##   age & 40.00 & 17.00 & 42.00 & 16.00 &  &  & 0.10 & KS &   \\ 
##   sex &  &  &  &  & 76.72 & 73.43 & 0.31 & ChiSq &   \\ 
##   exac3mo &  &  &  &  & 35.11 & 25.80 & 0.00 & ChiSq & * \\ 
##   yrsEduc & 12.00 & 2.00 & 12.00 & 1.00 &  &  & 0.04 & KS & * \\ 
##   CGIsev & 4.00 & 1.00 & 4.00 & 2.00 &  &  & 0.74 & KS &   \\ 
##   employFT &  &  &  &  & 8.02 & 6.43 & 0.44 & ChiSq &   \\ 
##   yrsFirstTx & 14.00 & 18.25 & 16.00 & 18.00 &  &  & 0.23 & KS &   \\ 
##   yrsFrstAntiPsyRx & 12.00 & 17.00 & 13.00 & 18.00 &  &  & 0.40 & KS &   \\ 
##   olzB0 &  &  &  &  & 21.37 & 27.83 & 0.04 & ChiSq & * \\ 
##   quetB0 &  &  &  &  & 11.07 & 9.56 & 0.53 & ChiSq &   \\ 
##   rispB0 &  &  &  &  & 21.37 & 22.84 & 0.67 & ChiSq &   \\ 
##   zipB0 &  &  &  &  & 3.05 & 5.25 & 0.18 & ChiSq &   \\ 
##   halB0 &  &  &  &  & 5.73 & 5.84 & 1.00 & ChiSq &   \\ 
##   decaB0 &  &  &  &  & 1.15 & 1.61 & 0.78 & ChiSq &   \\ 
##   perB0 &  &  &  &  & 1.91 & 2.12 & 1.00 & ChiSq &   \\ 
##   otherB0 &  &  &  &  & 3.44 & 9.39 & 0.00 & ChiSq & * \\ 
##   race &  &  &  &  & 0.00 & 0.00 & 0.00 & ChiSq & * \\ 
##   marital &  &  &  &  & 0.00 & 0.00 & 0.65 & ChiSq &   \\ 
##   COPD &  &  &  &  & 2.67 & 2.12 & 0.75 & ChiSq &   \\ 
##   DM &  &  &  &  & 8.40 & 11.17 & 0.23 & ChiSq &   \\ 
##   HepABC &  &  &  &  & 8.78 & 5.50 & 0.06 & ChiSq &   \\ 
##   Lipid &  &  &  &  & 9.54 & 15.14 & 0.02 & ChiSq & * \\ 
##   HTN &  &  &  &  & 20.99 & 19.80 & 0.72 & ChiSq &   \\ 
##   IHD &  &  &  &  & 1.15 & 0.76 & 0.81 & ChiSq &   \\ 
##   OsteoArth &  &  &  &  & 3.05 & 5.50 & 0.14 & ChiSq &   \\ 
##   Osteopor &  &  &  &  & 0.38 & 0.85 & 0.70 & ChiSq &   \\ 
##   STI &  &  &  &  & 1.91 & 0.42 & 0.03 & ChiSq & * \\ 
##   Depression &  &  &  &  & 30.53 & 26.99 & 0.28 & ChiSq &   \\ 
##   alcDep\_5yrs &  &  &  &  & 19.47 & 13.03 & 0.01 & ChiSq & * \\ 
##   alcAbuse\_5yrs &  &  &  &  & 22.90 & 16.92 & 0.03 & ChiSq & * \\ 
##   drugDep\_5yrs &  &  &  &  & 24.43 & 15.06 & 0.00 & ChiSq & * \\ 
##   drugAbuse\_5yrs &  &  &  &  & 24.81 & 19.46 & 0.06 & ChiSq &   \\ 
##   OCD\_5yrs &  &  &  &  & 2.67 & 5.50 & 0.08 & ChiSq &   \\ 
##   anxDis\_5yrs &  &  &  &  & 12.21 & 13.96 & 0.52 & ChiSq &   \\ 
##   time.inTrialYrs & 0.18 & 0.25 & 1.38 & 0.59 &  &  & 0.00 & KS & * \\ 
##    \hline
## \end{tabular}
## \caption{Differences in Missing Follow-Up Data} 
## \end{table}
# output to markup
kable(testTable, digits = 3)
Var M.Missing Spread.Missing M.Present Spread.Present Perc.Missing Perc.Present P.Value Test.Used signif
P 19.000 7.000 18.000 8.000 0.384 KS
N 20.000 9.000 20.000 9.000 0.793 KS
G 37.000 12.000 37.000 13.000 0.571 KS
age 40.000 17.000 42.000 16.000 0.097 KS
sex 76.72 73.43 0.308 ChiSq
exac3mo 35.11 25.80 0.003 ChiSq *
yrsEduc 12.000 2.000 12.000 1.000 0.035 KS *
CGIsev 4.000 1.000 4.000 2.000 0.738 KS
employFT 8.02 6.43 0.438 ChiSq
yrsFirstTx 14.000 18.250 16.000 18.000 0.230 KS
yrsFrstAntiPsyRx 12.000 17.000 13.000 18.000 0.404 KS
olzB0 21.37 27.83 0.039 ChiSq *
quetB0 11.07 9.56 0.530 ChiSq
rispB0 21.37 22.84 0.665 ChiSq
zipB0 3.05 5.25 0.182 ChiSq
halB0 5.73 5.84 1.000 ChiSq
decaB0 1.15 1.61 0.784 ChiSq
perB0 1.91 2.12 1.000 ChiSq
otherB0 3.44 9.39 0.002 ChiSq *
race 0.00 0.00 0.002 ChiSq *
marital 0.00 0.00 0.650 ChiSq
COPD 2.67 2.12 0.748 ChiSq
DM 8.40 11.17 0.229 ChiSq
HepABC 8.78 5.50 0.062 ChiSq
Lipid 9.54 15.14 0.024 ChiSq *
HTN 20.99 19.80 0.725 ChiSq
IHD 1.15 0.76 0.808 ChiSq
OsteoArth 3.05 5.50 0.139 ChiSq
Osteopor 0.38 0.85 0.697 ChiSq
STI 1.91 0.42 0.027 ChiSq *
Depression 30.53 26.99 0.278 ChiSq
alcDep_5yrs 19.47 13.03 0.009 ChiSq *
alcAbuse_5yrs 22.90 16.92 0.028 ChiSq *
drugDep_5yrs 24.43 15.06 0.000 ChiSq *
drugAbuse_5yrs 24.81 19.46 0.063 ChiSq
OCD_5yrs 2.67 5.50 0.081 ChiSq
anxDis_5yrs 12.21 13.96 0.519 ChiSq
time.inTrialYrs 0.178 0.251 1.381 0.586 0.000 KS *

4 Summary Values and Tables

4.1 Summary Differences in Missing Group

For the multiple univariable analyses of differences in baseline/demographics, we summarise those variables that were different at p-value < 0.05:

kable( testTable[ which( testTable$P.Value < 0.05), ], digits = 3 )
Var M.Missing Spread.Missing M.Present Spread.Present Perc.Missing Perc.Present P.Value Test.Used signif
6 exac3mo 35.11 25.80 0.003 ChiSq *
7 yrsEduc 12.000 2.000 12.000 1.000 0.035 KS *
12 olzB0 21.37 27.83 0.039 ChiSq *
19 otherB0 3.44 9.39 0.002 ChiSq *
20 race 0.00 0.00 0.002 ChiSq *
25 Lipid 9.54 15.14 0.024 ChiSq *
30 STI 1.91 0.42 0.027 ChiSq *
32 alcDep_5yrs 19.47 13.03 0.009 ChiSq *
33 alcAbuse_5yrs 22.90 16.92 0.028 ChiSq *
34 drugDep_5yrs 24.43 15.06 0.000 ChiSq *
38 time.inTrialYrs 0.178 0.251 1.381 0.586 0.000 KS *

race requires more breaking-down as multiple levels (so the proportions are not described in the above table):

prop.race

As does marital because of multiple levels. Again, here’s the breakdown of numbers:

prop.marital

5 Clean-Up

Save testTable alongside tab.Surv and missingTab for later use. This is packaged as .RData because the it simplifies loading for the dependent notebooks.

save(missingTab, tab.Surv, testTable, prop.marital, prop.race, allIDs, file = "survival_tabulated_data_v3.RData")