1 Setup

There are three sets of derived data that results and analyses depend on:

tab.Surv is the preprocessed data ready for survival analyses
missingTab is the same data but formatted for an analysis of missing data and comparisons of e.g. differences in variables between groups
allIDs is simply a vector of participant IDs

We have two preprocessing scripts that work directly on the CATIE source data:

TRS_CATIE_v3.R, which does some initial data-munging work to get the raw CATIE source data into a format that can be processed, including things like assembling tables of medications and doses (which is not a trivial task with the CATIE data)
KM_surivival_v3.R – takes the output of TRS_CATIE_v3.R and processes each participant’s trajectory/history through the trial (symptoms, social-occupational functioning, trials of treatment) and outputs the tables above.

I’m happy to deliver these scripts on request (they are long, idiosyncratic, not very readable and will require some time to make them more friendly). They are not included here because they cannot be run independently of the source CATIE data set, and I do not have rights to package/publish these datasets – they can be obtained by applying to to the NIMH for access : https://www.nimh.nih.gov/funding/clinical-research/datasets/index.shtml

This notebook uses only the derived data from TRS_CATIE_v3.R and KM_surivival_v3.R so the logic of the analyses can be followed.

rm( list = ls() )
require(knitr)

## Loading required package: knitr

missingTab <- read.csv("missingData_tabulated_v3.csv")
tab.Surv   <- read.csv("survivalData_tabulated_v3.csv")
allIDs     <- read.csv("all_participant_IDs.csv")

options(knitr.kable.NA = '')

2 Demographics Tables for Loss-To-Follow-Up

Some participants will be missing data which means that for establishing TRS cases, there may be bias introduced because – e.g. very unwell patients might have been lost to follow up more so than those who were relatively stable or well at recruitment.

To establish patterns in loss-to-followup, we first interrogate the data with multiple univariate analyses (t-tests / Komogorov-Smifnoff (KS) tests for continuous or Chi-square tests for proportions / ordinal data).

2.1 Establish Data Types

We use missingTab to create a vector of data types (for each variable/column), which will enable some automation to build the table summarising missing data.

# create vector of data types (numeric, ordinal/factor)
dataType <- as.numeric( unlist( lapply(missingTab, function(x) if (!is.numeric(x)) NA else max(x, na.rm = TRUE))) )

# shrink dataType so we don't test missingFU.rx, sx, sof : these three are dependent variables in the missing data 
# analysis, so don't need including
dataType <- dataType[ 1:(length( dataType ) -4 ) ]

# parse dataType and decide on a relevant univariate test - e.g. Chi square for categorical proportions and t and KS for continuous
testFlag <- rep(NA, length( dataType ) )

for( i in 1:length( dataType ) ) {
  if( dataType[i] == 1 | is.na( dataType[i] ) ) {
    # categorical (factor) or binary : proportion test - Chi-square : code = 1
    testFlag[i] <- 1
  } else {
    if( dataType[i] > 1 ) {
      # continuous, therefore KS and t-test
      testFlag[i] <- 2
    }
  }
}

2.2 Data Storage

Create a data frame that storing results:

testTable <- data.frame( Var = rep("", length( dataType ) ),
                         M.Missing          = rep(NA, length( dataType ) ),
                         Spread.Missing     = rep(NA, length( dataType ) ),
                         M.Present          = rep(NA, length( dataType ) ),
                         Spread.Present     = rep(NA, length( dataType ) ),
                         Perc.Missing       = rep(NA, length( dataType ) ),
                         Perc.Present       = rep(NA, length( dataType ) ),
                         P.Value            = rep(NA, length( dataType ) ),
                         Test.Used          = rep("", length( dataType ) ),
                         stringsAsFactors = FALSE
)

2.3 Populate Table

Note : we use suppressWarnings() because the Chi-square and KS test report warnings (due to e.g. sample sizes or ties) – normally, these would require attention but here we don’t need the warnings or output in the markup.

# for all variables in testFlag ...
for( i in 1:length( testFlag ) ) {
  thisVar <- names( missingTab )[i]
  testTable$Var[i] <- thisVar
  
  # Use missingFU.any (i.e. SOF, Sx or Rx)
  # Select out data for missing overall group and non-missing group
  grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), i ]
  grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), i ]
  
  # if this variable's testFlag == 1, it's ordinal / categorical data
  if( testFlag[i] == 1 ) {
    test.tab <- rbind( table( grpNotMissing ), table( grpMissing ) )
    rownames(test.tab) <- c("Present","Missing")
    
    suppressWarnings(
      xs <- chisq.test( test.tab )
    )
    testTable$Perc.Present[i]  <- round( length( which( grpNotMissing == 1 ) ) / length( grpNotMissing ) * 100, 2 )     ## proportion of 1's to 0s in subjects with present data 
    testTable$Perc.Missing[i]  <- round( length( which( grpMissing == 1 ) ) / length( grpMissing ) * 100, 2 )   
    ## proportion of 1's to 0s in subjects with Missing data 
    testTable$P.Value[i]       <- round( xs$p.value, 3 )
    testTable$Test.Used[i]     <- "ChiSq"
    
  } else {
    # This variable is continuous
    # We need either KS and t-test 
    
    # test normality with Shapiro-Wilks
    this.SW.test.missing    <- shapiro.test( grpMissing )
    this.SW.test.notmissing <- shapiro.test( grpNotMissing )
    
    if( this.SW.test.missing$p.value < 0.05 | this.SW.test.notmissing$p.value < 0.05 ) {
      # if p.value of either SW test < 0.05, then use KS (i.e. at least one distribution is non-normal)        
      suppressWarnings(
        this.test <- ks.test( grpMissing, grpNotMissing, alternative = "two.sided" )      
      )
    } else {
      # both missing and non-missing data are normal enough
      suppressWarnings(
        this.test  <- t.test( grpMissing, grpNotMissing, alternative = "two.sided" )
      )
    }
    
    if( this.test$method == "Two-sample Kolmogorov-Smirnov test" ) {
      # record median and IQR
      testTable$M.Missing[i]      <- median( grpMissing, na.rm = TRUE )
      testTable$Spread.Missing[i] <- IQR( grpMissing, na.rm = TRUE )
      testTable$M.Present[i]      <- median( grpNotMissing, na.rm = TRUE )
      testTable$Spread.Present[i] <- IQR( grpNotMissing, na.rm = TRUE )
      testTable$P.Value[i]        <- round( this.test$p.value, 3 )
      testTable$Test.Used[i]      <- "KS"
    }
    
    if( this.test$method == "Welch Two Sample t-test" ) {
      # record median and IQR
      testTable$M.Missing[i]      <- mean( grpMissing, na.rm = TRUE )
      testTable$Spread.Missing[i] <- sd( grpMissing, na.rm = TRUE )
      testTable$M.Present[i]      <- mean( grpNotMissing, na.rm = TRUE )
      testTable$Spread.Present[i] <- sd( grpNotMissing, na.rm = TRUE )
      testTable$P.Value[i]        <- round( this.test$p.value, 3 )
      testTable$Test.Used[i]      <- "T"
      
    }
    
  }
}

# add a significance flag (purely cosmetic)
testTable$signif <- ifelse( testTable$P.Value < 0.05, "*", " " )

2.4 Ancillary Tables

Above, the proportions for missing categorical/ordinal data where only counted for binary variables, not those for e.g. factors > 2 levels.

Two variables – race and marital status – have many levels, so we create two ancillary tables for these variables alone (the “format” will be different and not compatible with the larger table)

grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), "marital" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "marital" ]

prop.marital <- rbind( table( grpMissing ),
                       table( grpNotMissing ) )

rownames( prop.marital ) <- c( "Missing", "Not Missing")
prop.marital <- data.frame( t(prop.marital) )
prop.marital$N <- rowSums( prop.marital )

grpMissing    <- missingTab[ which( missingTab$missingFU.any == 1 ), "race" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "race" ]
prop.race     <- rbind( table( grpMissing ),
                        table( grpNotMissing ) )

rownames( prop.race ) <- c( "Missing", "Not Missing")
prop.race <- data.frame( t(prop.race ) )
prop.race$N <- rowSums( prop.race )

(prop.race)

(prop.marital)

2.5 Summary

testTable now contains multiple univariate tests for differences in missing data at follow-up for each of the demographics / baseline characteristics. These are highlighted only for use in adjusting the subsequent regressions/survival analysis – not because they are hypotheses that are worthy of formal testing. Hence, there’s no claim any difference survives post-hoc multiple correction – it’s just we need to know if there are differences, and to account for these variables later (i.e. it is better to tolerate Type I error and include a variable in adjusted analyses than not).

3 Tabular Output

require(xtable)

## Loading required package: xtable

testTable.latex <- print( xtable(testTable, 
                                 caption = "Differences in Missing Follow-Up Data"),
                          include.rownames=FALSE)

## % latex table generated in R 3.4.3 by xtable 1.8-2 package
## % Sat Mar 31 18:43:14 2018
## \begin{table}[ht]
## \centering
## \begin{tabular}{lrrrrrrrll}
##   \hline
## Var & M.Missing & Spread.Missing & M.Present & Spread.Present & Perc.Missing & Perc.Present & P.Value & Test.Used & signif \\ 
##   \hline
## P & 19.00 & 7.00 & 18.00 & 8.00 &  &  & 0.38 & KS &   \\ 
##   N & 20.00 & 9.00 & 20.00 & 9.00 &  &  & 0.79 & KS &   \\ 
##   G & 37.00 & 12.00 & 37.00 & 13.00 &  &  & 0.57 & KS &   \\ 
##   age & 40.00 & 17.00 & 42.00 & 16.00 &  &  & 0.10 & KS &   \\ 
##   sex &  &  &  &  & 76.72 & 73.43 & 0.31 & ChiSq &   \\ 
##   exac3mo &  &  &  &  & 35.11 & 25.80 & 0.00 & ChiSq & * \\ 
##   yrsEduc & 12.00 & 2.00 & 12.00 & 1.00 &  &  & 0.04 & KS & * \\ 
##   CGIsev & 4.00 & 1.00 & 4.00 & 2.00 &  &  & 0.74 & KS &   \\ 
##   employFT &  &  &  &  & 8.02 & 6.43 & 0.44 & ChiSq &   \\ 
##   yrsFirstTx & 14.00 & 18.25 & 16.00 & 18.00 &  &  & 0.23 & KS &   \\ 
##   yrsFrstAntiPsyRx & 12.00 & 17.00 & 13.00 & 18.00 &  &  & 0.40 & KS &   \\ 
##   olzB0 &  &  &  &  & 21.37 & 27.83 & 0.04 & ChiSq & * \\ 
##   quetB0 &  &  &  &  & 11.07 & 9.56 & 0.53 & ChiSq &   \\ 
##   rispB0 &  &  &  &  & 21.37 & 22.84 & 0.67 & ChiSq &   \\ 
##   zipB0 &  &  &  &  & 3.05 & 5.25 & 0.18 & ChiSq &   \\ 
##   halB0 &  &  &  &  & 5.73 & 5.84 & 1.00 & ChiSq &   \\ 
##   decaB0 &  &  &  &  & 1.15 & 1.61 & 0.78 & ChiSq &   \\ 
##   perB0 &  &  &  &  & 1.91 & 2.12 & 1.00 & ChiSq &   \\ 
##   otherB0 &  &  &  &  & 3.44 & 9.39 & 0.00 & ChiSq & * \\ 
##   race &  &  &  &  & 0.00 & 0.00 & 0.00 & ChiSq & * \\ 
##   marital &  &  &  &  & 0.00 & 0.00 & 0.65 & ChiSq &   \\ 
##   COPD &  &  &  &  & 2.67 & 2.12 & 0.75 & ChiSq &   \\ 
##   DM &  &  &  &  & 8.40 & 11.17 & 0.23 & ChiSq &   \\ 
##   HepABC &  &  &  &  & 8.78 & 5.50 & 0.06 & ChiSq &   \\ 
##   Lipid &  &  &  &  & 9.54 & 15.14 & 0.02 & ChiSq & * \\ 
##   HTN &  &  &  &  & 20.99 & 19.80 & 0.72 & ChiSq &   \\ 
##   IHD &  &  &  &  & 1.15 & 0.76 & 0.81 & ChiSq &   \\ 
##   OsteoArth &  &  &  &  & 3.05 & 5.50 & 0.14 & ChiSq &   \\ 
##   Osteopor &  &  &  &  & 0.38 & 0.85 & 0.70 & ChiSq &   \\ 
##   STI &  &  &  &  & 1.91 & 0.42 & 0.03 & ChiSq & * \\ 
##   Depression &  &  &  &  & 30.53 & 26.99 & 0.28 & ChiSq &   \\ 
##   alcDep\_5yrs &  &  &  &  & 19.47 & 13.03 & 0.01 & ChiSq & * \\ 
##   alcAbuse\_5yrs &  &  &  &  & 22.90 & 16.92 & 0.03 & ChiSq & * \\ 
##   drugDep\_5yrs &  &  &  &  & 24.43 & 15.06 & 0.00 & ChiSq & * \\ 
##   drugAbuse\_5yrs &  &  &  &  & 24.81 & 19.46 & 0.06 & ChiSq &   \\ 
##   OCD\_5yrs &  &  &  &  & 2.67 & 5.50 & 0.08 & ChiSq &   \\ 
##   anxDis\_5yrs &  &  &  &  & 12.21 & 13.96 & 0.52 & ChiSq &   \\ 
##   time.inTrialYrs & 0.18 & 0.25 & 1.38 & 0.59 &  &  & 0.00 & KS & * \\ 
##    \hline
## \end{tabular}
## \caption{Differences in Missing Follow-Up Data} 
## \end{table}

# output to markup
kable(testTable, digits = 3)

Var	M.Missing	Spread.Missing	M.Present	Spread.Present	Perc.Missing	Perc.Present	P.Value	Test.Used	signif
P	19.000	7.000	18.000	8.000			0.384	KS
N	20.000	9.000	20.000	9.000			0.793	KS
G	37.000	12.000	37.000	13.000			0.571	KS
age	40.000	17.000	42.000	16.000			0.097	KS
sex					76.72	73.43	0.308	ChiSq
exac3mo					35.11	25.80	0.003	ChiSq	*
yrsEduc	12.000	2.000	12.000	1.000			0.035	KS	*
CGIsev	4.000	1.000	4.000	2.000			0.738	KS
employFT					8.02	6.43	0.438	ChiSq
yrsFirstTx	14.000	18.250	16.000	18.000			0.230	KS
yrsFrstAntiPsyRx	12.000	17.000	13.000	18.000			0.404	KS
olzB0					21.37	27.83	0.039	ChiSq	*
quetB0					11.07	9.56	0.530	ChiSq
rispB0					21.37	22.84	0.665	ChiSq
zipB0					3.05	5.25	0.182	ChiSq
halB0					5.73	5.84	1.000	ChiSq
decaB0					1.15	1.61	0.784	ChiSq
perB0					1.91	2.12	1.000	ChiSq
otherB0					3.44	9.39	0.002	ChiSq	*
race					0.00	0.00	0.002	ChiSq	*
marital					0.00	0.00	0.650	ChiSq
COPD					2.67	2.12	0.748	ChiSq
DM					8.40	11.17	0.229	ChiSq
HepABC					8.78	5.50	0.062	ChiSq
Lipid					9.54	15.14	0.024	ChiSq	*
HTN					20.99	19.80	0.725	ChiSq
IHD					1.15	0.76	0.808	ChiSq
OsteoArth					3.05	5.50	0.139	ChiSq
Osteopor					0.38	0.85	0.697	ChiSq
STI					1.91	0.42	0.027	ChiSq	*
Depression					30.53	26.99	0.278	ChiSq
alcDep_5yrs					19.47	13.03	0.009	ChiSq	*
alcAbuse_5yrs					22.90	16.92	0.028	ChiSq	*
drugDep_5yrs					24.43	15.06	0.000	ChiSq	*
drugAbuse_5yrs					24.81	19.46	0.063	ChiSq
OCD_5yrs					2.67	5.50	0.081	ChiSq
anxDis_5yrs					12.21	13.96	0.519	ChiSq
time.inTrialYrs	0.178	0.251	1.381	0.586			0.000	KS	*

4 Summary Values and Tables

4.1 Summary Differences in Missing Group

For the multiple univariable analyses of differences in baseline/demographics, we summarise those variables that were different at p-value < 0.05:

kable( testTable[ which( testTable$P.Value < 0.05), ], digits = 3 )

	Var	M.Missing	Spread.Missing	M.Present	Spread.Present	Perc.Missing	Perc.Present	P.Value	Test.Used	signif
6	exac3mo					35.11	25.80	0.003	ChiSq	*
7	yrsEduc	12.000	2.000	12.000	1.000			0.035	KS	*
12	olzB0					21.37	27.83	0.039	ChiSq	*
19	otherB0					3.44	9.39	0.002	ChiSq	*
20	race					0.00	0.00	0.002	ChiSq	*
25	Lipid					9.54	15.14	0.024	ChiSq	*
30	STI					1.91	0.42	0.027	ChiSq	*
32	alcDep_5yrs					19.47	13.03	0.009	ChiSq	*
33	alcAbuse_5yrs					22.90	16.92	0.028	ChiSq	*
34	drugDep_5yrs					24.43	15.06	0.000	ChiSq	*
38	time.inTrialYrs	0.178	0.251	1.381	0.586			0.000	KS	*

race requires more breaking-down as multiple levels (so the proportions are not described in the above table):

prop.race

As does marital because of multiple levels. Again, here’s the breakdown of numbers:

prop.marital

5 Clean-Up

Save testTable alongside tab.Surv and missingTab for later use. This is packaged as .RData because the it simplifies loading for the dependent notebooks.

save(missingTab, tab.Surv, testTable, prop.marital, prop.race, allIDs, file = "survival_tabulated_data_v3.RData")

CATIE TRS : Missing Data Analysis - v3.0

Dan W Joyce

23rd March 2018