There are three sets of derived data that results and analyses depend on:
tab.Surv is the preprocessed data ready for survival analysesmissingTab is the same data but formatted for an analysis of missing data and comparisons of e.g. differences in variables between groupsallIDs is simply a vector of participant IDsWe have two preprocessing scripts that work directly on the CATIE source data:
TRS_CATIE_v3.R, which does some initial data-munging work to get the raw CATIE source data into a format that can be processed, including things like assembling tables of medications and doses (which is not a trivial task with the CATIE data)
KM_surivival_v3.R – takes the output of TRS_CATIE_v3.R and processes each participant’s trajectory/history through the trial (symptoms, social-occupational functioning, trials of treatment) and outputs the tables above.
I’m happy to deliver these scripts on request (they are long, idiosyncratic, not very readable and will require some time to make them more friendly). They are not included here because they cannot be run independently of the source CATIE data set, and I do not have rights to package/publish these datasets – they can be obtained by applying to to the NIMH for access : https://www.nimh.nih.gov/funding/clinical-research/datasets/index.shtml
This notebook uses only the derived data from TRS_CATIE_v3.R and KM_surivival_v3.R so the logic of the analyses can be followed.
rm( list = ls() )
require(knitr)## Loading required package: knitr
missingTab <- read.csv("missingData_tabulated_v3.csv")
tab.Surv <- read.csv("survivalData_tabulated_v3.csv")
allIDs <- read.csv("all_participant_IDs.csv")
options(knitr.kable.NA = '')Some participants will be missing data which means that for establishing TRS cases, there may be bias introduced because – e.g. very unwell patients might have been lost to follow up more so than those who were relatively stable or well at recruitment.
To establish patterns in loss-to-followup, we first interrogate the data with multiple univariate analyses (t-tests / Komogorov-Smifnoff (KS) tests for continuous or Chi-square tests for proportions / ordinal data).
We use missingTab to create a vector of data types (for each variable/column), which will enable some automation to build the table summarising missing data.
# create vector of data types (numeric, ordinal/factor)
dataType <- as.numeric( unlist( lapply(missingTab, function(x) if (!is.numeric(x)) NA else max(x, na.rm = TRUE))) )
# shrink dataType so we don't test missingFU.rx, sx, sof : these three are dependent variables in the missing data
# analysis, so don't need including
dataType <- dataType[ 1:(length( dataType ) -4 ) ]
# parse dataType and decide on a relevant univariate test - e.g. Chi square for categorical proportions and t and KS for continuous
testFlag <- rep(NA, length( dataType ) )
for( i in 1:length( dataType ) ) {
if( dataType[i] == 1 | is.na( dataType[i] ) ) {
# categorical (factor) or binary : proportion test - Chi-square : code = 1
testFlag[i] <- 1
} else {
if( dataType[i] > 1 ) {
# continuous, therefore KS and t-test
testFlag[i] <- 2
}
}
}Create a data frame that storing results:
testTable <- data.frame( Var = rep("", length( dataType ) ),
M.Missing = rep(NA, length( dataType ) ),
Spread.Missing = rep(NA, length( dataType ) ),
M.Present = rep(NA, length( dataType ) ),
Spread.Present = rep(NA, length( dataType ) ),
Perc.Missing = rep(NA, length( dataType ) ),
Perc.Present = rep(NA, length( dataType ) ),
P.Value = rep(NA, length( dataType ) ),
Test.Used = rep("", length( dataType ) ),
stringsAsFactors = FALSE
)Note : we use suppressWarnings() because the Chi-square and KS test report warnings (due to e.g. sample sizes or ties) – normally, these would require attention but here we don’t need the warnings or output in the markup.
# for all variables in testFlag ...
for( i in 1:length( testFlag ) ) {
thisVar <- names( missingTab )[i]
testTable$Var[i] <- thisVar
# Use missingFU.any (i.e. SOF, Sx or Rx)
# Select out data for missing overall group and non-missing group
grpMissing <- missingTab[ which( missingTab$missingFU.any == 1 ), i ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), i ]
# if this variable's testFlag == 1, it's ordinal / categorical data
if( testFlag[i] == 1 ) {
test.tab <- rbind( table( grpNotMissing ), table( grpMissing ) )
rownames(test.tab) <- c("Present","Missing")
suppressWarnings(
xs <- chisq.test( test.tab )
)
testTable$Perc.Present[i] <- round( length( which( grpNotMissing == 1 ) ) / length( grpNotMissing ) * 100, 2 ) ## proportion of 1's to 0s in subjects with present data
testTable$Perc.Missing[i] <- round( length( which( grpMissing == 1 ) ) / length( grpMissing ) * 100, 2 )
## proportion of 1's to 0s in subjects with Missing data
testTable$P.Value[i] <- round( xs$p.value, 3 )
testTable$Test.Used[i] <- "ChiSq"
} else {
# This variable is continuous
# We need either KS and t-test
# test normality with Shapiro-Wilks
this.SW.test.missing <- shapiro.test( grpMissing )
this.SW.test.notmissing <- shapiro.test( grpNotMissing )
if( this.SW.test.missing$p.value < 0.05 | this.SW.test.notmissing$p.value < 0.05 ) {
# if p.value of either SW test < 0.05, then use KS (i.e. at least one distribution is non-normal)
suppressWarnings(
this.test <- ks.test( grpMissing, grpNotMissing, alternative = "two.sided" )
)
} else {
# both missing and non-missing data are normal enough
suppressWarnings(
this.test <- t.test( grpMissing, grpNotMissing, alternative = "two.sided" )
)
}
if( this.test$method == "Two-sample Kolmogorov-Smirnov test" ) {
# record median and IQR
testTable$M.Missing[i] <- median( grpMissing, na.rm = TRUE )
testTable$Spread.Missing[i] <- IQR( grpMissing, na.rm = TRUE )
testTable$M.Present[i] <- median( grpNotMissing, na.rm = TRUE )
testTable$Spread.Present[i] <- IQR( grpNotMissing, na.rm = TRUE )
testTable$P.Value[i] <- round( this.test$p.value, 3 )
testTable$Test.Used[i] <- "KS"
}
if( this.test$method == "Welch Two Sample t-test" ) {
# record median and IQR
testTable$M.Missing[i] <- mean( grpMissing, na.rm = TRUE )
testTable$Spread.Missing[i] <- sd( grpMissing, na.rm = TRUE )
testTable$M.Present[i] <- mean( grpNotMissing, na.rm = TRUE )
testTable$Spread.Present[i] <- sd( grpNotMissing, na.rm = TRUE )
testTable$P.Value[i] <- round( this.test$p.value, 3 )
testTable$Test.Used[i] <- "T"
}
}
}
# add a significance flag (purely cosmetic)
testTable$signif <- ifelse( testTable$P.Value < 0.05, "*", " " )Above, the proportions for missing categorical/ordinal data where only counted for binary variables, not those for e.g. factors > 2 levels.
Two variables – race and marital status – have many levels, so we create two ancillary tables for these variables alone (the “format” will be different and not compatible with the larger table)
grpMissing <- missingTab[ which( missingTab$missingFU.any == 1 ), "marital" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "marital" ]
prop.marital <- rbind( table( grpMissing ),
table( grpNotMissing ) )
rownames( prop.marital ) <- c( "Missing", "Not Missing")
prop.marital <- data.frame( t(prop.marital) )
prop.marital$N <- rowSums( prop.marital )
grpMissing <- missingTab[ which( missingTab$missingFU.any == 1 ), "race" ]
grpNotMissing <- missingTab[ which( missingTab$missingFU.any == 0 ), "race" ]
prop.race <- rbind( table( grpMissing ),
table( grpNotMissing ) )
rownames( prop.race ) <- c( "Missing", "Not Missing")
prop.race <- data.frame( t(prop.race ) )
prop.race$N <- rowSums( prop.race )
(prop.race)(prop.marital)testTable now contains multiple univariate tests for differences in missing data at follow-up for each of the demographics / baseline characteristics. These are highlighted only for use in adjusting the subsequent regressions/survival analysis – not because they are hypotheses that are worthy of formal testing. Hence, there’s no claim any difference survives post-hoc multiple correction – it’s just we need to know if there are differences, and to account for these variables later (i.e. it is better to tolerate Type I error and include a variable in adjusted analyses than not).
require(xtable)## Loading required package: xtable
testTable.latex <- print( xtable(testTable,
caption = "Differences in Missing Follow-Up Data"),
include.rownames=FALSE)## % latex table generated in R 3.4.3 by xtable 1.8-2 package
## % Sat Mar 31 18:43:14 2018
## \begin{table}[ht]
## \centering
## \begin{tabular}{lrrrrrrrll}
## \hline
## Var & M.Missing & Spread.Missing & M.Present & Spread.Present & Perc.Missing & Perc.Present & P.Value & Test.Used & signif \\
## \hline
## P & 19.00 & 7.00 & 18.00 & 8.00 & & & 0.38 & KS & \\
## N & 20.00 & 9.00 & 20.00 & 9.00 & & & 0.79 & KS & \\
## G & 37.00 & 12.00 & 37.00 & 13.00 & & & 0.57 & KS & \\
## age & 40.00 & 17.00 & 42.00 & 16.00 & & & 0.10 & KS & \\
## sex & & & & & 76.72 & 73.43 & 0.31 & ChiSq & \\
## exac3mo & & & & & 35.11 & 25.80 & 0.00 & ChiSq & * \\
## yrsEduc & 12.00 & 2.00 & 12.00 & 1.00 & & & 0.04 & KS & * \\
## CGIsev & 4.00 & 1.00 & 4.00 & 2.00 & & & 0.74 & KS & \\
## employFT & & & & & 8.02 & 6.43 & 0.44 & ChiSq & \\
## yrsFirstTx & 14.00 & 18.25 & 16.00 & 18.00 & & & 0.23 & KS & \\
## yrsFrstAntiPsyRx & 12.00 & 17.00 & 13.00 & 18.00 & & & 0.40 & KS & \\
## olzB0 & & & & & 21.37 & 27.83 & 0.04 & ChiSq & * \\
## quetB0 & & & & & 11.07 & 9.56 & 0.53 & ChiSq & \\
## rispB0 & & & & & 21.37 & 22.84 & 0.67 & ChiSq & \\
## zipB0 & & & & & 3.05 & 5.25 & 0.18 & ChiSq & \\
## halB0 & & & & & 5.73 & 5.84 & 1.00 & ChiSq & \\
## decaB0 & & & & & 1.15 & 1.61 & 0.78 & ChiSq & \\
## perB0 & & & & & 1.91 & 2.12 & 1.00 & ChiSq & \\
## otherB0 & & & & & 3.44 & 9.39 & 0.00 & ChiSq & * \\
## race & & & & & 0.00 & 0.00 & 0.00 & ChiSq & * \\
## marital & & & & & 0.00 & 0.00 & 0.65 & ChiSq & \\
## COPD & & & & & 2.67 & 2.12 & 0.75 & ChiSq & \\
## DM & & & & & 8.40 & 11.17 & 0.23 & ChiSq & \\
## HepABC & & & & & 8.78 & 5.50 & 0.06 & ChiSq & \\
## Lipid & & & & & 9.54 & 15.14 & 0.02 & ChiSq & * \\
## HTN & & & & & 20.99 & 19.80 & 0.72 & ChiSq & \\
## IHD & & & & & 1.15 & 0.76 & 0.81 & ChiSq & \\
## OsteoArth & & & & & 3.05 & 5.50 & 0.14 & ChiSq & \\
## Osteopor & & & & & 0.38 & 0.85 & 0.70 & ChiSq & \\
## STI & & & & & 1.91 & 0.42 & 0.03 & ChiSq & * \\
## Depression & & & & & 30.53 & 26.99 & 0.28 & ChiSq & \\
## alcDep\_5yrs & & & & & 19.47 & 13.03 & 0.01 & ChiSq & * \\
## alcAbuse\_5yrs & & & & & 22.90 & 16.92 & 0.03 & ChiSq & * \\
## drugDep\_5yrs & & & & & 24.43 & 15.06 & 0.00 & ChiSq & * \\
## drugAbuse\_5yrs & & & & & 24.81 & 19.46 & 0.06 & ChiSq & \\
## OCD\_5yrs & & & & & 2.67 & 5.50 & 0.08 & ChiSq & \\
## anxDis\_5yrs & & & & & 12.21 & 13.96 & 0.52 & ChiSq & \\
## time.inTrialYrs & 0.18 & 0.25 & 1.38 & 0.59 & & & 0.00 & KS & * \\
## \hline
## \end{tabular}
## \caption{Differences in Missing Follow-Up Data}
## \end{table}
# output to markup
kable(testTable, digits = 3)| Var | M.Missing | Spread.Missing | M.Present | Spread.Present | Perc.Missing | Perc.Present | P.Value | Test.Used | signif |
|---|---|---|---|---|---|---|---|---|---|
| P | 19.000 | 7.000 | 18.000 | 8.000 | 0.384 | KS | |||
| N | 20.000 | 9.000 | 20.000 | 9.000 | 0.793 | KS | |||
| G | 37.000 | 12.000 | 37.000 | 13.000 | 0.571 | KS | |||
| age | 40.000 | 17.000 | 42.000 | 16.000 | 0.097 | KS | |||
| sex | 76.72 | 73.43 | 0.308 | ChiSq | |||||
| exac3mo | 35.11 | 25.80 | 0.003 | ChiSq | * | ||||
| yrsEduc | 12.000 | 2.000 | 12.000 | 1.000 | 0.035 | KS | * | ||
| CGIsev | 4.000 | 1.000 | 4.000 | 2.000 | 0.738 | KS | |||
| employFT | 8.02 | 6.43 | 0.438 | ChiSq | |||||
| yrsFirstTx | 14.000 | 18.250 | 16.000 | 18.000 | 0.230 | KS | |||
| yrsFrstAntiPsyRx | 12.000 | 17.000 | 13.000 | 18.000 | 0.404 | KS | |||
| olzB0 | 21.37 | 27.83 | 0.039 | ChiSq | * | ||||
| quetB0 | 11.07 | 9.56 | 0.530 | ChiSq | |||||
| rispB0 | 21.37 | 22.84 | 0.665 | ChiSq | |||||
| zipB0 | 3.05 | 5.25 | 0.182 | ChiSq | |||||
| halB0 | 5.73 | 5.84 | 1.000 | ChiSq | |||||
| decaB0 | 1.15 | 1.61 | 0.784 | ChiSq | |||||
| perB0 | 1.91 | 2.12 | 1.000 | ChiSq | |||||
| otherB0 | 3.44 | 9.39 | 0.002 | ChiSq | * | ||||
| race | 0.00 | 0.00 | 0.002 | ChiSq | * | ||||
| marital | 0.00 | 0.00 | 0.650 | ChiSq | |||||
| COPD | 2.67 | 2.12 | 0.748 | ChiSq | |||||
| DM | 8.40 | 11.17 | 0.229 | ChiSq | |||||
| HepABC | 8.78 | 5.50 | 0.062 | ChiSq | |||||
| Lipid | 9.54 | 15.14 | 0.024 | ChiSq | * | ||||
| HTN | 20.99 | 19.80 | 0.725 | ChiSq | |||||
| IHD | 1.15 | 0.76 | 0.808 | ChiSq | |||||
| OsteoArth | 3.05 | 5.50 | 0.139 | ChiSq | |||||
| Osteopor | 0.38 | 0.85 | 0.697 | ChiSq | |||||
| STI | 1.91 | 0.42 | 0.027 | ChiSq | * | ||||
| Depression | 30.53 | 26.99 | 0.278 | ChiSq | |||||
| alcDep_5yrs | 19.47 | 13.03 | 0.009 | ChiSq | * | ||||
| alcAbuse_5yrs | 22.90 | 16.92 | 0.028 | ChiSq | * | ||||
| drugDep_5yrs | 24.43 | 15.06 | 0.000 | ChiSq | * | ||||
| drugAbuse_5yrs | 24.81 | 19.46 | 0.063 | ChiSq | |||||
| OCD_5yrs | 2.67 | 5.50 | 0.081 | ChiSq | |||||
| anxDis_5yrs | 12.21 | 13.96 | 0.519 | ChiSq | |||||
| time.inTrialYrs | 0.178 | 0.251 | 1.381 | 0.586 | 0.000 | KS | * |
For the multiple univariable analyses of differences in baseline/demographics, we summarise those variables that were different at p-value < 0.05:
kable( testTable[ which( testTable$P.Value < 0.05), ], digits = 3 )| Var | M.Missing | Spread.Missing | M.Present | Spread.Present | Perc.Missing | Perc.Present | P.Value | Test.Used | signif | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | exac3mo | 35.11 | 25.80 | 0.003 | ChiSq | * | ||||
| 7 | yrsEduc | 12.000 | 2.000 | 12.000 | 1.000 | 0.035 | KS | * | ||
| 12 | olzB0 | 21.37 | 27.83 | 0.039 | ChiSq | * | ||||
| 19 | otherB0 | 3.44 | 9.39 | 0.002 | ChiSq | * | ||||
| 20 | race | 0.00 | 0.00 | 0.002 | ChiSq | * | ||||
| 25 | Lipid | 9.54 | 15.14 | 0.024 | ChiSq | * | ||||
| 30 | STI | 1.91 | 0.42 | 0.027 | ChiSq | * | ||||
| 32 | alcDep_5yrs | 19.47 | 13.03 | 0.009 | ChiSq | * | ||||
| 33 | alcAbuse_5yrs | 22.90 | 16.92 | 0.028 | ChiSq | * | ||||
| 34 | drugDep_5yrs | 24.43 | 15.06 | 0.000 | ChiSq | * | ||||
| 38 | time.inTrialYrs | 0.178 | 0.251 | 1.381 | 0.586 | 0.000 | KS | * |
race requires more breaking-down as multiple levels (so the proportions are not described in the above table):
prop.raceAs does marital because of multiple levels. Again, here’s the breakdown of numbers:
prop.maritalSave testTable alongside tab.Surv and missingTab for later use. This is packaged as .RData because the it simplifies loading for the dependent notebooks.
save(missingTab, tab.Surv, testTable, prop.marital, prop.race, allIDs, file = "survival_tabulated_data_v3.RData")