library(vtree)
library(Hmisc)

## Warning: package 'survival' was built under R version 3.5.3

packageVersion("vtree")

## [1] '1.1.1'

Data obtained from http://biostat.mc.vanderbilt.edu/DataSets

getHdata(abm)

Acute Bacterial Meningitis Dataset: Description from Website

Note that the S datasets are called ABM internally so as to not cause confusion with the variable abm.

These are data on 581 patients having either acute viral (abm=0) or acute bacterial (abm=1) meningitis, from a study done at Duke University Medical Center that was published in Spanos A, Harrell FE, Durack DT (1989): Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA 262: 2700-2707. Note that this is the complete dataset, not the subset of observations having complete data on key variables that was used to fit the multivariable model in the article. Expressions for computing key derived variables are stored as an attribute named derived on the data frame, and a vector of names of variables used in the final model in the article are contained in an attribute named main.analysis.variables. To create the derived variables do something such as:

attach(ABM)
eval(attr(ABM,'derived'))

To just list the formulas for derived variables type

attr(ABM, 'derived').

If you want to use CSF/blood glucose ratio as a variable, you will need to create this derived variable before fitting models as this variable is derived from more than one input variable (this will also allow multiple imputation on this derived variable). For derived variables involving only a single input variable, it is best to derive them during the model fit. Here is an example

# Function to compute no. months from peak of summer
dsummer <- function(x) pmin(abs(x-8), abs(x+12-8)) 
# Function to compute cube root
cr <- function(x) x^(1/3)
f <- lrm(abm ~ dsummer(month) + rcs(cr(wbc), 4) + rcs(log(gl+1), 5))

Sometimes it is a good idea to create multiple imputations using only basic variables. Then derived variables need to be recomputed for each imputation. This can be done using the Hmisc fit.mult.impute function’s derived parameter. If you are using only the pre-defined derived variables you can specify derived=attr(ABM, 'derived') to fit.mult.impute. Note that if a variable is derived from a single variable through the use of a function such as the dsummer function above, that derived variable does not need to be defined to derived= in the call to fit.mult.impute. But if you compute a variable named dsummer as is done in attr(ABM,'derived') you do need to specify the expression in fit.mult.impute(..., derived=...).

In the modeling that was done in the article, the gram smear result was not used in deriving the model, but if the gram smear was known and positive (gram > 0), the predicted Prob(abm) was overridden to 1.0.

Analysis

head(ABM)

##   casenum year month  age  race    sex dx priordx priorrx wbc pmn bands
## 1       1   78     1  4.0 black female  1       0       0 6.5  50     4
## 2       2   78    12  1.0 black   male  1       0       0 3.7  62     5
## 3       3   78     3  0.8 black   male  0       1       1  NA  NA    NA
## 4       4   78     8 54.0 black   male  6       2       0 7.5  73     7
## 5       5   NA    NA   NA  <NA>   <NA> NA      NA      NA  NA  NA    NA
## 6       6   NA    NA   NA  <NA>   <NA> NA      NA      NA  NA  NA    NA
##   compns daysrx offrx lptodc lpgap morelabs bloodgl gl  pr reds whites
## 1      0     10     0      0    NA       NA     165  3 304  440   4000
## 2      0     10     2      0    NA        0     150 92  NA  450   5490
## 3      6     10     1      1    NA       NA     183 36  NA    0   4500
## 4      6     10    NA     NA    NA        0      NA 52  43   27      0
## 5     NA     NA    NA     NA    NA       NA      NA NA  NA   NA     NA
## 6     NA     NA    NA     NA    NA       NA      NA NA  NA   NA     NA
##   polys lymphs monos others gram culture cie bloodclt bloodgl2 gl2 pr2
## 1   100      0     0      0    4       1   1        1       NA  58  46
## 2    97      3     0      0    5       1  NA        1       NA  60  NA
## 3   100      0     0      0    0       0   0       NA       NA  52  51
## 4    NA     NA    NA     NA    0       6  NA        6       NA  NA  NA
## 5    NA     NA    NA     NA   NA      NA  NA       NA       NA  NA  NA
## 6    NA     NA    NA     NA   NA      NA  NA       NA       NA  NA  NA
##   reds2 whites2 polys2 lymphs2 monos2 others2 sumbands   subset abm
## 1     1      47      0     100      0       0 2.000000     test   1
## 2    NA      NA     NA      NA     NA      NA 3.099000 training   1
## 3   335     230      0     100      0       0       NA     test  NA
## 4    NA      NA     NA      NA     NA      NA 5.108999 training   1
## 5    NA      NA     NA      NA     NA      NA       NA     test   0
## 6    NA      NA     NA      NA     NA      NA       NA training   0

The check.is.na parameter lets you examine the missing data pattern in a group of variables. Darker shades represent missing values. Let’s look at the missing data pattern for race, age, sex, wbc, and abm:

vtree(ABM,"race age sex wbc abm",check.is.na=TRUE,height=500,width=850)

For a simplified view of all the unique patterns (“sequences”) of missingness, specify seq=TRUE:

vtree(ABM,"race age sex wbc abm",check.is.na=TRUE,height=500,width=850,seq=TRUE)

Data exploration with vtree: Missing Values

Nick Barrowman

March 25, 2019

Acute Bacterial Meningitis Dataset: Description from Website

Analysis