Data exploration with vtree: diabetes dataset

library(vtree)
library(Hmisc)

## Warning: package 'survival' was built under R version 3.5.3

packageVersion("vtree")

## [1] '1.1.1'

Data obtained from http://biostat.mc.vanderbilt.edu/DataSets

getHdata(diabetes)

These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data consist of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is associated most strongly with obesity. The waist/hip ratio may be a predictor in diabetes and heart disease. DM II is also agssociated with hypertension - they may both be part of “Syndrome X”. The 403 subjects were the ones who were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of diabetes. For more information about this study see

Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997

and

Schorling JB, Roach J, Siegel M, Baturka N, Hunt DE, Guterbock TM, Stewart HL: A trial of church-based smoking cessation interventions for rural African Americans. Preventive Medicine 26:92-101; 1997.

head(diabetes)

##     id chol stab.glu hdl ratio glyhb   location age gender height weight
## 1 1000  203       82  56   3.6  4.31 Buckingham  46 female     62    121
## 2 1001  165       97  24   6.9  4.44 Buckingham  29 female     64    218
## 3 1002  228       92  37   6.2  4.64 Buckingham  58 female     61    256
## 4 1003   78       93  12   6.5  4.63 Buckingham  67   male     67    119
## 5 1005  249       90  28   8.9  7.72 Buckingham  64   male     68    183
## 6 1008  248       94  69   3.6  4.81 Buckingham  34   male     71    190
##    frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
## 1 medium   118    59    NA    NA    29  38      720
## 2  large   112    68    NA    NA    46  48      360
## 3  large   190    92   185    92    49  57      180
## 4  large   110    50    NA    NA    33  38      480
## 5 medium   138    80    NA    NA    44  41      300
## 6  large   132    86    NA    NA    36  42      195

The variables location, gender, and frame are factors.

Running vtree on a single variable is equivalent to a 1-way contingency table:

vtree(diabetes,"frame",horiz=FALSE,height=250,width=850)

Note that frame has 12 missing values. “Valid” percentages are calculated after removing these missing values. Specifying vp=FALSE lets you calculate percentages without removing the missing values.

Running vtree on two variables is equivalent to a 2-way contingency table. Note that the variables can be listed in a single string,separated by spaces.

vtree(diabetes,"frame location",horiz=FALSE,height=250,width=850)

If we don’t need to see the variable names on the left-hand side, we can specify showlevels=FALSE:

vtree(diabetes,"frame location",horiz=FALSE,height=250,width=850,showlevels=FALSE)

Now let’s use the summary parameter to show some information about a continuous variable, glyhb. Let’s specify summary="glyhb \nglyhb\nmean=%mean%\nSD=%SD%\nmv=%mv% %leafonly%". Here’s what it means:

glyhb at the beginning means we want a summary of that variable.
Next there is a separating space.
The rest of the string describes the format of the output.
\n is a line break.
%mean% is a code for the mean.
%SD% is a code for the standard deviation.
%mv% is a code for the number of missing values.
%leafonly% requests that the summary information be shown only in leaf nodes.

vtree(diabetes,"frame location",horiz=FALSE,height=250,width=850,
  showlevels=FALSE,summary="glyhb \n\nglyhb\nmean=%mean%\nSD=%SD%\nmv=%mv% %leafonly%")

Data exploration with vtree: diabetes dataset

Nick Barrowman

February 5, 2019