Note that excluding.missing is not working as I expect it to (see population totals below). Need to understand this.

Running CommonSetup with stack CompTwoVar, CreateSurveyObjects, LoadData, CommonSetup

Common Setup

Running LoadData with stack CompTwoVar, CreateSurveyObjects, LoadData

Load Data

Initial version just loads the saved data.

Note that the RData version sets some variables (e.g. testRun) it would be better not to.

#load("~/Health/NHANES/ajdamico/ajdamico_usgsd/National Health and Nutrition Examination Survey/reference_range_files/NHANES_ABCDEFG.RData")
load("~/Health/NHANES/ajdamico/ajdamico_usgsd/National Health and Nutrition Examination Survey/reference_range_files/NHANES_ABCDEFG_20141127.rda")
ls()
##  [1] "cdp"                           "currentLevel"                  "displayReferenceRange"        
##  [4] "displayScreen"                 "displaySuppliedReferenceRange" "excludeMissing"               
##  [7] "levelStack"                    "NHANES_Common_Setup_Loaded"    "NHANESRunFile"                
## [10] "thisLevel"                     "topLevel"                      "useWeightsName"               
## [13] "varName1"                      "varName2"                      "x"
#colnames(x)

Running CreateSurveyObjects with stack CompTwoVar, CreateSurveyObjects

Create Survey Objects

Creating survey objects using weights WTMEC2YR and excluding records missing ~ Urine.SG + LDL.

Survey design for Taylor-series linearization. More detail available in source code.

Note that the warning is caused by exclusion of under 20 subjects from our data file since we focus only on adults.

# create survey design object with NHANES design information
# using the final data frame of NHANES data
nhanes.tsl.design <- 
  svydesign(
    id = ~SDMVPSU , 
    strata = ~SDMVSTRA ,
    nest = TRUE ,
    weights = ~get(useWeightsName) ,
    data = x,
    # We usually have complete cases, but support running on subsets 
    excluding.missing = as.formula(excludeMissing) # analysis var(s) of interest
    )

# notice the 'nhanes.tsl.design' object used in all subsequent analysis commands

# 2000 census populations to adjust to
# from excel file: http://www.cdc.gov/nchs/tutorials/nhanes/Downloads/Continuous/ageadjwt.xls

# create a data frame containing the four population strata
# RES: The convention appears to be population in thousands.  For example, see:
# http://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/AgeStandardization/Task1a.htm

pop.by.age <- 
  data.frame( 
        agecat = 1:4 , 
        Freq = c( 55901 , 77670 , 72816 , 45364 ) 
    )   

# print that data frame to the screen
#pop.by.age

# this data frame contains the information that the year 2000 united states population was approximately
# 56 million 0-19 year olds
# 77 million 20-39 year olds
# 73 million 40-59 year olds
# 45 million 60+ year olds

# create a new survey object stratified to the census counts, broken out by both race / ethnicity and gender #
nhanes.by.race.and.gender <-
  svystandardize(
    nhanes.tsl.design , 
    by = ~agecat ,                    # stratification variable
    over = ~racef + sex ,             # break out variable(s)
    population = pop.by.age,          # data frame containing census populations
    # We usually have complete cases, but support running on subsets 
    excluding.missing = as.formula(excludeMissing) # analysis var(s) of interest
    )
## Warning in postStratify.survey.design(design, combined, allmargins, partial = TRUE): Some strata
## absent from sample: ignored

First some general information about this set of NHANES data.

There are 33259 total (unweighted) records. Broken out by race/ethnicity category:

##       racef counts se
## White White  15943  0
## Black Black   6590  0
## Hisp   Hisp   8748  0
## Other Other   1978  0

Confirm that the sum of the weights we are using ( 1.36706510^{9} ) equals the sum calculated by svytotal:

##         total       SE
## one 1.367e+09 30161105

The civilian, non-institutionalized population of the United States by race/ethnicity category:

##       racef       one       se
## White White 964688060 35558398
## Black Black 144660386  7520211
## Hisp   Hisp 179493449 11241672
## Other Other  78222740  4696539

The civilian, non-institutionalized population of the United States by age category:

##       agetext       one       se
## 20-39   20-39 526070779 11402207
## 40-59   40-59 524817290 14418451
## 60+       60+ 316176567 10343969

Running CompTwoVar with stack CompTwoVar

Compare Urine.SG and LDL

For these two variables the data contains 4831 subjects with data for both corresponding to a population of 1.83813510^{8} Americans. These subjects are distributed demographically (population represented) as follows (note the contrast to population numbers above, I think this indicates an issue with my use of excluding.missing in svystandardize):

# Does not display well inline
tapply(x[completeCases, useWeightsName], x[completeCases, c("sex", "agetext")], sum)
##         agetext
## sex      0-19    20-39    40-59      60+
##   Male     NA 33752335 36177078 19570154
##   Female   NA 34002546 37766559 22544818
tapply(x[completeCases, useWeightsName], x[completeCases, c("sex", "racef")], sum)
##         racef
## sex         White    Black     Hisp   Other
##   Male   61746344  8851698 12938914 5962611
##   Female 65961762 10836920 12010921 5504319

First look at the individual variables.

Urine.SG

displayReferenceRange(varName1)
## 
## 
## Calculating reference ranges for Urine.SG
## Urine.SG appears to be discrete with interval 0.001
##              0  0.05  0.25   0.5  0.75  0.95   1
## Urine.SG 1.001 1.005 1.011 1.017 1.022 1.028 1.2

##              racef    sex 0.05 0.25  0.5 0.75 0.95  se.0.05  se.0.25   se.0.5  se.0.75  se.0.95
## White.Male   White   Male 1.01 1.01 1.02 1.02 1.03 0.000510 0.000255 0.000255 0.000255 0.000510
## Black.Male   Black   Male 1.01 1.01 1.02 1.02 1.03 0.000792 0.000510 0.000510 0.000510 0.001020
## Hisp.Male     Hisp   Male 1.01 1.01 1.02 1.02 1.03 0.000765 0.000765 0.000510 0.000255 0.000765
## Other.Male   Other   Male 1.01 1.01 1.02 1.02 1.03 0.001010 0.000755 0.001020 0.000765 0.001880
## White.Female White Female 1.00 1.01 1.01 1.02 1.03 0.000510 0.000255 0.000510 0.000255 0.001020
## Black.Female Black Female 1.01 1.01 1.02 1.02 1.03 0.000765 0.000510 0.000510 0.000255 0.000829
## Hisp.Female   Hisp Female 1.00 1.01 1.02 1.02 1.03 0.000255 0.000510 0.000510 0.000510 0.000255
## Other.Female Other Female 1.00 1.01 1.01 1.02 1.03 0.000966 0.000765 0.001020 0.001440 0.001700

LDL

displayReferenceRange(varName2)
## 
## 
## Calculating reference ranges for LDL
## LDL appears to be discrete with interval 0.2
##        0    0.05 0.25   0.5  0.75    0.95     1
## LDL -172 59.7172 88.8 111.4 136.4 176.389 355.6

##              racef    sex 0.05 0.25 0.5 0.75 0.95 se.0.05 se.0.25 se.0.5 se.0.75 se.0.95
## White.Male   White   Male 54.9 87.0 110  135  174    1.33    2.03  1.730    2.19    4.78
## Black.Male   Black   Male 62.2 87.5 109  137  174    3.99    1.89  1.550    2.10    6.36
## Hisp.Male     Hisp   Male 58.3 91.8 113  142  179    2.62    1.44  1.260    3.04    1.99
## Other.Male   Other   Male 54.1 93.7 114  133  175   13.50    3.83  4.860    7.21   17.60
## White.Female White Female 66.2 91.0 114  139  181    2.25    1.01  0.969    1.53    5.00
## Black.Female Black Female 59.1 85.2 109  131  175    2.97    2.40  2.400    3.35    7.00
## Hisp.Female   Hisp Female 61.0 88.5 113  136  174    2.90    2.01  1.320    1.91    1.99
## Other.Female Other Female 49.0 74.3 104  126  158   10.10    3.78  8.600    6.33   53.60

Compare Urine.SG and LDL

Try different types of plots to see which is most meaningful.

First try each of the svyplot styles.

I don’t find the bubble and transparent styles that useful for NHANES data.

## Loading required package: hexbin

The hexbin plots aren’t terribly useful with default ranges.

See NHANES_analyze_tissue_cholesterol.R for attempts at limiting ranges for survey hexbin plots.

Now try some ggplot alternatives.

Now let’s look at relationship by subgroups (here age groups and race) and sex.

http://sape.inf.usi.ch/quick-reference/ggplot2 looks like a useful ggplot2 reference.

Try ggplot binhex/bin2d plots.
http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points
http://stackoverflow.com/questions/13167531/ggplot2-multiple-stat-binhex-plots-with-different-color-gradients-in-one-image

It appears geom_hex, stat_binhex, and stat_bin2d do not support weights.
geom_bin2d appears to understand weights so use that.

Note that this does not work well for discrete variables currently. The bin size used can contain differing number of discrete values (e.g. for urine SG a bin of 1.009-1.0021 includes 1.001 and 1.002 while a bin of 1.0021-1.0033 only includes 1.003) which introduces a banding effect into the plots. (see comment in To Do section below)

A similar example with geom_hex. Notice that the Frequency values are unweighted despite the weights being used (weights are not supported by geom_hex)

To Do

Use discretized vs continuous knowledge for deciding whether or not to jitter plots or try to fix binwidths to the size of the discretization interval.