Note that excluding.missing is not working as I expect it to (see population totals below). Need to understand this.
Running CommonSetup with stack CompTwoVar, CreateSurveyObjects, LoadData, CommonSetup
Running LoadData with stack CompTwoVar, CreateSurveyObjects, LoadData
Initial version just loads the saved data.
Note that the RData version sets some variables (e.g. testRun) it would be better not to.
#load("~/Health/NHANES/ajdamico/ajdamico_usgsd/National Health and Nutrition Examination Survey/reference_range_files/NHANES_ABCDEFG.RData")
load("~/Health/NHANES/ajdamico/ajdamico_usgsd/National Health and Nutrition Examination Survey/reference_range_files/NHANES_ABCDEFG_20141127.rda")
ls()
## [1] "cdp" "currentLevel" "displayReferenceRange"
## [4] "displayScreen" "displaySuppliedReferenceRange" "excludeMissing"
## [7] "levelStack" "NHANES_Common_Setup_Loaded" "NHANESRunFile"
## [10] "thisLevel" "topLevel" "useWeightsName"
## [13] "varName1" "varName2" "x"
#colnames(x)
Running CreateSurveyObjects with stack CompTwoVar, CreateSurveyObjects
Creating survey objects using weights WTMEC2YR and excluding records missing ~ Urine.SG + LDL.
Survey design for Taylor-series linearization. More detail available in source code.
Note that the warning is caused by exclusion of under 20 subjects from our data file since we focus only on adults.
# create survey design object with NHANES design information
# using the final data frame of NHANES data
nhanes.tsl.design <-
svydesign(
id = ~SDMVPSU ,
strata = ~SDMVSTRA ,
nest = TRUE ,
weights = ~get(useWeightsName) ,
data = x,
# We usually have complete cases, but support running on subsets
excluding.missing = as.formula(excludeMissing) # analysis var(s) of interest
)
# notice the 'nhanes.tsl.design' object used in all subsequent analysis commands
# 2000 census populations to adjust to
# from excel file: http://www.cdc.gov/nchs/tutorials/nhanes/Downloads/Continuous/ageadjwt.xls
# create a data frame containing the four population strata
# RES: The convention appears to be population in thousands. For example, see:
# http://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/AgeStandardization/Task1a.htm
pop.by.age <-
data.frame(
agecat = 1:4 ,
Freq = c( 55901 , 77670 , 72816 , 45364 )
)
# print that data frame to the screen
#pop.by.age
# this data frame contains the information that the year 2000 united states population was approximately
# 56 million 0-19 year olds
# 77 million 20-39 year olds
# 73 million 40-59 year olds
# 45 million 60+ year olds
# create a new survey object stratified to the census counts, broken out by both race / ethnicity and gender #
nhanes.by.race.and.gender <-
svystandardize(
nhanes.tsl.design ,
by = ~agecat , # stratification variable
over = ~racef + sex , # break out variable(s)
population = pop.by.age, # data frame containing census populations
# We usually have complete cases, but support running on subsets
excluding.missing = as.formula(excludeMissing) # analysis var(s) of interest
)
## Warning in postStratify.survey.design(design, combined, allmargins, partial = TRUE): Some strata
## absent from sample: ignored
First some general information about this set of NHANES data.
There are 33259 total (unweighted) records. Broken out by race/ethnicity category:
## racef counts se
## White White 15943 0
## Black Black 6590 0
## Hisp Hisp 8748 0
## Other Other 1978 0
Confirm that the sum of the weights we are using ( 1.36706510^{9} ) equals the sum calculated by svytotal:
## total SE
## one 1.367e+09 30161105
The civilian, non-institutionalized population of the United States by race/ethnicity category:
## racef one se
## White White 964688060 35558398
## Black Black 144660386 7520211
## Hisp Hisp 179493449 11241672
## Other Other 78222740 4696539
The civilian, non-institutionalized population of the United States by age category:
## agetext one se
## 20-39 20-39 526070779 11402207
## 40-59 40-59 524817290 14418451
## 60+ 60+ 316176567 10343969
Running CompTwoVar with stack CompTwoVar
For these two variables the data contains 4831 subjects with data for both corresponding to a population of 1.83813510^{8} Americans. These subjects are distributed demographically (population represented) as follows (note the contrast to population numbers above, I think this indicates an issue with my use of excluding.missing in svystandardize):
# Does not display well inline
tapply(x[completeCases, useWeightsName], x[completeCases, c("sex", "agetext")], sum)
## agetext
## sex 0-19 20-39 40-59 60+
## Male NA 33752335 36177078 19570154
## Female NA 34002546 37766559 22544818
tapply(x[completeCases, useWeightsName], x[completeCases, c("sex", "racef")], sum)
## racef
## sex White Black Hisp Other
## Male 61746344 8851698 12938914 5962611
## Female 65961762 10836920 12010921 5504319
First look at the individual variables.
displayReferenceRange(varName1)
##
##
## Calculating reference ranges for Urine.SG
## Urine.SG appears to be discrete with interval 0.001
## 0 0.05 0.25 0.5 0.75 0.95 1
## Urine.SG 1.001 1.005 1.011 1.017 1.022 1.028 1.2
## racef sex 0.05 0.25 0.5 0.75 0.95 se.0.05 se.0.25 se.0.5 se.0.75 se.0.95
## White.Male White Male 1.01 1.01 1.02 1.02 1.03 0.000510 0.000255 0.000255 0.000255 0.000510
## Black.Male Black Male 1.01 1.01 1.02 1.02 1.03 0.000792 0.000510 0.000510 0.000510 0.001020
## Hisp.Male Hisp Male 1.01 1.01 1.02 1.02 1.03 0.000765 0.000765 0.000510 0.000255 0.000765
## Other.Male Other Male 1.01 1.01 1.02 1.02 1.03 0.001010 0.000755 0.001020 0.000765 0.001880
## White.Female White Female 1.00 1.01 1.01 1.02 1.03 0.000510 0.000255 0.000510 0.000255 0.001020
## Black.Female Black Female 1.01 1.01 1.02 1.02 1.03 0.000765 0.000510 0.000510 0.000255 0.000829
## Hisp.Female Hisp Female 1.00 1.01 1.02 1.02 1.03 0.000255 0.000510 0.000510 0.000510 0.000255
## Other.Female Other Female 1.00 1.01 1.01 1.02 1.03 0.000966 0.000765 0.001020 0.001440 0.001700
displayReferenceRange(varName2)
##
##
## Calculating reference ranges for LDL
## LDL appears to be discrete with interval 0.2
## 0 0.05 0.25 0.5 0.75 0.95 1
## LDL -172 59.7172 88.8 111.4 136.4 176.389 355.6
## racef sex 0.05 0.25 0.5 0.75 0.95 se.0.05 se.0.25 se.0.5 se.0.75 se.0.95
## White.Male White Male 54.9 87.0 110 135 174 1.33 2.03 1.730 2.19 4.78
## Black.Male Black Male 62.2 87.5 109 137 174 3.99 1.89 1.550 2.10 6.36
## Hisp.Male Hisp Male 58.3 91.8 113 142 179 2.62 1.44 1.260 3.04 1.99
## Other.Male Other Male 54.1 93.7 114 133 175 13.50 3.83 4.860 7.21 17.60
## White.Female White Female 66.2 91.0 114 139 181 2.25 1.01 0.969 1.53 5.00
## Black.Female Black Female 59.1 85.2 109 131 175 2.97 2.40 2.400 3.35 7.00
## Hisp.Female Hisp Female 61.0 88.5 113 136 174 2.90 2.01 1.320 1.91 1.99
## Other.Female Other Female 49.0 74.3 104 126 158 10.10 3.78 8.600 6.33 53.60
Try different types of plots to see which is most meaningful.
First try each of the svyplot styles.
I don’t find the bubble and transparent styles that useful for NHANES data.
## Loading required package: hexbin
The hexbin plots aren’t terribly useful with default ranges.
See NHANES_analyze_tissue_cholesterol.R for attempts at limiting ranges for survey hexbin plots.
Now try some ggplot alternatives.
Now let’s look at relationship by subgroups (here age groups and race) and sex.
http://sape.inf.usi.ch/quick-reference/ggplot2 looks like a useful ggplot2 reference.
Try ggplot binhex/bin2d plots.
http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points
http://stackoverflow.com/questions/13167531/ggplot2-multiple-stat-binhex-plots-with-different-color-gradients-in-one-image
It appears geom_hex, stat_binhex, and stat_bin2d do not support weights.
geom_bin2d appears to understand weights so use that.
Note that this does not work well for discrete variables currently. The bin size used can contain differing number of discrete values (e.g. for urine SG a bin of 1.009-1.0021 includes 1.001 and 1.002 while a bin of 1.0021-1.0033 only includes 1.003) which introduces a banding effect into the plots. (see comment in To Do section below)
A similar example with geom_hex. Notice that the Frequency values are unweighted despite the weights being used (weights are not supported by geom_hex)
Use discretized vs continuous knowledge for deciding whether or not to jitter plots or try to fix binwidths to the size of the discretization interval.