Hospital records and survey data are both key resources when tabulating public health metrics. However, the question remains whether self-reported surveys reflect the reality of current disease burdens. For this project, I'll be comparing the NHANES dataset (containing survey data) with the SPARCS dataset, containing surveillance hospitalization data in the state of New York. The health outcome of interest is opioid-dependence annd whether age differs between prescription and illicit opioid use.
Here we can preview the SPARCS (surveillance) data. I queried this using an API token. This totals to 16k hospitalizations:
app.token = 'B8MOgX9ChGjrpoBEzo4Kdmqpt'
inpatient.2016 = 'https://health.data.ny.gov/resource/gnzp-ekau.csv'
x = paste0(inpatient.2016, '?apr_drg_code=773&$limit=100000&$$app_token=', app.token)
dt.sparcs = read.csv(url(x))
dt.sparcs = as.data.table(dt.sparcs)
dt.sparcs[, 1:5] %>% head() %>% kable() %>% kableExtra::kable_styling()
| health_service_area | hospital_county | operating_certificate_number | facility_id | facility_name |
|---|---|---|---|---|
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
| Western NY | Cattaraugus | 401001 | 66 | Olean General Hospital |
# http://health.data.ny.gov/resource/xdss-u53e.csv?$limit=5000&$$app_token=YOURAPPTOKENHERE
For NHANES, we'll be reading these into CSV format and importing multiple tables:
dt.hsq = read.xport(file.path(dir.NHANES, 'HSQ_I.XPT')) %>% as.data.table()
dt.demo.NHANES = read.xport(file.path(dir.NHANES, 'DEMO_I.XPT')) %>% as.data.table()
dt.drugs.NHANES = read.xport(file.path(dir.NHANES, "RXQ_RX_I.XPT")) %>% as.data.table()
dt.drug.info.NHANES = read.xport(file.path(dir.NHANES, "RXQ_DRUG.xpt")) %>% as.data.table()
dt.drugs.NHANES[,1:5] %>% head() %>% kable() %>% kableExtra::kable_styling()
| SEQN | RXDUSE | RXDDRUG | RXDDRGID | RXQSEEN |
|---|---|---|---|---|
| 83732 | 1 | CETIRIZINE | d03827 | 2 |
| 83732 | 1 | FLUTICASONE NASAL | d04283 | 2 |
| 83732 | 1 | HYDROXYCHLOROQUINE | d00817 | 2 |
| 83732 | 1 | INSULIN GLARGINE | d04538 | 2 |
| 83732 | 1 | LIRAGLUTIDE | d07466 | 1 |
| 83732 | 1 | LISINOPRIL | d00732 | 1 |
One aspect to bear in mind here is that the NHANES dataset has multiple one-to-many relationships. The demographics table contains information on participants, but participants can have multiple drug orderes. The drug information table (similarly) has core information per drug but can be joined to multiple drug orders.
Here, the demographic and drug order data has a one-to-many relationship:
dt.demo.drugs.merged = merge(dt.demo.NHANES, dt.drugs.NHANES, by = 'SEQN')
dt.demo.drugs.merged[1:5,1:5] %>% head() %>% kable() %>% kableExtra::kable_styling()
| SEQN | SDDSRVYR | RIDSTATR | RIAGENDR | RIDAGEYR |
|---|---|---|---|---|
| 83732 | 9 | 2 | 1 | 62 |
| 83732 | 9 | 2 | 1 | 62 |
| 83732 | 9 | 2 | 1 | 62 |
| 83732 | 9 | 2 | 1 | 62 |
| 83732 | 9 | 2 | 1 | 62 |
Similarly, we can merge the drug orders to the drug metadata and visualize accordingly:
dt.drugs.NHANES.merged = merge(dt.drugs.NHANES, dt.drug.info.NHANES, by = 'RXDDRGID')
dt.drugs.NHANES.merged[1:5, 1:7] %>% head() %>% kable() %>% kableExtra::kable_styling()
| RXDDRGID | SEQN | RXDUSE | RXDDRUG.x | RXQSEEN | RXDDAYS | RXDRSC1 |
|---|---|---|---|---|---|---|
| a10900 | 86039 | 1 | ESTRADIOL; ESTRIOL | 1 | 1825 | Z79.890 |
| c00001 | 83847 | 1 | ANTI-INFECTIVES - UNSPECIFIED | 2 | 5 | A49.9P |
| c00001 | 85430 | 1 | ANTI-INFECTIVES - UNSPECIFIED | 2 | 7 | J01.9 |
| c00001 | 88234 | 1 | ANTI-INFECTIVES - UNSPECIFIED | 2 | 14 | N76.0 |
| c00001 | 88487 | 1 | ANTI-INFECTIVES - UNSPECIFIED | 2 | 10 | 55555 |
From the API query, we've subsetted based on opioid-misuse. We'll need to subset the NHANES dataset to only contain folks who use opioids (via the drug names "oxycodone" and "fentanyl":
dt.drugs.NHANES.sub = dt.drugs.NHANES[grepl("oxycodone|fentanyl", RXDDRUG, ignore.case = T)]
vec.NHANES.ids = dt.drugs.NHANES.sub[, SEQN]
From here, we'll subset all the NHANES datasets with these ids (N = 72):
dt.demo.NHANES.sub = dt.demo.NHANES[SEQN %in% vec.NHANES.ids]
dt.hsq.sub = dt.hsq[SEQN %in% vec.NHANES.ids]
Here I'll convert a continuous value (age) to a binned category. The SPARCS dataset has age categories as follows:
dt.sparcs[,.N, by = 'age_group'] %>% kable() %>% kableExtra::kable_styling()
| age_group | N |
|---|---|
| 18 to 29 | 3874 |
| 50 to 69 | 4380 |
| 30 to 49 | 7533 |
| 70 or Older | 133 |
| 0 to 17 | 17 |
Meanwhile the NHANES dataset only has age as a continuous variable:
ggplot(aes(as.numeric(RIDAGEYR)), data = dt.demo.NHANES.sub) + geom_histogram(color = 'white', fill = 'lightblue') + theme_minimal() + labs(title = 'Age Distribution: NHANES', x = 'Age (years)')
Therefore, we'll apply this to the demographics table of the NHANES dataset:
dt.demo.NHANES.sub[, age_group := cut(RIDAGEYR, c(0, 17, 29, 49, 69, 100), labels = c("0 to 17", "18 to 29", "30 to 49", "50 to 69", "70 or Older"), right = T)]
dt.demo.NHANES.sub[,.N, by = 'age_group'][order(age_group),] %>% kable() %>% kableExtra::kable_styling()
| age_group | N |
|---|---|
| 0 to 17 | 1 |
| 18 to 29 | 6 |
| 30 to 49 | 17 |
| 50 to 69 | 26 |
| 70 or Older | 16 |
| NA | 1 |
One key feature of surveillance data is geographical information. We can use this when mapping the opioid misuse (via the urbnmapr package):
dt.counties = (counties)
dt.counties = as.data.table(dt.counties)
dt.counties = dt.counties[state_name == 'New York']
dt.counties$county_name = gsub('\\s*County', '', dt.counties$county_name)
#dt.fips = fread('fips.txt')
#dt.fips = dt.fips[grepl('^36', `FIPS code`)]
#dt.fips[name == 'New York', name := "Manhattan"]
dt.sparcs.N = dt.sparcs[type_of_admission == 'Emergency',.N, by = "hospital_county"]
setnames(dt.sparcs.N, "hospital_county", "county_name")
dt.sparcs.N[county_name == 'St Lawrence', county_name := 'St. Lawrence']
dt.sparcs.N[county_name == 'Manhattan', county_name := 'New York']
dt.emergency.dat = merge(dt.sparcs.N, dt.counties, by = 'county_name', all.y = T)
dt.emergency.dat %>%
ggplot(aes(long, lat, group = group, fill = N)) +
geom_polygon(color = NA) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
labs(title = "Number of Emergency Visits due to Opioid Use") + scale_fill_viridis_b() + theme(legend.position = 'bottom') + theme_minimal()
While a good portion of the cases are located Downstate, there are two glaring counties in Upstate (Erie and Suffolk) with significant case counts.
It's worth exploring if age is normally distributed among individuals prescribed opioids (my guess is no). But we can test this using a skewdness test (shapiro.test):
shapiro.test(dt.demo.NHANES.sub$RIDAGEYR)
##
## Shapiro-Wilk normality test
##
## data: dt.demo.NHANES.sub$RIDAGEYR
## W = 0.95444, p-value = 0.0154
This confirms that age is skewed (most likely skewed-left).
Many attribute opioid use disorder with youth; we can compare the age distributions between folks with prescriptions and hospitalizations to verify this.
The next step is visualizing the age distributions between hospitalizations and individuals with prescription opioids:
dt.demo.NHANES.sub.N = dt.demo.NHANES.sub[,.N, by = 'age_group']
hist.NHANES = ggplot(data = dt.demo.NHANES.sub.N[!is.na(age_group)], aes(age_group, N, fill = factor(age_group))) + geom_bar(stat = 'identity') + theme_minimal() + labs(x = "Age Group", y = "Count", title = "NHANES Age Groups") + scale_fill_tableau()
dt.sparcs.AGE.N = dt.sparcs[,.N, by = 'age_group']
dt.sparcs.AGE.N[, age_group := factor(age_group, levels = c("0 to 17", "18 to 29", "30 to 49", "50 to 69", "70 or Older"))]
hist.SPARCS = ggplot(data = dt.sparcs.AGE.N, aes(age_group, N, fill = factor(age_group))) + geom_bar(stat = 'identity') + theme_minimal() + labs(x = "Age Group", y = "Count", title = "SPARCS Age Groups") + scale_fill_tableau()
ggarrange(hist.NHANES, hist.SPARCS, ncol = 1, common.legend = T)
There's a KEY takeaway here: just by eyeballing we can tell that folks 50-69 are most likely to be prescribed opioids but individuals in the 30-49 age bucket are most likely to be hospitalized for opioid misuse. We'll use our statistical tests to confirm this.
If we want to compare the age distributions between the two datasets, we can use a simple Chi-Squared test:
dt.cont.tbl = rbind(dt.sparcs[,.(age_group, Data = 'SPARCS')], dt.demo.NHANES.sub[,.(age_group, Data = 'NHANES')])
chisq.test(dt.cont.tbl$age_group, dt.cont.tbl$Data)
## Warning in chisq.test(dt.cont.tbl$age_group, dt.cont.tbl$Data): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: dt.cont.tbl$age_group and dt.cont.tbl$Data
## X-squared = 414.5, df = 4, p-value < 2.2e-16
As the distributions indicated, there's a significant difference in age demographics of those prescribed opioids compared to those who are hospitalized for an overdose.