An Introduction to the exposome: exposome-wide association in body mass index and fasting glucose

Please check Canvas for the due date of this assignment.

Start early. This is a challenging assignment! Each student must hand in their code and answers separately. This assignment is 20% of your grade. We cannot accept late assignments.

Submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(knitr)
library(kableExtra)
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(cowplot)

Background:

Body mass index and glucose are phenotypes that are known to be risk factors for Type 2 Diabetes (T2D). Genetic factors are associated with body mass index and glucose. However, environmental exposure factors associated to fasting glucose and/or body mass index may also play a role in T2D, given the rapid rise of both body mass index and diabetes in the last 20 years (https://www.cdc.gov/diabetes/statistics/slides/long_term_trends.pdf).

For example, it is hypothesized that “poor” diets or sugar consumption or even pollutants may influence body mass index or fasting glucose. However, these factors do not predict the phenotypes of body mass index or fasting glucose completely. In this assignment you will complement these associations in T2D, linking new potential environmental exposures in T2D phenotypes. Specifically, we will execute a data-driven search for environmental exposures correlated with body mass index and fasting glucose, called ‘environment-wide association studies’ or ‘exposome-wide association studies’ (EWASs), similar to how investigators correlate individual SNPs with disease in GWAS.

Description of Data and Readings:

Please skim Patel et al 2010.

load('./assignment1.Rdata')

This .Rdata file contains 5 data.frames and an array named the following:

  • ExposureDescription:the dictionary of exposure variables in the training and testing datasets
  • ExposureList: a list of biomarkers of environmental exposures to test against BMI and glucose.
  • NHData.train: the training dataset from the NHANES 1999-2000 and 2001-2002 surveys
  • NHData.test: the testing dataset from the NHANES 2003-2004 and 2005-2006 surveys
  • demographicVariables: a dictionary of demographic variables in the training and testing datasets.

These data were derived from the National Health and Nutrition Examination Survey (NHANES), a survey on a representative sample of the United States population, undertaken by the US Centers for Disease Control and Prevention (see here: https://www.cdc.gov/nchs/nhanes/. Specifically, we downloaded the data from the NHANES web site and stitched them together to create the data.frame objects above. For more information, please see our paper, Patel CJ, et al 2016, Nature Scientific Data.

A short tutorial on analysis of survey data

The NHANES participants are representative of the United States population. A challenge in making a dataset representative is ensuring all types of individuals are sampled that reflect the diversity of the US at the lowest possible cost. Simply calling a finite number of people up randomly is not going to achieve representativeness. The US CDC and National Centers for Health Statistics (NCHS) are able to achieve representativeness by sampling specific facets of the population at higher probabilities than would be observed if one was to randomly sample the population. Second, NCHS samples people from specific parts of the population (e.g., 15 counties a year) to save time and money (it is easier to survey people in a neighborhood than randomly all over the US).

In fact, most data that you will find from the US CDC or international epidemiological surveys, use a special sampling technique called “survey sampling”. You can learn more about the CDC and the NCHS sampling procedure here: https://wwwn.cdc.gov/Nchs/Nhanes/AnalyticGuidelines.aspx

What does this mean for us? We cannot use the common tools in R to do modeling if we desire to produce “valid” effect sizes, associations/correlations, and pvalues. Why? Because individuals are not sampled randomly, they exhibit some inherent correlation with others (e.g., individuals sampled from the same town, for example). Therefore, this will influence the standard errors and averages of the correlations and therefore also influence the inference (pvalues for correlation).

How do we address this? We simply use a package, aptly called ‘survey’, to take into account the unequal survey-weighted design of the NHANES population. The two main functions you will use in this assignment to achieve this includes ‘svydesign’ and ‘svyglm’ (survey-weighted general linear modeling, which can do linear, logistic, and other types of regressions). Here is how you use them using the NHdata.train data.frame.

First, we need to load the ‘survey’ package:

library(survey)
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart

Next, we need to tell R that we have a survey-weighted data we need to analyze. This is done through the ‘svydesign’ function from the survey library, for example for the training data, NHData.train:

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)

This tells R that we have a survey-weighted sample (denoted by weights in the function above). The weights signify how much an individual should be considered in the model and is a function of their prevalence in the population. For example, individuals with higher weights are less prevalent in the population. In contrast, in a simple random sample, the weights would be equal for every individual. Second, the ids and strata parameters tell R where an individual was sampled. ‘Strata’ are units (such as zipcodes) that are sampled within ‘primary sampling units’ (or PSUs, SDMVPSU) such as a county. Individuals within strata are more correlated with one another than individuals that live in different strata. Therefore, these similarities within sampling strata must be taken into account when estimating correlations.

We can run regression models using the ‘svyglm’ function. For example, suppose I wanted to regress Body Mass Index (coded as BMXBMI in the data.frame) on age (coded as RIDAGEYR), we can use the ‘svyglm’ function (INSTEAD of the regular ‘lm’ function):

mod <-svyglm(BMXBMI ~ RIDAGEYR, dsn)
summary(mod)

Observe that the estimates (and pvalues) will be different when using the ‘lm’ function (give it a shot):

mod2 <- lm(BMXBMI ~ RIDAGEYR, data=NHData.train)
summary(mod2)

In this assignment, we will use the survey package and ‘svydesign’ and ‘svyglm’ functions to run your association tests. These are good functions to know when analyzing data from the public domain.

Answer the questions that are numbered in the space below that says “your answer here”. Supply R code when required by using the back ticks. For example:

Sample Questions (and Answers):

  1. What is the meaning of life?

The meaning of life is 42.

  1. What is e to the power of 3?
exp(3)

A. Background Questions

  1. How are the phenotypes (glucose and BMI) measured in participants of the NHANES? (1)

Pehnotype 1: Body Mass Index (kg/m**2)- Calculated according to this equation: \[\ Weight(kg)/Height^2(m^2)\]

A detailed protocol that standardizes the measurements of weight and height is available per year of survey, different guidelines are available per age group (for example attached the Anthropometry And Physical Activity Monitor Procedures Manual 2005-2006 https://wwwn.cdc.gov/nchs/data/nhanes/2005-2006/manuals/BM.pdf).

“For children and adolescents, cutoff criteria are based on the gender-specific BMI-for-age growth charts: underweight (BMI values < 5th percentile); healthy weight (BMI values 5th-84th percentiles); overweight (BMI values 85th-94th percentiles); and obesity (BMI values ≥ 95th percentile) (Barlow, 2007).

For adults, cutoff criteria are fixed: underweight (BMI values < 18.5); normal weight (BMI values 18.5-24.9); overweight (BMI values 25.0-29.9); obese—Class I (BMI values 30.0-34.9); obese—Class II (BMI values 35.0-39.9); and extremely obese—Class III (BMI values > 40.0) (National Institutes of Health, 1998).”

Phenotype 2: Fasting blood glucose-plasma (mg/dL) The fasting glucose levels are measured in the blood. blood samples are drawn by venipuncture from fasting individuals.

  1. What does representative mean with respect to NHANES? How is it different than the Framingham Heart Study? (2)

The study population of the Framingham heart study represents a certain subset of the population, which is the residents of Framingham Massachusetts. Thus the results of studies based on this cohort are less generalizable. The NHANES dataset focuses on collecting information from the civilian population living in the 50 states of the US including the district of Columbia(Excluding those who live in institutionalized settings), the sampling aims to have a fair representation of the entire population of the US, which means that the results are more generalizable to the US population comparing to the Framingham study.

  1. These phenotypes (glucose and BMI) are related to type 2 diabetes. (a) Draw the E, G, P, and D diagram and annotate what you are investigating in your anticipated NHANES EWASs. (b) What does body mass index measure? What does fasting glucose measures? Why are they important in type 2 diabetes? (2)

a:

b:

  1. Fasting blood glucose is set to detect individuals with impaired glucose tolerance. Fasting plasma glucose (FPG) >= 126 mg/dL is one of the accepted criteria for the diagnosis of Diabetes Mellitus (DM). FPG is in fact very reliable and convenient test to identify DM in asymptomatic individuals.

  2. Body Mass Index is is used to assess weight status in children and adolescents as well as adults.

Those measures are important in type 2 diabetes. The first is an important diagnostic measure as previously mentioned, also impaired glucose tolerance (< 126) is a risk factor for developing DM later on. BMI and obesity are important risk factors in diabetes and are linked to insulin resistance and disease progression.

  1. How are the environmental exposures measured in the NHANES? Choose a few examples (up to 5) of different types of biomarkers of environmental exposures from the ExposureList array and query the NHANES website for the assay description of the biomarker. (2)

Here I looked into 3 different exposures and their measures: 1) URXUIO –> Iodine, urine (ng/mL), it was measured using Inductively coupled plasma-mass spectrometry (ICP-MS) from urine samples.
2) LBXB12 –> Vitamin B12, serum (pg/mL), used from blood samples. The Elecsys Vitamin B12 assay employs a competitive test principle using intrinsic factor specific for vitamin B12. 3) LBXFOL –> Folate, serum in (ng/mL)- “Serum folate is measured by using the Bio-Rad Laboratories”Quantaphase II Folate” radioassay kit. The assay is performed by combining serum or a whole blood hemolysate sample with 125I-folate and cyanide.”

Let’s explore the phenotypes, body mass index (BMXBMI) and fasting glucose (LBXGLU) with respect to demographic characteristics of your population in the training dataset (NHData.train)

  1. How many individuals are there with BMI values? Draw a histogram of BMI and describe the mean and median. Is the distribution “skewed” or biased toward a certain direction on the x axis? (1)
  • your answer here (with R code). Remember, use the NHData.train cohort.

The number of observations (individuals) in the NHData.train dataset is 21004. 17472 individuals have a record of their BMI, and 3532 individuals have missing values.

length(NHData.train$BMXBMI)
## [1] 21004
sum(is.na(NHData.train$BMXBMI))
## [1] 3532
sum(!is.na(NHData.train$BMXBMI))
## [1] 17472
# Missing values will be dropped
BMI_hist <- NHData.train %>% ggplot(aes(BMXBMI)) +       
  geom_histogram(bins=30,color = "#000000", fill = "#5191A9", alpha=0.6) +labs(x='Body Mass Index')+ geom_vline(aes(xintercept=mean(BMXBMI,na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) + geom_vline(aes(xintercept=median(BMXBMI,na.rm=TRUE)),
            color="red", linetype="dashed", size=1)

print(BMI_hist)

mean(NHData.train$BMXBMI,na.rm=TRUE)
## [1] 24.79468
median(NHData.train$BMXBMI,na.rm=TRUE)
## [1] 24.1

The data has a right skewed distribution (right tail) (median < mean). The mean (blue dashed line) is 24.8 kg/m^2. The median (red dashed line) is 24.1 kg/m^2. (P.S the null values were automatically dropped by the ggplot function).

  1. Plot BMI vs age. (1)
  • your answer here
# Missing values will be dropped

BMI_age <- NHData.train %>% ggplot(aes(RIDAGEYR,BMXBMI)) +       
  geom_point()+labs(y='Body Mass Index', x='Age in years')

print(BMI_age)

  1. Plot a boxplot of BMI versus sex. (1)
  • your answer here

I Will create a sex variable (male=0, female=1) (basically this is the same as the female variable, but created it to reduce my confusion).

NHData.train <- NHData.train %>% mutate(sex = ifelse(male == 1, 0, ifelse(female == 1, 1, NA)))
NHData.train$sex <- as.factor(NHData.train$sex)

# checking the values of the new variable
table(NHData.train$sex)
## 
##     0     1 
## 10214 10790
# For later use: creating same variable in the test dataset
NHData.test <- NHData.test %>% mutate(sex = ifelse(male == 1, 0, ifelse(female == 1, 1, NA)))
# Missing values will be dropped
# Will place the categorical variable on the x axis. 

ggplot(NHData.train, aes(x = sex, y =BMXBMI )) +
  geom_boxplot(alpha=0.5,fill = "#5F9EA0") +theme_classic()+labs(x='Sex', y='Body Mass Index')+ scale_x_discrete(labels = c('0' = 'Male' , '1'='Female'))

  1. Plot a boxplot of BMI versus race/ethnicity. (2)
  • your answer here
# Creating a categorical variable of Race/ethnicity with the following values: White American=1, Black American =2, Mexican American=3, Other Hispanic=4, Other Ethnicity=5

NHData.train <- NHData.train %>% mutate(ethnicity = case_when(white ==1  ~ 1, black==1  ~ 2, mexican==1 ~ 3, other_hispanic==1 ~ 4, other_eth==1 ~ 5))
                             
# Converting to categorical variable 
NHData.train$ethnicity <- as.factor(NHData.train$ethnicity)
table(NHData.train$ethnicity)
# Checking the integrity of the recoded variable: ethnicity.
table(NHData.train$white)
table(NHData.train$black)
table(NHData.train$mexican)
table(NHData.train$other_hispanic)
table(NHData.train$other_eth)
# For later use: creating same variable in the test dataset
NHData.test <- NHData.test %>% mutate(ethnicity = case_when(white ==1  ~ 1, black==1  ~ 2, mexican==1 ~ 3, other_hispanic==1 ~ 4, other_eth==1 ~ 5))
                             
# Converting to categorical variable 
NHData.test$ethnicity <- as.factor(NHData.test$ethnicity)
table(NHData.test$ethnicity)
# Missing values will be dropped

ggplot(NHData.train, aes(x = ethnicity, y =BMXBMI )) +
  geom_boxplot(alpha=0.5,fill = "#5F9EA0") +theme_classic()+labs(x='Race/Ethnicity', y='Body Mass Index')+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_x_discrete(labels = c('1' = 'White American' , '2'='Black American', '3'='Mexican American', '4'='Other Hispanic', '5'= 'Other Ethnicity'))

  1. Plot BMI versus an indicator of socioecononomic status, the income to poverty ratio (INDFMPIR). The income-to-poverty ratio is a the household income divided by the household poverty level for a given survey year. Therefore a Income-to-poverty ratio of 1 means the individual has household income equal to the poverty level. (1)
  • your answer here
# Missing values will be dropped
# We have two continuous values, I will visualize using scatter plot. 
# I added ethnicity as a third variable with the hue. 

BMI_income <- NHData.train %>% ggplot(aes(INDFMPIR,BMXBMI,color= ethnicity)) +       
  geom_point()+labs(y='Body Mass Index', x='Income-to-poverty ratio')

print(BMI_income)

  1. Describe the differences in groups (e.g., what is the difference of BMI averages in males vs. females): summarizing the plots from 6-9. (1)
  • your answer here

From visually inspecting the differences between males and females in regard of their BMI, it seems that the two groups are comparable, with similar median and IQR. Statistical tests can be applied to examine significance of the difference. The different ethnic groups look similar as well, with white and black groups having more outliers. BMI seems to be lower in the extremes of age (childhood and adolescence vs elderly) with the highest values in the middle age group. In regard to income-poverty ratio and BMI, I don’t detect a specific association with certain income level.

  1. What statistical model would you use to test the association (and significance) of your findings in 10? Execute the model using svyglm below, using multivariate regression, modeling BMI as a function of age, ethnicity, sex, and INDFMPIR. (3)
  • your answer here.

We can use Ttest to compare the difference in BMI between males and females, and ANOVA test to compare BMI differences between different ethnic groups. Also, since our outcome (dependent variable) is continuous, we can use a Linear regression model to investigate the difference in BMI between different groups. The beta coefficient of the independent variable will inform us about the difference (compared to a reference group) between the groups when adjusting to different variables.

#Data preparation to use with svyglm (adopted from earlier in the file)
dsn <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)
# Missing values will be dropped. 
model2 <- svyglm(BMXBMI~ RIDAGEYR+sex+INDFMPIR+ethnicity,dsn)

summary(model2)
## 
## Call:
## svyglm(formula = BMXBMI ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.217088   0.245206  82.449  < 2e-16 ***
## RIDAGEYR     0.152421   0.003439  44.325  < 2e-16 ***
## sex1         0.209020   0.132947   1.572    0.130    
## INDFMPIR    -0.061980   0.056904  -1.089    0.288    
## ethnicity2   1.560491   0.180950   8.624 1.66e-08 ***
## ethnicity3   0.992781   0.190223   5.219 3.10e-05 ***
## ethnicity4   0.553229   0.281339   1.966    0.062 .  
## ethnicity5  -0.012941   0.612393  -0.021    0.983    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 44.90787)
## 
## Number of Fisher Scoring iterations: 2

Using the model you described in 11, report the following:

  1. Change in BMI for a 1 year increase/change in age and significance of this association. (1)
  • your answer here

Holding all other variables constant, a 1 unit increase in age, will lead to an increase of 0.15 in the outcome variable (BMI), this change is statistically significant (p<0.001).

  1. Difference of BMI in males versus females and significance. (1)
  • your answer here

Based on our sample, holding all other variables constant, there is difference in the mean BMI between males and females, females have higher mean BMI compared to males (by 0.209 units), but this difference is not statistically significant (pvalue=0.13).

  1. Difference of BMI in Non-Hispanic Black versus Whites, Mexican American versus Whites, and Other Hispanic vs. White, and Other versus White. (1)
  • your answer here.

In my model, the reference group for the ethnicity variable is white. - Black American have higher mean BMI compared to white American (by 1.56 units increase), the difference is statistically significant. - Mexican Americans have higher mean BMI compared to white American (by 0.99 units increase), the difference is statistically significant. - Other hispanics have higher mean BMI compared to white American (by 0.55 units increase), the difference is NOT statistically significant. - Other ethnicities have lower mean BMI compared to white American (by -0.0129 units decrease), the difference is NOT statistically significant.

  1. Repeat Q11 for fasting glucose (LBXGLU), developing a multivariate linear regression model to associate fasting glucose with age, sex, race, and income-poverty ratio. (3)
  • your answer here
model3 <- svyglm(LBXGLU~ RIDAGEYR+sex+INDFMPIR+ethnicity,dsn)

summary(model3)
## 
## Call:
## svyglm(formula = LBXGLU ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  87.8945     1.0508  83.648  < 2e-16 ***
## RIDAGEYR      0.4186     0.0291  14.388 1.13e-12 ***
## sex1         -5.5616     0.8217  -6.768 8.41e-07 ***
## INDFMPIR     -0.9295     0.2844  -3.268  0.00352 ** 
## ethnicity2    0.1312     1.4520   0.090  0.92884    
## ethnicity3    3.4764     0.9499   3.660  0.00138 ** 
## ethnicity4    3.9009     2.8895   1.350  0.19073    
## ethnicity5    3.0543     3.2390   0.943  0.35594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1024.717)
## 
## Number of Fisher Scoring iterations: 2

We will explore one biomarker of exposure: serum Lead (LBXBPB).

  1. Plot a histogram of serum lead (LBXBPB) and qualitatively describe the “shape” of the distribution - what does it look like? Is centered or skewed in a direction? (1)
  • your answer here
# Missing values will be dropped

NHData.train %>% ggplot(aes(LBXBPB)) +       
  geom_histogram(bins=100,color = "#000000", fill = "#5191A9", alpha=0.6) +labs(x='Serum Lead')

The variable has a right skewed distribution. We can normalize the distribution using different approaches.

Investigators attempt to make this distribution more “normal” or symmetrically centered around a value by applying a “transformation” to make the linear associations easier to interpret. Since the lead levels seem to have an exponential-like decay a log-based transformation is used. Lets take a look at the log base e of lead levels:

# Missing values will be dropped 

NHData.train <- NHData.train %>% mutate(lead_logged = log(LBXBPB+ .001)) # add 0.001 for values that are 0
p <- ggplot(NHData.train, aes(lead_logged))
p <- p + geom_histogram(bins=30,color = "#000000", fill = "#5191A9", alpha=0.6)+labs(x='Log transformed Serum Lead')

p

Using a multivariate linear regression model, estimate the following:

  1. Execute a multivariate linear regression model to associate log(blood lead levels [plus .001, a small constant]) with age, sex, race, and income-poverty ratio, again in the training cohort (NHData.train) (3)
  • your answer here
#Updating dsn after creating new variable 
dsn <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)

# Running the model 
model4 <- svyglm(lead_logged~ RIDAGEYR+sex+INDFMPIR+ethnicity, dsn)

summary(model4)
## 
## Call:
## svyglm(formula = lead_logged ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4019785  0.0350688  11.463 9.53e-11 ***
## RIDAGEYR     0.0109253  0.0005453  20.035 1.28e-15 ***
## sex1        -0.4316016  0.0127236 -33.921  < 2e-16 ***
## INDFMPIR    -0.0648098  0.0054698 -11.849 5.07e-11 ***
## ethnicity2   0.1718230  0.0224377   7.658 1.21e-07 ***
## ethnicity3   0.1010466  0.0355791   2.840  0.00953 ** 
## ethnicity4   0.0003425  0.0475020   0.007  0.99431    
## ethnicity5  -0.0401062  0.0386275  -1.038  0.31042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.4047992)
## 
## Number of Fisher Scoring iterations: 2
  1. Could demographic variables be confounders in a test of association between Lead and BMI or glucose? Why or why not? (2)
  • your answer here

Yes, demographic variables can be confounders. For example, in the US, lead has historically been used as an ingredient in many types of paint, including residential and industrial paints. Older houses might still have old paint that exposes the residents to the toxic outcomes of lead (today and given the awareness of the toxic effects of lead we see this less), but people with lower socioeconomic status are differentialy more exposed to lead (residing in older houses) Other exposures might be related to certain type of works, that are more prevalent in a certain gender (due to cultural norms).

B. Executing environment-wide associations

Logging and scaling exposure variables

Logging variables

As above, to handle the right skew of exposure biomarkers, the measurements are log transformed. We add a constant (0.001). If we have a model such as the following:

model_log <- svyglm(LBXGLU ~ log(LBXBPB + .001), dsn)
summary(model_log)
## 
## Call:
## svyglm(formula = LBXGLU ~ log(LBXBPB + 0.001), design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          98.2936     0.6279 156.535  < 2e-16 ***
## log(LBXBPB + 0.001)   4.3218     0.6638   6.511 4.68e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1107.804)
## 
## Number of Fisher Scoring iterations: 2

How would you interpret the coefficient? Answer: for a 1 unit change in the log10(exposure), you have a 10 unit increase (9.95) in fasting glucose mg/dL.

Another way to interpret this is in terms of (percentages)[https://stats.oarc.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-terms-of-percent-change-in-linear-regression/]: a 1% change in the lead levels is associated with a 0.04 mg/dL (=4.32/100) higher serum glucose level. However this is association is not independent of potential confounding variables.

Scaling variables

Different variables have different units (e.g., mg/dL for lead levels). When we execute EWAS, we wish for every association to be comparable, or on the same units. Therefore, we need to “scale” the exposure variables to be on the same unit for each variable. One intuitive way to scale is to make all variables on the unit of standard deviation of that variable (e.g., sd(exposure)), dividing each measurement for each individual by the standard deviation of the population after subtracting the mean (centering the distribution at “0”). In code, to standardize or scale lead levels (after logging):

NHData.train <- NHData.train %>% mutate(lead_standardized = scale(log(LBXBPB+.001))) # Ways to standardize or "scale" LBXBPB

NHData.train %>% summarize(m1= mean(lead_standardized, na.rm=T), m2 = mean(log(LBXBPB + .001), na.rm=T), s1 = sd(lead_standardized, na.rm = T), s2=sd(log(LBXBPB + .001), na.rm=T)) ## m1 should be 0, s1 should be equal to 1 
##             m1        m2 s1        s2
## 1 2.872113e-17 0.4598617  1 0.7132666

Doing EWAS

Now we will test each of the exposures in ExposureList for linear association with a quantitative or continuous trait in the training and testing datasets separately.

The script should input a the training or testing datasets and output a tibble that contains the exposure ID (e.g., LBXBPB),exposure name, phenotype name,estimate,standard error, pvalue, an false discovery rate (FDR).

Your code should execute associations that associate between each exposure in ExposureList and the phenotype (e.g., BMI). The association should quantify how much the phenotype changes for a 1 standard deviation change in the logarithm base 10 of the exposure value (e.g., scale(log(exposure + 0.001))).

Because some of the exposure variables contain 0 values we need to add a constant before doing a log transformation. Choosing a constant is a matter of hot debate. For the sake of the exercise, use a small constant equal to 0.001. The dependent variable is the phenotype, and the independent variable is an exposure. Include age, sex, race/ethnicity, and income/poverty ratio (INDFMPIR) as covariates.

Execute the following:

  1. EWAS on 1 standard deviation change of Body Mass Index (BMXBMI) in the training and testing datasets separately. Use ‘svyglm’ as the regression approach of choice (see function below). (5)
# here is an (optional) function you can use to take in phenotype and exposure variables as "string" input to a linear regression:

### function to do a pairwise association between a phenotype and exposure
association <- function(phenotype="BMXBMI", exposure="LBXBCD", covariates=c("RIDAGEYR","sex","INDFMPIR"), dsn) {
  # phenotype: a string that is a phenotype name (e.g., "BMXBMI")
  # exposure: a string this is a exposure name (e.g., "LBXBCD")
  # covariates: an array of covariate variables
  # dsn: the survey design object
  covariate_string <- paste(covariates, collapse="+")
  mod_string <- sprintf('scale(%s) ~ scale(log(%s + .001)) + %s', phenotype, exposure, covariate_string)
  # the above formats formula as a string 
  mod <- svyglm(as.formula(mod_string), dsn)
  return(mod)
}

# run an association between BMI and Blood cadmium, adjusting for male sex and age in years
assoc_test_model <- association("BMXBMI", "LBXBCD", dsn=dsn)
model_summary_example <- summary(assoc_test_model)
# your code here:
# run the association function for each exposure in ExposureList

# I will use the association function that you provided, will use a loop to iterate over the exposure list (ExposureList) and report the results.

# I had to recode sex into double, for some reason I kept getting an error when it was factor (even though it was 2 level factor)- this will not affect the results since it is a binary variable. 
NHData.train$sex <- as.double(NHData.train$sex)
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)


# Training Data set- BMI as a phenotype 

# Create an empty dataframe to store the results
train_model_Summary_BMI = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(), sd = numeric(), p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "BMXBMI", exposure = exposure, covariates=covariates,dsn = dsn)
  model_summary=summary(mod)

train_model_Summary_BMI =rbind(train_model_Summary_BMI, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  rownames(train_model_Summary_BMI) <- NULL
}
# Test Dataset- BMI as a phenotype 

# Creating the survey dataset 
dsnt <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.test)

# Create an empty dataframe to store the results
test_model_Summary_BMI = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(),sd = numeric(),p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "BMXBMI", exposure = exposure,covariates=covariates, dsn = dsnt)
  model_summary=summary(mod)

test_model_Summary_BMI =rbind(test_model_Summary_BMI, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  
rownames(test_model_Summary_BMI) <- NULL
}
  1. EWAS on 1 standard deviation of fasting blood glucose (LBXGLU) in the training and testing datasets separately. (1)
# your code here
# run the association function for each exposure in ExposureList


# Training Dataset- Fasting blood glucose as a phenotype 

# Create an empty dataframe to store the results
train_model_Summary_LBXGLU = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(), sd = numeric(), p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "LBXGLU", exposure = exposure, covariates=covariates,dsn = dsn)
  model_summary=summary(mod)

train_model_Summary_LBXGLU =rbind(train_model_Summary_LBXGLU, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  rownames(train_model_Summary_LBXGLU) <- NULL
}
# Test Dataset- Fasting blood glucose as a phenotype 

# Creating the survey dataset 
dsnt <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.test)

# Create an empty dataframe to store the results
test_model_Summary_LBXGLU = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(),sd = numeric(),p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "LBXGLU", exposure = exposure,covariates=covariates, dsn = dsnt)
  model_summary=summary(mod)

test_model_Summary_LBXGLU =rbind(test_model_Summary_LBXGLU, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  
rownames(test_model_Summary_LBXGLU) <- NULL
}
  1. Use the ‘p.adjust’ function in R to estimate the Benjamini-Hochberg False Discovery Rate (the method parameter should be “BH” or “FDR”) for the exposure-phenotype associations in the training data cohort. Specifically, find or filter by all coefficients of the exposures (in ExposureList) and estimate the FDR for the pvalues that correspond to those coefficients. (1)
# Training dataset

# BMI as a phenotype 
# Create a dataframe with all the pvalues for the exposures (estimated 160) (extracted from the model summary)

train_BMI_pvalue_adj <- train_model_Summary_BMI %>%  select(c(Exposure,term, p_value))

# Selecting the rows with the exposures 
train_BMI_pvalue_adj <- train_BMI_pvalue_adj[grepl("^scale", train_BMI_pvalue_adj$term), ]

# Adding a column "FDR" with the adjusted p.value
train_BMI_pvalue_adj <- train_BMI_pvalue_adj %>%  mutate(FDR= p.adjust(p_value))


train_BMI_pvalue_adj %>% select(-c(term)) %>%  as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Phenotype: BMI" = 3)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: BMI
Exposure p_value FDR
2 LBXBCD 0.3195187 1.0000000
11 LBXBPB 0.0000000 0.0000000
20 LBXTHG 0.0076044 0.7921536
29 LBXIHG 0.9544587 1.0000000
38 LBXCOT 0.0062644 0.6765554
47 URXUBA 0.0026423 0.3065035
56 URXUBE 0.8341129 1.0000000
65 URXUCD 0.0531901 1.0000000
74 URXUCO 0.8573066 1.0000000
83 URXUCS 0.0359950 1.0000000
92 URXUMO 0.6599336 1.0000000
101 URXUPB 0.0006536 0.0784369
110 URXUPT 0.0100096 1.0000000
119 URXUSB 0.0104021 1.0000000
128 URXUTL 0.0000026 0.0003625
137 URXUTU 0.2662449 1.0000000
146 URXUUR 0.6184358 1.0000000
155 LBXRBF 0.6032394 1.0000000
164 LBXB12 0.0000000 0.0000000
173 LBXFOL 0.0000000 0.0000000
182 URXUHG 0.0206495 1.0000000
190 URXUIO 0.1032021 1.0000000
199 LBXVID 0.0000022 0.0003130
208 LBXPFOA 0.4054300 1.0000000
217 LBXPFOS 0.7653314 1.0000000
226 LBXPFHS 0.0724645 1.0000000
235 LBXEPAH 0.2114645 1.0000000
244 LBXMPAH 0.7368422 1.0000000
253 LBXPFDE 0.0542814 1.0000000
262 LBXPFHP 0.1661517 1.0000000
271 LBXPFNA 0.1784799 1.0000000
280 LBXPFSA 0.0835150 1.0000000
289 LBXPFUA 0.0104107 1.0000000
298 LBXPFDO 0.2205320 1.0000000
307 URXMBP 0.6394247 1.0000000
316 URXMCP 0.7807973 1.0000000
325 URXMEP 0.0000012 0.0001770
334 URXMHP 0.1788226 1.0000000
343 URXMNP 0.1649842 1.0000000
352 URXMOP 0.1692863 1.0000000
361 URXMZP 0.0172847 1.0000000
370 URXMNM 0.7613717 1.0000000
379 URXMC1 0.3450090 1.0000000
388 URXMHH 0.0020281 0.2393138
397 URXMOH 0.0045881 0.5138655
406 URXMIB 0.1808406 1.0000000
415 LBX028 0.0476617 1.0000000
424 LBX066 0.0416295 1.0000000
433 LBX074 0.0075894 0.7921536
442 LBX105 0.0174660 1.0000000
451 LBX118 0.0001078 0.0140165
460 LBX156 0.0000035 0.0004926
469 LBX157 0.6646874 1.0000000
478 LBX167 0.9671198 1.0000000
487 LBX189 0.5571675 1.0000000
496 LBXD01 0.3874104 1.0000000
505 LBXD02 0.1632009 1.0000000
514 LBXD03 0.0065394 0.6997202
523 LBXD04 0.0120424 1.0000000
532 LBXD05 0.0000078 0.0010812
541 LBXD07 0.0002471 0.0313780
550 LBXF01 0.3103989 1.0000000
559 LBXF02 0.5395425 1.0000000
568 LBXF03 0.5441765 1.0000000
577 LBXF04 0.0000004 0.0000539
586 LBXF05 0.0248908 1.0000000
595 LBXF06 0.1278376 1.0000000
604 LBXF07 0.1354041 1.0000000
613 LBXF08 0.0115322 1.0000000
622 LBXF09 0.0403998 1.0000000
631 LBXF10 0.0048935 0.5431783
640 LBXPCB 0.0000012 0.0001782
649 LBXTC2 0.3422942 1.0000000
658 LBXHXC 0.0000013 0.0001815
667 LBXTCD 0.4352769 1.0000000
676 LBX052 0.0005042 0.0620127
685 LBX087 0.6499132 1.0000000
694 LBX099 0.0014810 0.1762418
703 LBX101 0.5486068 1.0000000
712 LBX110 0.4037700 1.0000000
721 LBX128 0.0542540 1.0000000
730 LBX138 0.3540049 1.0000000
739 LBX146 0.0023781 0.2782415
748 LBX149 0.2241002 1.0000000
757 LBX151 0.1416618 1.0000000
766 LBX153 0.0201897 1.0000000
775 LBX170 0.0000103 0.0014024
784 LBX172 0.1325569 1.0000000
793 LBX177 0.5836161 1.0000000
802 LBX178 0.0038537 0.4354637
811 LBX180 0.0000016 0.0002235
820 LBX183 0.3078027 1.0000000
829 LBX187 0.0006355 0.0768899
838 LBX194 0.0001191 0.0153597
847 LBX195 0.5192174 1.0000000
856 LBX196 0.0002691 0.0339125
865 LBD199 0.0002766 0.0345706
874 LBX206 0.0001403 0.0179536
883 LBXHCB 0.8078181 1.0000000
892 LBXBHC 0.0000336 0.0044999
901 LBXGHC 0.2048413 1.0000000
910 LBXPDE 0.0726419 1.0000000
919 LBXPDT 0.0004995 0.0619334
928 LBXODT 0.0094251 0.9613633
937 LBXOXY 0.0637211 1.0000000
946 LBXTNA 0.0459136 1.0000000
955 LBXHPE 0.0000000 0.0000010
964 LBXMIR 0.0450957 1.0000000
973 LBXALD 0.1792932 1.0000000
982 LBXDIE 0.0000647 0.0085417
991 LBXEND 0.1827504 1.0000000
1000 URXP01 0.9399434 1.0000000
1009 URXP02 0.0005182 0.0632239
1018 URXP03 0.0049827 0.5480947
1027 URXP04 0.0000455 0.0060562
1036 URXP05 0.0602737 1.0000000
1045 URXP06 0.0000898 0.0117703
1054 URXP07 0.0000011 0.0001658
1063 URXP08 0.6753242 1.0000000
1072 URXP10 0.0697394 1.0000000
1081 URXP11 0.9291868 1.0000000
1090 URXP12 0.0110013 1.0000000
1099 URXP13 0.6571452 1.0000000
1108 URXP14 0.3413248 1.0000000
1117 URXP15 0.0028134 0.3235455
1126 URXP16 0.9354066 1.0000000
1135 URXP17 0.0052727 0.5747190
1144 URXP19 0.0075443 0.7921536
1153 URXP20 0.0066907 0.7092133
1162 URXP21 0.1529976 1.0000000
1171 URXP22 0.7391207 1.0000000
1180 URXP24 0.9327224 1.0000000
1189 LBXIRN 0.0000001 0.0000209
1198 LBXVIE 0.8894893 1.0000000
1207 LBXALC 0.0000008 0.0001123
1216 LBXBEC 0.0000000 0.0000012
1225 LBXCBC 0.0000001 0.0000199
1234 LBXCRY 0.0000018 0.0002544
1243 LBXGTC 0.0000000 0.0000000
1252 LBXLUZ 0.0000002 0.0000232
1261 LBXLYC 0.0163420 1.0000000
1270 LBXRPL 0.0037080 0.4227151
1279 LBXRST 0.0000096 0.0013156
1288 LBXVIA 0.0000013 0.0001815
1297 URX14D 0.0132166 1.0000000
1306 URX1TB 0.1450205 1.0000000
1315 URX3TB 0.0737432 1.0000000
1324 URXCBF 0.4927722 1.0000000
1333 URXPCP 0.9379043 1.0000000
1342 URXOPP 0.7896600 1.0000000
1351 URXDPY 0.9302191 1.0000000
1360 URXMET 0.9309390 1.0000000
1369 URXTCC 0.0248860 1.0000000
1378 LBXSPH 0.0000003 0.0000506
1387 URXDAZ 0.0447449 1.0000000
1396 URXDMA 0.8935126 1.0000000
1405 URXEQU 0.9083837 1.0000000
1414 URXETD 0.0914072 1.0000000
1423 URXETL 0.0000109 0.0014760
1432 URXGNS 0.2845850 1.0000000
# Fasting blood glucose as a phenotype 

train_LBXGLU_pvalue_adj <- train_model_Summary_LBXGLU %>%  select(c(Exposure,term, p_value))

# Selecting the rows with the exposures 
train_LBXGLU_pvalue_adj <- train_LBXGLU_pvalue_adj[grepl("^scale", train_LBXGLU_pvalue_adj$term), ]

# Adding a column "FDR" with the adjusted p.value
train_LBXGLU_pvalue_adj <- train_LBXGLU_pvalue_adj %>%  mutate(FDR= p.adjust(p_value))

train_LBXGLU_pvalue_adj %>% select(-c(term)) %>%  as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Phenotype: Fasting blood glucose" = 3)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: Fasting blood glucose
Exposure p_value FDR
2 LBXBCD 0.2524382 1.0000000
11 LBXBPB 0.0099156 1.0000000
20 LBXTHG 0.8737893 1.0000000
28 LBXIHG 0.6528913 1.0000000
36 LBXCOT 0.5066388 1.0000000
45 URXUBA 0.8637950 1.0000000
54 URXUBE 0.8564783 1.0000000
63 URXUCD 0.0436044 1.0000000
72 URXUCO 0.7561466 1.0000000
81 URXUCS 0.1820190 1.0000000
90 URXUMO 0.3752412 1.0000000
99 URXUPB 0.0955528 1.0000000
108 URXUPT 0.0123237 1.0000000
117 URXUSB 0.9557703 1.0000000
126 URXUTL 0.0579288 1.0000000
135 URXUTU 0.5287842 1.0000000
144 URXUUR 0.7742112 1.0000000
153 LBXRBF 0.0755413 1.0000000
162 LBXB12 0.0541075 1.0000000
171 LBXFOL 0.5739248 1.0000000
180 URXUHG 0.0082958 1.0000000
188 URXUIO 0.3039359 1.0000000
197 LBXVID 0.0045006 0.6750972
206 LBXPFOA 0.2662639 1.0000000
215 LBXPFOS 0.8510614 1.0000000
224 LBXPFHS 0.7351347 1.0000000
233 LBXEPAH 0.3918283 1.0000000
242 LBXMPAH 0.6184530 1.0000000
251 LBXPFDE 0.3184249 1.0000000
260 LBXPFHP 0.1498731 1.0000000
269 LBXPFNA 0.9678252 1.0000000
278 LBXPFSA 0.7087865 1.0000000
287 LBXPFUA 0.0947895 1.0000000
296 LBXPFDO 0.0208490 1.0000000
305 URXMBP 0.0787413 1.0000000
314 URXMCP 0.8602032 1.0000000
323 URXMEP 0.6673060 1.0000000
332 URXMHP 0.0156763 1.0000000
341 URXMNP 0.3080442 1.0000000
350 URXMOP 0.1003312 1.0000000
359 URXMZP 0.4724653 1.0000000
368 URXMNM 0.7466823 1.0000000
377 URXMC1 0.3502295 1.0000000
386 URXMHH 0.1850945 1.0000000
395 URXMOH 0.1802506 1.0000000
404 URXMIB 0.6273750 1.0000000
413 LBX028 0.9652125 1.0000000
422 LBX066 0.1034568 1.0000000
431 LBX074 0.0381726 1.0000000
440 LBX105 0.0874380 1.0000000
449 LBX118 0.0004591 0.0730005
458 LBX156 0.2574891 1.0000000
467 LBX157 0.9227788 1.0000000
476 LBX167 0.3573418 1.0000000
485 LBX189 0.5066463 1.0000000
494 LBXD01 0.5290067 1.0000000
503 LBXD02 0.1144996 1.0000000
512 LBXD03 0.0594733 1.0000000
521 LBXD04 0.0702706 1.0000000
530 LBXD05 0.0124010 1.0000000
539 LBXD07 0.1290237 1.0000000
548 LBXF01 0.3270192 1.0000000
557 LBXF02 0.8788898 1.0000000
566 LBXF03 0.0143752 1.0000000
575 LBXF04 0.0207212 1.0000000
584 LBXF05 0.0281421 1.0000000
593 LBXF06 0.5258286 1.0000000
602 LBXF07 0.2007177 1.0000000
611 LBXF08 0.0525535 1.0000000
620 LBXF09 0.1840493 1.0000000
629 LBXF10 0.1728195 1.0000000
638 LBXPCB 0.0017955 0.2782985
647 LBXTC2 0.0968072 1.0000000
656 LBXHXC 0.0310668 1.0000000
665 LBXTCD 0.4790167 1.0000000
674 LBX052 0.2139956 1.0000000
683 LBX087 0.5294871 1.0000000
692 LBX099 0.0378590 1.0000000
701 LBX101 0.5871605 1.0000000
710 LBX110 0.6072532 1.0000000
719 LBX128 0.4314588 1.0000000
728 LBX138 0.0272366 1.0000000
737 LBX146 0.0939591 1.0000000
746 LBX149 0.4869590 1.0000000
755 LBX151 0.4851366 1.0000000
764 LBX153 0.0336878 1.0000000
773 LBX170 0.0301907 1.0000000
782 LBX172 0.2610474 1.0000000
791 LBX177 0.1915411 1.0000000
800 LBX178 0.6080314 1.0000000
809 LBX180 0.0929561 1.0000000
818 LBX183 0.1034680 1.0000000
827 LBX187 0.0194162 1.0000000
836 LBX194 0.6321377 1.0000000
845 LBX195 0.4980020 1.0000000
854 LBX196 0.1590014 1.0000000
863 LBD199 0.2854395 1.0000000
872 LBX206 0.6053112 1.0000000
881 LBXHCB 0.1359256 1.0000000
890 LBXBHC 0.0013993 0.2182849
899 LBXGHC 0.7282429 1.0000000
908 LBXPDE 0.0018151 0.2795214
917 LBXPDT 0.0006299 0.0995209
926 LBXODT 0.9597318 1.0000000
935 LBXOXY 0.0330669 1.0000000
944 LBXTNA 0.0338404 1.0000000
953 LBXHPE 0.0043534 0.6573670
962 LBXMIR 0.3653136 1.0000000
971 LBXALD 0.4833529 1.0000000
980 LBXDIE 0.0283430 1.0000000
989 LBXEND 0.5020156 1.0000000
998 URXP01 0.7068498 1.0000000
1007 URXP02 0.9118684 1.0000000
1016 URXP03 0.6687521 1.0000000
1025 URXP04 0.1121305 1.0000000
1034 URXP05 0.2378190 1.0000000
1043 URXP06 0.1683708 1.0000000
1052 URXP07 0.0158414 1.0000000
1061 URXP08 0.6678138 1.0000000
1070 URXP10 0.1213364 1.0000000
1079 URXP11 0.2429619 1.0000000
1088 URXP12 0.2239252 1.0000000
1097 URXP13 0.3462016 1.0000000
1106 URXP14 0.1889930 1.0000000
1115 URXP15 0.9654468 1.0000000
1124 URXP16 0.8236533 1.0000000
1133 URXP17 0.1697639 1.0000000
1142 URXP19 0.7135169 1.0000000
1151 URXP20 0.3700847 1.0000000
1160 URXP21 0.6963767 1.0000000
1169 URXP22 0.8653249 1.0000000
1178 URXP24 0.3617239 1.0000000
1187 LBXIRN 0.0081613 1.0000000
1196 LBXVIE 0.4844879 1.0000000
1205 LBXALC 0.0187917 1.0000000
1214 LBXBEC 0.0021148 0.3214506
1223 LBXCBC 0.0009141 0.1435199
1232 LBXCRY 0.0020868 0.3192807
1241 LBXGTC 0.0000000 0.0000048
1250 LBXLUZ 0.0117679 1.0000000
1259 LBXLYC 0.0442429 1.0000000
1268 LBXRPL 0.7692859 1.0000000
1277 LBXRST 0.0866710 1.0000000
1286 LBXVIA 0.3413536 1.0000000
1295 URX14D 0.7273087 1.0000000
1304 URX1TB 0.0199153 1.0000000
1313 URX3TB 0.8332302 1.0000000
1322 URXCBF 0.7636764 1.0000000
1331 URXPCP 0.7997655 1.0000000
1340 URXOPP 0.2847122 1.0000000
1349 URXDPY 0.7779530 1.0000000
1358 URXMET 0.2600918 1.0000000
1367 URXTCC 0.6232722 1.0000000
1376 LBXSPH 0.1312346 1.0000000
1385 URXDAZ 0.2820743 1.0000000
1394 URXDMA 0.2668222 1.0000000
1403 URXEQU 0.3679812 1.0000000
1412 URXETD 0.7907561 1.0000000
1421 URXETL 0.8667247 1.0000000
1430 URXGNS 0.9148936 1.0000000

Combining all together into a final dataset per Phenotype for the training dataset.

Phenotype: BMI

train_summary_final_BMI- is a nested dataframe, it is nested by the exposure and the exposure description. For every exposure under the “data” column, you can see the full adjusted linear model of association. For each coefficient, the following information is displayed: Data: estimate,standard error, pvalue, and false discovery rate (FDR) for the exposure variable.

# Merging the model summary with FDR.
train_BMI_pvalue_adj_reduced <- train_BMI_pvalue_adj %>% select(c(term, FDR))
train_BMI_model_Summary_FDR <- merge(train_model_Summary_BMI,train_BMI_pvalue_adj_reduced  , by = "term", all.x = TRUE)
# Nesting the dataframe by Exposure 
train_model_Summary_nested_BMI <-train_BMI_model_Summary_FDR %>% nest_by(Exposure)
# Adding the description of the exposure ID.
exposure_def <- ExposureDescription %>% select(c(var,var_desc))
names(exposure_def)[1] <- "Exposure"
names(exposure_def)[2] <- "Exposure Description"

# Deleting duplicate rows, using the exposure column since the exposure description column has subtle changes in the description of element (like capital vs small letters), that won't be detected as duplicates. 
duplicates <- duplicated(exposure_def$Exposure)
exposure_def <- exposure_def[!duplicates, ]
# Merging in the description of each exposure (inner join)
train_summary_final_BMI <- merge(train_model_Summary_nested_BMI,exposure_def, by = "Exposure", all = FALSE)
# Customizing the output dataframe 
new_order <- c("Exposure", "Exposure Description", "data")
train_summary_final_BMI <- train_summary_final_BMI[new_order]

Displaying a DataFrame with only the Exposures and their relevant measures:

train_BMI_model_exposures_only <- train_BMI_model_Summary_FDR
train_BMI_model_exposures_only <- train_BMI_model_exposures_only[grepl("^scale", train_BMI_model_exposures_only$term), ]
train_BMI_model_exposures_only <- train_BMI_model_exposures_only %>% select(-term)
train_BMI_model_exposures_only <- merge(train_BMI_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value","FDR")
train_BMI_model_exposures_only <- train_BMI_model_exposures_only[new_order]



train_BMI_model_exposures_only %>% as.data.frame() %>%  setNames(c("Exposure", "Exposure Description", "Beta Coefficient", "Standard Error","pValue", "FDR"))%>% kable() %>%  add_header_above(c(" " = 1, "Phenotype: BMI" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: BMI
Exposure Exposure Description Beta Coefficient Standard Error pValue FDR
LBD199 PCB199 (ng/g) -0.3408196 0.0508426 0.0002766 0.0345706
LBX028 PCB28 (ng/g) 0.0915053 0.0368643 0.0476617 1.0000000
LBX052 PCB52 (ng/g) 0.0795955 0.0193847 0.0005042 0.0620127
LBX066 PCB66 (ng/g) 0.0420634 0.0193845 0.0416295 1.0000000
LBX074 PCB74 (ng/g) 0.0807489 0.0273424 0.0075894 0.7921536
LBX087 PCB87 (ng/g) 0.0074455 0.0157070 0.6499132 1.0000000
LBX099 PCB99 (ng/g) 0.0762283 0.0208611 0.0014810 0.1762418
LBX101 PCB101 (ng/g) 0.0169128 0.0277400 0.5486068 1.0000000
LBX105 PCB105 (ng/g) 0.0567227 0.0219861 0.0174660 1.0000000
LBX110 PCB110 (ng/g) 0.0162001 0.0182331 0.4037700 1.0000000
LBX118 PCB118 (ng/g) 0.1117064 0.0235063 0.0001078 0.0140165
LBX128 PCB128 (ng/g) 0.0577789 0.0283394 0.0542540 1.0000000
LBX138 PCB138 & 158 (ng/g) 0.0241376 0.0254667 0.3540049 1.0000000
LBX146 PCB146 (ng/g) -0.0830151 0.0240366 0.0023781 0.2782415
LBX149 PCB149 (ng/g) 0.0222698 0.0166993 0.2241002 1.0000000
LBX151 PCB151 (ng/g) 0.0280933 0.0169630 0.1416618 1.0000000
LBX153 PCB153 (ng/g) -0.0778833 0.0309886 0.0201897 1.0000000
LBX156 PCB156 (ng/g) -0.1236637 0.0198585 0.0000035 0.0004926
LBX157 PCB157 (ng/g) 0.0115978 0.0263800 0.6646874 1.0000000
LBX167 PCB167 (ng/g) -0.0010927 0.0261936 0.9671198 1.0000000
LBX170 PCB170 (ng/g) -0.1616512 0.0280853 0.0000103 0.0014024
LBX172 PCB172 (ng/g) -0.0307905 0.0196760 0.1325569 1.0000000
LBX177 PCB177 (ng/g) -0.0132860 0.0238655 0.5836161 1.0000000
LBX178 PCB178 (ng/g) -0.0764934 0.0235545 0.0038537 0.4354637
LBX180 PCB180 (ng/g) -0.2086052 0.0316211 0.0000016 0.0002235
LBX183 PCB183 (ng/g) -0.0220815 0.0211261 0.3078027 1.0000000
LBX187 PCB187 (ng/g) -0.1127399 0.0281197 0.0006355 0.0768899
LBX189 PCB189 (ng/g) 0.0108575 0.0176159 0.5571675 1.0000000
LBX194 PCB194 (ng/g) -0.3778197 0.0492534 0.0001191 0.0153597
LBX195 PCB195 (ng/g) 0.0114952 0.0169407 0.5192174 1.0000000
LBX196 PCB196 & 203 (ng/g) -0.2656293 0.0394514 0.0002691 0.0339125
LBX206 PCB206 (ng/g) -0.1628048 0.0217808 0.0001403 0.0179536
LBXALC a-Carotene(ug/dL) -0.1916125 0.0116597 0.0000008 0.0001123
LBXALD Aldrin (ng/g) 0.0213970 0.0143398 0.1792932 1.0000000
LBXB12 Vitamin B12, serum (pg/mL) -0.2047473 0.0103057 0.0000000 0.0000000
LBXBCD Cadmium (ug/L) -0.0106824 0.0104770 0.3195187 1.0000000
LBXBEC trans-b-carotene(ug/dL) -0.2583417 0.0081134 0.0000000 0.0000012
LBXBHC Beta-hexachlorocyclohexane (ng/g) 0.1435895 0.0273696 0.0000336 0.0044999
LBXBPB Lead (ug/dL) -0.2035729 0.0159326 0.0000000 0.0000000
LBXCBC cis-b-carotene(ug/dL) -0.2409739 0.0113488 0.0000001 0.0000199
LBXCOT Cotinine (ng/mL) 0.0358362 0.0117989 0.0062644 0.6765554
LBXCRY b-cryptoxanthin(ug/dL) -0.2368754 0.0163658 0.0000018 0.0002544
LBXD01 1,2,3,7,8-pncdd (fg/g) -0.0194447 0.0220296 0.3874104 1.0000000
LBXD02 1,2,3,4,7,8-hxcdd (fg/g) 0.0696474 0.0447041 0.1632009 1.0000000
LBXD03 1,2,3,6,7,8-hxcdd (fg/g) 0.0579234 0.0191896 0.0065394 0.6997202
LBXD04 1,2,3,7,8,9-hxcdd (fg/g) 0.0536833 0.0195327 0.0120424 1.0000000
LBXD05 1,2,3,4,6,7,8-hpcdd (fg/g) 0.1598730 0.0272081 0.0000078 0.0010812
LBXD07 1,2,3,4,6,7,8,9-ocdd (fg/g) 0.1318667 0.0299399 0.0002471 0.0313780
LBXDIE Dieldrin (ng/g) 0.3250939 0.0385232 0.0000647 0.0085417
LBXEND Endrin (ng/g) 0.0213904 0.0144656 0.1827504 1.0000000
LBXEPAH 2-(N-ethyl-PFOSA) acetate 0.0384728 0.0275097 0.2114645 1.0000000
LBXF01 2,3,7,8-tcdf (fg/g) 0.0262069 0.0252113 0.3103989 1.0000000
LBXF02 1,2,3,7,8-pncdf (fg/g) 0.0135876 0.0217855 0.5395425 1.0000000
LBXF03 2,3,4,7,8-pncdf (fg/g) -0.0128551 0.0208510 0.5441765 1.0000000
LBXF04 1,2,3,4,7,8-hxcdf (fg/g) 0.1348533 0.0185184 0.0000004 0.0000539
LBXF05 1,2,3,6,7,8-hxcdf (fg/g) 0.0455173 0.0188408 0.0248908 1.0000000
LBXF06 1,2,3,7,8,9-hxcdf (fg/g) 0.0308585 0.0194651 0.1278376 1.0000000
LBXF07 2,3,4,6,7,8-hxcdf (fg/g) 0.0251465 0.0161939 0.1354041 1.0000000
LBXF08 1,2,3,4,6,7,8-hpcdf (fg/g) 0.0438385 0.0158389 0.0115322 1.0000000
LBXF09 1,2,3,4,7,8,9-hpcdf (fg/g) 0.0697482 0.0277887 0.0403998 1.0000000
LBXF10 1,2,3,4,6,7,8,9-ocdf (fg/g) 0.0622894 0.0198089 0.0048935 0.5431783
LBXFOL Folate, serum (ng/mL) -0.1905314 0.0096388 0.0000000 0.0000000
LBXGHC Gamma-hexachlorocyclohexane (ng/g) 0.0403834 0.0308621 0.2048413 1.0000000
LBXGTC g-tocopherol(ug/dL) 0.2285937 0.0100124 0.0000000 0.0000000
LBXHCB Hexachlorobenzene (ng/g) 0.0073522 0.0298469 0.8078181 1.0000000
LBXHPE Heptachlor Epoxide (ng/g) 0.2698109 0.0288549 0.0000000 0.0000010
LBXHXC 3,3,4,4,5,5-hxcb (fg/g) -0.2141849 0.0319744 0.0000013 0.0001815
LBXIHG Mercury, inorganic (ug/L) -0.0012979 0.0224575 0.9544587 1.0000000
LBXIRN Iron, Frozen Serum (ug/dL) -0.0928059 0.0119747 0.0000001 0.0000209
LBXLUZ Combined Lutein/zeaxanthin (ug/dL) -0.2284943 0.0110242 0.0000002 0.0000232
LBXLYC trans-lycopene(ug/dL) 0.0391903 0.0124745 0.0163420 1.0000000
LBXMIR Mirex (ng/g) -0.0431996 0.0202747 0.0450957 1.0000000
LBXMPAH 2-(N-methyl-PFOSA) acetate -0.0104817 0.0297741 0.7368422 1.0000000
LBXODT o,p-DDT (ng/g) 0.0795222 0.0278279 0.0094251 0.9613633
LBXOXY Oxychlordane (ng/g) 0.0608732 0.0310987 0.0637211 1.0000000
LBXPCB 3,3,4,4,5-pncb (fg/g) 0.1607471 0.0239557 0.0000012 0.0001782
LBXPDE p,p-DDE (ng/g) 0.0555545 0.0293940 0.0726419 1.0000000
LBXPDT p,p-DDT (ng/g) 0.0874431 0.0212755 0.0004995 0.0619334
LBXPFDE Perfluorodecanoic acid -0.0559531 0.0234452 0.0542814 1.0000000
LBXPFDO Perfluorododecanoic acid -0.0590740 0.0432043 0.2205320 1.0000000
LBXPFHP Perfluoroheptanoic acid 0.0453131 0.0287564 0.1661517 1.0000000
LBXPFHS Perfluorohexane sulfonic acid -0.0906463 0.0416582 0.0724645 1.0000000
LBXPFNA Perfluorononanoic acid -0.0667847 0.0438379 0.1784799 1.0000000
LBXPFOA Perfluorooctanoic acid 0.0236730 0.0264606 0.4054300 1.0000000
LBXPFOS Perfluorooctane sulfonic acid -0.0102091 0.0326829 0.7653314 1.0000000
LBXPFSA Perfluorooctane sulfonamide -0.0985670 0.0475431 0.0835150 1.0000000
LBXPFUA Perfluoroundecanoic acid -0.0685577 0.0186627 0.0104107 1.0000000
LBXRBF Folate, RBC (ng/mL RBC) -0.0065411 0.0123954 0.6032394 1.0000000
LBXRPL Retinyl palmitate(ug/dL) -0.0547151 0.0167629 0.0037080 0.4227151
LBXRST Retinyl stearate(ug/dL) -0.0655155 0.0113216 0.0000096 0.0013156
LBXSPH Phosphorus (mg/dL) -0.1023527 0.0139922 0.0000003 0.0000506
LBXTC2 3,4,4,5-tcb (fg/g) -0.0192188 0.0197799 0.3422942 1.0000000
LBXTCD 2,3,7,8-tcdd (fg/g) 0.0198184 0.0249161 0.4352769 1.0000000
LBXTHG Mercury, total (ug/L) -0.0657011 0.0222536 0.0076044 0.7921536
LBXTNA Trans-nonachlor (ng/g) 0.0594385 0.0280126 0.0459136 1.0000000
LBXVIA Retinol(ug/dL) 0.0964585 0.0144043 0.0000013 0.0001815
LBXVID Vitamin D (ng/mL) -0.2598976 0.0185320 0.0000022 0.0003130
LBXVIE a-Tocopherol(ug/dL) -0.0018945 0.0134701 0.8894893 1.0000000
URX14D 2,5-dichlorophenol (ug/L) result 0.0684557 0.0252926 0.0132166 1.0000000
URX1TB 2,4,5-trichlorophenol (ug/L) result 0.0251401 0.0166089 0.1450205 1.0000000
URX3TB 2,4,6-trichlorophenol (ug/L) result -0.0315782 0.0167775 0.0737432 1.0000000
URXCBF Carbofuranphenol (ug/L) result -0.0219768 0.0314808 0.4927722 1.0000000
URXDAZ Daidzein (ng/mL) 0.0310222 0.0145333 0.0447449 1.0000000
URXDMA o-Desmethylangolensin (O-DMA) (ng/mL) -0.0027413 0.0202323 0.8935126 1.0000000
URXDPY diethylaminomethylpyrimidinol/one (ug/L) 0.0020866 0.0229881 0.9302191 1.0000000
URXEQU Equol (ng/mL) 0.0025834 0.0221803 0.9083837 1.0000000
URXETD Enterodiol (ng/mL) -0.0407436 0.0230313 0.0914072 1.0000000
URXETL Enterolactone (ng/mL) -0.1152220 0.0201079 0.0000109 0.0014760
URXGNS Genistein (ng/mL) 0.0186561 0.0169892 0.2845850 1.0000000
URXMBP Mono-n-butyl phthalate 0.0076812 0.0161581 0.6394247 1.0000000
URXMC1 Mono-(3-carboxypropyl) phthalate 0.0257322 0.0254145 0.3450090 1.0000000
URXMCP Mono-cyclohexyl phthalate -0.0055374 0.0196446 0.7807973 1.0000000
URXMEP Mono-ethyl phthalate 0.1090921 0.0162427 0.0000012 0.0001770
URXMET Metolachlor mercapturate (ug/L) result 0.0017177 0.0191221 0.9309390 1.0000000
URXMHH Mono-(2-ethyl-5-hydroxyhexyl) phthalate 0.0790602 0.0165632 0.0020281 0.2393138
URXMHP Mono-(2-ethyl)-hexyl phthalate 0.0218056 0.0156774 0.1788226 1.0000000
URXMIB Mono-isobutyl phthalate 0.0276898 0.0186325 0.1808406 1.0000000
URXMNM Mono-n-methyl phthalate 0.0074960 0.0237375 0.7613717 1.0000000
URXMNP Mono-isononyl phthalate -0.0128979 0.0089652 0.1649842 1.0000000
URXMOH Mono-(2-ethyl-5-oxohexl) phthalate 0.0776022 0.0189394 0.0045881 0.5138655
URXMOP Mono-n-octyl phthalate -0.0281631 0.0197847 0.1692863 1.0000000
URXMZP Mono-benzyl phthalate 0.0602601 0.0233141 0.0172847 1.0000000
URXOPP O-Phenyl phenol (ug/L) result 0.0065515 0.0242488 0.7896600 1.0000000
URXP01 1-napthol (ng/L) 0.0020432 0.0261658 0.9399434 1.0000000
URXP02 2-napthol (ng/L) 0.0949955 0.0157132 0.0005182 0.0632239
URXP03 3-fluorene (ng/L) 0.0546421 0.0174202 0.0049827 0.5480947
URXP04 2-fluorene (ng/L) 0.0858877 0.0167861 0.0000455 0.0060562
URXP05 3-phenanthrene (ng/L) 0.0392506 0.0197663 0.0602737 1.0000000
URXP06 1-phenanthrene (ng/L) 0.1089104 0.0225535 0.0000898 0.0117703
URXP07 2-phenanthrene (ng/L) 0.1404064 0.0208021 0.0000011 0.0001658
URXP08 1-benzo[c] phenanthrene (ng/L) 0.0080217 0.0188848 0.6753242 1.0000000
URXP10 1-pyrene (ng/L) 0.0320947 0.0167939 0.0697394 1.0000000
URXP11 2-benzo[c] phenanthrene (ng/L) -0.0017306 0.0192421 0.9291868 1.0000000
URXP12 1-benzo[a] anthracene (ng/L) 0.0338052 0.0121216 0.0110013 1.0000000
URXP13 6-chrysene (ng/L) 0.0102043 0.0226637 0.6571452 1.0000000
URXP14 3-benzo[c] phenanthrene (ng/L) 0.0155483 0.0159695 0.3413248 1.0000000
URXP15 3-chrysene (ng/L) -0.0537573 0.0158944 0.0028134 0.3235455
URXP16 3-benz[a] anthracene (ng/L) 0.0022213 0.0270823 0.9354066 1.0000000
URXP17 9-fluorene (ng/L) 0.1118388 0.0280468 0.0052727 0.5747190
URXP19 4-phenanthrene (ng/L) 0.0652742 0.0175891 0.0075443 0.7921536
URXP20 1-chrysene (ng/L) 0.0976988 0.0256916 0.0066907 0.7092133
URXP21 2-chrysene (ng/L) 0.0358762 0.0223825 0.1529976 1.0000000
URXP22 4-chrysene (ng/L) 0.0117159 0.0338084 0.7391207 1.0000000
URXP24 3-benzo(a) pyrene (ng/L) -0.0023909 0.0273231 0.9327224 1.0000000
URXPCP Pentachlorophenol (ug/L) result 0.0019523 0.0247614 0.9379043 1.0000000
URXTCC dichlorovnl-dimeth prop carboacid (ug/L) -0.0391348 0.0161983 0.0248860 1.0000000
URXUBA Barium, urine (ng/mL) 0.0510851 0.0149858 0.0026423 0.3065035
URXUBE Beryllium, urine (ng/mL) -0.0031338 0.0147788 0.8341129 1.0000000
URXUCD Cadmium, urine (ng/mL) 0.0411535 0.0200872 0.0531901 1.0000000
URXUCO Cobalt, urine (ng/mL) -0.0029838 0.0163922 0.8573066 1.0000000
URXUCS Cesium, urine (ng/mL) 0.0254334 0.0113513 0.0359950 1.0000000
URXUHG Mercury, urine (ng/mL) -0.0550791 0.0220874 0.0206495 1.0000000
URXUIO Iodine, urine (ng/mL) -0.0358517 0.0191400 0.1032021 1.0000000
URXUMO Molybdenum, urine (ng/mL) 0.0075207 0.0168505 0.6599336 1.0000000
URXUPB Lead, urine (ng/mL) -0.0530362 0.0132674 0.0006536 0.0784369
URXUPT Platinum, urine (ng/mL) -0.0445046 0.0157208 0.0100096 1.0000000
URXUSB Antimony, urine (ng/mL) 0.0584387 0.0207685 0.0104021 1.0000000
URXUTL Thallium, urine (ng/mL) 0.0743989 0.0116830 0.0000026 0.0003625
URXUTU Tungsten, urine (ng/mL) 0.0195960 0.0171570 0.2662449 1.0000000
URXUUR Uranium, urine (ng/mL) 0.0118812 0.0228046 0.6184358 1.0000000

Phenotype: Fasting Blood Glucose

train_summary_final_LBXGLU- is a nested dataframe, it is nested by the exposure and the exposure description. For every exposure under the “data” column, you can see the full adjusted linear model of association. For each coefficient, the following information is displayed: Data: estimate,standard error, pvalue, and false discovery rate (FDR) for the exposure variable.

# Merging the model summary with FDR.
train_LBXGLU_pvalue_adj_reduced <- train_LBXGLU_pvalue_adj %>% select(c(term, FDR))
train_LBXGLU_model_Summary_FDR <- merge(train_model_Summary_LBXGLU,train_LBXGLU_pvalue_adj_reduced  , by = "term", all.x = TRUE)
# Nesting the dataframe by Exposure 
train_model_Summary_nested_LBXGLU <-train_LBXGLU_model_Summary_FDR %>% nest_by(Exposure)
# Merging in the description of each exposure (inner join)
train_summary_final_LBXGLU <- merge(train_model_Summary_nested_LBXGLU,exposure_def, by = "Exposure", all = FALSE)
# Customizing the output dataframe 
new_order <- c("Exposure", "Exposure Description", "data")
train_summary_final_LBXGLU <- train_summary_final_LBXGLU[new_order]

Displaying a DataFrame with only the Exposures and their relevant measures:

train_LBXGLU_model_exposures_only <- train_LBXGLU_model_Summary_FDR
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only[grepl("^scale", train_LBXGLU_model_exposures_only$term), ]
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only %>% select(-term)
train_LBXGLU_model_exposures_only <- merge(train_LBXGLU_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value","FDR")
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only[new_order]



train_LBXGLU_model_exposures_only %>% as.data.frame() %>%  setNames(c("Exposure", "Exposure Description", "Beta Coefficient", "Standard Error","pValue", "FDR"))%>% kable() %>%  add_header_above(c(" " = 1, "Phenotype: Fasting Blood Glucose" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: Fasting Blood Glucose
Exposure Exposure Description Beta Coefficient Standard Error pValue FDR
LBD199 PCB199 (ng/g) -0.0515945 0.0446150 0.2854395 1.0000000
LBX028 PCB28 (ng/g) 0.0011809 0.0259753 0.9652125 1.0000000
LBX052 PCB52 (ng/g) 0.0250654 0.0195598 0.2139956 1.0000000
LBX066 PCB66 (ng/g) 0.1074855 0.0631412 0.1034568 1.0000000
LBX074 PCB74 (ng/g) 0.1269555 0.0573901 0.0381726 1.0000000
LBX087 PCB87 (ng/g) 0.0276298 0.0417716 0.5294871 1.0000000
LBX099 PCB99 (ng/g) 0.1043700 0.0470952 0.0378590 1.0000000
LBX101 PCB101 (ng/g) -0.0142698 0.0258778 0.5871605 1.0000000
LBX105 PCB105 (ng/g) 0.0527263 0.0294115 0.0874380 1.0000000
LBX110 PCB110 (ng/g) 0.0197141 0.0366436 0.6072532 1.0000000
LBX118 PCB118 (ng/g) 0.1747719 0.0421620 0.0004591 0.0730005
LBX128 PCB128 (ng/g) 0.0194525 0.0242509 0.4314588 1.0000000
LBX138 PCB138 & 158 (ng/g) 0.1031927 0.0434761 0.0272366 1.0000000
LBX146 PCB146 (ng/g) 0.0764909 0.0436016 0.0939591 1.0000000
LBX149 PCB149 (ng/g) 0.0266535 0.0363263 0.4869590 1.0000000
LBX151 PCB151 (ng/g) 0.0206117 0.0279704 0.4851366 1.0000000
LBX153 PCB153 (ng/g) 0.0981909 0.0432096 0.0336878 1.0000000
LBX156 PCB156 (ng/g) 0.0542031 0.0465673 0.2574891 1.0000000
LBX157 PCB157 (ng/g) 0.0023948 0.0244106 0.9227788 1.0000000
LBX167 PCB167 (ng/g) 0.0295029 0.0313485 0.3573418 1.0000000
LBX170 PCB170 (ng/g) 0.1433432 0.0616588 0.0301907 1.0000000
LBX172 PCB172 (ng/g) 0.0418000 0.0361891 0.2610474 1.0000000
LBX177 PCB177 (ng/g) 0.0558405 0.0413782 0.1915411 1.0000000
LBX178 PCB178 (ng/g) 0.0152238 0.0292379 0.6080314 1.0000000
LBX180 PCB180 (ng/g) 0.0776180 0.0440995 0.0929561 1.0000000
LBX183 PCB183 (ng/g) 0.0587003 0.0344840 0.1034680 1.0000000
LBX187 PCB187 (ng/g) 0.1060423 0.0418922 0.0194162 1.0000000
LBX189 PCB189 (ng/g) 0.0367138 0.0524647 0.5066463 1.0000000
LBX194 PCB194 (ng/g) -0.0334983 0.0669428 0.6321377 1.0000000
LBX195 PCB195 (ng/g) 0.0447708 0.0626536 0.4980020 1.0000000
LBX196 PCB196 & 203 (ng/g) -0.0792445 0.0502781 0.1590014 1.0000000
LBX206 PCB206 (ng/g) 0.0229918 0.0425016 0.6053112 1.0000000
LBXALC a-Carotene(ug/dL) -0.0838811 0.0275736 0.0187917 1.0000000
LBXALD Aldrin (ng/g) 0.0275750 0.0372617 0.4833529 1.0000000
LBXB12 Vitamin B12, serum (pg/mL) 0.0305578 0.0149780 0.0541075 1.0000000
LBXBCD Cadmium (ug/L) -0.0155094 0.0131792 0.2524382 1.0000000
LBXBEC trans-b-carotene(ug/dL) -0.1053922 0.0222479 0.0021148 0.3214506
LBXBHC Beta-hexachlorocyclohexane (ng/g) 0.1655359 0.0450068 0.0013993 0.2182849
LBXBPB Lead (ug/dL) -0.0502433 0.0177217 0.0099156 1.0000000
LBXCBC cis-b-carotene(ug/dL) -0.0888239 0.0161730 0.0009141 0.1435199
LBXCOT Cotinine (ng/mL) 0.0098208 0.0145354 0.5066388 1.0000000
LBXCRY b-cryptoxanthin(ug/dL) -0.0849335 0.0178858 0.0020868 0.3192807
LBXD01 1,2,3,7,8-pncdd (fg/g) 0.0323786 0.0505807 0.5290067 1.0000000
LBXD02 1,2,3,4,7,8-hxcdd (fg/g) 0.0929563 0.0515767 0.1144996 1.0000000
LBXD03 1,2,3,6,7,8-hxcdd (fg/g) 0.0846654 0.0424918 0.0594733 1.0000000
LBXD04 1,2,3,7,8,9-hxcdd (fg/g) 0.0889778 0.0466544 0.0702706 1.0000000
LBXD05 1,2,3,4,6,7,8-hpcdd (fg/g) 0.0908108 0.0332008 0.0124010 1.0000000
LBXD07 1,2,3,4,6,7,8,9-ocdd (fg/g) 0.0740640 0.0468723 0.1290237 1.0000000
LBXDIE Dieldrin (ng/g) 0.2105569 0.0764569 0.0283430 1.0000000
LBXEND Endrin (ng/g) 0.0219057 0.0309539 0.5020156 1.0000000
LBXEPAH 2-(N-ethyl-PFOSA) acetate 0.0451150 0.0489024 0.3918283 1.0000000
LBXF01 2,3,7,8-tcdf (fg/g) -0.0264002 0.0263068 0.3270192 1.0000000
LBXF02 1,2,3,7,8-pncdf (fg/g) -0.0053544 0.0347137 0.8788898 1.0000000
LBXF03 2,3,4,7,8-pncdf (fg/g) 0.0824060 0.0308800 0.0143752 1.0000000
LBXF04 1,2,3,4,7,8-hxcdf (fg/g) 0.0950920 0.0380174 0.0207212 1.0000000
LBXF05 1,2,3,6,7,8-hxcdf (fg/g) 0.0951395 0.0403458 0.0281421 1.0000000
LBXF06 1,2,3,7,8,9-hxcdf (fg/g) 0.0222468 0.0344842 0.5258286 1.0000000
LBXF07 2,3,4,6,7,8-hxcdf (fg/g) 0.0499906 0.0378432 0.2007177 1.0000000
LBXF08 1,2,3,4,6,7,8-hpcdf (fg/g) 0.0602082 0.0293017 0.0525535 1.0000000
LBXF09 1,2,3,4,7,8,9-hpcdf (fg/g) 0.1178753 0.0799849 0.1840493 1.0000000
LBXF10 1,2,3,4,6,7,8,9-ocdf (fg/g) 0.0494291 0.0350254 0.1728195 1.0000000
LBXFOL Folate, serum (ng/mL) -0.0087240 0.0152730 0.5739248 1.0000000
LBXGHC Gamma-hexachlorocyclohexane (ng/g) 0.0124146 0.0352549 0.7282429 1.0000000
LBXGTC g-tocopherol(ug/dL) 0.1422715 0.0167066 0.0000000 0.0000048
LBXHCB Hexachlorobenzene (ng/g) -0.0344947 0.0222453 0.1359256 1.0000000
LBXHPE Heptachlor Epoxide (ng/g) 0.1490605 0.0466539 0.0043534 0.6573670
LBXHXC 3,3,4,4,5,5-hxcb (fg/g) 0.1053043 0.0455634 0.0310668 1.0000000
LBXIHG Mercury, inorganic (ug/L) -0.0040755 0.0089385 0.6528913 1.0000000
LBXIRN Iron, Frozen Serum (ug/dL) -0.0429903 0.0147162 0.0081613 1.0000000
LBXLUZ Combined Lutein/zeaxanthin (ug/dL) -0.0812267 0.0240357 0.0117679 1.0000000
LBXLYC trans-lycopene(ug/dL) -0.0411998 0.0168306 0.0442429 1.0000000
LBXMIR Mirex (ng/g) 0.0351970 0.0380378 0.3653136 1.0000000
LBXMPAH 2-(N-methyl-PFOSA) acetate 0.0468449 0.0892399 0.6184530 1.0000000
LBXODT o,p-DDT (ng/g) -0.0015792 0.0309068 0.9597318 1.0000000
LBXOXY Oxychlordane (ng/g) 0.1509769 0.0661787 0.0330669 1.0000000
LBXPCB 3,3,4,4,5-pncb (fg/g) 0.1195139 0.0334510 0.0017955 0.2782985
LBXPDE p,p-DDE (ng/g) 0.1337392 0.0374807 0.0018151 0.2795214
LBXPDT p,p-DDT (ng/g) 0.1412222 0.0351914 0.0006299 0.0995209
LBXPFDE Perfluorodecanoic acid -0.0376989 0.0346551 0.3184249 1.0000000
LBXPFDO Perfluorododecanoic acid -0.1204707 0.0387366 0.0208490 1.0000000
LBXPFHP Perfluoroheptanoic acid -0.0294164 0.0178196 0.1498731 1.0000000
LBXPFHS Perfluorohexane sulfonic acid -0.0263705 0.0744008 0.7351347 1.0000000
LBXPFNA Perfluorononanoic acid -0.0021235 0.0505025 0.9678252 1.0000000
LBXPFOA Perfluorooctanoic acid -0.0846044 0.0690300 0.2662639 1.0000000
LBXPFOS Perfluorooctane sulfonic acid -0.0128117 0.0653578 0.8510614 1.0000000
LBXPFSA Perfluorooctane sulfonamide 0.0124482 0.0317767 0.7087865 1.0000000
LBXPFUA Perfluoroundecanoic acid -0.0441076 0.0222565 0.0947895 1.0000000
LBXRBF Folate, RBC (ng/mL RBC) 0.0369604 0.0197687 0.0755413 1.0000000
LBXRPL Retinyl palmitate(ug/dL) 0.0060217 0.0202660 0.7692859 1.0000000
LBXRST Retinyl stearate(ug/dL) 0.0398926 0.0221948 0.0866710 1.0000000
LBXSPH Phosphorus (mg/dL) 0.0338711 0.0215665 0.1312346 1.0000000
LBXTC2 3,4,4,5-tcb (fg/g) -0.0454994 0.0261752 0.0968072 1.0000000
LBXTCD 2,3,7,8-tcdd (fg/g) 0.0326694 0.0453276 0.4790167 1.0000000
LBXTHG Mercury, total (ug/L) -0.0027128 0.0168803 0.8737893 1.0000000
LBXTNA Trans-nonachlor (ng/g) 0.1373569 0.0605026 0.0338404 1.0000000
LBXVIA Retinol(ug/dL) -0.0160785 0.0165150 0.3413536 1.0000000
LBXVID Vitamin D (ng/mL) -0.1333235 0.0324176 0.0045006 0.6750972
LBXVIE a-Tocopherol(ug/dL) 0.0131746 0.0185116 0.4844879 1.0000000
URX14D 2,5-dichlorophenol (ug/L) result -0.0151568 0.0428880 0.7273087 1.0000000
URX1TB 2,4,5-trichlorophenol (ug/L) result -0.0321642 0.0127656 0.0199153 1.0000000
URX3TB 2,4,6-trichlorophenol (ug/L) result -0.0061597 0.0288923 0.8332302 1.0000000
URXCBF Carbofuranphenol (ug/L) result -0.0078998 0.0259360 0.7636764 1.0000000
URXDAZ Daidzein (ng/mL) 0.0242993 0.0220098 0.2820743 1.0000000
URXDMA o-Desmethylangolensin (O-DMA) (ng/mL) 0.0243009 0.0213028 0.2668222 1.0000000
URXDPY diethylaminomethylpyrimidinol/one (ug/L) 0.0046509 0.0158687 0.7779530 1.0000000
URXEQU Equol (ng/mL) 0.0149104 0.0162056 0.3679812 1.0000000
URXETD Enterodiol (ng/mL) 0.0059101 0.0219925 0.7907561 1.0000000
URXETL Enterolactone (ng/mL) 0.0032140 0.0189182 0.8667247 1.0000000
URXGNS Genistein (ng/mL) 0.0014812 0.0136941 0.9148936 1.0000000
URXMBP Mono-n-butyl phthalate 0.0386255 0.0209016 0.0787413 1.0000000
URXMC1 Mono-(3-carboxypropyl) phthalate 0.0235570 0.0235368 0.3502295 1.0000000
URXMCP Mono-cyclohexyl phthalate 0.0042171 0.0236532 0.8602032 1.0000000
URXMEP Mono-ethyl phthalate 0.0162812 0.0373446 0.6673060 1.0000000
URXMET Metolachlor mercapturate (ug/L) result -0.0165152 0.0134787 0.2600918 1.0000000
URXMHH Mono-(2-ethyl-5-hydroxyhexyl) phthalate 0.0391208 0.0266177 0.1850945 1.0000000
URXMHP Mono-(2-ethyl)-hexyl phthalate 0.0608473 0.0231423 0.0156763 1.0000000
URXMIB Mono-isobutyl phthalate -0.0162421 0.0320020 0.6273750 1.0000000
URXMNM Mono-n-methyl phthalate -0.0097385 0.0289795 0.7466823 1.0000000
URXMNP Mono-isononyl phthalate -0.0161354 0.0154452 0.3080442 1.0000000
URXMOH Mono-(2-ethyl-5-oxohexl) phthalate 0.0414723 0.0278637 0.1802506 1.0000000
URXMOP Mono-n-octyl phthalate -0.0201920 0.0117467 0.1003312 1.0000000
URXMZP Mono-benzyl phthalate 0.0175628 0.0240041 0.4724653 1.0000000
URXOPP O-Phenyl phenol (ug/L) result -0.0300582 0.0273799 0.2847122 1.0000000
URXP01 1-napthol (ng/L) -0.0115938 0.0295899 0.7068498 1.0000000
URXP02 2-napthol (ng/L) 0.0033295 0.0290158 0.9118684 1.0000000
URXP03 3-fluorene (ng/L) 0.0078669 0.0181287 0.6687521 1.0000000
URXP04 2-fluorene (ng/L) 0.0286705 0.0172896 0.1121305 1.0000000
URXP05 3-phenanthrene (ng/L) 0.0221311 0.0182136 0.2378190 1.0000000
URXP06 1-phenanthrene (ng/L) 0.0269843 0.0189140 0.1683708 1.0000000
URXP07 2-phenanthrene (ng/L) 0.0474879 0.0180941 0.0158414 1.0000000
URXP08 1-benzo[c] phenanthrene (ng/L) -0.0067204 0.0154399 0.6678138 1.0000000
URXP10 1-pyrene (ng/L) 0.0332506 0.0205945 0.1213364 1.0000000
URXP11 2-benzo[c] phenanthrene (ng/L) -0.0170052 0.0141542 0.2429619 1.0000000
URXP12 1-benzo[a] anthracene (ng/L) 0.0379480 0.0302821 0.2239252 1.0000000
URXP13 6-chrysene (ng/L) 0.0220321 0.0228638 0.3462016 1.0000000
URXP14 3-benzo[c] phenanthrene (ng/L) -0.0253617 0.0186809 0.1889930 1.0000000
URXP15 3-chrysene (ng/L) -0.0007344 0.0167520 0.9654468 1.0000000
URXP16 3-benz[a] anthracene (ng/L) 0.0058835 0.0260734 0.8236533 1.0000000
URXP17 9-fluorene (ng/L) 0.0535452 0.0349870 0.1697639 1.0000000
URXP19 4-phenanthrene (ng/L) -0.0130919 0.0342371 0.7135169 1.0000000
URXP20 1-chrysene (ng/L) 0.0397310 0.0414832 0.3700847 1.0000000
URXP21 2-chrysene (ng/L) -0.0092898 0.0228420 0.6963767 1.0000000
URXP22 4-chrysene (ng/L) -0.0057447 0.0326519 0.8653249 1.0000000
URXP24 3-benzo(a) pyrene (ng/L) 0.0210543 0.0215790 0.3617239 1.0000000
URXPCP Pentachlorophenol (ug/L) result 0.0071180 0.0277087 0.7997655 1.0000000
URXTCC dichlorovnl-dimeth prop carboacid (ug/L) -0.0099347 0.0199268 0.6232722 1.0000000
URXUBA Barium, urine (ng/mL) 0.0073754 0.0424701 0.8637950 1.0000000
URXUBE Beryllium, urine (ng/mL) -0.0021292 0.0116287 0.8564783 1.0000000
URXUCD Cadmium, urine (ng/mL) -0.0894074 0.0416382 0.0436044 1.0000000
URXUCO Cobalt, urine (ng/mL) 0.0081500 0.0259035 0.7561466 1.0000000
URXUCS Cesium, urine (ng/mL) -0.0731324 0.0529838 0.1820190 1.0000000
URXUHG Mercury, urine (ng/mL) -0.0829618 0.0286026 0.0082958 1.0000000
URXUIO Iodine, urine (ng/mL) 0.0460943 0.0415512 0.3039359 1.0000000
URXUMO Molybdenum, urine (ng/mL) -0.0430284 0.0474954 0.3752412 1.0000000
URXUPB Lead, urine (ng/mL) -0.0824724 0.0472545 0.0955528 1.0000000
URXUPT Platinum, urine (ng/mL) -0.0605199 0.0221036 0.0123237 1.0000000
URXUSB Antimony, urine (ng/mL) 0.0023215 0.0413607 0.9557703 1.0000000
URXUTL Thallium, urine (ng/mL) -0.1107449 0.0552117 0.0579288 1.0000000
URXUTU Tungsten, urine (ng/mL) -0.0187654 0.0292987 0.5287842 1.0000000
URXUUR Uranium, urine (ng/mL) -0.0140040 0.0469627 0.7742112 1.0000000

C. Analysis of EWAS results

    1. What were the top 3 exposures associated with fasting glucose ranked by FDR (lowest to highest)? (b) How about for BMI? (c) How many were found in each phenotype in the training cohort with an FDR corrected pvalue of less than 0.05? (2)
# (a) top 3 exposures associated with fasting blood glucose 

train_LBXGLU_model_exposures_only %>% arrange(FDR) %>% head(n=3) %>% select(-c( sd, p_value)) %>% as.data.frame() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px") %>% add_header_above(c(" " = 1, "Phenotype: Fasting Blood Glucose" = 3))
Phenotype: Fasting Blood Glucose
Exposure Exposure Description beta_coefficient FDR
LBXGTC g-tocopherol(ug/dL) 0.1422715 0.0000048
LBX118 PCB118 (ng/g) 0.1747719 0.0730005
LBXPDT p,p-DDT (ng/g) 0.1412222 0.0995209
# (b) top 3 exposures associated with BMI
train_BMI_model_exposures_only %>% arrange(FDR) %>% head(n=3) %>% select(-c(sd, p_value)) %>% as.data.frame() %>% kable()  %>% column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px") %>% add_header_above(c(" " = 1, "Phenotype: BMI" = 3))
Phenotype: BMI
Exposure Exposure Description beta_coefficient FDR
LBXGTC g-tocopherol(ug/dL) 0.2285937 0
LBXB12 Vitamin B12, serum (pg/mL) -0.2047473 0
LBXFOL Folate, serum (ng/mL) -0.1905314 0
# (c) How many were found in each phenotype in the training cohort with an FDR corrected pvalue of less than 0.05
# Phenotype= BMI
train_BMI_model_exposures_only %>% filter(FDR <0.05) %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 2, "Phenotype: BMI" = 4)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: BMI
Exposure Exposure Description beta_coefficient sd p_value FDR
LBD199 PCB199 (ng/g) -0.3408196 0.0508426 0.0002766 0.0345706
LBX118 PCB118 (ng/g) 0.1117064 0.0235063 0.0001078 0.0140165
LBX156 PCB156 (ng/g) -0.1236637 0.0198585 0.0000035 0.0004926
LBX170 PCB170 (ng/g) -0.1616512 0.0280853 0.0000103 0.0014024
LBX180 PCB180 (ng/g) -0.2086052 0.0316211 0.0000016 0.0002235
LBX194 PCB194 (ng/g) -0.3778197 0.0492534 0.0001191 0.0153597
LBX196 PCB196 & 203 (ng/g) -0.2656293 0.0394514 0.0002691 0.0339125
LBX206 PCB206 (ng/g) -0.1628048 0.0217808 0.0001403 0.0179536
LBXALC a-Carotene(ug/dL) -0.1916125 0.0116597 0.0000008 0.0001123
LBXB12 Vitamin B12, serum (pg/mL) -0.2047473 0.0103057 0.0000000 0.0000000
LBXBEC trans-b-carotene(ug/dL) -0.2583417 0.0081134 0.0000000 0.0000012
LBXBHC Beta-hexachlorocyclohexane (ng/g) 0.1435895 0.0273696 0.0000336 0.0044999
LBXBPB Lead (ug/dL) -0.2035729 0.0159326 0.0000000 0.0000000
LBXCBC cis-b-carotene(ug/dL) -0.2409739 0.0113488 0.0000001 0.0000199
LBXCRY b-cryptoxanthin(ug/dL) -0.2368754 0.0163658 0.0000018 0.0002544
LBXD05 1,2,3,4,6,7,8-hpcdd (fg/g) 0.1598730 0.0272081 0.0000078 0.0010812
LBXD07 1,2,3,4,6,7,8,9-ocdd (fg/g) 0.1318667 0.0299399 0.0002471 0.0313780
LBXDIE Dieldrin (ng/g) 0.3250939 0.0385232 0.0000647 0.0085417
LBXF04 1,2,3,4,7,8-hxcdf (fg/g) 0.1348533 0.0185184 0.0000004 0.0000539
LBXFOL Folate, serum (ng/mL) -0.1905314 0.0096388 0.0000000 0.0000000
LBXGTC g-tocopherol(ug/dL) 0.2285937 0.0100124 0.0000000 0.0000000
LBXHPE Heptachlor Epoxide (ng/g) 0.2698109 0.0288549 0.0000000 0.0000010
LBXHXC 3,3,4,4,5,5-hxcb (fg/g) -0.2141849 0.0319744 0.0000013 0.0001815
LBXIRN Iron, Frozen Serum (ug/dL) -0.0928059 0.0119747 0.0000001 0.0000209
LBXLUZ Combined Lutein/zeaxanthin (ug/dL) -0.2284943 0.0110242 0.0000002 0.0000232
LBXPCB 3,3,4,4,5-pncb (fg/g) 0.1607471 0.0239557 0.0000012 0.0001782
LBXRST Retinyl stearate(ug/dL) -0.0655155 0.0113216 0.0000096 0.0013156
LBXSPH Phosphorus (mg/dL) -0.1023527 0.0139922 0.0000003 0.0000506
LBXVIA Retinol(ug/dL) 0.0964585 0.0144043 0.0000013 0.0001815
LBXVID Vitamin D (ng/mL) -0.2598976 0.0185320 0.0000022 0.0003130
URXETL Enterolactone (ng/mL) -0.1152220 0.0201079 0.0000109 0.0014760
URXMEP Mono-ethyl phthalate 0.1090921 0.0162427 0.0000012 0.0001770
URXP04 2-fluorene (ng/L) 0.0858877 0.0167861 0.0000455 0.0060562
URXP06 1-phenanthrene (ng/L) 0.1089104 0.0225535 0.0000898 0.0117703
URXP07 2-phenanthrene (ng/L) 0.1404064 0.0208021 0.0000011 0.0001658
URXUTL Thallium, urine (ng/mL) 0.0743989 0.0116830 0.0000026 0.0003625
# Phenotype= Fasting blood glucose
train_LBXGLU_model_exposures_only %>% filter(FDR <0.05) %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 2, "Phenotype: Fasting Blood Glucose" = 4)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Phenotype: Fasting Blood Glucose
Exposure Exposure Description beta_coefficient sd p_value FDR
LBXGTC g-tocopherol(ug/dL) 0.1422715 0.0167066 0 4.8e-06

When examining the association with BMI, 36 exposures had a FDR < 0.05. When examining the association with Fasting blood glucose, only 1 exposure had a FDR < 0.05.

  1. Now interpret the coefficients of the top 3 most findings, ranked by low to high FDR in body mass index and fasting glucose. Specifically, how much does (a) fasting glucose and (b) body mass index change with respect to change in the exposure? Write down your answer in the units of estimate. (2)
  • a: your answer here
  1. A 1% change in g-tocopherol is associated with a 0.0014 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.
  2. A 1% change in PCB118 is associated with a 0.0017 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.
  3. A 1% change in p,p-DDT is associated with a 0.0014 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.
  • b: your answer here
  1. A 1% change in g-tocopherol is associated with a 0.0022 units higher of BMI, holding age, sex, ethnicity, and family poverty level constant.
  2. A 1% change in Vitamin B12 serum is associated with a 0.002 units less of BMI, holding age, sex, ethnicity, and family poverty level constant.
  3. A 1% change in Folate serum is associated with a 0.0019 units less of BMI, holding age, sex, ethnicity, and family poverty level constant.
    1. Produce a “volcano plot” of the results for glucose and body mass index (two separate plots) by plotting the association size, or estimate on the x-axis and the -log10(FDR) on the yaxis for the training cohort data for the coefficient on the exposure (each point should denote the association size and FDR between an exposure and a phenotype). (b) Why is it useful to scale the dependent and independent variables before running the model? (2)
  • your answer here
# color coding the points with FDR < 0.05, if true, red. 
train_BMI_model_exposures_only1 <- train_BMI_model_exposures_only %>% mutate(significant=ifelse(FDR < 0.05, 1,0))
train_LBXGLU_model_exposures_only1 <- train_LBXGLU_model_exposures_only %>% mutate(significant=ifelse(FDR < 0.05, 1,0))

# (a) Volcano Plot for BMI
a <- ggplot(train_BMI_model_exposures_only1 , aes(x = beta_coefficient, y =-log10(FDR), color=as.factor(significant))) +
  geom_point() +labs(x='Association Size', y='-log10(FDR)') +  ggtitle("BMI") +scale_color_manual(values = c("0" = "black", "1" = "red"))+  labs(color='Significance')+ theme_linedraw()+geom_hline(yintercept = 1.3, linetype = "dashed", color = "#491C1C")+  guides(color = 'none')
 

# (a) Volcano Plot for Fasting Blood Glucose
b <- ggplot(train_LBXGLU_model_exposures_only1 , aes(x = beta_coefficient, y =-log10(FDR), color=as.factor(significant))) + geom_point() +labs(x='Association Size', y='-log10(FDR)') +  ggtitle("Fasting blood glucose") +scale_color_manual(values = c("0" = "black", "1" = "red"))+  labs(color='Significance')+geom_hline(yintercept = 1.3, linetype = "dashed", color = "#491C1C")+ theme_linedraw()+  guides(color = 'none') 

plot_grid(a,b, ncol = 2)

  1. It is useful to scale the dependent and independent variables before running the analysis because it will improve the interpretation, having the dependents and independents variables on the same scale make it easy to understand their relationship, since the coefficients represent the change in the dependent variable for a one-unit change in the independent variable.
  1. Filter “replicated” findings using the following heuristic: FDR significance of 0.05 in the training dataset, p-value < 0.05 in the testing dataset, and concordant directionality of associations (e.g., both associations are >0 OR both are <0). Provide your answer in the form of a table with the estimate of the “replicated” findings in the train, test, the FDR from the training data and pvalue from the test cohort data. (3)
test_LBXGLU_model_exposures_only <- test_model_Summary_LBXGLU
test_LBXGLU_model_exposures_only <-test_LBXGLU_model_exposures_only[grepl("^scale", test_LBXGLU_model_exposures_only$term), ]
test_LBXGLU_model_exposures_only <- test_LBXGLU_model_exposures_only%>% select(-term)
test_LBXGLU_model_exposures_only <- merge(test_LBXGLU_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value")
test_LBXGLU_model_exposures_only <- test_LBXGLU_model_exposures_only[new_order]

# Filtering for significant values in the training set 
LBXGLU_train_sig <- train_LBXGLU_model_exposures_only %>%  filter(FDR<0.05) %>% select(-c(sd,p_value))

# Filtering for significant values in the testing set 
LBXGLU_test_sig <- test_LBXGLU_model_exposures_only %>%  filter(p_value <0.05)%>% select(-sd)


# Renaming column names before merging 
colnames(LBXGLU_train_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_train","FDR_train")
colnames(LBXGLU_test_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_test","p_value_test")

# Merging the training and test set results
LBXGLU_test_train <- merge(LBXGLU_train_sig,LBXGLU_test_sig, by = c("Exposure","Exposure Description"), all = FALSE)

LBXGLU_test_train %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Replicated findings for Fasting blood glucose"=5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Replicated findings for Fasting blood glucose
Exposure Exposure Description beta_coefficient_train FDR_train beta_coefficient_test p_value_test
LBXGTC g-tocopherol(ug/dL) 0.1422715 4.8e-06 0.1250581 2e-07

Displaying a DataFrame with only the Exposures and their relevant measures for testing set (BMI):

test_BMI_model_exposures_only <- test_model_Summary_BMI
test_BMI_model_exposures_only <-test_BMI_model_exposures_only[grepl("^scale", test_BMI_model_exposures_only$term), ]
test_BMI_model_exposures_only <- test_BMI_model_exposures_only%>% select(-term)
test_BMI_model_exposures_only <- merge(test_BMI_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value")
test_BMI_model_exposures_only <- test_BMI_model_exposures_only[new_order]

# Filtering for significant values in the training set 
BMI_train_sig <- train_BMI_model_exposures_only %>%  filter(FDR<0.05) %>% select(-c(sd,p_value))

# Filtering for significant values in the testing set 
BMI_test_sig <- test_BMI_model_exposures_only %>%  filter(p_value <0.05)%>% select(-sd)


# Renaming column names before merging 
colnames(BMI_train_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_train","FDR_train")
colnames(BMI_test_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_test","p_value_test")

# Merging the training and test set results
BMI_test_train <- merge(BMI_train_sig,BMI_test_sig, by = c("Exposure","Exposure Description"), all = FALSE)

BMI_test_train %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Replicated findings for BMI" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")
Replicated findings for BMI
Exposure Exposure Description beta_coefficient_train FDR_train beta_coefficient_test p_value_test
LBD199 PCB199 (ng/g) -0.3408196 0.0345706 -0.3340961 0.0002995
LBX118 PCB118 (ng/g) 0.1117064 0.0140165 0.1489610 0.0177188
LBX156 PCB156 (ng/g) -0.1236637 0.0004926 -0.1544014 0.0047365
LBX170 PCB170 (ng/g) -0.1616512 0.0014024 -0.2247118 0.0014327
LBX180 PCB180 (ng/g) -0.2086052 0.0002235 -0.3387496 0.0001508
LBX194 PCB194 (ng/g) -0.3778197 0.0153597 -0.3248369 0.0000919
LBX196 PCB196 & 203 (ng/g) -0.2656293 0.0339125 -0.2436532 0.0011456
LBX206 PCB206 (ng/g) -0.1628048 0.0179536 -0.3159491 0.0005139
LBXALC a-Carotene(ug/dL) -0.1916125 0.0001123 -0.1946426 0.0000000
LBXB12 Vitamin B12, serum (pg/mL) -0.2047473 0.0000000 -0.1897773 0.0000000
LBXBEC trans-b-carotene(ug/dL) -0.2583417 0.0000012 -0.2593669 0.0000000
LBXBHC Beta-hexachlorocyclohexane (ng/g) 0.1435895 0.0044999 0.1427461 0.0360052
LBXBPB Lead (ug/dL) -0.2035729 0.0000000 -0.2032720 0.0000000
LBXCBC cis-b-carotene(ug/dL) -0.2409739 0.0000199 -0.2556237 0.0000000
LBXCRY b-cryptoxanthin(ug/dL) -0.2368754 0.0002544 -0.2125888 0.0000000
LBXD05 1,2,3,4,6,7,8-hpcdd (fg/g) 0.1598730 0.0010812 0.2249577 0.0001554
LBXD07 1,2,3,4,6,7,8,9-ocdd (fg/g) 0.1318667 0.0313780 0.2172344 0.0001519
LBXDIE Dieldrin (ng/g) 0.3250939 0.0085417 0.3351152 0.0000956
LBXF04 1,2,3,4,7,8-hxcdf (fg/g) 0.1348533 0.0000539 0.1429995 0.0012816
LBXFOL Folate, serum (ng/mL) -0.1905314 0.0000000 -0.1824641 0.0000000
LBXGTC g-tocopherol(ug/dL) 0.2285937 0.0000000 0.2356250 0.0000000
LBXHPE Heptachlor Epoxide (ng/g) 0.2698109 0.0000010 0.2886614 0.0000460
LBXHXC 3,3,4,4,5,5-hxcb (fg/g) -0.2141849 0.0001815 -0.2348097 0.0009598
LBXIRN Iron, Frozen Serum (ug/dL) -0.0928059 0.0000209 -0.1901752 0.0000000
LBXLUZ Combined Lutein/zeaxanthin (ug/dL) -0.2284943 0.0000232 -0.1914445 0.0000000
LBXPCB 3,3,4,4,5-pncb (fg/g) 0.1607471 0.0001782 0.1780708 0.0020522
LBXRST Retinyl stearate(ug/dL) -0.0655155 0.0013156 -0.0611861 0.0004880
LBXSPH Phosphorus (mg/dL) -0.1023527 0.0000506 -0.1153578 0.0000000
LBXVIA Retinol(ug/dL) 0.0964585 0.0001815 0.0757284 0.0000793
LBXVID Vitamin D (ng/mL) -0.2598976 0.0003130 -0.2546304 0.0000000
URXETL Enterolactone (ng/mL) -0.1152220 0.0014760 -0.0936139 0.0000101
URXMEP Mono-ethyl phthalate 0.1090921 0.0001770 0.1097786 0.0000123
URXP04 2-fluorene (ng/L) 0.0858877 0.0060562 0.0967908 0.0078634
URXP06 1-phenanthrene (ng/L) 0.1089104 0.0117703 0.1174154 0.0050251
URXP07 2-phenanthrene (ng/L) 0.1404064 0.0001658 0.1774947 0.0002225
URXUTL Thallium, urine (ng/mL) 0.0743989 0.0003625 0.1008309 0.0000389
  1. Environmental exposures are highly correlated with one another and they can change over time. Therefore, exposures could be subject to many biases. Describe a type of bias (e.g., selection, sampling, or other) that can influence your findings and give an example.(2)
  • your answer here

Confounding is one of the biases that we can face when studying environmental exposures.It is when a third variable is associated with the exposure and the outcome but not on the causal pathway, it can lead to wrong associations. In our study we found that both b12 and folate are associated with BMI. But we also know that B12 and folate are correlated and interdependent. B12 is needed for the conversion of folate to its active form, and a deficiency in B12 can lead to a buildup of inactive folate in the body. Deficiency of one can mask deficiency in other. This correlation between the two can be confoundig in terms of the association with BMI.

  1. BMI is a risk factor for diabetes: individuals with higher BMI are at risk for future type 2 diabetes. In this study, we found factors associated with both. For exposures that are found in both BMI and type 2 diabetes - do you think exposure is influencing both traits independently, or via BMI, or neither? Explain why. Optional: Write down a possible linear regression model to test your hypothesis and interpret the finding. (3)
  • your answer here

The exposure LBXGTC (g-tocopherol(ug/dL)) was found to be associated with both BMI and Fasting Plasma Glucose. g-tocopherol is a form of vitamin E that is found in many foods, including nuts, seeds, and vegetable oils. The literature about the association between vitamin E and BMI or diabetes is not conclusive. Giving the association between BMI and diabetes, it is reasonable to think that an exposure affecting both, does so dependently. To further explore that, we can conduct mediation analysis. What we know: 1) Exposure is significantly associated with T2D. Beta Coefficient 0.1422715.

Packages within the Tidyverse

  1. Exposure is significantly associated with BMI.Beta coefficient 0.2285937.

Packages within the Tidyverse

What I need to further do: 3) I will build a linear model with fasting blood glucose as the outcome, and with the following independent variables: BMI, exposure (g-tocopherol) and confounders (sex, age, ethnicity, poverty to income ratio)

Packages within the Tidyverse

model_mediation <- association("LBXGLU", "LBXGTC", covariates=c("BMXBMI","RIDAGEYR","sex","ethnicity","INDFMPIR"), dsn=dsn)

summary(model_mediation)
## 
## Call:
## svyglm(formula = as.formula(mod_string), design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -0.6597769  0.0592085 -11.143 4.97e-10 ***
## scale(log(LBXGTC + 0.001))  0.1169084  0.0173931   6.722 1.54e-06 ***
## BMXBMI                      0.0149378  0.0014666  10.186 2.32e-09 ***
## RIDAGEYR                    0.0129586  0.0009889  13.105 2.82e-11 ***
## sex                        -0.1672797  0.0239012  -6.999 8.62e-07 ***
## ethnicity2                 -0.0277477  0.0452493  -0.613  0.54664    
## ethnicity3                  0.1192694  0.0335673   3.553  0.00199 ** 
## ethnicity4                  0.0300114  0.0456817   0.657  0.51869    
## ethnicity5                  0.1228672  0.1087525   1.130  0.27193    
## INDFMPIR                   -0.0177648  0.0082172  -2.162  0.04292 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8908525)
## 
## Number of Fisher Scoring iterations: 2

The exposure is still significantly associated with T2D as we can see. The effect still exist but it is smaller in magnitude (Beta coefficient 0.1169084). This suggests that BMI partially mediates between exposure and T2D.

Reference: https://data.library.virginia.edu/introduction-to-mediation-analysis/