An Introduction to the exposome: exposome-wide association in body mass index and fasting glucose

Please check Canvas for the due date of this assignment.

Start early. This is a challenging assignment! Each student must hand in their code and answers separately. This assignment is 20% of your grade. We cannot accept late assignments.

Submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dbplyr)

## 
## Attaching package: 'dbplyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

library(knitr)
library(kableExtra)

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(cowplot)

Background:

Body mass index and glucose are phenotypes that are known to be risk factors for Type 2 Diabetes (T2D). Genetic factors are associated with body mass index and glucose. However, environmental exposure factors associated to fasting glucose and/or body mass index may also play a role in T2D, given the rapid rise of both body mass index and diabetes in the last 20 years (https://www.cdc.gov/diabetes/statistics/slides/long_term_trends.pdf).

For example, it is hypothesized that “poor” diets or sugar consumption or even pollutants may influence body mass index or fasting glucose. However, these factors do not predict the phenotypes of body mass index or fasting glucose completely. In this assignment you will complement these associations in T2D, linking new potential environmental exposures in T2D phenotypes. Specifically, we will execute a data-driven search for environmental exposures correlated with body mass index and fasting glucose, called ‘environment-wide association studies’ or ‘exposome-wide association studies’ (EWASs), similar to how investigators correlate individual SNPs with disease in GWAS.

Description of Data and Readings:

Please skim Patel et al 2010.

load('./assignment1.Rdata')

This .Rdata file contains 5 data.frames and an array named the following:

ExposureDescription:the dictionary of exposure variables in the training and testing datasets
ExposureList: a list of biomarkers of environmental exposures to test against BMI and glucose.
NHData.train: the training dataset from the NHANES 1999-2000 and 2001-2002 surveys
NHData.test: the testing dataset from the NHANES 2003-2004 and 2005-2006 surveys
demographicVariables: a dictionary of demographic variables in the training and testing datasets.

These data were derived from the National Health and Nutrition Examination Survey (NHANES), a survey on a representative sample of the United States population, undertaken by the US Centers for Disease Control and Prevention (see here: https://www.cdc.gov/nchs/nhanes/. Specifically, we downloaded the data from the NHANES web site and stitched them together to create the data.frame objects above. For more information, please see our paper, Patel CJ, et al 2016, Nature Scientific Data.

A short tutorial on analysis of survey data

The NHANES participants are representative of the United States population. A challenge in making a dataset representative is ensuring all types of individuals are sampled that reflect the diversity of the US at the lowest possible cost. Simply calling a finite number of people up randomly is not going to achieve representativeness. The US CDC and National Centers for Health Statistics (NCHS) are able to achieve representativeness by sampling specific facets of the population at higher probabilities than would be observed if one was to randomly sample the population. Second, NCHS samples people from specific parts of the population (e.g., 15 counties a year) to save time and money (it is easier to survey people in a neighborhood than randomly all over the US).

In fact, most data that you will find from the US CDC or international epidemiological surveys, use a special sampling technique called “survey sampling”. You can learn more about the CDC and the NCHS sampling procedure here: https://wwwn.cdc.gov/Nchs/Nhanes/AnalyticGuidelines.aspx

What does this mean for us? We cannot use the common tools in R to do modeling if we desire to produce “valid” effect sizes, associations/correlations, and pvalues. Why? Because individuals are not sampled randomly, they exhibit some inherent correlation with others (e.g., individuals sampled from the same town, for example). Therefore, this will influence the standard errors and averages of the correlations and therefore also influence the inference (pvalues for correlation).

How do we address this? We simply use a package, aptly called ‘survey’, to take into account the unequal survey-weighted design of the NHANES population. The two main functions you will use in this assignment to achieve this includes ‘svydesign’ and ‘svyglm’ (survey-weighted general linear modeling, which can do linear, logistic, and other types of regressions). Here is how you use them using the NHdata.train data.frame.

First, we need to load the ‘survey’ package:

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

Next, we need to tell R that we have a survey-weighted data we need to analyze. This is done through the ‘svydesign’ function from the survey library, for example for the training data, NHData.train:

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)

This tells R that we have a survey-weighted sample (denoted by weights in the function above). The weights signify how much an individual should be considered in the model and is a function of their prevalence in the population. For example, individuals with higher weights are less prevalent in the population. In contrast, in a simple random sample, the weights would be equal for every individual. Second, the ids and strata parameters tell R where an individual was sampled. ‘Strata’ are units (such as zipcodes) that are sampled within ‘primary sampling units’ (or PSUs, SDMVPSU) such as a county. Individuals within strata are more correlated with one another than individuals that live in different strata. Therefore, these similarities within sampling strata must be taken into account when estimating correlations.

We can run regression models using the ‘svyglm’ function. For example, suppose I wanted to regress Body Mass Index (coded as BMXBMI in the data.frame) on age (coded as RIDAGEYR), we can use the ‘svyglm’ function (INSTEAD of the regular ‘lm’ function):

mod <-svyglm(BMXBMI ~ RIDAGEYR, dsn)
summary(mod)

Observe that the estimates (and pvalues) will be different when using the ‘lm’ function (give it a shot):

mod2 <- lm(BMXBMI ~ RIDAGEYR, data=NHData.train)
summary(mod2)

In this assignment, we will use the survey package and ‘svydesign’ and ‘svyglm’ functions to run your association tests. These are good functions to know when analyzing data from the public domain.

Answer the questions that are numbered in the space below that says “your answer here”. Supply R code when required by using the back ticks. For example:

Sample Questions (and Answers):

What is the meaning of life?

The meaning of life is 42.

What is e to the power of 3?

exp(3)

A. Background Questions

How are the phenotypes (glucose and BMI) measured in participants of the NHANES? (1)

your answer here

Pehnotype 1: Body Mass Index (kg/m**2)- Calculated according to this equation: \[\ Weight(kg)/Height^2(m^2)\]

A detailed protocol that standardizes the measurements of weight and height is available per year of survey, different guidelines are available per age group (for example attached the Anthropometry And Physical Activity Monitor Procedures Manual 2005-2006 https://wwwn.cdc.gov/nchs/data/nhanes/2005-2006/manuals/BM.pdf).

“For children and adolescents, cutoff criteria are based on the gender-specific BMI-for-age growth charts: underweight (BMI values < 5th percentile); healthy weight (BMI values 5th-84th percentiles); overweight (BMI values 85th-94th percentiles); and obesity (BMI values ≥ 95th percentile) (Barlow, 2007).

For adults, cutoff criteria are fixed: underweight (BMI values < 18.5); normal weight (BMI values 18.5-24.9); overweight (BMI values 25.0-29.9); obese—Class I (BMI values 30.0-34.9); obese—Class II (BMI values 35.0-39.9); and extremely obese—Class III (BMI values > 40.0) (National Institutes of Health, 1998).”

Phenotype 2: Fasting blood glucose-plasma (mg/dL) The fasting glucose levels are measured in the blood. blood samples are drawn by venipuncture from fasting individuals.

What does representative mean with respect to NHANES? How is it different than the Framingham Heart Study? (2)

Your answer here

The study population of the Framingham heart study represents a certain subset of the population, which is the residents of Framingham Massachusetts. Thus the results of studies based on this cohort are less generalizable. The NHANES dataset focuses on collecting information from the civilian population living in the 50 states of the US including the district of Columbia(Excluding those who live in institutionalized settings), the sampling aims to have a fair representation of the entire population of the US, which means that the results are more generalizable to the US population comparing to the Framingham study.

These phenotypes (glucose and BMI) are related to type 2 diabetes. (a) Draw the E, G, P, and D diagram and annotate what you are investigating in your anticipated NHANES EWASs. (b) What does body mass index measure? What does fasting glucose measures? Why are they important in type 2 diabetes? (2)

Your answers here (in free text, some hints below)

Phenotype(s): body mass index, Fasting plasma glucose
Exposure(s): Environmental exposures (stated in the ExposureList)
Disease(s): Type 2 Diabetes

Your answer here

Fasting blood glucose is set to detect individuals with impaired glucose tolerance. Fasting plasma glucose (FPG) >= 126 mg/dL is one of the accepted criteria for the diagnosis of Diabetes Mellitus (DM). FPG is in fact very reliable and convenient test to identify DM in asymptomatic individuals.
Body Mass Index is is used to assess weight status in children and adolescents as well as adults.

Those measures are important in type 2 diabetes. The first is an important diagnostic measure as previously mentioned, also impaired glucose tolerance (< 126) is a risk factor for developing DM later on. BMI and obesity are important risk factors in diabetes and are linked to insulin resistance and disease progression.

How are the environmental exposures measured in the NHANES? Choose a few examples (up to 5) of different types of biomarkers of environmental exposures from the ExposureList array and query the NHANES website for the assay description of the biomarker. (2)

your answers here

Here I looked into 3 different exposures and their measures: 1) URXUIO –> Iodine, urine (ng/mL), it was measured using Inductively coupled plasma-mass spectrometry (ICP-MS) from urine samples.
2) LBXB12 –> Vitamin B12, serum (pg/mL), used from blood samples. The Elecsys Vitamin B12 assay employs a competitive test principle using intrinsic factor specific for vitamin B12. 3) LBXFOL –> Folate, serum in (ng/mL)- “Serum folate is measured by using the Bio-Rad Laboratories”Quantaphase II Folate” radioassay kit. The assay is performed by combining serum or a whole blood hemolysate sample with 125I-folate and cyanide.”

Let’s explore the phenotypes, body mass index (BMXBMI) and fasting glucose (LBXGLU) with respect to demographic characteristics of your population in the training dataset (NHData.train)

How many individuals are there with BMI values? Draw a histogram of BMI and describe the mean and median. Is the distribution “skewed” or biased toward a certain direction on the x axis? (1)

your answer here (with R code). Remember, use the NHData.train cohort.

The number of observations (individuals) in the NHData.train dataset is 21004. 17472 individuals have a record of their BMI, and 3532 individuals have missing values.

length(NHData.train$BMXBMI)

## [1] 21004

sum(is.na(NHData.train$BMXBMI))

## [1] 3532

sum(!is.na(NHData.train$BMXBMI))

## [1] 17472

# Missing values will be dropped
BMI_hist <- NHData.train %>% ggplot(aes(BMXBMI)) +       
  geom_histogram(bins=30,color = "#000000", fill = "#5191A9", alpha=0.6) +labs(x='Body Mass Index')+ geom_vline(aes(xintercept=mean(BMXBMI,na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) + geom_vline(aes(xintercept=median(BMXBMI,na.rm=TRUE)),
            color="red", linetype="dashed", size=1)

print(BMI_hist)

mean(NHData.train$BMXBMI,na.rm=TRUE)

## [1] 24.79468

median(NHData.train$BMXBMI,na.rm=TRUE)

## [1] 24.1

The data has a right skewed distribution (right tail) (median < mean). The mean (blue dashed line) is 24.8 kg/m^2. The median (red dashed line) is 24.1 kg/m^2. (P.S the null values were automatically dropped by the ggplot function).

Plot BMI vs age. (1)

your answer here

# Missing values will be dropped

BMI_age <- NHData.train %>% ggplot(aes(RIDAGEYR,BMXBMI)) +       
  geom_point()+labs(y='Body Mass Index', x='Age in years')

print(BMI_age)

Plot a boxplot of BMI versus sex. (1)

your answer here

I Will create a sex variable (male=0, female=1) (basically this is the same as the female variable, but created it to reduce my confusion).

NHData.train <- NHData.train %>% mutate(sex = ifelse(male == 1, 0, ifelse(female == 1, 1, NA)))
NHData.train$sex <- as.factor(NHData.train$sex)

# checking the values of the new variable
table(NHData.train$sex)

## 
##     0     1 
## 10214 10790

# For later use: creating same variable in the test dataset
NHData.test <- NHData.test %>% mutate(sex = ifelse(male == 1, 0, ifelse(female == 1, 1, NA)))

# Missing values will be dropped
# Will place the categorical variable on the x axis. 

ggplot(NHData.train, aes(x = sex, y =BMXBMI )) +
  geom_boxplot(alpha=0.5,fill = "#5F9EA0") +theme_classic()+labs(x='Sex', y='Body Mass Index')+ scale_x_discrete(labels = c('0' = 'Male' , '1'='Female'))

Plot a boxplot of BMI versus race/ethnicity. (2)

your answer here

# Creating a categorical variable of Race/ethnicity with the following values: White American=1, Black American =2, Mexican American=3, Other Hispanic=4, Other Ethnicity=5

NHData.train <- NHData.train %>% mutate(ethnicity = case_when(white ==1  ~ 1, black==1  ~ 2, mexican==1 ~ 3, other_hispanic==1 ~ 4, other_eth==1 ~ 5))
                             
# Converting to categorical variable 
NHData.train$ethnicity <- as.factor(NHData.train$ethnicity)
table(NHData.train$ethnicity)

# Checking the integrity of the recoded variable: ethnicity.
table(NHData.train$white)
table(NHData.train$black)
table(NHData.train$mexican)
table(NHData.train$other_hispanic)
table(NHData.train$other_eth)

# For later use: creating same variable in the test dataset
NHData.test <- NHData.test %>% mutate(ethnicity = case_when(white ==1  ~ 1, black==1  ~ 2, mexican==1 ~ 3, other_hispanic==1 ~ 4, other_eth==1 ~ 5))
                             
# Converting to categorical variable 
NHData.test$ethnicity <- as.factor(NHData.test$ethnicity)
table(NHData.test$ethnicity)

# Missing values will be dropped

ggplot(NHData.train, aes(x = ethnicity, y =BMXBMI )) +
  geom_boxplot(alpha=0.5,fill = "#5F9EA0") +theme_classic()+labs(x='Race/Ethnicity', y='Body Mass Index')+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_x_discrete(labels = c('1' = 'White American' , '2'='Black American', '3'='Mexican American', '4'='Other Hispanic', '5'= 'Other Ethnicity'))

Plot BMI versus an indicator of socioecononomic status, the income to poverty ratio (INDFMPIR). The income-to-poverty ratio is a the household income divided by the household poverty level for a given survey year. Therefore a Income-to-poverty ratio of 1 means the individual has household income equal to the poverty level. (1)

your answer here

# Missing values will be dropped
# We have two continuous values, I will visualize using scatter plot. 
# I added ethnicity as a third variable with the hue. 

BMI_income <- NHData.train %>% ggplot(aes(INDFMPIR,BMXBMI,color= ethnicity)) +       
  geom_point()+labs(y='Body Mass Index', x='Income-to-poverty ratio')

print(BMI_income)

Describe the differences in groups (e.g., what is the difference of BMI averages in males vs. females): summarizing the plots from 6-9. (1)

your answer here

From visually inspecting the differences between males and females in regard of their BMI, it seems that the two groups are comparable, with similar median and IQR. Statistical tests can be applied to examine significance of the difference. The different ethnic groups look similar as well, with white and black groups having more outliers. BMI seems to be lower in the extremes of age (childhood and adolescence vs elderly) with the highest values in the middle age group. In regard to income-poverty ratio and BMI, I don’t detect a specific association with certain income level.

What statistical model would you use to test the association (and significance) of your findings in 10? Execute the model using svyglm below, using multivariate regression, modeling BMI as a function of age, ethnicity, sex, and INDFMPIR. (3)

your answer here.

We can use Ttest to compare the difference in BMI between males and females, and ANOVA test to compare BMI differences between different ethnic groups. Also, since our outcome (dependent variable) is continuous, we can use a Linear regression model to investigate the difference in BMI between different groups. The beta coefficient of the independent variable will inform us about the difference (compared to a reference group) between the groups when adjusting to different variables.

#Data preparation to use with svyglm (adopted from earlier in the file)
dsn <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)

# Missing values will be dropped. 
model2 <- svyglm(BMXBMI~ RIDAGEYR+sex+INDFMPIR+ethnicity,dsn)

summary(model2)

## 
## Call:
## svyglm(formula = BMXBMI ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.217088   0.245206  82.449  < 2e-16 ***
## RIDAGEYR     0.152421   0.003439  44.325  < 2e-16 ***
## sex1         0.209020   0.132947   1.572    0.130    
## INDFMPIR    -0.061980   0.056904  -1.089    0.288    
## ethnicity2   1.560491   0.180950   8.624 1.66e-08 ***
## ethnicity3   0.992781   0.190223   5.219 3.10e-05 ***
## ethnicity4   0.553229   0.281339   1.966    0.062 .  
## ethnicity5  -0.012941   0.612393  -0.021    0.983    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 44.90787)
## 
## Number of Fisher Scoring iterations: 2

Using the model you described in 11, report the following:

Change in BMI for a 1 year increase/change in age and significance of this association. (1)

your answer here

Holding all other variables constant, a 1 unit increase in age, will lead to an increase of 0.15 in the outcome variable (BMI), this change is statistically significant (p<0.001).

Difference of BMI in males versus females and significance. (1)

your answer here

Based on our sample, holding all other variables constant, there is difference in the mean BMI between males and females, females have higher mean BMI compared to males (by 0.209 units), but this difference is not statistically significant (pvalue=0.13).

Difference of BMI in Non-Hispanic Black versus Whites, Mexican American versus Whites, and Other Hispanic vs. White, and Other versus White. (1)

your answer here.

In my model, the reference group for the ethnicity variable is white. - Black American have higher mean BMI compared to white American (by 1.56 units increase), the difference is statistically significant. - Mexican Americans have higher mean BMI compared to white American (by 0.99 units increase), the difference is statistically significant. - Other hispanics have higher mean BMI compared to white American (by 0.55 units increase), the difference is NOT statistically significant. - Other ethnicities have lower mean BMI compared to white American (by -0.0129 units decrease), the difference is NOT statistically significant.

Repeat Q11 for fasting glucose (LBXGLU), developing a multivariate linear regression model to associate fasting glucose with age, sex, race, and income-poverty ratio. (3)

your answer here

model3 <- svyglm(LBXGLU~ RIDAGEYR+sex+INDFMPIR+ethnicity,dsn)

summary(model3)

## 
## Call:
## svyglm(formula = LBXGLU ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  87.8945     1.0508  83.648  < 2e-16 ***
## RIDAGEYR      0.4186     0.0291  14.388 1.13e-12 ***
## sex1         -5.5616     0.8217  -6.768 8.41e-07 ***
## INDFMPIR     -0.9295     0.2844  -3.268  0.00352 ** 
## ethnicity2    0.1312     1.4520   0.090  0.92884    
## ethnicity3    3.4764     0.9499   3.660  0.00138 ** 
## ethnicity4    3.9009     2.8895   1.350  0.19073    
## ethnicity5    3.0543     3.2390   0.943  0.35594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1024.717)
## 
## Number of Fisher Scoring iterations: 2

We will explore one biomarker of exposure: serum Lead (LBXBPB).

Plot a histogram of serum lead (LBXBPB) and qualitatively describe the “shape” of the distribution - what does it look like? Is centered or skewed in a direction? (1)

your answer here

# Missing values will be dropped

NHData.train %>% ggplot(aes(LBXBPB)) +       
  geom_histogram(bins=100,color = "#000000", fill = "#5191A9", alpha=0.6) +labs(x='Serum Lead')

The variable has a right skewed distribution. We can normalize the distribution using different approaches.

Investigators attempt to make this distribution more “normal” or symmetrically centered around a value by applying a “transformation” to make the linear associations easier to interpret. Since the lead levels seem to have an exponential-like decay a log-based transformation is used. Lets take a look at the log base e of lead levels:

# Missing values will be dropped 

NHData.train <- NHData.train %>% mutate(lead_logged = log(LBXBPB+ .001)) # add 0.001 for values that are 0
p <- ggplot(NHData.train, aes(lead_logged))
p <- p + geom_histogram(bins=30,color = "#000000", fill = "#5191A9", alpha=0.6)+labs(x='Log transformed Serum Lead')

p

Using a multivariate linear regression model, estimate the following:

Execute a multivariate linear regression model to associate log(blood lead levels [plus .001, a small constant]) with age, sex, race, and income-poverty ratio, again in the training cohort (NHData.train) (3)

your answer here

#Updating dsn after creating new variable 
dsn <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)

# Running the model 
model4 <- svyglm(lead_logged~ RIDAGEYR+sex+INDFMPIR+ethnicity, dsn)

summary(model4)

## 
## Call:
## svyglm(formula = lead_logged ~ RIDAGEYR + sex + INDFMPIR + ethnicity, 
##     design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4019785  0.0350688  11.463 9.53e-11 ***
## RIDAGEYR     0.0109253  0.0005453  20.035 1.28e-15 ***
## sex1        -0.4316016  0.0127236 -33.921  < 2e-16 ***
## INDFMPIR    -0.0648098  0.0054698 -11.849 5.07e-11 ***
## ethnicity2   0.1718230  0.0224377   7.658 1.21e-07 ***
## ethnicity3   0.1010466  0.0355791   2.840  0.00953 ** 
## ethnicity4   0.0003425  0.0475020   0.007  0.99431    
## ethnicity5  -0.0401062  0.0386275  -1.038  0.31042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.4047992)
## 
## Number of Fisher Scoring iterations: 2

Could demographic variables be confounders in a test of association between Lead and BMI or glucose? Why or why not? (2)

your answer here

Yes, demographic variables can be confounders. For example, in the US, lead has historically been used as an ingredient in many types of paint, including residential and industrial paints. Older houses might still have old paint that exposes the residents to the toxic outcomes of lead (today and given the awareness of the toxic effects of lead we see this less), but people with lower socioeconomic status are differentialy more exposed to lead (residing in older houses) Other exposures might be related to certain type of works, that are more prevalent in a certain gender (due to cultural norms).

B. Executing environment-wide associations

Logging and scaling exposure variables

Logging variables

As above, to handle the right skew of exposure biomarkers, the measurements are log transformed. We add a constant (0.001). If we have a model such as the following:

model_log <- svyglm(LBXGLU ~ log(LBXBPB + .001), dsn)
summary(model_log)

## 
## Call:
## svyglm(formula = LBXGLU ~ log(LBXBPB + 0.001), design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          98.2936     0.6279 156.535  < 2e-16 ***
## log(LBXBPB + 0.001)   4.3218     0.6638   6.511 4.68e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1107.804)
## 
## Number of Fisher Scoring iterations: 2

How would you interpret the coefficient? Answer: for a 1 unit change in the log10(exposure), you have a 10 unit increase (9.95) in fasting glucose mg/dL.

Another way to interpret this is in terms of (percentages)[https://stats.oarc.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-terms-of-percent-change-in-linear-regression/]: a 1% change in the lead levels is associated with a 0.04 mg/dL (=4.32/100) higher serum glucose level. However this is association is not independent of potential confounding variables.

Scaling variables

Different variables have different units (e.g., mg/dL for lead levels). When we execute EWAS, we wish for every association to be comparable, or on the same units. Therefore, we need to “scale” the exposure variables to be on the same unit for each variable. One intuitive way to scale is to make all variables on the unit of standard deviation of that variable (e.g., sd(exposure)), dividing each measurement for each individual by the standard deviation of the population after subtracting the mean (centering the distribution at “0”). In code, to standardize or scale lead levels (after logging):

NHData.train <- NHData.train %>% mutate(lead_standardized = scale(log(LBXBPB+.001))) # Ways to standardize or "scale" LBXBPB

NHData.train %>% summarize(m1= mean(lead_standardized, na.rm=T), m2 = mean(log(LBXBPB + .001), na.rm=T), s1 = sd(lead_standardized, na.rm = T), s2=sd(log(LBXBPB + .001), na.rm=T)) ## m1 should be 0, s1 should be equal to 1

##             m1        m2 s1        s2
## 1 2.872113e-17 0.4598617  1 0.7132666

Doing EWAS

Now we will test each of the exposures in ExposureList for linear association with a quantitative or continuous trait in the training and testing datasets separately.

The script should input a the training or testing datasets and output a tibble that contains the exposure ID (e.g., LBXBPB),exposure name, phenotype name,estimate,standard error, pvalue, an false discovery rate (FDR).

Your code should execute associations that associate between each exposure in ExposureList and the phenotype (e.g., BMI). The association should quantify how much the phenotype changes for a 1 standard deviation change in the logarithm base 10 of the exposure value (e.g., scale(log(exposure + 0.001))).

Because some of the exposure variables contain 0 values we need to add a constant before doing a log transformation. Choosing a constant is a matter of hot debate. For the sake of the exercise, use a small constant equal to 0.001. The dependent variable is the phenotype, and the independent variable is an exposure. Include age, sex, race/ethnicity, and income/poverty ratio (INDFMPIR) as covariates.

Execute the following:

EWAS on 1 standard deviation change of Body Mass Index (BMXBMI) in the training and testing datasets separately. Use ‘svyglm’ as the regression approach of choice (see function below). (5)

# here is an (optional) function you can use to take in phenotype and exposure variables as "string" input to a linear regression:

### function to do a pairwise association between a phenotype and exposure
association <- function(phenotype="BMXBMI", exposure="LBXBCD", covariates=c("RIDAGEYR","sex","INDFMPIR"), dsn) {
  # phenotype: a string that is a phenotype name (e.g., "BMXBMI")
  # exposure: a string this is a exposure name (e.g., "LBXBCD")
  # covariates: an array of covariate variables
  # dsn: the survey design object
  covariate_string <- paste(covariates, collapse="+")
  mod_string <- sprintf('scale(%s) ~ scale(log(%s + .001)) + %s', phenotype, exposure, covariate_string)
  # the above formats formula as a string 
  mod <- svyglm(as.formula(mod_string), dsn)
  return(mod)
}

# run an association between BMI and Blood cadmium, adjusting for male sex and age in years
assoc_test_model <- association("BMXBMI", "LBXBCD", dsn=dsn)
model_summary_example <- summary(assoc_test_model)

# your code here:
# run the association function for each exposure in ExposureList

# I will use the association function that you provided, will use a loop to iterate over the exposure list (ExposureList) and report the results.

# I had to recode sex into double, for some reason I kept getting an error when it was factor (even though it was 2 level factor)- this will not affect the results since it is a binary variable. 
NHData.train$sex <- as.double(NHData.train$sex)
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.train)


# Training Data set- BMI as a phenotype 

# Create an empty dataframe to store the results
train_model_Summary_BMI = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(), sd = numeric(), p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "BMXBMI", exposure = exposure, covariates=covariates,dsn = dsn)
  model_summary=summary(mod)

train_model_Summary_BMI =rbind(train_model_Summary_BMI, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  rownames(train_model_Summary_BMI) <- NULL
}

# Test Dataset- BMI as a phenotype 

# Creating the survey dataset 
dsnt <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.test)

# Create an empty dataframe to store the results
test_model_Summary_BMI = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(),sd = numeric(),p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "BMXBMI", exposure = exposure,covariates=covariates, dsn = dsnt)
  model_summary=summary(mod)

test_model_Summary_BMI =rbind(test_model_Summary_BMI, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  
rownames(test_model_Summary_BMI) <- NULL
}

EWAS on 1 standard deviation of fasting blood glucose (LBXGLU) in the training and testing datasets separately. (1)

# your code here
# run the association function for each exposure in ExposureList


# Training Dataset- Fasting blood glucose as a phenotype 

# Create an empty dataframe to store the results
train_model_Summary_LBXGLU = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(), sd = numeric(), p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "LBXGLU", exposure = exposure, covariates=covariates,dsn = dsn)
  model_summary=summary(mod)

train_model_Summary_LBXGLU =rbind(train_model_Summary_LBXGLU, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  rownames(train_model_Summary_LBXGLU) <- NULL
}

# Test Dataset- Fasting blood glucose as a phenotype 

# Creating the survey dataset 
dsnt <- svydesign(ids=~ SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=NHData.test)

# Create an empty dataframe to store the results
test_model_Summary_LBXGLU = data.frame(Exposure=character(), term = character(), beta_coefficient = numeric(),sd = numeric(),p_value = numeric())
covariates=c("RIDAGEYR","sex","ethnicity","INDFMPIR")

for (exposure in ExposureList) {
  mod <- association(phenotype = "LBXGLU", exposure = exposure,covariates=covariates, dsn = dsnt)
  model_summary=summary(mod)

test_model_Summary_LBXGLU =rbind(test_model_Summary_LBXGLU, data.frame(Exposure =exposure,   term=row.names(model_summary$coefficients),beta_coefficient=model_summary$coefficients[, "Estimate"], sd=model_summary$coefficients[, "Std. Error"],p_value= model_summary$coefficients[, "Pr(>|t|)"]))
  
rownames(test_model_Summary_LBXGLU) <- NULL
}

Use the ‘p.adjust’ function in R to estimate the Benjamini-Hochberg False Discovery Rate (the method parameter should be “BH” or “FDR”) for the exposure-phenotype associations in the training data cohort. Specifically, find or filter by all coefficients of the exposures (in ExposureList) and estimate the FDR for the pvalues that correspond to those coefficients. (1)

# Training dataset

# BMI as a phenotype 
# Create a dataframe with all the pvalues for the exposures (estimated 160) (extracted from the model summary)

train_BMI_pvalue_adj <- train_model_Summary_BMI %>%  select(c(Exposure,term, p_value))

# Selecting the rows with the exposures 
train_BMI_pvalue_adj <- train_BMI_pvalue_adj[grepl("^scale", train_BMI_pvalue_adj$term), ]

# Adding a column "FDR" with the adjusted p.value
train_BMI_pvalue_adj <- train_BMI_pvalue_adj %>%  mutate(FDR= p.adjust(p_value))


train_BMI_pvalue_adj %>% select(-c(term)) %>%  as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Phenotype: BMI" = 3)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Phenotype: BMI
	Exposure	p_value	FDR
2	LBXBCD	0.3195187	1.0000000
11	LBXBPB	0.0000000	0.0000000
20	LBXTHG	0.0076044	0.7921536
29	LBXIHG	0.9544587	1.0000000
38	LBXCOT	0.0062644	0.6765554
47	URXUBA	0.0026423	0.3065035
56	URXUBE	0.8341129	1.0000000
65	URXUCD	0.0531901	1.0000000
74	URXUCO	0.8573066	1.0000000
83	URXUCS	0.0359950	1.0000000
92	URXUMO	0.6599336	1.0000000
101	URXUPB	0.0006536	0.0784369
110	URXUPT	0.0100096	1.0000000
119	URXUSB	0.0104021	1.0000000
128	URXUTL	0.0000026	0.0003625
137	URXUTU	0.2662449	1.0000000
146	URXUUR	0.6184358	1.0000000
155	LBXRBF	0.6032394	1.0000000
164	LBXB12	0.0000000	0.0000000
173	LBXFOL	0.0000000	0.0000000
182	URXUHG	0.0206495	1.0000000
190	URXUIO	0.1032021	1.0000000
199	LBXVID	0.0000022	0.0003130
208	LBXPFOA	0.4054300	1.0000000
217	LBXPFOS	0.7653314	1.0000000
226	LBXPFHS	0.0724645	1.0000000
235	LBXEPAH	0.2114645	1.0000000
244	LBXMPAH	0.7368422	1.0000000
253	LBXPFDE	0.0542814	1.0000000
262	LBXPFHP	0.1661517	1.0000000
271	LBXPFNA	0.1784799	1.0000000
280	LBXPFSA	0.0835150	1.0000000
289	LBXPFUA	0.0104107	1.0000000
298	LBXPFDO	0.2205320	1.0000000
307	URXMBP	0.6394247	1.0000000
316	URXMCP	0.7807973	1.0000000
325	URXMEP	0.0000012	0.0001770
334	URXMHP	0.1788226	1.0000000
343	URXMNP	0.1649842	1.0000000
352	URXMOP	0.1692863	1.0000000
361	URXMZP	0.0172847	1.0000000
370	URXMNM	0.7613717	1.0000000
379	URXMC1	0.3450090	1.0000000
388	URXMHH	0.0020281	0.2393138
397	URXMOH	0.0045881	0.5138655
406	URXMIB	0.1808406	1.0000000
415	LBX028	0.0476617	1.0000000
424	LBX066	0.0416295	1.0000000
433	LBX074	0.0075894	0.7921536
442	LBX105	0.0174660	1.0000000
451	LBX118	0.0001078	0.0140165
460	LBX156	0.0000035	0.0004926
469	LBX157	0.6646874	1.0000000
478	LBX167	0.9671198	1.0000000
487	LBX189	0.5571675	1.0000000
496	LBXD01	0.3874104	1.0000000
505	LBXD02	0.1632009	1.0000000
514	LBXD03	0.0065394	0.6997202
523	LBXD04	0.0120424	1.0000000
532	LBXD05	0.0000078	0.0010812
541	LBXD07	0.0002471	0.0313780
550	LBXF01	0.3103989	1.0000000
559	LBXF02	0.5395425	1.0000000
568	LBXF03	0.5441765	1.0000000
577	LBXF04	0.0000004	0.0000539
586	LBXF05	0.0248908	1.0000000
595	LBXF06	0.1278376	1.0000000
604	LBXF07	0.1354041	1.0000000
613	LBXF08	0.0115322	1.0000000
622	LBXF09	0.0403998	1.0000000
631	LBXF10	0.0048935	0.5431783
640	LBXPCB	0.0000012	0.0001782
649	LBXTC2	0.3422942	1.0000000
658	LBXHXC	0.0000013	0.0001815
667	LBXTCD	0.4352769	1.0000000
676	LBX052	0.0005042	0.0620127
685	LBX087	0.6499132	1.0000000
694	LBX099	0.0014810	0.1762418
703	LBX101	0.5486068	1.0000000
712	LBX110	0.4037700	1.0000000
721	LBX128	0.0542540	1.0000000
730	LBX138	0.3540049	1.0000000
739	LBX146	0.0023781	0.2782415
748	LBX149	0.2241002	1.0000000
757	LBX151	0.1416618	1.0000000
766	LBX153	0.0201897	1.0000000
775	LBX170	0.0000103	0.0014024
784	LBX172	0.1325569	1.0000000
793	LBX177	0.5836161	1.0000000
802	LBX178	0.0038537	0.4354637
811	LBX180	0.0000016	0.0002235
820	LBX183	0.3078027	1.0000000
829	LBX187	0.0006355	0.0768899
838	LBX194	0.0001191	0.0153597
847	LBX195	0.5192174	1.0000000
856	LBX196	0.0002691	0.0339125
865	LBD199	0.0002766	0.0345706
874	LBX206	0.0001403	0.0179536
883	LBXHCB	0.8078181	1.0000000
892	LBXBHC	0.0000336	0.0044999
901	LBXGHC	0.2048413	1.0000000
910	LBXPDE	0.0726419	1.0000000
919	LBXPDT	0.0004995	0.0619334
928	LBXODT	0.0094251	0.9613633
937	LBXOXY	0.0637211	1.0000000
946	LBXTNA	0.0459136	1.0000000
955	LBXHPE	0.0000000	0.0000010
964	LBXMIR	0.0450957	1.0000000
973	LBXALD	0.1792932	1.0000000
982	LBXDIE	0.0000647	0.0085417
991	LBXEND	0.1827504	1.0000000
1000	URXP01	0.9399434	1.0000000
1009	URXP02	0.0005182	0.0632239
1018	URXP03	0.0049827	0.5480947
1027	URXP04	0.0000455	0.0060562
1036	URXP05	0.0602737	1.0000000
1045	URXP06	0.0000898	0.0117703
1054	URXP07	0.0000011	0.0001658
1063	URXP08	0.6753242	1.0000000
1072	URXP10	0.0697394	1.0000000
1081	URXP11	0.9291868	1.0000000
1090	URXP12	0.0110013	1.0000000
1099	URXP13	0.6571452	1.0000000
1108	URXP14	0.3413248	1.0000000
1117	URXP15	0.0028134	0.3235455
1126	URXP16	0.9354066	1.0000000
1135	URXP17	0.0052727	0.5747190
1144	URXP19	0.0075443	0.7921536
1153	URXP20	0.0066907	0.7092133
1162	URXP21	0.1529976	1.0000000
1171	URXP22	0.7391207	1.0000000
1180	URXP24	0.9327224	1.0000000
1189	LBXIRN	0.0000001	0.0000209
1198	LBXVIE	0.8894893	1.0000000
1207	LBXALC	0.0000008	0.0001123
1216	LBXBEC	0.0000000	0.0000012
1225	LBXCBC	0.0000001	0.0000199
1234	LBXCRY	0.0000018	0.0002544
1243	LBXGTC	0.0000000	0.0000000
1252	LBXLUZ	0.0000002	0.0000232
1261	LBXLYC	0.0163420	1.0000000
1270	LBXRPL	0.0037080	0.4227151
1279	LBXRST	0.0000096	0.0013156
1288	LBXVIA	0.0000013	0.0001815
1297	URX14D	0.0132166	1.0000000
1306	URX1TB	0.1450205	1.0000000
1315	URX3TB	0.0737432	1.0000000
1324	URXCBF	0.4927722	1.0000000
1333	URXPCP	0.9379043	1.0000000
1342	URXOPP	0.7896600	1.0000000
1351	URXDPY	0.9302191	1.0000000
1360	URXMET	0.9309390	1.0000000
1369	URXTCC	0.0248860	1.0000000
1378	LBXSPH	0.0000003	0.0000506
1387	URXDAZ	0.0447449	1.0000000
1396	URXDMA	0.8935126	1.0000000
1405	URXEQU	0.9083837	1.0000000
1414	URXETD	0.0914072	1.0000000
1423	URXETL	0.0000109	0.0014760
1432	URXGNS	0.2845850	1.0000000

# Fasting blood glucose as a phenotype 

train_LBXGLU_pvalue_adj <- train_model_Summary_LBXGLU %>%  select(c(Exposure,term, p_value))

# Selecting the rows with the exposures 
train_LBXGLU_pvalue_adj <- train_LBXGLU_pvalue_adj[grepl("^scale", train_LBXGLU_pvalue_adj$term), ]

# Adding a column "FDR" with the adjusted p.value
train_LBXGLU_pvalue_adj <- train_LBXGLU_pvalue_adj %>%  mutate(FDR= p.adjust(p_value))

train_LBXGLU_pvalue_adj %>% select(-c(term)) %>%  as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Phenotype: Fasting blood glucose" = 3)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Phenotype: Fasting blood glucose
	Exposure	p_value	FDR
2	LBXBCD	0.2524382	1.0000000
11	LBXBPB	0.0099156	1.0000000
20	LBXTHG	0.8737893	1.0000000
28	LBXIHG	0.6528913	1.0000000
36	LBXCOT	0.5066388	1.0000000
45	URXUBA	0.8637950	1.0000000
54	URXUBE	0.8564783	1.0000000
63	URXUCD	0.0436044	1.0000000
72	URXUCO	0.7561466	1.0000000
81	URXUCS	0.1820190	1.0000000
90	URXUMO	0.3752412	1.0000000
99	URXUPB	0.0955528	1.0000000
108	URXUPT	0.0123237	1.0000000
117	URXUSB	0.9557703	1.0000000
126	URXUTL	0.0579288	1.0000000
135	URXUTU	0.5287842	1.0000000
144	URXUUR	0.7742112	1.0000000
153	LBXRBF	0.0755413	1.0000000
162	LBXB12	0.0541075	1.0000000
171	LBXFOL	0.5739248	1.0000000
180	URXUHG	0.0082958	1.0000000
188	URXUIO	0.3039359	1.0000000
197	LBXVID	0.0045006	0.6750972
206	LBXPFOA	0.2662639	1.0000000
215	LBXPFOS	0.8510614	1.0000000
224	LBXPFHS	0.7351347	1.0000000
233	LBXEPAH	0.3918283	1.0000000
242	LBXMPAH	0.6184530	1.0000000
251	LBXPFDE	0.3184249	1.0000000
260	LBXPFHP	0.1498731	1.0000000
269	LBXPFNA	0.9678252	1.0000000
278	LBXPFSA	0.7087865	1.0000000
287	LBXPFUA	0.0947895	1.0000000
296	LBXPFDO	0.0208490	1.0000000
305	URXMBP	0.0787413	1.0000000
314	URXMCP	0.8602032	1.0000000
323	URXMEP	0.6673060	1.0000000
332	URXMHP	0.0156763	1.0000000
341	URXMNP	0.3080442	1.0000000
350	URXMOP	0.1003312	1.0000000
359	URXMZP	0.4724653	1.0000000
368	URXMNM	0.7466823	1.0000000
377	URXMC1	0.3502295	1.0000000
386	URXMHH	0.1850945	1.0000000
395	URXMOH	0.1802506	1.0000000
404	URXMIB	0.6273750	1.0000000
413	LBX028	0.9652125	1.0000000
422	LBX066	0.1034568	1.0000000
431	LBX074	0.0381726	1.0000000
440	LBX105	0.0874380	1.0000000
449	LBX118	0.0004591	0.0730005
458	LBX156	0.2574891	1.0000000
467	LBX157	0.9227788	1.0000000
476	LBX167	0.3573418	1.0000000
485	LBX189	0.5066463	1.0000000
494	LBXD01	0.5290067	1.0000000
503	LBXD02	0.1144996	1.0000000
512	LBXD03	0.0594733	1.0000000
521	LBXD04	0.0702706	1.0000000
530	LBXD05	0.0124010	1.0000000
539	LBXD07	0.1290237	1.0000000
548	LBXF01	0.3270192	1.0000000
557	LBXF02	0.8788898	1.0000000
566	LBXF03	0.0143752	1.0000000
575	LBXF04	0.0207212	1.0000000
584	LBXF05	0.0281421	1.0000000
593	LBXF06	0.5258286	1.0000000
602	LBXF07	0.2007177	1.0000000
611	LBXF08	0.0525535	1.0000000
620	LBXF09	0.1840493	1.0000000
629	LBXF10	0.1728195	1.0000000
638	LBXPCB	0.0017955	0.2782985
647	LBXTC2	0.0968072	1.0000000
656	LBXHXC	0.0310668	1.0000000
665	LBXTCD	0.4790167	1.0000000
674	LBX052	0.2139956	1.0000000
683	LBX087	0.5294871	1.0000000
692	LBX099	0.0378590	1.0000000
701	LBX101	0.5871605	1.0000000
710	LBX110	0.6072532	1.0000000
719	LBX128	0.4314588	1.0000000
728	LBX138	0.0272366	1.0000000
737	LBX146	0.0939591	1.0000000
746	LBX149	0.4869590	1.0000000
755	LBX151	0.4851366	1.0000000
764	LBX153	0.0336878	1.0000000
773	LBX170	0.0301907	1.0000000
782	LBX172	0.2610474	1.0000000
791	LBX177	0.1915411	1.0000000
800	LBX178	0.6080314	1.0000000
809	LBX180	0.0929561	1.0000000
818	LBX183	0.1034680	1.0000000
827	LBX187	0.0194162	1.0000000
836	LBX194	0.6321377	1.0000000
845	LBX195	0.4980020	1.0000000
854	LBX196	0.1590014	1.0000000
863	LBD199	0.2854395	1.0000000
872	LBX206	0.6053112	1.0000000
881	LBXHCB	0.1359256	1.0000000
890	LBXBHC	0.0013993	0.2182849
899	LBXGHC	0.7282429	1.0000000
908	LBXPDE	0.0018151	0.2795214
917	LBXPDT	0.0006299	0.0995209
926	LBXODT	0.9597318	1.0000000
935	LBXOXY	0.0330669	1.0000000
944	LBXTNA	0.0338404	1.0000000
953	LBXHPE	0.0043534	0.6573670
962	LBXMIR	0.3653136	1.0000000
971	LBXALD	0.4833529	1.0000000
980	LBXDIE	0.0283430	1.0000000
989	LBXEND	0.5020156	1.0000000
998	URXP01	0.7068498	1.0000000
1007	URXP02	0.9118684	1.0000000
1016	URXP03	0.6687521	1.0000000
1025	URXP04	0.1121305	1.0000000
1034	URXP05	0.2378190	1.0000000
1043	URXP06	0.1683708	1.0000000
1052	URXP07	0.0158414	1.0000000
1061	URXP08	0.6678138	1.0000000
1070	URXP10	0.1213364	1.0000000
1079	URXP11	0.2429619	1.0000000
1088	URXP12	0.2239252	1.0000000
1097	URXP13	0.3462016	1.0000000
1106	URXP14	0.1889930	1.0000000
1115	URXP15	0.9654468	1.0000000
1124	URXP16	0.8236533	1.0000000
1133	URXP17	0.1697639	1.0000000
1142	URXP19	0.7135169	1.0000000
1151	URXP20	0.3700847	1.0000000
1160	URXP21	0.6963767	1.0000000
1169	URXP22	0.8653249	1.0000000
1178	URXP24	0.3617239	1.0000000
1187	LBXIRN	0.0081613	1.0000000
1196	LBXVIE	0.4844879	1.0000000
1205	LBXALC	0.0187917	1.0000000
1214	LBXBEC	0.0021148	0.3214506
1223	LBXCBC	0.0009141	0.1435199
1232	LBXCRY	0.0020868	0.3192807
1241	LBXGTC	0.0000000	0.0000048
1250	LBXLUZ	0.0117679	1.0000000
1259	LBXLYC	0.0442429	1.0000000
1268	LBXRPL	0.7692859	1.0000000
1277	LBXRST	0.0866710	1.0000000
1286	LBXVIA	0.3413536	1.0000000
1295	URX14D	0.7273087	1.0000000
1304	URX1TB	0.0199153	1.0000000
1313	URX3TB	0.8332302	1.0000000
1322	URXCBF	0.7636764	1.0000000
1331	URXPCP	0.7997655	1.0000000
1340	URXOPP	0.2847122	1.0000000
1349	URXDPY	0.7779530	1.0000000
1358	URXMET	0.2600918	1.0000000
1367	URXTCC	0.6232722	1.0000000
1376	LBXSPH	0.1312346	1.0000000
1385	URXDAZ	0.2820743	1.0000000
1394	URXDMA	0.2668222	1.0000000
1403	URXEQU	0.3679812	1.0000000
1412	URXETD	0.7907561	1.0000000
1421	URXETL	0.8667247	1.0000000
1430	URXGNS	0.9148936	1.0000000

Combining all together into a final dataset per Phenotype for the training dataset.

Phenotype: BMI

train_summary_final_BMI- is a nested dataframe, it is nested by the exposure and the exposure description. For every exposure under the “data” column, you can see the full adjusted linear model of association. For each coefficient, the following information is displayed: Data: estimate,standard error, pvalue, and false discovery rate (FDR) for the exposure variable.

# Merging the model summary with FDR.
train_BMI_pvalue_adj_reduced <- train_BMI_pvalue_adj %>% select(c(term, FDR))
train_BMI_model_Summary_FDR <- merge(train_model_Summary_BMI,train_BMI_pvalue_adj_reduced  , by = "term", all.x = TRUE)

# Nesting the dataframe by Exposure 
train_model_Summary_nested_BMI <-train_BMI_model_Summary_FDR %>% nest_by(Exposure)

# Adding the description of the exposure ID.
exposure_def <- ExposureDescription %>% select(c(var,var_desc))
names(exposure_def)[1] <- "Exposure"
names(exposure_def)[2] <- "Exposure Description"

# Deleting duplicate rows, using the exposure column since the exposure description column has subtle changes in the description of element (like capital vs small letters), that won't be detected as duplicates. 
duplicates <- duplicated(exposure_def$Exposure)
exposure_def <- exposure_def[!duplicates, ]

# Merging in the description of each exposure (inner join)
train_summary_final_BMI <- merge(train_model_Summary_nested_BMI,exposure_def, by = "Exposure", all = FALSE)

# Customizing the output dataframe 
new_order <- c("Exposure", "Exposure Description", "data")
train_summary_final_BMI <- train_summary_final_BMI[new_order]

Displaying a DataFrame with only the Exposures and their relevant measures:

train_BMI_model_exposures_only <- train_BMI_model_Summary_FDR
train_BMI_model_exposures_only <- train_BMI_model_exposures_only[grepl("^scale", train_BMI_model_exposures_only$term), ]
train_BMI_model_exposures_only <- train_BMI_model_exposures_only %>% select(-term)
train_BMI_model_exposures_only <- merge(train_BMI_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value","FDR")
train_BMI_model_exposures_only <- train_BMI_model_exposures_only[new_order]



train_BMI_model_exposures_only %>% as.data.frame() %>%  setNames(c("Exposure", "Exposure Description", "Beta Coefficient", "Standard Error","pValue", "FDR"))%>% kable() %>%  add_header_above(c(" " = 1, "Phenotype: BMI" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Phenotype: BMI
Exposure	Exposure Description	Beta Coefficient	Standard Error	pValue	FDR
LBD199	PCB199 (ng/g)	-0.3408196	0.0508426	0.0002766	0.0345706
LBX028	PCB28 (ng/g)	0.0915053	0.0368643	0.0476617	1.0000000
LBX052	PCB52 (ng/g)	0.0795955	0.0193847	0.0005042	0.0620127
LBX066	PCB66 (ng/g)	0.0420634	0.0193845	0.0416295	1.0000000
LBX074	PCB74 (ng/g)	0.0807489	0.0273424	0.0075894	0.7921536
LBX087	PCB87 (ng/g)	0.0074455	0.0157070	0.6499132	1.0000000
LBX099	PCB99 (ng/g)	0.0762283	0.0208611	0.0014810	0.1762418
LBX101	PCB101 (ng/g)	0.0169128	0.0277400	0.5486068	1.0000000
LBX105	PCB105 (ng/g)	0.0567227	0.0219861	0.0174660	1.0000000
LBX110	PCB110 (ng/g)	0.0162001	0.0182331	0.4037700	1.0000000
LBX118	PCB118 (ng/g)	0.1117064	0.0235063	0.0001078	0.0140165
LBX128	PCB128 (ng/g)	0.0577789	0.0283394	0.0542540	1.0000000
LBX138	PCB138 & 158 (ng/g)	0.0241376	0.0254667	0.3540049	1.0000000
LBX146	PCB146 (ng/g)	-0.0830151	0.0240366	0.0023781	0.2782415
LBX149	PCB149 (ng/g)	0.0222698	0.0166993	0.2241002	1.0000000
LBX151	PCB151 (ng/g)	0.0280933	0.0169630	0.1416618	1.0000000
LBX153	PCB153 (ng/g)	-0.0778833	0.0309886	0.0201897	1.0000000
LBX156	PCB156 (ng/g)	-0.1236637	0.0198585	0.0000035	0.0004926
LBX157	PCB157 (ng/g)	0.0115978	0.0263800	0.6646874	1.0000000
LBX167	PCB167 (ng/g)	-0.0010927	0.0261936	0.9671198	1.0000000
LBX170	PCB170 (ng/g)	-0.1616512	0.0280853	0.0000103	0.0014024
LBX172	PCB172 (ng/g)	-0.0307905	0.0196760	0.1325569	1.0000000
LBX177	PCB177 (ng/g)	-0.0132860	0.0238655	0.5836161	1.0000000
LBX178	PCB178 (ng/g)	-0.0764934	0.0235545	0.0038537	0.4354637
LBX180	PCB180 (ng/g)	-0.2086052	0.0316211	0.0000016	0.0002235
LBX183	PCB183 (ng/g)	-0.0220815	0.0211261	0.3078027	1.0000000
LBX187	PCB187 (ng/g)	-0.1127399	0.0281197	0.0006355	0.0768899
LBX189	PCB189 (ng/g)	0.0108575	0.0176159	0.5571675	1.0000000
LBX194	PCB194 (ng/g)	-0.3778197	0.0492534	0.0001191	0.0153597
LBX195	PCB195 (ng/g)	0.0114952	0.0169407	0.5192174	1.0000000
LBX196	PCB196 & 203 (ng/g)	-0.2656293	0.0394514	0.0002691	0.0339125
LBX206	PCB206 (ng/g)	-0.1628048	0.0217808	0.0001403	0.0179536
LBXALC	a-Carotene(ug/dL)	-0.1916125	0.0116597	0.0000008	0.0001123
LBXALD	Aldrin (ng/g)	0.0213970	0.0143398	0.1792932	1.0000000
LBXB12	Vitamin B12, serum (pg/mL)	-0.2047473	0.0103057	0.0000000	0.0000000
LBXBCD	Cadmium (ug/L)	-0.0106824	0.0104770	0.3195187	1.0000000
LBXBEC	trans-b-carotene(ug/dL)	-0.2583417	0.0081134	0.0000000	0.0000012
LBXBHC	Beta-hexachlorocyclohexane (ng/g)	0.1435895	0.0273696	0.0000336	0.0044999
LBXBPB	Lead (ug/dL)	-0.2035729	0.0159326	0.0000000	0.0000000
LBXCBC	cis-b-carotene(ug/dL)	-0.2409739	0.0113488	0.0000001	0.0000199
LBXCOT	Cotinine (ng/mL)	0.0358362	0.0117989	0.0062644	0.6765554
LBXCRY	b-cryptoxanthin(ug/dL)	-0.2368754	0.0163658	0.0000018	0.0002544
LBXD01	1,2,3,7,8-pncdd (fg/g)	-0.0194447	0.0220296	0.3874104	1.0000000
LBXD02	1,2,3,4,7,8-hxcdd (fg/g)	0.0696474	0.0447041	0.1632009	1.0000000
LBXD03	1,2,3,6,7,8-hxcdd (fg/g)	0.0579234	0.0191896	0.0065394	0.6997202
LBXD04	1,2,3,7,8,9-hxcdd (fg/g)	0.0536833	0.0195327	0.0120424	1.0000000
LBXD05	1,2,3,4,6,7,8-hpcdd (fg/g)	0.1598730	0.0272081	0.0000078	0.0010812
LBXD07	1,2,3,4,6,7,8,9-ocdd (fg/g)	0.1318667	0.0299399	0.0002471	0.0313780
LBXDIE	Dieldrin (ng/g)	0.3250939	0.0385232	0.0000647	0.0085417
LBXEND	Endrin (ng/g)	0.0213904	0.0144656	0.1827504	1.0000000
LBXEPAH	2-(N-ethyl-PFOSA) acetate	0.0384728	0.0275097	0.2114645	1.0000000
LBXF01	2,3,7,8-tcdf (fg/g)	0.0262069	0.0252113	0.3103989	1.0000000
LBXF02	1,2,3,7,8-pncdf (fg/g)	0.0135876	0.0217855	0.5395425	1.0000000
LBXF03	2,3,4,7,8-pncdf (fg/g)	-0.0128551	0.0208510	0.5441765	1.0000000
LBXF04	1,2,3,4,7,8-hxcdf (fg/g)	0.1348533	0.0185184	0.0000004	0.0000539
LBXF05	1,2,3,6,7,8-hxcdf (fg/g)	0.0455173	0.0188408	0.0248908	1.0000000
LBXF06	1,2,3,7,8,9-hxcdf (fg/g)	0.0308585	0.0194651	0.1278376	1.0000000
LBXF07	2,3,4,6,7,8-hxcdf (fg/g)	0.0251465	0.0161939	0.1354041	1.0000000
LBXF08	1,2,3,4,6,7,8-hpcdf (fg/g)	0.0438385	0.0158389	0.0115322	1.0000000
LBXF09	1,2,3,4,7,8,9-hpcdf (fg/g)	0.0697482	0.0277887	0.0403998	1.0000000
LBXF10	1,2,3,4,6,7,8,9-ocdf (fg/g)	0.0622894	0.0198089	0.0048935	0.5431783
LBXFOL	Folate, serum (ng/mL)	-0.1905314	0.0096388	0.0000000	0.0000000
LBXGHC	Gamma-hexachlorocyclohexane (ng/g)	0.0403834	0.0308621	0.2048413	1.0000000
LBXGTC	g-tocopherol(ug/dL)	0.2285937	0.0100124	0.0000000	0.0000000
LBXHCB	Hexachlorobenzene (ng/g)	0.0073522	0.0298469	0.8078181	1.0000000
LBXHPE	Heptachlor Epoxide (ng/g)	0.2698109	0.0288549	0.0000000	0.0000010
LBXHXC	3,3,4,4,5,5-hxcb (fg/g)	-0.2141849	0.0319744	0.0000013	0.0001815
LBXIHG	Mercury, inorganic (ug/L)	-0.0012979	0.0224575	0.9544587	1.0000000
LBXIRN	Iron, Frozen Serum (ug/dL)	-0.0928059	0.0119747	0.0000001	0.0000209
LBXLUZ	Combined Lutein/zeaxanthin (ug/dL)	-0.2284943	0.0110242	0.0000002	0.0000232
LBXLYC	trans-lycopene(ug/dL)	0.0391903	0.0124745	0.0163420	1.0000000
LBXMIR	Mirex (ng/g)	-0.0431996	0.0202747	0.0450957	1.0000000
LBXMPAH	2-(N-methyl-PFOSA) acetate	-0.0104817	0.0297741	0.7368422	1.0000000
LBXODT	o,p-DDT (ng/g)	0.0795222	0.0278279	0.0094251	0.9613633
LBXOXY	Oxychlordane (ng/g)	0.0608732	0.0310987	0.0637211	1.0000000
LBXPCB	3,3,4,4,5-pncb (fg/g)	0.1607471	0.0239557	0.0000012	0.0001782
LBXPDE	p,p-DDE (ng/g)	0.0555545	0.0293940	0.0726419	1.0000000
LBXPDT	p,p-DDT (ng/g)	0.0874431	0.0212755	0.0004995	0.0619334
LBXPFDE	Perfluorodecanoic acid	-0.0559531	0.0234452	0.0542814	1.0000000
LBXPFDO	Perfluorododecanoic acid	-0.0590740	0.0432043	0.2205320	1.0000000
LBXPFHP	Perfluoroheptanoic acid	0.0453131	0.0287564	0.1661517	1.0000000
LBXPFHS	Perfluorohexane sulfonic acid	-0.0906463	0.0416582	0.0724645	1.0000000
LBXPFNA	Perfluorononanoic acid	-0.0667847	0.0438379	0.1784799	1.0000000
LBXPFOA	Perfluorooctanoic acid	0.0236730	0.0264606	0.4054300	1.0000000
LBXPFOS	Perfluorooctane sulfonic acid	-0.0102091	0.0326829	0.7653314	1.0000000
LBXPFSA	Perfluorooctane sulfonamide	-0.0985670	0.0475431	0.0835150	1.0000000
LBXPFUA	Perfluoroundecanoic acid	-0.0685577	0.0186627	0.0104107	1.0000000
LBXRBF	Folate, RBC (ng/mL RBC)	-0.0065411	0.0123954	0.6032394	1.0000000
LBXRPL	Retinyl palmitate(ug/dL)	-0.0547151	0.0167629	0.0037080	0.4227151
LBXRST	Retinyl stearate(ug/dL)	-0.0655155	0.0113216	0.0000096	0.0013156
LBXSPH	Phosphorus (mg/dL)	-0.1023527	0.0139922	0.0000003	0.0000506
LBXTC2	3,4,4,5-tcb (fg/g)	-0.0192188	0.0197799	0.3422942	1.0000000
LBXTCD	2,3,7,8-tcdd (fg/g)	0.0198184	0.0249161	0.4352769	1.0000000
LBXTHG	Mercury, total (ug/L)	-0.0657011	0.0222536	0.0076044	0.7921536
LBXTNA	Trans-nonachlor (ng/g)	0.0594385	0.0280126	0.0459136	1.0000000
LBXVIA	Retinol(ug/dL)	0.0964585	0.0144043	0.0000013	0.0001815
LBXVID	Vitamin D (ng/mL)	-0.2598976	0.0185320	0.0000022	0.0003130
LBXVIE	a-Tocopherol(ug/dL)	-0.0018945	0.0134701	0.8894893	1.0000000
URX14D	2,5-dichlorophenol (ug/L) result	0.0684557	0.0252926	0.0132166	1.0000000
URX1TB	2,4,5-trichlorophenol (ug/L) result	0.0251401	0.0166089	0.1450205	1.0000000
URX3TB	2,4,6-trichlorophenol (ug/L) result	-0.0315782	0.0167775	0.0737432	1.0000000
URXCBF	Carbofuranphenol (ug/L) result	-0.0219768	0.0314808	0.4927722	1.0000000
URXDAZ	Daidzein (ng/mL)	0.0310222	0.0145333	0.0447449	1.0000000
URXDMA	o-Desmethylangolensin (O-DMA) (ng/mL)	-0.0027413	0.0202323	0.8935126	1.0000000
URXDPY	diethylaminomethylpyrimidinol/one (ug/L)	0.0020866	0.0229881	0.9302191	1.0000000
URXEQU	Equol (ng/mL)	0.0025834	0.0221803	0.9083837	1.0000000
URXETD	Enterodiol (ng/mL)	-0.0407436	0.0230313	0.0914072	1.0000000
URXETL	Enterolactone (ng/mL)	-0.1152220	0.0201079	0.0000109	0.0014760
URXGNS	Genistein (ng/mL)	0.0186561	0.0169892	0.2845850	1.0000000
URXMBP	Mono-n-butyl phthalate	0.0076812	0.0161581	0.6394247	1.0000000
URXMC1	Mono-(3-carboxypropyl) phthalate	0.0257322	0.0254145	0.3450090	1.0000000
URXMCP	Mono-cyclohexyl phthalate	-0.0055374	0.0196446	0.7807973	1.0000000
URXMEP	Mono-ethyl phthalate	0.1090921	0.0162427	0.0000012	0.0001770
URXMET	Metolachlor mercapturate (ug/L) result	0.0017177	0.0191221	0.9309390	1.0000000
URXMHH	Mono-(2-ethyl-5-hydroxyhexyl) phthalate	0.0790602	0.0165632	0.0020281	0.2393138
URXMHP	Mono-(2-ethyl)-hexyl phthalate	0.0218056	0.0156774	0.1788226	1.0000000
URXMIB	Mono-isobutyl phthalate	0.0276898	0.0186325	0.1808406	1.0000000
URXMNM	Mono-n-methyl phthalate	0.0074960	0.0237375	0.7613717	1.0000000
URXMNP	Mono-isononyl phthalate	-0.0128979	0.0089652	0.1649842	1.0000000
URXMOH	Mono-(2-ethyl-5-oxohexl) phthalate	0.0776022	0.0189394	0.0045881	0.5138655
URXMOP	Mono-n-octyl phthalate	-0.0281631	0.0197847	0.1692863	1.0000000
URXMZP	Mono-benzyl phthalate	0.0602601	0.0233141	0.0172847	1.0000000
URXOPP	O-Phenyl phenol (ug/L) result	0.0065515	0.0242488	0.7896600	1.0000000
URXP01	1-napthol (ng/L)	0.0020432	0.0261658	0.9399434	1.0000000
URXP02	2-napthol (ng/L)	0.0949955	0.0157132	0.0005182	0.0632239
URXP03	3-fluorene (ng/L)	0.0546421	0.0174202	0.0049827	0.5480947
URXP04	2-fluorene (ng/L)	0.0858877	0.0167861	0.0000455	0.0060562
URXP05	3-phenanthrene (ng/L)	0.0392506	0.0197663	0.0602737	1.0000000
URXP06	1-phenanthrene (ng/L)	0.1089104	0.0225535	0.0000898	0.0117703
URXP07	2-phenanthrene (ng/L)	0.1404064	0.0208021	0.0000011	0.0001658
URXP08	1-benzo[c] phenanthrene (ng/L)	0.0080217	0.0188848	0.6753242	1.0000000
URXP10	1-pyrene (ng/L)	0.0320947	0.0167939	0.0697394	1.0000000
URXP11	2-benzo[c] phenanthrene (ng/L)	-0.0017306	0.0192421	0.9291868	1.0000000
URXP12	1-benzo[a] anthracene (ng/L)	0.0338052	0.0121216	0.0110013	1.0000000
URXP13	6-chrysene (ng/L)	0.0102043	0.0226637	0.6571452	1.0000000
URXP14	3-benzo[c] phenanthrene (ng/L)	0.0155483	0.0159695	0.3413248	1.0000000
URXP15	3-chrysene (ng/L)	-0.0537573	0.0158944	0.0028134	0.3235455
URXP16	3-benz[a] anthracene (ng/L)	0.0022213	0.0270823	0.9354066	1.0000000
URXP17	9-fluorene (ng/L)	0.1118388	0.0280468	0.0052727	0.5747190
URXP19	4-phenanthrene (ng/L)	0.0652742	0.0175891	0.0075443	0.7921536
URXP20	1-chrysene (ng/L)	0.0976988	0.0256916	0.0066907	0.7092133
URXP21	2-chrysene (ng/L)	0.0358762	0.0223825	0.1529976	1.0000000
URXP22	4-chrysene (ng/L)	0.0117159	0.0338084	0.7391207	1.0000000
URXP24	3-benzo(a) pyrene (ng/L)	-0.0023909	0.0273231	0.9327224	1.0000000
URXPCP	Pentachlorophenol (ug/L) result	0.0019523	0.0247614	0.9379043	1.0000000
URXTCC	dichlorovnl-dimeth prop carboacid (ug/L)	-0.0391348	0.0161983	0.0248860	1.0000000
URXUBA	Barium, urine (ng/mL)	0.0510851	0.0149858	0.0026423	0.3065035
URXUBE	Beryllium, urine (ng/mL)	-0.0031338	0.0147788	0.8341129	1.0000000
URXUCD	Cadmium, urine (ng/mL)	0.0411535	0.0200872	0.0531901	1.0000000
URXUCO	Cobalt, urine (ng/mL)	-0.0029838	0.0163922	0.8573066	1.0000000
URXUCS	Cesium, urine (ng/mL)	0.0254334	0.0113513	0.0359950	1.0000000
URXUHG	Mercury, urine (ng/mL)	-0.0550791	0.0220874	0.0206495	1.0000000
URXUIO	Iodine, urine (ng/mL)	-0.0358517	0.0191400	0.1032021	1.0000000
URXUMO	Molybdenum, urine (ng/mL)	0.0075207	0.0168505	0.6599336	1.0000000
URXUPB	Lead, urine (ng/mL)	-0.0530362	0.0132674	0.0006536	0.0784369
URXUPT	Platinum, urine (ng/mL)	-0.0445046	0.0157208	0.0100096	1.0000000
URXUSB	Antimony, urine (ng/mL)	0.0584387	0.0207685	0.0104021	1.0000000
URXUTL	Thallium, urine (ng/mL)	0.0743989	0.0116830	0.0000026	0.0003625
URXUTU	Tungsten, urine (ng/mL)	0.0195960	0.0171570	0.2662449	1.0000000
URXUUR	Uranium, urine (ng/mL)	0.0118812	0.0228046	0.6184358	1.0000000

Phenotype: Fasting Blood Glucose

train_summary_final_LBXGLU- is a nested dataframe, it is nested by the exposure and the exposure description. For every exposure under the “data” column, you can see the full adjusted linear model of association. For each coefficient, the following information is displayed: Data: estimate,standard error, pvalue, and false discovery rate (FDR) for the exposure variable.

# Merging the model summary with FDR.
train_LBXGLU_pvalue_adj_reduced <- train_LBXGLU_pvalue_adj %>% select(c(term, FDR))
train_LBXGLU_model_Summary_FDR <- merge(train_model_Summary_LBXGLU,train_LBXGLU_pvalue_adj_reduced  , by = "term", all.x = TRUE)

# Nesting the dataframe by Exposure 
train_model_Summary_nested_LBXGLU <-train_LBXGLU_model_Summary_FDR %>% nest_by(Exposure)

# Merging in the description of each exposure (inner join)
train_summary_final_LBXGLU <- merge(train_model_Summary_nested_LBXGLU,exposure_def, by = "Exposure", all = FALSE)

# Customizing the output dataframe 
new_order <- c("Exposure", "Exposure Description", "data")
train_summary_final_LBXGLU <- train_summary_final_LBXGLU[new_order]

Displaying a DataFrame with only the Exposures and their relevant measures:

train_LBXGLU_model_exposures_only <- train_LBXGLU_model_Summary_FDR
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only[grepl("^scale", train_LBXGLU_model_exposures_only$term), ]
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only %>% select(-term)
train_LBXGLU_model_exposures_only <- merge(train_LBXGLU_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value","FDR")
train_LBXGLU_model_exposures_only <- train_LBXGLU_model_exposures_only[new_order]



train_LBXGLU_model_exposures_only %>% as.data.frame() %>%  setNames(c("Exposure", "Exposure Description", "Beta Coefficient", "Standard Error","pValue", "FDR"))%>% kable() %>%  add_header_above(c(" " = 1, "Phenotype: Fasting Blood Glucose" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Phenotype: Fasting Blood Glucose
Exposure	Exposure Description	Beta Coefficient	Standard Error	pValue	FDR
LBD199	PCB199 (ng/g)	-0.0515945	0.0446150	0.2854395	1.0000000
LBX028	PCB28 (ng/g)	0.0011809	0.0259753	0.9652125	1.0000000
LBX052	PCB52 (ng/g)	0.0250654	0.0195598	0.2139956	1.0000000
LBX066	PCB66 (ng/g)	0.1074855	0.0631412	0.1034568	1.0000000
LBX074	PCB74 (ng/g)	0.1269555	0.0573901	0.0381726	1.0000000
LBX087	PCB87 (ng/g)	0.0276298	0.0417716	0.5294871	1.0000000
LBX099	PCB99 (ng/g)	0.1043700	0.0470952	0.0378590	1.0000000
LBX101	PCB101 (ng/g)	-0.0142698	0.0258778	0.5871605	1.0000000
LBX105	PCB105 (ng/g)	0.0527263	0.0294115	0.0874380	1.0000000
LBX110	PCB110 (ng/g)	0.0197141	0.0366436	0.6072532	1.0000000
LBX118	PCB118 (ng/g)	0.1747719	0.0421620	0.0004591	0.0730005
LBX128	PCB128 (ng/g)	0.0194525	0.0242509	0.4314588	1.0000000
LBX138	PCB138 & 158 (ng/g)	0.1031927	0.0434761	0.0272366	1.0000000
LBX146	PCB146 (ng/g)	0.0764909	0.0436016	0.0939591	1.0000000
LBX149	PCB149 (ng/g)	0.0266535	0.0363263	0.4869590	1.0000000
LBX151	PCB151 (ng/g)	0.0206117	0.0279704	0.4851366	1.0000000
LBX153	PCB153 (ng/g)	0.0981909	0.0432096	0.0336878	1.0000000
LBX156	PCB156 (ng/g)	0.0542031	0.0465673	0.2574891	1.0000000
LBX157	PCB157 (ng/g)	0.0023948	0.0244106	0.9227788	1.0000000
LBX167	PCB167 (ng/g)	0.0295029	0.0313485	0.3573418	1.0000000
LBX170	PCB170 (ng/g)	0.1433432	0.0616588	0.0301907	1.0000000
LBX172	PCB172 (ng/g)	0.0418000	0.0361891	0.2610474	1.0000000
LBX177	PCB177 (ng/g)	0.0558405	0.0413782	0.1915411	1.0000000
LBX178	PCB178 (ng/g)	0.0152238	0.0292379	0.6080314	1.0000000
LBX180	PCB180 (ng/g)	0.0776180	0.0440995	0.0929561	1.0000000
LBX183	PCB183 (ng/g)	0.0587003	0.0344840	0.1034680	1.0000000
LBX187	PCB187 (ng/g)	0.1060423	0.0418922	0.0194162	1.0000000
LBX189	PCB189 (ng/g)	0.0367138	0.0524647	0.5066463	1.0000000
LBX194	PCB194 (ng/g)	-0.0334983	0.0669428	0.6321377	1.0000000
LBX195	PCB195 (ng/g)	0.0447708	0.0626536	0.4980020	1.0000000
LBX196	PCB196 & 203 (ng/g)	-0.0792445	0.0502781	0.1590014	1.0000000
LBX206	PCB206 (ng/g)	0.0229918	0.0425016	0.6053112	1.0000000
LBXALC	a-Carotene(ug/dL)	-0.0838811	0.0275736	0.0187917	1.0000000
LBXALD	Aldrin (ng/g)	0.0275750	0.0372617	0.4833529	1.0000000
LBXB12	Vitamin B12, serum (pg/mL)	0.0305578	0.0149780	0.0541075	1.0000000
LBXBCD	Cadmium (ug/L)	-0.0155094	0.0131792	0.2524382	1.0000000
LBXBEC	trans-b-carotene(ug/dL)	-0.1053922	0.0222479	0.0021148	0.3214506
LBXBHC	Beta-hexachlorocyclohexane (ng/g)	0.1655359	0.0450068	0.0013993	0.2182849
LBXBPB	Lead (ug/dL)	-0.0502433	0.0177217	0.0099156	1.0000000
LBXCBC	cis-b-carotene(ug/dL)	-0.0888239	0.0161730	0.0009141	0.1435199
LBXCOT	Cotinine (ng/mL)	0.0098208	0.0145354	0.5066388	1.0000000
LBXCRY	b-cryptoxanthin(ug/dL)	-0.0849335	0.0178858	0.0020868	0.3192807
LBXD01	1,2,3,7,8-pncdd (fg/g)	0.0323786	0.0505807	0.5290067	1.0000000
LBXD02	1,2,3,4,7,8-hxcdd (fg/g)	0.0929563	0.0515767	0.1144996	1.0000000
LBXD03	1,2,3,6,7,8-hxcdd (fg/g)	0.0846654	0.0424918	0.0594733	1.0000000
LBXD04	1,2,3,7,8,9-hxcdd (fg/g)	0.0889778	0.0466544	0.0702706	1.0000000
LBXD05	1,2,3,4,6,7,8-hpcdd (fg/g)	0.0908108	0.0332008	0.0124010	1.0000000
LBXD07	1,2,3,4,6,7,8,9-ocdd (fg/g)	0.0740640	0.0468723	0.1290237	1.0000000
LBXDIE	Dieldrin (ng/g)	0.2105569	0.0764569	0.0283430	1.0000000
LBXEND	Endrin (ng/g)	0.0219057	0.0309539	0.5020156	1.0000000
LBXEPAH	2-(N-ethyl-PFOSA) acetate	0.0451150	0.0489024	0.3918283	1.0000000
LBXF01	2,3,7,8-tcdf (fg/g)	-0.0264002	0.0263068	0.3270192	1.0000000
LBXF02	1,2,3,7,8-pncdf (fg/g)	-0.0053544	0.0347137	0.8788898	1.0000000
LBXF03	2,3,4,7,8-pncdf (fg/g)	0.0824060	0.0308800	0.0143752	1.0000000
LBXF04	1,2,3,4,7,8-hxcdf (fg/g)	0.0950920	0.0380174	0.0207212	1.0000000
LBXF05	1,2,3,6,7,8-hxcdf (fg/g)	0.0951395	0.0403458	0.0281421	1.0000000
LBXF06	1,2,3,7,8,9-hxcdf (fg/g)	0.0222468	0.0344842	0.5258286	1.0000000
LBXF07	2,3,4,6,7,8-hxcdf (fg/g)	0.0499906	0.0378432	0.2007177	1.0000000
LBXF08	1,2,3,4,6,7,8-hpcdf (fg/g)	0.0602082	0.0293017	0.0525535	1.0000000
LBXF09	1,2,3,4,7,8,9-hpcdf (fg/g)	0.1178753	0.0799849	0.1840493	1.0000000
LBXF10	1,2,3,4,6,7,8,9-ocdf (fg/g)	0.0494291	0.0350254	0.1728195	1.0000000
LBXFOL	Folate, serum (ng/mL)	-0.0087240	0.0152730	0.5739248	1.0000000
LBXGHC	Gamma-hexachlorocyclohexane (ng/g)	0.0124146	0.0352549	0.7282429	1.0000000
LBXGTC	g-tocopherol(ug/dL)	0.1422715	0.0167066	0.0000000	0.0000048
LBXHCB	Hexachlorobenzene (ng/g)	-0.0344947	0.0222453	0.1359256	1.0000000
LBXHPE	Heptachlor Epoxide (ng/g)	0.1490605	0.0466539	0.0043534	0.6573670
LBXHXC	3,3,4,4,5,5-hxcb (fg/g)	0.1053043	0.0455634	0.0310668	1.0000000
LBXIHG	Mercury, inorganic (ug/L)	-0.0040755	0.0089385	0.6528913	1.0000000
LBXIRN	Iron, Frozen Serum (ug/dL)	-0.0429903	0.0147162	0.0081613	1.0000000
LBXLUZ	Combined Lutein/zeaxanthin (ug/dL)	-0.0812267	0.0240357	0.0117679	1.0000000
LBXLYC	trans-lycopene(ug/dL)	-0.0411998	0.0168306	0.0442429	1.0000000
LBXMIR	Mirex (ng/g)	0.0351970	0.0380378	0.3653136	1.0000000
LBXMPAH	2-(N-methyl-PFOSA) acetate	0.0468449	0.0892399	0.6184530	1.0000000
LBXODT	o,p-DDT (ng/g)	-0.0015792	0.0309068	0.9597318	1.0000000
LBXOXY	Oxychlordane (ng/g)	0.1509769	0.0661787	0.0330669	1.0000000
LBXPCB	3,3,4,4,5-pncb (fg/g)	0.1195139	0.0334510	0.0017955	0.2782985
LBXPDE	p,p-DDE (ng/g)	0.1337392	0.0374807	0.0018151	0.2795214
LBXPDT	p,p-DDT (ng/g)	0.1412222	0.0351914	0.0006299	0.0995209
LBXPFDE	Perfluorodecanoic acid	-0.0376989	0.0346551	0.3184249	1.0000000
LBXPFDO	Perfluorododecanoic acid	-0.1204707	0.0387366	0.0208490	1.0000000
LBXPFHP	Perfluoroheptanoic acid	-0.0294164	0.0178196	0.1498731	1.0000000
LBXPFHS	Perfluorohexane sulfonic acid	-0.0263705	0.0744008	0.7351347	1.0000000
LBXPFNA	Perfluorononanoic acid	-0.0021235	0.0505025	0.9678252	1.0000000
LBXPFOA	Perfluorooctanoic acid	-0.0846044	0.0690300	0.2662639	1.0000000
LBXPFOS	Perfluorooctane sulfonic acid	-0.0128117	0.0653578	0.8510614	1.0000000
LBXPFSA	Perfluorooctane sulfonamide	0.0124482	0.0317767	0.7087865	1.0000000
LBXPFUA	Perfluoroundecanoic acid	-0.0441076	0.0222565	0.0947895	1.0000000
LBXRBF	Folate, RBC (ng/mL RBC)	0.0369604	0.0197687	0.0755413	1.0000000
LBXRPL	Retinyl palmitate(ug/dL)	0.0060217	0.0202660	0.7692859	1.0000000
LBXRST	Retinyl stearate(ug/dL)	0.0398926	0.0221948	0.0866710	1.0000000
LBXSPH	Phosphorus (mg/dL)	0.0338711	0.0215665	0.1312346	1.0000000
LBXTC2	3,4,4,5-tcb (fg/g)	-0.0454994	0.0261752	0.0968072	1.0000000
LBXTCD	2,3,7,8-tcdd (fg/g)	0.0326694	0.0453276	0.4790167	1.0000000
LBXTHG	Mercury, total (ug/L)	-0.0027128	0.0168803	0.8737893	1.0000000
LBXTNA	Trans-nonachlor (ng/g)	0.1373569	0.0605026	0.0338404	1.0000000
LBXVIA	Retinol(ug/dL)	-0.0160785	0.0165150	0.3413536	1.0000000
LBXVID	Vitamin D (ng/mL)	-0.1333235	0.0324176	0.0045006	0.6750972
LBXVIE	a-Tocopherol(ug/dL)	0.0131746	0.0185116	0.4844879	1.0000000
URX14D	2,5-dichlorophenol (ug/L) result	-0.0151568	0.0428880	0.7273087	1.0000000
URX1TB	2,4,5-trichlorophenol (ug/L) result	-0.0321642	0.0127656	0.0199153	1.0000000
URX3TB	2,4,6-trichlorophenol (ug/L) result	-0.0061597	0.0288923	0.8332302	1.0000000
URXCBF	Carbofuranphenol (ug/L) result	-0.0078998	0.0259360	0.7636764	1.0000000
URXDAZ	Daidzein (ng/mL)	0.0242993	0.0220098	0.2820743	1.0000000
URXDMA	o-Desmethylangolensin (O-DMA) (ng/mL)	0.0243009	0.0213028	0.2668222	1.0000000
URXDPY	diethylaminomethylpyrimidinol/one (ug/L)	0.0046509	0.0158687	0.7779530	1.0000000
URXEQU	Equol (ng/mL)	0.0149104	0.0162056	0.3679812	1.0000000
URXETD	Enterodiol (ng/mL)	0.0059101	0.0219925	0.7907561	1.0000000
URXETL	Enterolactone (ng/mL)	0.0032140	0.0189182	0.8667247	1.0000000
URXGNS	Genistein (ng/mL)	0.0014812	0.0136941	0.9148936	1.0000000
URXMBP	Mono-n-butyl phthalate	0.0386255	0.0209016	0.0787413	1.0000000
URXMC1	Mono-(3-carboxypropyl) phthalate	0.0235570	0.0235368	0.3502295	1.0000000
URXMCP	Mono-cyclohexyl phthalate	0.0042171	0.0236532	0.8602032	1.0000000
URXMEP	Mono-ethyl phthalate	0.0162812	0.0373446	0.6673060	1.0000000
URXMET	Metolachlor mercapturate (ug/L) result	-0.0165152	0.0134787	0.2600918	1.0000000
URXMHH	Mono-(2-ethyl-5-hydroxyhexyl) phthalate	0.0391208	0.0266177	0.1850945	1.0000000
URXMHP	Mono-(2-ethyl)-hexyl phthalate	0.0608473	0.0231423	0.0156763	1.0000000
URXMIB	Mono-isobutyl phthalate	-0.0162421	0.0320020	0.6273750	1.0000000
URXMNM	Mono-n-methyl phthalate	-0.0097385	0.0289795	0.7466823	1.0000000
URXMNP	Mono-isononyl phthalate	-0.0161354	0.0154452	0.3080442	1.0000000
URXMOH	Mono-(2-ethyl-5-oxohexl) phthalate	0.0414723	0.0278637	0.1802506	1.0000000
URXMOP	Mono-n-octyl phthalate	-0.0201920	0.0117467	0.1003312	1.0000000
URXMZP	Mono-benzyl phthalate	0.0175628	0.0240041	0.4724653	1.0000000
URXOPP	O-Phenyl phenol (ug/L) result	-0.0300582	0.0273799	0.2847122	1.0000000
URXP01	1-napthol (ng/L)	-0.0115938	0.0295899	0.7068498	1.0000000
URXP02	2-napthol (ng/L)	0.0033295	0.0290158	0.9118684	1.0000000
URXP03	3-fluorene (ng/L)	0.0078669	0.0181287	0.6687521	1.0000000
URXP04	2-fluorene (ng/L)	0.0286705	0.0172896	0.1121305	1.0000000
URXP05	3-phenanthrene (ng/L)	0.0221311	0.0182136	0.2378190	1.0000000
URXP06	1-phenanthrene (ng/L)	0.0269843	0.0189140	0.1683708	1.0000000
URXP07	2-phenanthrene (ng/L)	0.0474879	0.0180941	0.0158414	1.0000000
URXP08	1-benzo[c] phenanthrene (ng/L)	-0.0067204	0.0154399	0.6678138	1.0000000
URXP10	1-pyrene (ng/L)	0.0332506	0.0205945	0.1213364	1.0000000
URXP11	2-benzo[c] phenanthrene (ng/L)	-0.0170052	0.0141542	0.2429619	1.0000000
URXP12	1-benzo[a] anthracene (ng/L)	0.0379480	0.0302821	0.2239252	1.0000000
URXP13	6-chrysene (ng/L)	0.0220321	0.0228638	0.3462016	1.0000000
URXP14	3-benzo[c] phenanthrene (ng/L)	-0.0253617	0.0186809	0.1889930	1.0000000
URXP15	3-chrysene (ng/L)	-0.0007344	0.0167520	0.9654468	1.0000000
URXP16	3-benz[a] anthracene (ng/L)	0.0058835	0.0260734	0.8236533	1.0000000
URXP17	9-fluorene (ng/L)	0.0535452	0.0349870	0.1697639	1.0000000
URXP19	4-phenanthrene (ng/L)	-0.0130919	0.0342371	0.7135169	1.0000000
URXP20	1-chrysene (ng/L)	0.0397310	0.0414832	0.3700847	1.0000000
URXP21	2-chrysene (ng/L)	-0.0092898	0.0228420	0.6963767	1.0000000
URXP22	4-chrysene (ng/L)	-0.0057447	0.0326519	0.8653249	1.0000000
URXP24	3-benzo(a) pyrene (ng/L)	0.0210543	0.0215790	0.3617239	1.0000000
URXPCP	Pentachlorophenol (ug/L) result	0.0071180	0.0277087	0.7997655	1.0000000
URXTCC	dichlorovnl-dimeth prop carboacid (ug/L)	-0.0099347	0.0199268	0.6232722	1.0000000
URXUBA	Barium, urine (ng/mL)	0.0073754	0.0424701	0.8637950	1.0000000
URXUBE	Beryllium, urine (ng/mL)	-0.0021292	0.0116287	0.8564783	1.0000000
URXUCD	Cadmium, urine (ng/mL)	-0.0894074	0.0416382	0.0436044	1.0000000
URXUCO	Cobalt, urine (ng/mL)	0.0081500	0.0259035	0.7561466	1.0000000
URXUCS	Cesium, urine (ng/mL)	-0.0731324	0.0529838	0.1820190	1.0000000
URXUHG	Mercury, urine (ng/mL)	-0.0829618	0.0286026	0.0082958	1.0000000
URXUIO	Iodine, urine (ng/mL)	0.0460943	0.0415512	0.3039359	1.0000000
URXUMO	Molybdenum, urine (ng/mL)	-0.0430284	0.0474954	0.3752412	1.0000000
URXUPB	Lead, urine (ng/mL)	-0.0824724	0.0472545	0.0955528	1.0000000
URXUPT	Platinum, urine (ng/mL)	-0.0605199	0.0221036	0.0123237	1.0000000
URXUSB	Antimony, urine (ng/mL)	0.0023215	0.0413607	0.9557703	1.0000000
URXUTL	Thallium, urine (ng/mL)	-0.1107449	0.0552117	0.0579288	1.0000000
URXUTU	Tungsten, urine (ng/mL)	-0.0187654	0.0292987	0.5287842	1.0000000
URXUUR	Uranium, urine (ng/mL)	-0.0140040	0.0469627	0.7742112	1.0000000

C. Analysis of EWAS results

1. What were the top 3 exposures associated with fasting glucose ranked by FDR (lowest to highest)? (b) How about for BMI? (c) How many were found in each phenotype in the training cohort with an FDR corrected pvalue of less than 0.05? (2)

# (a) top 3 exposures associated with fasting blood glucose 

train_LBXGLU_model_exposures_only %>% arrange(FDR) %>% head(n=3) %>% select(-c( sd, p_value)) %>% as.data.frame() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px") %>% add_header_above(c(" " = 1, "Phenotype: Fasting Blood Glucose" = 3))

	Phenotype: Fasting Blood Glucose
Exposure	Exposure Description	beta_coefficient	FDR
LBXGTC	g-tocopherol(ug/dL)	0.1422715	0.0000048
LBX118	PCB118 (ng/g)	0.1747719	0.0730005
LBXPDT	p,p-DDT (ng/g)	0.1412222	0.0995209

# (b) top 3 exposures associated with BMI
train_BMI_model_exposures_only %>% arrange(FDR) %>% head(n=3) %>% select(-c(sd, p_value)) %>% as.data.frame() %>% kable()  %>% column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px") %>% add_header_above(c(" " = 1, "Phenotype: BMI" = 3))

	Phenotype: BMI
Exposure	Exposure Description	beta_coefficient	FDR
LBXGTC	g-tocopherol(ug/dL)	0.2285937	0
LBXB12	Vitamin B12, serum (pg/mL)	-0.2047473	0
LBXFOL	Folate, serum (ng/mL)	-0.1905314	0

# (c) How many were found in each phenotype in the training cohort with an FDR corrected pvalue of less than 0.05
# Phenotype= BMI
train_BMI_model_exposures_only %>% filter(FDR <0.05) %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 2, "Phenotype: BMI" = 4)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

		Phenotype: BMI
Exposure	Exposure Description	beta_coefficient	sd	p_value	FDR
LBD199	PCB199 (ng/g)	-0.3408196	0.0508426	0.0002766	0.0345706
LBX118	PCB118 (ng/g)	0.1117064	0.0235063	0.0001078	0.0140165
LBX156	PCB156 (ng/g)	-0.1236637	0.0198585	0.0000035	0.0004926
LBX170	PCB170 (ng/g)	-0.1616512	0.0280853	0.0000103	0.0014024
LBX180	PCB180 (ng/g)	-0.2086052	0.0316211	0.0000016	0.0002235
LBX194	PCB194 (ng/g)	-0.3778197	0.0492534	0.0001191	0.0153597
LBX196	PCB196 & 203 (ng/g)	-0.2656293	0.0394514	0.0002691	0.0339125
LBX206	PCB206 (ng/g)	-0.1628048	0.0217808	0.0001403	0.0179536
LBXALC	a-Carotene(ug/dL)	-0.1916125	0.0116597	0.0000008	0.0001123
LBXB12	Vitamin B12, serum (pg/mL)	-0.2047473	0.0103057	0.0000000	0.0000000
LBXBEC	trans-b-carotene(ug/dL)	-0.2583417	0.0081134	0.0000000	0.0000012
LBXBHC	Beta-hexachlorocyclohexane (ng/g)	0.1435895	0.0273696	0.0000336	0.0044999
LBXBPB	Lead (ug/dL)	-0.2035729	0.0159326	0.0000000	0.0000000
LBXCBC	cis-b-carotene(ug/dL)	-0.2409739	0.0113488	0.0000001	0.0000199
LBXCRY	b-cryptoxanthin(ug/dL)	-0.2368754	0.0163658	0.0000018	0.0002544
LBXD05	1,2,3,4,6,7,8-hpcdd (fg/g)	0.1598730	0.0272081	0.0000078	0.0010812
LBXD07	1,2,3,4,6,7,8,9-ocdd (fg/g)	0.1318667	0.0299399	0.0002471	0.0313780
LBXDIE	Dieldrin (ng/g)	0.3250939	0.0385232	0.0000647	0.0085417
LBXF04	1,2,3,4,7,8-hxcdf (fg/g)	0.1348533	0.0185184	0.0000004	0.0000539
LBXFOL	Folate, serum (ng/mL)	-0.1905314	0.0096388	0.0000000	0.0000000
LBXGTC	g-tocopherol(ug/dL)	0.2285937	0.0100124	0.0000000	0.0000000
LBXHPE	Heptachlor Epoxide (ng/g)	0.2698109	0.0288549	0.0000000	0.0000010
LBXHXC	3,3,4,4,5,5-hxcb (fg/g)	-0.2141849	0.0319744	0.0000013	0.0001815
LBXIRN	Iron, Frozen Serum (ug/dL)	-0.0928059	0.0119747	0.0000001	0.0000209
LBXLUZ	Combined Lutein/zeaxanthin (ug/dL)	-0.2284943	0.0110242	0.0000002	0.0000232
LBXPCB	3,3,4,4,5-pncb (fg/g)	0.1607471	0.0239557	0.0000012	0.0001782
LBXRST	Retinyl stearate(ug/dL)	-0.0655155	0.0113216	0.0000096	0.0013156
LBXSPH	Phosphorus (mg/dL)	-0.1023527	0.0139922	0.0000003	0.0000506
LBXVIA	Retinol(ug/dL)	0.0964585	0.0144043	0.0000013	0.0001815
LBXVID	Vitamin D (ng/mL)	-0.2598976	0.0185320	0.0000022	0.0003130
URXETL	Enterolactone (ng/mL)	-0.1152220	0.0201079	0.0000109	0.0014760
URXMEP	Mono-ethyl phthalate	0.1090921	0.0162427	0.0000012	0.0001770
URXP04	2-fluorene (ng/L)	0.0858877	0.0167861	0.0000455	0.0060562
URXP06	1-phenanthrene (ng/L)	0.1089104	0.0225535	0.0000898	0.0117703
URXP07	2-phenanthrene (ng/L)	0.1404064	0.0208021	0.0000011	0.0001658
URXUTL	Thallium, urine (ng/mL)	0.0743989	0.0116830	0.0000026	0.0003625

# Phenotype= Fasting blood glucose
train_LBXGLU_model_exposures_only %>% filter(FDR <0.05) %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 2, "Phenotype: Fasting Blood Glucose" = 4)) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

		Phenotype: Fasting Blood Glucose
Exposure	Exposure Description	beta_coefficient	sd	p_value	FDR
LBXGTC	g-tocopherol(ug/dL)	0.1422715	0.0167066	0	4.8e-06

When examining the association with BMI, 36 exposures had a FDR < 0.05. When examining the association with Fasting blood glucose, only 1 exposure had a FDR < 0.05.

Now interpret the coefficients of the top 3 most findings, ranked by low to high FDR in body mass index and fasting glucose. Specifically, how much does (a) fasting glucose and (b) body mass index change with respect to change in the exposure? Write down your answer in the units of estimate. (2)

a: your answer here

A 1% change in g-tocopherol is associated with a 0.0014 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.
A 1% change in PCB118 is associated with a 0.0017 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.
A 1% change in p,p-DDT is associated with a 0.0014 units higher of serum glucose level, holding age, sex, ethnicity, and family poverty level constant.

b: your answer here

A 1% change in g-tocopherol is associated with a 0.0022 units higher of BMI, holding age, sex, ethnicity, and family poverty level constant.
A 1% change in Vitamin B12 serum is associated with a 0.002 units less of BMI, holding age, sex, ethnicity, and family poverty level constant.
A 1% change in Folate serum is associated with a 0.0019 units less of BMI, holding age, sex, ethnicity, and family poverty level constant.

1. Produce a “volcano plot” of the results for glucose and body mass index (two separate plots) by plotting the association size, or estimate on the x-axis and the -log10(FDR) on the yaxis for the training cohort data for the coefficient on the exposure (each point should denote the association size and FDR between an exposure and a phenotype). (b) Why is it useful to scale the dependent and independent variables before running the model? (2)

your answer here

# color coding the points with FDR < 0.05, if true, red. 
train_BMI_model_exposures_only1 <- train_BMI_model_exposures_only %>% mutate(significant=ifelse(FDR < 0.05, 1,0))
train_LBXGLU_model_exposures_only1 <- train_LBXGLU_model_exposures_only %>% mutate(significant=ifelse(FDR < 0.05, 1,0))

# (a) Volcano Plot for BMI
a <- ggplot(train_BMI_model_exposures_only1 , aes(x = beta_coefficient, y =-log10(FDR), color=as.factor(significant))) +
  geom_point() +labs(x='Association Size', y='-log10(FDR)') +  ggtitle("BMI") +scale_color_manual(values = c("0" = "black", "1" = "red"))+  labs(color='Significance')+ theme_linedraw()+geom_hline(yintercept = 1.3, linetype = "dashed", color = "#491C1C")+  guides(color = 'none')
 

# (a) Volcano Plot for Fasting Blood Glucose
b <- ggplot(train_LBXGLU_model_exposures_only1 , aes(x = beta_coefficient, y =-log10(FDR), color=as.factor(significant))) + geom_point() +labs(x='Association Size', y='-log10(FDR)') +  ggtitle("Fasting blood glucose") +scale_color_manual(values = c("0" = "black", "1" = "red"))+  labs(color='Significance')+geom_hline(yintercept = 1.3, linetype = "dashed", color = "#491C1C")+ theme_linedraw()+  guides(color = 'none') 

plot_grid(a,b, ncol = 2)

It is useful to scale the dependent and independent variables before running the analysis because it will improve the interpretation, having the dependents and independents variables on the same scale make it easy to understand their relationship, since the coefficients represent the change in the dependent variable for a one-unit change in the independent variable.

Filter “replicated” findings using the following heuristic: FDR significance of 0.05 in the training dataset, p-value < 0.05 in the testing dataset, and concordant directionality of associations (e.g., both associations are >0 OR both are <0). Provide your answer in the form of a table with the estimate of the “replicated” findings in the train, test, the FDR from the training data and pvalue from the test cohort data. (3)

test_LBXGLU_model_exposures_only <- test_model_Summary_LBXGLU
test_LBXGLU_model_exposures_only <-test_LBXGLU_model_exposures_only[grepl("^scale", test_LBXGLU_model_exposures_only$term), ]
test_LBXGLU_model_exposures_only <- test_LBXGLU_model_exposures_only%>% select(-term)
test_LBXGLU_model_exposures_only <- merge(test_LBXGLU_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value")
test_LBXGLU_model_exposures_only <- test_LBXGLU_model_exposures_only[new_order]

# Filtering for significant values in the training set 
LBXGLU_train_sig <- train_LBXGLU_model_exposures_only %>%  filter(FDR<0.05) %>% select(-c(sd,p_value))

# Filtering for significant values in the testing set 
LBXGLU_test_sig <- test_LBXGLU_model_exposures_only %>%  filter(p_value <0.05)%>% select(-sd)


# Renaming column names before merging 
colnames(LBXGLU_train_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_train","FDR_train")
colnames(LBXGLU_test_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_test","p_value_test")

# Merging the training and test set results
LBXGLU_test_train <- merge(LBXGLU_train_sig,LBXGLU_test_sig, by = c("Exposure","Exposure Description"), all = FALSE)

LBXGLU_test_train %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Replicated findings for Fasting blood glucose"=5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Replicated findings for Fasting blood glucose
Exposure	Exposure Description	beta_coefficient_train	FDR_train	beta_coefficient_test	p_value_test
LBXGTC	g-tocopherol(ug/dL)	0.1422715	4.8e-06	0.1250581	2e-07

Displaying a DataFrame with only the Exposures and their relevant measures for testing set (BMI):

test_BMI_model_exposures_only <- test_model_Summary_BMI
test_BMI_model_exposures_only <-test_BMI_model_exposures_only[grepl("^scale", test_BMI_model_exposures_only$term), ]
test_BMI_model_exposures_only <- test_BMI_model_exposures_only%>% select(-term)
test_BMI_model_exposures_only <- merge(test_BMI_model_exposures_only,exposure_def, by = "Exposure", all = FALSE)
new_order <- c("Exposure", "Exposure Description", "beta_coefficient", "sd","p_value")
test_BMI_model_exposures_only <- test_BMI_model_exposures_only[new_order]

# Filtering for significant values in the training set 
BMI_train_sig <- train_BMI_model_exposures_only %>%  filter(FDR<0.05) %>% select(-c(sd,p_value))

# Filtering for significant values in the testing set 
BMI_test_sig <- test_BMI_model_exposures_only %>%  filter(p_value <0.05)%>% select(-sd)


# Renaming column names before merging 
colnames(BMI_train_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_train","FDR_train")
colnames(BMI_test_sig ) <- c("Exposure", "Exposure Description", "beta_coefficient_test","p_value_test")

# Merging the training and test set results
BMI_test_train <- merge(BMI_train_sig,BMI_test_sig, by = c("Exposure","Exposure Description"), all = FALSE)

BMI_test_train %>% as.data.frame() %>% kable() %>% add_header_above(c(" " = 1, "Replicated findings for BMI" = 5))%>% kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  column_spec(1:ncol(train_BMI_pvalue_adj), width = "100px")

	Replicated findings for BMI
Exposure	Exposure Description	beta_coefficient_train	FDR_train	beta_coefficient_test	p_value_test
LBD199	PCB199 (ng/g)	-0.3408196	0.0345706	-0.3340961	0.0002995
LBX118	PCB118 (ng/g)	0.1117064	0.0140165	0.1489610	0.0177188
LBX156	PCB156 (ng/g)	-0.1236637	0.0004926	-0.1544014	0.0047365
LBX170	PCB170 (ng/g)	-0.1616512	0.0014024	-0.2247118	0.0014327
LBX180	PCB180 (ng/g)	-0.2086052	0.0002235	-0.3387496	0.0001508
LBX194	PCB194 (ng/g)	-0.3778197	0.0153597	-0.3248369	0.0000919
LBX196	PCB196 & 203 (ng/g)	-0.2656293	0.0339125	-0.2436532	0.0011456
LBX206	PCB206 (ng/g)	-0.1628048	0.0179536	-0.3159491	0.0005139
LBXALC	a-Carotene(ug/dL)	-0.1916125	0.0001123	-0.1946426	0.0000000
LBXB12	Vitamin B12, serum (pg/mL)	-0.2047473	0.0000000	-0.1897773	0.0000000
LBXBEC	trans-b-carotene(ug/dL)	-0.2583417	0.0000012	-0.2593669	0.0000000
LBXBHC	Beta-hexachlorocyclohexane (ng/g)	0.1435895	0.0044999	0.1427461	0.0360052
LBXBPB	Lead (ug/dL)	-0.2035729	0.0000000	-0.2032720	0.0000000
LBXCBC	cis-b-carotene(ug/dL)	-0.2409739	0.0000199	-0.2556237	0.0000000
LBXCRY	b-cryptoxanthin(ug/dL)	-0.2368754	0.0002544	-0.2125888	0.0000000
LBXD05	1,2,3,4,6,7,8-hpcdd (fg/g)	0.1598730	0.0010812	0.2249577	0.0001554
LBXD07	1,2,3,4,6,7,8,9-ocdd (fg/g)	0.1318667	0.0313780	0.2172344	0.0001519
LBXDIE	Dieldrin (ng/g)	0.3250939	0.0085417	0.3351152	0.0000956
LBXF04	1,2,3,4,7,8-hxcdf (fg/g)	0.1348533	0.0000539	0.1429995	0.0012816
LBXFOL	Folate, serum (ng/mL)	-0.1905314	0.0000000	-0.1824641	0.0000000
LBXGTC	g-tocopherol(ug/dL)	0.2285937	0.0000000	0.2356250	0.0000000
LBXHPE	Heptachlor Epoxide (ng/g)	0.2698109	0.0000010	0.2886614	0.0000460
LBXHXC	3,3,4,4,5,5-hxcb (fg/g)	-0.2141849	0.0001815	-0.2348097	0.0009598
LBXIRN	Iron, Frozen Serum (ug/dL)	-0.0928059	0.0000209	-0.1901752	0.0000000
LBXLUZ	Combined Lutein/zeaxanthin (ug/dL)	-0.2284943	0.0000232	-0.1914445	0.0000000
LBXPCB	3,3,4,4,5-pncb (fg/g)	0.1607471	0.0001782	0.1780708	0.0020522
LBXRST	Retinyl stearate(ug/dL)	-0.0655155	0.0013156	-0.0611861	0.0004880
LBXSPH	Phosphorus (mg/dL)	-0.1023527	0.0000506	-0.1153578	0.0000000
LBXVIA	Retinol(ug/dL)	0.0964585	0.0001815	0.0757284	0.0000793
LBXVID	Vitamin D (ng/mL)	-0.2598976	0.0003130	-0.2546304	0.0000000
URXETL	Enterolactone (ng/mL)	-0.1152220	0.0014760	-0.0936139	0.0000101
URXMEP	Mono-ethyl phthalate	0.1090921	0.0001770	0.1097786	0.0000123
URXP04	2-fluorene (ng/L)	0.0858877	0.0060562	0.0967908	0.0078634
URXP06	1-phenanthrene (ng/L)	0.1089104	0.0117703	0.1174154	0.0050251
URXP07	2-phenanthrene (ng/L)	0.1404064	0.0001658	0.1774947	0.0002225
URXUTL	Thallium, urine (ng/mL)	0.0743989	0.0003625	0.1008309	0.0000389

Environmental exposures are highly correlated with one another and they can change over time. Therefore, exposures could be subject to many biases. Describe a type of bias (e.g., selection, sampling, or other) that can influence your findings and give an example.(2)

your answer here

Confounding is one of the biases that we can face when studying environmental exposures.It is when a third variable is associated with the exposure and the outcome but not on the causal pathway, it can lead to wrong associations. In our study we found that both b12 and folate are associated with BMI. But we also know that B12 and folate are correlated and interdependent. B12 is needed for the conversion of folate to its active form, and a deficiency in B12 can lead to a buildup of inactive folate in the body. Deficiency of one can mask deficiency in other. This correlation between the two can be confoundig in terms of the association with BMI.

BMI is a risk factor for diabetes: individuals with higher BMI are at risk for future type 2 diabetes. In this study, we found factors associated with both. For exposures that are found in both BMI and type 2 diabetes - do you think exposure is influencing both traits independently, or via BMI, or neither? Explain why. Optional: Write down a possible linear regression model to test your hypothesis and interpret the finding. (3)

your answer here

The exposure LBXGTC (g-tocopherol(ug/dL)) was found to be associated with both BMI and Fasting Plasma Glucose. g-tocopherol is a form of vitamin E that is found in many foods, including nuts, seeds, and vegetable oils. The literature about the association between vitamin E and BMI or diabetes is not conclusive. Giving the association between BMI and diabetes, it is reasonable to think that an exposure affecting both, does so dependently. To further explore that, we can conduct mediation analysis. What we know: 1) Exposure is significantly associated with T2D. Beta Coefficient 0.1422715.

Packages within the Tidyverse

Exposure is significantly associated with BMI.Beta coefficient 0.2285937.

Packages within the Tidyverse

What I need to further do: 3) I will build a linear model with fasting blood glucose as the outcome, and with the following independent variables: BMI, exposure (g-tocopherol) and confounders (sex, age, ethnicity, poverty to income ratio)

Packages within the Tidyverse

model_mediation <- association("LBXGLU", "LBXGTC", covariates=c("BMXBMI","RIDAGEYR","sex","ethnicity","INDFMPIR"), dsn=dsn)

summary(model_mediation)

## 
## Call:
## svyglm(formula = as.formula(mod_string), design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -0.6597769  0.0592085 -11.143 4.97e-10 ***
## scale(log(LBXGTC + 0.001))  0.1169084  0.0173931   6.722 1.54e-06 ***
## BMXBMI                      0.0149378  0.0014666  10.186 2.32e-09 ***
## RIDAGEYR                    0.0129586  0.0009889  13.105 2.82e-11 ***
## sex                        -0.1672797  0.0239012  -6.999 8.62e-07 ***
## ethnicity2                 -0.0277477  0.0452493  -0.613  0.54664    
## ethnicity3                  0.1192694  0.0335673   3.553  0.00199 ** 
## ethnicity4                  0.0300114  0.0456817   0.657  0.51869    
## ethnicity5                  0.1228672  0.1087525   1.130  0.27193    
## INDFMPIR                   -0.0177648  0.0082172  -2.162  0.04292 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8908525)
## 
## Number of Fisher Scoring iterations: 2

The exposure is still significantly associated with T2D as we can see. The effect still exist but it is smaller in magnitude (Beta coefficient 0.1169084). This suggests that BMI partially mediates between exposure and T2D.

Reference: https://data.library.virginia.edu/introduction-to-mediation-analysis/

Assignment 1

Aya Awwad

An Introduction to the exposome: exposome-wide association in body mass index and fasting glucose

Background:

Description of Data and Readings:

A short tutorial on analysis of survey data

Sample Questions (and Answers):

A. Background Questions

Let’s explore the phenotypes, body mass index (BMXBMI) and fasting glucose (LBXGLU) with respect to demographic characteristics of your population in the training dataset (NHData.train)

Using the model you described in 11, report the following:

We will explore one biomarker of exposure: serum Lead (LBXBPB).

Using a multivariate linear regression model, estimate the following:

B. Executing environment-wide associations

Logging and scaling exposure variables

Logging variables

Scaling variables

Doing EWAS

Combining all together into a final dataset per Phenotype for the training dataset.

Phenotype: BMI

Phenotype: Fasting Blood Glucose

C. Analysis of EWAS results