Introduction to GWAS and the “genomewide polygenic score (GPS)”

Please check Canvas for the due date of this assignment. Submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.

In Session 2 and 3, we had an introduction to the “genome-wide association studies” (GWASs), which are used to examine the association between individual (up to a few million) genetic variants along the genome and a phenotype (e.g., BMI or diabetes). The phenotype can be disease and non-disease (or “cases” and “controls”), where one tests the variant frequency in cases (the variant prevalence in cases) versus controls (the variant prevalence in the controls).

Case-control GWASes are analytically “easy” to do: specifically, we execute a contingency table test of association (e.g., Chi-square test) or a logistic regression. Suppose we have a variant “locus” (location on the genome) called X_1. Lets also suppose that X_1 can take on either Thymine (“T”) or Cytosine (“C”), two nucleotide bases. Then, we choose the “effect” allele (lets say “T”), and for each individual, count how many Ts they have at X_1 (0, 1, or 2). Lets call the count at X_1, C_1. Then, we count the number of total Ts in the cases vs. controls and do a contingency table test (2x2 table) or a logistic regression. Lets write down the logistic model: if Y is 1 for individuals with a disease, and 0 without, then we have a model for X_1 and Y:

\[Y = \alpha + \beta_1 C_1\]

Where \(C_1\) is 0, 1, or 2 for each individual in the dataset, and Y is 0 for controls, 1 for cases. Then, we execute this model for every variant on the GWAS array. In a logistic regression model, the \(exp(\beta_1)\) is the “odds ratio” or, odds of the disease for a 1 unit change in the effect allele vs. no change.

We can do the same if Y is continuous, such as body mass index, using linear regression (lm in R). As a thought question, what would be the interpretation of \(\beta_1\) if using a “lm”? For more about GWASs, see this tutorial.

One important detail is p-values of significance are used to determine what loci are associated with the outcome (if \(\beta_1\) is nonzero). We see those signals in the “Manhattan Plot” - the higher the -log10(pvalue), the higher the point on the y-axis, and hence, more of a signal than those with higher p-values. As a result, millions of \(\beta_m\) are estimated. The most significant loci (and the \(\beta_m\)) for studies are reported in the GWAS Catalog.

How can we use millions of risk estimates to determine risk for a single person?: the polygenic risk score. Essentially, the polygenic risk score (PRS) or equivalently, the genome-wide polygenic score (GPS, for real-valued traits such as body mass index) is simply a sum of the number of effect alleles for an individual (summing up across M number of loci) multiplied by the \(\beta\) from the regression. For example, for one individual (indexed by i), the GPS for i would be:

\[PRS^i = \sum^M_{j=1} \beta_j C^i_j\]

Therefore, the higher the PRS, the higher the outcome risk, or the higher the PGS, the higher the observed trait (e.g., height, weight, body mass index).

What \(\beta_j\) do we choose to put in the GPS? The easiest is to apply some pvalue threshold (e.g., Bonferroni) and then “prune” those that are independent of others (remember, variant frequency are correlated between closely neighboring loci due to linkage disequilibrium). Another way is to use a regularized regression similar to what we saw in Session 3. When combining effect sizes, the \(\beta_j\) used in the GPS will be smaller than that in GWAS. Check out this optional tutorial for more detail

Body Mass Index or Quatelet’s Index: Digging into the BMI GPS

Body mass index (BMI), or squared weight of an individual divided by their height, is a major risk factor for many clinical outcomes, such as type 2 diabetes and heart disease. BMI is used to characterize clinical obesity, or obesity categories, such as “severe obesity”, “obesity”, or “underweight”: see the cutpoints

BMI is known as a “complex” phenotype, with many genetic (G), environmental (E), and behavioral factors that contribute to it, such as physical activity, or genetic variants in the FTO gene.

Khera and colleagues have developed a GPS using data from prior GWAS on BMI and using the UK Biobank cohort data.

Read Khera et al 2019 and answer the following questions:

  1. What is the inclusion criteria of the main GPS-deriving part of the study (UK Biobank), namely what ethncity and age group?

They restricted the analysis to individuals of white British ancestry, aged between 40-69 yo.

  1. How do Khera et al define high GPS “carriers”? Approximately what is the sample size and prevalence of the high GPS carriers?

They labeled the top decile of the GPS distribution as “carriers” and the remainder of the distribution as non-carriers. The testing dataset contained 288,016, 10% of which were labeleld as carriers, meaning the sample size was 28,801 (they mention the top decile is 28,784, possibly some individuals were dropped due to missing data).

  1. Why did Khera et al use data from (a) Partners, (b) Framingham Offspring/CARDIA, and (c) Avon Longitudinal Study of Parents and Children?
  1. They used it to replicate the results. Among individuals who underwent bariatric surgeries, in UK Biobank, 38.9% had high GPS. This result was replicated in the Partners dataset (33% of the individuals underwent bariatric surgery had high GPS).
  2. To examine young adult’s risk of developing severe obesity according to their GPS. They identified individuals who were not obese at baseline, and whom calculation of GPS was possible, and determined their obesity incidence.
  3. To examine the age at which the weight gradient and severe obesity observed in adulthood starts.
  1. From Figure 2, what is the mean (a) body mass index, (b) weight, and (c) “rate of severe obesity” of participants in the UK Biobank for high GPS carriers?
  1. Mean body mass index is 30 (kg/m^2)
  2. Mean weight in kg is 85.3
  3. Rate of severe obesity is 5.6%
  1. Approximately what percentage of individuals who are (a) severely obese, (b) obese, (c) overweight, and (d) normal are high GPS carriers?
  1. 5.6%
  2. 37.6%
  3. 39.6%
  4. 17&
  1. Interpret the odds ratios in Figure 3B for (a) “Extreme Obesity” and (b) “Bariatric Surgery”.
  1. individuals that are extremely obese have 4.22 times the odds for being high GPS carrier, this result is statistically significant
  2. Individuals who had a bariatric surgery have 4.96 times the odds of being high GPS carriers, this result is statistically significant.

Exploring the polygenic score

library(tidyverse)
library(survey)
library(broom)
library(knitr)
library(gtsummary)

We can find the GPS model statistics for Khera here, at the PGS catalog: https://www.pgscatalog.org/score/PGS000027/

We have provided the GPS in your project directory, called: PGS000027.txt.gz –>

pgs_bmi <- read_delim('PGS000027.txt.gz', delim = "\t", comment="#")
  1. Describe what each column and row of the PGS contains.
  • Write your answer here. Columns:
  • chr_name: The name (number) of the chromosome
  • chr_position: physical location of a gene or genetic variant on a specific chromosome.
  • effect_allele: The effect allele is the variant of the genetic variant that is found to be associated with the trait or disease being studied (BMI in this case).
  • other_allele: the second allele of the gene that is not the effect allele.
  • effect_weight: weight or strength of the association between a genetic variant and a specific trait or disease.

Rows: each row is the features of a different gene (its position, alleles and effect weight of the effect allele)

One of the “strongest” or lowest pvalue variants associated with BMI happens to be in the FTO gene. For example, lets look at a GWAS that cited by Khera in the GWAS Catalog: https://www.ebi.ac.uk/gwas/publications/25673413

And sort by pvalue, lowest to highest.

  1. Create new variables for the chromosome number and location for the top SNP (by pvalue) found above.
chromosome_number <- 16 # your answer here
chromosome_position <-  53769662 # your answer here
pgs_bmi %>% filter(chr_position==53769662)
## # A tibble: 0 × 5
## # … with 5 variables: chr_name <dbl>, chr_position <dbl>, effect_allele <chr>,
## #   other_allele <chr>, effect_weight <dbl>

According to the UCSC browser, FTO spans approximately 415 kilobases, from base 53701692 to base 54158512

fto_start <- 53701692
fto_end <- 54158512
  1. Plot the weights over the span of the FTO gene, by filtering the GPS file by chromosome number and the start and ending place of the gene using the variables you created in 8. Annotate your plot with a vertical line that denotes the location of the top GWAS SNP (answer to 8).
# plotting code here

# Filtering for the variants witin the fto_start and fto_end
pgs_bmi_FTO_subset <- pgs_bmi %>% filter(chr_position >= fto_start & chr_position<= fto_end) %>% ggplot(aes(x=chr_position, y=effect_weight)) + geom_point(size = 1, color = "#0099f9") +labs(x='Chromosome Position', y='Effect Weight', title = "Effect weight over the span of the FTO gene") + theme_classic() + geom_vline(xintercept = chromosome_position, color = "red", linetype = "dashed") + annotate("text", x = 53740000, y = 0.004, label = "Top SNP",  color = "red")

pgs_bmi_FTO_subset

  1. How “large” are the weights in the FTO gene relative to the rest of the 200k weights along the genome? Use the quantile or ecdf function to find out the rank of the max weight found in the FTO gene.
pgs_bmi_FTO_subset1 <- pgs_bmi %>% filter(chr_position >= fto_start & chr_position<= fto_end) 
# hint use ecdf: ?ecdf
# or, alternatively, try quantile: ?quantile 

# Finding the maximal weight in the FTO gene
max_fto_weight <- max(pgs_bmi_FTO_subset1$effect_weight)

# Will create an ECDF for the weights in the dataset
ecdf_distribustion <- ecdf(pgs_bmi$effect_weight)

# finding the rank
fto_rank <- ecdf_distribustion(max_fto_weight)

# ECDF plot
plot(ecdf_distribustion, xlab = "Effect Weight", ylab = "Proportion", main = "ECDF of Effect Weights", xlim = c(0, 0.006))

# annotation
abline(v = max_fto_weight, lty = 2, col = "red")
text(max_fto_weight, fto_rank, "Max weight in FTO", pos = 4, color='red')

99% of the observation have an effect weight lower than the maximal effect weight in the FTO gene.

Recall session 3 on clinical machine learning, specifically on approaches to build large regression models, such as LASSO for question 11.

  1. What is the coefficient size for the association between BMI and the top variant (ordered by pvalue) from GWAS (https://www.ebi.ac.uk/gwas/publications/25673413) and why is it different than the weights from the PGS?
  • Your answer here (two parts) In a GWAS, we basically build a logistic regression model to examine the association between the variants and phenotype of interest. The beta coefficient of the model is the effect size of that variant. the \(exp(\beta_1)\) is the “odds ratio” or, odds of the disease for a 1 unit change in the effect allele vs. no change.The top values are the one that reach genome wide significance (pvalue <0.05*10^-8). PGS is different, the polygenic risk score of an individual is calculated as the weighted sum of their risk alleles, where the weights correspond to the effect sizes of the risk alleles as estimated from a genome-wide association study (GWAS) on the phenotype of interest. To achieve higher statistical significance, We include not only the SNPs that reached genome wide significance. In choosing the SNPs to include in the model, we can use shrinkage techniques (among other approaches) such as LASSO to include the most informative variants. Even thoug the PGS is in a way relying on GWAS, but the effect size will be different between GWAS and PGS, in the latter the effect size estimates may be scaled or shrunk.
  1. The R2 is an estimate of the error between the predicted quantity (genetically predicted BMI) and actual quantity (observed BMI), ranging from 0 to 1. 1 denotes perfect prediction and 0 is the opposite (noise). What is the (approximate) R^2 of the genetic predictor of BMI (hint: see Table S1). How do you interpret this quantity?
  • Your answer here (two parts) in the setting of a prediction model, R^2 (Coefficient of determination) denotes the proportion of the variance of y that is explained by the variables in the model. The value of R^2 ranges from 0-1 (R^2=0, the variables give 0 information about the variance of y)- R^2= pearson correlation coefficient ^2 = 0.292^2 = 0.085264. 8.5% of the variance of the observed BMI is explained by the genetically predicted BMI.

To note: R^2 in the setting of GWAS is a measure of Linkage Disequilibrium (LD), it denotes how linked two variants are (measure from 0-1, the smaller the less correlated).in table S1, R^2=0.2, indicated that the variants that reached genome wide significance in the GWAS are not in linkage disequilibrium to one another, meaning they are not correlated.

  1. If an individual has 2 copies of the effect allele (highest risk) for every loci, what would be their GPS?
## need to sum over all the effect weights, multiplying each by 2
unique_individual <- sum(pgs_bmi$effect_weight)*2

Examining the potential co-morbidities of individuals with “observed” BMI

A medical colleague comments that the predictive ability of carrier risk for future health care utilization (Figure 3) and disease outcomes (Figure 4) seem large. They claim that perhaps genetic screening for GPS carrier status should be used to screen asymptomatic individuals for BMI-related outcomes in the clinic.

  1. Interpret Figure 4 for (a) Coronary Disease, (b) Diabetes Mellitus (DM), and (c) Hypertension (HTN).
  1. A carrier for GPS (high GPS) has 1.27 times the odds for coronary artery disease compared to a non carrier. This result is significant at 5% level. (1.27, [1.19-1.36]).( OR: Carriers have 28% higher odds of having coronary artery disease)
  2. A carrier for GPS (high GPS) has 1.7 times the odds (alternatively, 70% higher odds) for DM compared to a non carrier. This result is significant at 5% level. c)A carrier for GPS (high GPS) has 1.35 times the odds (alternatively, 35% higher odds) for HTN compared to a non carrier. This result is significant at 5% level.
  1. Key to implementing your colleagues idea is to understand how risks for genetic carriers compare to risk for observed/measured BMI (actually measuring BMI of a person in the clinic). We will now examine the association between clinical definitions of BMI in categories and disease outcomes. What model would you use to associate observed/measured BMI with disease (a binary variable)?

Reproducing Khera et al and Figure 3

Lets verify the connection between the predicted BMI and measured BMI and your answers to Q6 on a subset of the UK Biobank (n=14K individuals, randomly sampled).

Load in the UK Biobank data:

ukb <- readRDS('./ukb704.rds')
ukb %>% head()
## # A tibble: 6 × 4
##   birth_year sex      bmi   bmi_pgs
##        <int> <chr>  <dbl>     <dbl>
## 1       1957 Male    29.7 -0.162   
## 2       1944 Female  22.3  0.347   
## 3       1967 Male    26.4  0.200   
## 4       1970 Female  33.8 -0.651   
## 5       1961 Female  28.4 -0.000529
## 6       1968 Male    24.9 -0.710

Each row contains the birth_year, sex, and Observed BMI (bmi) and a “normalized” PGS of BMI that ranges from roughly -4 to positive 4 (bmi_pgs). In short, these GPSs were trained on a cohort independent of UKB and the weights for prediction were then used to predict the GPS for each individual as a function of their genetic profile.

  1. Using Figure 3 as a guide, use ifelse to create a new variable called “high_pgs” for High GPS carriers and use a logistic regression to associate the carriers with Extreme Obesity (BMI greater than 40) (see Figure 3 for thresholds).

Hints:

  • use ifelse to encode a binary variable.
  • use a logistic regression (glm with a family=binomial())
  • use the tidy function to attain the coefficients. Exponentiation the non-intercept coefficients (high_gps) to estimate the “odds ratio”.
## create a new variable called high_gps into obesity categories using ifelse

## carrier state= high GPS, is the top 10% of the distribution 
quantile(ukb$bmi_pgs, probs = 0.9)
##      90% 
## 1.079284
plot_dens <- density(ukb$bmi_pgs)
plot(plot_dens, main = "Density plot of BMI PGS scores")

highest_10_percent <- quantile(ukb$bmi_pgs, 0.9)
abline(v = highest_10_percent, col = "red", lty = 2)
axis(side = 1, at = highest_10_percent, 
     labels = paste(round(highest_10_percent, 2)))

# Creating a new variable high_pgs(=1 if carrier, 0 if not)
carrier_threshold <- 1.079284  # enter a threshold here
ukb <- ukb %>% mutate(high_pgs = ifelse(bmi_pgs > carrier_threshold, 1, 0))
ukb <- ukb %>% mutate(extreme_obesity_gt40 = ifelse(bmi>40, 1,0)) 

## logistic regression:
gps_mod <- glm(extreme_obesity_gt40 ~ high_pgs, data=ukb, family=binomial())

# Extracting the coefficient of the model.
tidy(gps_mod)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -4.22    0.0739     -57.1 0       
## 2 high_pgs        1.41    0.136       10.3 4.19e-25
exp(1.409444)
## [1] 4.093679

16 (continued). What is the OR? How far is it from the figure 3B estimate?

  • your answer here

Odds Ratio= 4.093679, it is very close to the odds ratio reported in figure 3B (4.22).

  1. Now calculate the R2 for the GPS on observed BMI. How far off is it from your answer to Question 12?
  • your answer here The calculated R2 –> Answer form question 12: 0.085264
## R2 of Observed BMI & GPS here
r2_calc <- cor(ukb$extreme_obesity_gt40,ukb$high_pgs)
r2_calc^2
## [1] 0.00877234

Bonus: Attempt to recreate Figures 2 and 3B (top panel) here:

#Figure 2 (Can't recreate Figure 2 B- no data about weight)

# Create a variable of deciles 1-10 
ukb$decile_var <- as.factor(cut(ukb$bmi_pgs, breaks = quantile(ukb$bmi_pgs, probs = seq(0, 1, by = 0.1)), include.lowest = TRUE, labels = FALSE))

# Creating clinical weight category variable 
ukb$obesity_category <- as.factor(cut(ukb$bmi, 
                            breaks=c(-Inf, 18.5, 25, 30, 40, Inf), 
                            labels=c("Underweight", "Normal", "Overweight", "Obesity", "Severe obesity")))
# Figure 2, A: polygenic score decile vs mean body mass index

figure2_A <- ukb %>% group_by(decile_var) %>% summarize(mean_bmi=mean(bmi)) %>% ggplot(aes(x=decile_var, y=mean_bmi)) + geom_point(size=2.5)+labs(x='Polygenic score decile', y= 'Mean Body Mass index')+ theme_classic()
# Figure 2, C: polygenic score decile vs percent of severe obesity
figure2_C <- ukb %>% group_by(decile_var, obesity_category) %>% summarize(n=n()) %>%
  mutate(severe_obesity_perc = (n[obesity_category == "Severe obesity"]/sum(n)*100)) %>% ggplot(aes(x=decile_var, y=severe_obesity_perc))+ geom_point(size=2.5) +labs(x='Polygenic score decile', y= 'Severe obesity, %')+ theme_classic()
library(cowplot)
plot_grid(figure2_A,figure2_C, ncol = 2)

# Figure 2, D: stratification into 3 decile categories. 
ukb <- ukb %>%  mutate(decile_3cat=case_when(decile_var==1~'Bottom Decile',decile_var %in% 2:9 ~ 'Deciles 2-9',decile_var==10 ~'Top Decile'))

ukb_summary <- ukb %>% 
  group_by(decile_3cat, obesity_category) %>% 
  summarize(n = n()) %>% # the count of individuals within each obesity group within each strata
  group_by(decile_3cat) %>%  # for the next step, the sum(n) will be the sum of each strata
  mutate(percentage = n/sum(n)*100) # Calculate within each decile group the percentage of obesity category
  
# Plot Figure 2,D
figure2_D <- ukb_summary %>% ggplot(aes(x = obesity_category, y = percentage, fill = decile_3cat)) +
  geom_bar(stat = "identity", position = "dodge",color='black') +#for me: stat identity: to plot data as is without transformation, position=dodge, to plot bars side by side.
  labs(x = "Clinical Weight Category", y = "Percent", fill='Polygenic score') +
  scale_fill_manual(values = c("white", "#59595A", "#010101")) + geom_text(aes(label = paste0(round(percentage,1))), position = position_dodge(width = 0.9), vjust = -0.5, size=3.5) +
  theme_minimal()

figure2_D

# Figure 3B
# Creating different extreme obesity categories
ukb <- ukb %>% mutate(extreme_obesity_gt50 = ifelse(bmi>50, 1,0)) 
ukb <- ukb %>% mutate(extreme_obesity_gt60 = ifelse(bmi>60, 1,0)) 

## now run a logistic regression:
gps_mod <- glm(extreme_obesity_gt40 ~ high_pgs, data=ukb, family=binomial())
gps_mod5 <- glm(extreme_obesity_gt50 ~ high_pgs, data=ukb, family=binomial())
gps_mod6 <- glm(extreme_obesity_gt60 ~ high_pgs, data=ukb, family=binomial())
# Building a dataframe with the following columns: Extreme Obesity, odds ratio, lower CI, upper CI, pvalue

OR_table <- data.frame(
  OR = c(exp(1.40944), exp( 1.5075),exp(-15.2062)),
  lci = c(exp(1.137902), exp(0.5358421), exp(10)),
  uci = c(exp(1.672466), exp(2.386119), exp(299.778873)),
  study = c("BMI > 40 kg/m2", "BMI > 50 kg/m2","BMI > 60 kg/m2")
)

# # Create the plot using ggplot2
# library(ggplot2)
# ggplot(OR_table, aes(x = OR, y = study)) +
#   geom_point(shape=15, size=3) +
#   geom_errorbarh(aes(xmin = lci, xmax = uci)) +
#   labs(x = "Odds Ratio", y = "Extreme Obesity") +
#   theme_classic()
load('nhanes.Rdata')
  1. Code obesity categories based on BMI in the NHANES (‘nhanes’ data tibble). Specifically, create categorical variable that encodes “underweight”, “normal”, “overweight”, “obese”, and “severe obesity”. See this: https://www.cdc.gov/obesity/adult/defining.html (and the Khera paper). Hint: use case_when to encode a categorical variable.
## transform into obesity categories using case_when
nhanes$Obesity_cat <- as.factor(cut(nhanes$BMXBMI, 
                            breaks=c(-Inf, 18.5, 25, 30, 40, Inf), 
                            labels=c("Underweight", "Normal", "Overweight", "Obesity", "Severe obesity")))


# Changing the reference group to Normal.
nhanes$Obesity_cat <- relevel(nhanes$Obesity_cat, ref = "Normal")
  1. Now code the outcome variables for diabetes as a function of fasting glucose (https://www.diabetes.org/a1c/diagnosis) and stage 1 hypertension (https://www.heart.org/en/health-topics/high-blood-pressure), where 1 are individuals with the disease and 0 are individuals without.
## code binary variables for diabetes: https://www.diabetes.org/a1c/diagnosis
## According to the guidlines, diabetic state is defined as fasting blood glucose >= 126 mg/dl
nhanes <- nhanes %>% mutate(Diabetes = ifelse(LBXGLU>=126, 1,0)) 

## code binary variables for hypertension: https://www.heart.org/en/health-topics/high-blood-pressure
## hypertension_1 ==1 are individuals with hypertension stage 1.
nhanes <- nhanes %>% mutate(hypertension_1 = ifelse((MSYS>130 & MSYS <139)|(MDIS>80 &MDIS<89), 1,0))
  1. Build a survey-weighted logistic regression that associates diabetes with clinical categories of obesity, adjusting for age and sex in the >= 40 year old white population (RIDAGEYR >= 40 and white == 1). Output the coefficients into a table. Interpret the clinical categories of obesity using the “normal” category as the reference group.

Hint Steps: a. create a svydesign data frame, e.g.:

nhanes <- nhanes %>% filter(!is.na(WTMEC2YR), WTMEC2YR > 0) # select only those that have non-zero and available weights
  1. use the subset function to subset a survey design data.frame to your inclusion criteria (RIDAGEYR >= 40, white == 1)
  2. use the svyglm function with the family parameter set to “quasibinomial()” to build a logistic regression model
  3. Exponentiate the non-intercept coefficients to estimate the odds ratio.
## svydesign, subset for White race over 40, and model here
target_pop <- nhanes %>% filter(RIDAGEYR>=40 & white==1)
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights = ~WTMEC2YR, nest = T, data=target_pop)

model4 <- svyglm(Diabetes ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and diabetes
coef_diab_obes <- tidy(model4)
coef_diab_obes <- coef_diab_obes %>% mutate(odds_ratio=exp(estimate))

colnames(coef_diab_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")

library(kableExtra)
coef_diab_obes %>% as.data.frame %>%  kable() %>% kable_styling() 
Variable Estimate Standard Error T statistic P value Odds Ratio
(Intercept) -5.8165258 0.3768442 -15.4348291 0.0000000 0.0029779
Obesity_catUnderweight -0.3597052 1.0689721 -0.3364963 0.7378508 0.6978820
Obesity_catOverweight 0.6211567 0.1910879 3.2506326 0.0020228 1.8610794
Obesity_catObesity 1.4172440 0.1802689 7.8618334 0.0000000 4.1257344
Obesity_catSevere obesity 2.1770880 0.2408182 9.0403785 0.0000000 8.8205831
RIDAGEYR 0.0479821 0.0052821 9.0839461 0.0000000 1.0491518
female -0.4327819 0.1333383 -3.2457428 0.0020518 0.6487020
  1. Build a survey-weighted logistic regression model that associates hypertension with clinical categories of obesity, adjusting for age and sex in the >=40 year old, White population.
model1 <- svyglm(hypertension_1 ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and Hypertension
coef_HTN_obes <- tidy(model1)
coef_HTN_obes <- coef_HTN_obes %>% mutate(odds_ratio=exp(estimate))

colnames(coef_HTN_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")

coef_HTN_obes %>% as.data.frame %>%  kable() %>% kable_styling() 
Variable Estimate Standard Error T statistic P value Odds Ratio
(Intercept) -1.5843767 0.2347156 -6.750197 0.0000000 0.2050756
Obesity_catUnderweight -0.4958091 0.3288886 -1.507529 0.1376132 0.6090779
Obesity_catOverweight 0.1251079 0.1119829 1.117205 0.2689456 1.1332707
Obesity_catObesity 0.4598710 0.0856401 5.369809 0.0000018 1.5838696
Obesity_catSevere obesity 0.5309901 0.1821038 2.915865 0.0051891 1.7006153
RIDAGEYR 0.0115151 0.0035069 3.283561 0.0018192 1.0115816
female -0.2893693 0.0868355 -3.332385 0.0015749 0.7487357
  1. Develop a survey-weighted logistic regression that associates Coronary Artery Disease (any_cad) with clinical categories of obesity, adjusting for age and sex in the >=40 year old, White population.
  • Note: any_cad is a variable created by the instructors that consists of self-report coronary disease or heart attack.
model2 <- svyglm(any_cad ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and Coronary artery disease
coef_CAD_obes <- tidy(model2)
coef_CAD_obes <- coef_CAD_obes %>% mutate(odds_ratio=exp(estimate))

colnames(coef_CAD_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")

coef_CAD_obes %>% as.data.frame %>%  kable() %>% kable_styling()
Variable Estimate Standard Error T statistic P value Odds Ratio
(Intercept) -6.4655933 0.2268696 -28.4991630 0.0000000 0.0015561
Obesity_catUnderweight -0.0740032 0.5387198 -0.1373686 0.8912600 0.9286687
Obesity_catOverweight 0.4020027 0.0851874 4.7190424 0.0000177 1.4948154
Obesity_catObesity 0.6877345 0.0995761 6.9066242 0.0000000 1.9892038
Obesity_catSevere obesity 0.6076405 0.2182914 2.7836214 0.0074348 1.8360940
RIDAGEYR 0.0705823 0.0029568 23.8708025 0.0000000 1.0731328
female -0.5956460 0.0957323 -6.2219931 0.0000001 0.5512064
  1. Interpret the coefficients that emerge from 20-22. Specifically, how does the obese and extreme obese odds ratios compare to the genetically predicted GPS carriers? Why are the observed BMI ORs higher (or lower) than the GPS carrier odds ratios? What figures, if any, from the Khera paper gave hints of your result?
  • Answer here (3 parts):
  • Interpretation of OR for obese and extreme obese categories *** Diabetes Mellitus (DM):
  • Severly Obese individual, has 8.80 times the odds of having diabetes compared to an individual with a normal weight. (statistically significant)
  • Obese individual, has 4.13 times the odds of having diabetes compared to an individual with a normal weight. (statistically significant)

*** Hypertension (HTN): - Severly Obese individual, has 1.7 times the odds of having hypertension compared to an individual with a normal weight. (statistically significant) - Obese individual, has 1.6 times the odds of having hypertension compared to an individual with a normal weight. (statistically significant)

*** Coronary Artery Disease (CAD): - Severly Obese individual, has 1.8 times the odds of having CAD compared to an individual with a normal weight. (statistically significant) - Obese individual, has twice the odds of having CAD compared to an individual with a normal weight. (statistically significant)

Are observed BMI higher or lower? What figures give clues on your finding? The observed BMI ORs are higher than the GPS carrier odds ratios. The carrier state was defined as the top 10%, that does not translate exactly to the obese and severe obese categories (no total overlap). In figure 2 we can see clues of that: figure A: The mean BMI of the carriers (decile 10) is ~30 kg/m^2, the mean BMI of severely obese individuals would be by definition higher than 40. figure C: the severe obese individuals can be found across all deciles (in higher percentage in the top decile).

The observed BMI are individuals with the phenotype and not just with a genetic risk to it, thus the association with the outcome was stronger/higher. Also, BMI categories capture not only genetic risk but also other factors (such as lifestyle and environmental factors) that may increase the risk of developing the outcomes of interest.

  1. What are the advantages and disadvantages of using GPS for screening asymptomatic individuals? your answers here (write down 2 advantages and disadvantages) Advantages:
  • Early detection and early preventive measures.
  • Non-invasive: and potentially cost-effective, with the technology being more affordable.

Disadvantages: - False positive: false positives may beget unnecessary measures also emotional toll on the individual. - Not yet feasible in low income areas. which will create disparities based on socioeconomic and technical ground. Also, the genetic data in hand does not represent all ancestry equally, meaning we are unable to generialize to the overall population.