Please check Canvas for the due date of this assignment. Submit your answers as a knitted .html file: select the Knit drop down at the top of the window. Once converted to .html, select the file in the file browser, click the More drop down, and select Export.
In Session 2 and 3, we had an introduction to the “genome-wide association studies” (GWASs), which are used to examine the association between individual (up to a few million) genetic variants along the genome and a phenotype (e.g., BMI or diabetes). The phenotype can be disease and non-disease (or “cases” and “controls”), where one tests the variant frequency in cases (the variant prevalence in cases) versus controls (the variant prevalence in the controls).
Case-control GWASes are analytically “easy” to do: specifically, we execute a contingency table test of association (e.g., Chi-square test) or a logistic regression. Suppose we have a variant “locus” (location on the genome) called X_1. Lets also suppose that X_1 can take on either Thymine (“T”) or Cytosine (“C”), two nucleotide bases. Then, we choose the “effect” allele (lets say “T”), and for each individual, count how many Ts they have at X_1 (0, 1, or 2). Lets call the count at X_1, C_1. Then, we count the number of total Ts in the cases vs. controls and do a contingency table test (2x2 table) or a logistic regression. Lets write down the logistic model: if Y is 1 for individuals with a disease, and 0 without, then we have a model for X_1 and Y:
\[Y = \alpha + \beta_1 C_1\]
Where \(C_1\) is 0, 1, or 2 for each individual in the dataset, and Y is 0 for controls, 1 for cases. Then, we execute this model for every variant on the GWAS array. In a logistic regression model, the \(exp(\beta_1)\) is the “odds ratio” or, odds of the disease for a 1 unit change in the effect allele vs. no change.
We can do the same if Y is continuous, such as body mass index, using linear regression (lm in R). As a thought question, what would be the interpretation of \(\beta_1\) if using a “lm”? For more about GWASs, see this tutorial.
One important detail is p-values of significance are used to determine what loci are associated with the outcome (if \(\beta_1\) is nonzero). We see those signals in the “Manhattan Plot” - the higher the -log10(pvalue), the higher the point on the y-axis, and hence, more of a signal than those with higher p-values. As a result, millions of \(\beta_m\) are estimated. The most significant loci (and the \(\beta_m\)) for studies are reported in the GWAS Catalog.
How can we use millions of risk estimates to determine risk for a single person?: the polygenic risk score. Essentially, the polygenic risk score (PRS) or equivalently, the genome-wide polygenic score (GPS, for real-valued traits such as body mass index) is simply a sum of the number of effect alleles for an individual (summing up across M number of loci) multiplied by the \(\beta\) from the regression. For example, for one individual (indexed by i), the GPS for i would be:
\[PRS^i = \sum^M_{j=1} \beta_j C^i_j\]
Therefore, the higher the PRS, the higher the outcome risk, or the higher the PGS, the higher the observed trait (e.g., height, weight, body mass index).
What \(\beta_j\) do we choose to put in the GPS? The easiest is to apply some pvalue threshold (e.g., Bonferroni) and then “prune” those that are independent of others (remember, variant frequency are correlated between closely neighboring loci due to linkage disequilibrium). Another way is to use a regularized regression similar to what we saw in Session 3. When combining effect sizes, the \(\beta_j\) used in the GPS will be smaller than that in GWAS. Check out this optional tutorial for more detail
Body mass index (BMI), or squared weight of an individual divided by their height, is a major risk factor for many clinical outcomes, such as type 2 diabetes and heart disease. BMI is used to characterize clinical obesity, or obesity categories, such as “severe obesity”, “obesity”, or “underweight”: see the cutpoints
BMI is known as a “complex” phenotype, with many genetic (G), environmental (E), and behavioral factors that contribute to it, such as physical activity, or genetic variants in the FTO gene.
Khera and colleagues have developed a GPS using data from prior GWAS on BMI and using the UK Biobank cohort data.
Read Khera et al 2019 and answer the following questions:
They restricted the analysis to individuals of white British ancestry, aged between 40-69 yo.
They labeled the top decile of the GPS distribution as “carriers” and the remainder of the distribution as non-carriers. The testing dataset contained 288,016, 10% of which were labeleld as carriers, meaning the sample size was 28,801 (they mention the top decile is 28,784, possibly some individuals were dropped due to missing data).
library(tidyverse)
library(survey)
library(broom)
library(knitr)
library(gtsummary)
We can find the GPS model statistics for Khera here, at the PGS catalog: https://www.pgscatalog.org/score/PGS000027/
We have provided the GPS in your project directory, called: PGS000027.txt.gz –>
pgs_bmi <- read_delim('PGS000027.txt.gz', delim = "\t", comment="#")
Rows: each row is the features of a different gene (its position, alleles and effect weight of the effect allele)
One of the “strongest” or lowest pvalue variants associated with BMI happens to be in the FTO gene. For example, lets look at a GWAS that cited by Khera in the GWAS Catalog: https://www.ebi.ac.uk/gwas/publications/25673413
And sort by pvalue, lowest to highest.
chromosome_number <- 16 # your answer here
chromosome_position <- 53769662 # your answer here
pgs_bmi %>% filter(chr_position==53769662)
## # A tibble: 0 × 5
## # … with 5 variables: chr_name <dbl>, chr_position <dbl>, effect_allele <chr>,
## # other_allele <chr>, effect_weight <dbl>
According to the UCSC browser, FTO spans approximately 415 kilobases, from base 53701692 to base 54158512
fto_start <- 53701692
fto_end <- 54158512
# plotting code here
# Filtering for the variants witin the fto_start and fto_end
pgs_bmi_FTO_subset <- pgs_bmi %>% filter(chr_position >= fto_start & chr_position<= fto_end) %>% ggplot(aes(x=chr_position, y=effect_weight)) + geom_point(size = 1, color = "#0099f9") +labs(x='Chromosome Position', y='Effect Weight', title = "Effect weight over the span of the FTO gene") + theme_classic() + geom_vline(xintercept = chromosome_position, color = "red", linetype = "dashed") + annotate("text", x = 53740000, y = 0.004, label = "Top SNP", color = "red")
pgs_bmi_FTO_subset
pgs_bmi_FTO_subset1 <- pgs_bmi %>% filter(chr_position >= fto_start & chr_position<= fto_end)
# hint use ecdf: ?ecdf
# or, alternatively, try quantile: ?quantile
# Finding the maximal weight in the FTO gene
max_fto_weight <- max(pgs_bmi_FTO_subset1$effect_weight)
# Will create an ECDF for the weights in the dataset
ecdf_distribustion <- ecdf(pgs_bmi$effect_weight)
# finding the rank
fto_rank <- ecdf_distribustion(max_fto_weight)
# ECDF plot
plot(ecdf_distribustion, xlab = "Effect Weight", ylab = "Proportion", main = "ECDF of Effect Weights", xlim = c(0, 0.006))
# annotation
abline(v = max_fto_weight, lty = 2, col = "red")
text(max_fto_weight, fto_rank, "Max weight in FTO", pos = 4, color='red')
99%
of the observation have an effect weight lower than the maximal effect
weight in the FTO gene.
Recall session 3 on clinical machine learning, specifically on approaches to build large regression models, such as LASSO for question 11.
To note: R^2 in the setting of GWAS is a measure of Linkage Disequilibrium (LD), it denotes how linked two variants are (measure from 0-1, the smaller the less correlated).in table S1, R^2=0.2, indicated that the variants that reached genome wide significance in the GWAS are not in linkage disequilibrium to one another, meaning they are not correlated.
## need to sum over all the effect weights, multiplying each by 2
unique_individual <- sum(pgs_bmi$effect_weight)*2
A medical colleague comments that the predictive ability of carrier risk for future health care utilization (Figure 3) and disease outcomes (Figure 4) seem large. They claim that perhaps genetic screening for GPS carrier status should be used to screen asymptomatic individuals for BMI-related outcomes in the clinic.
Lets verify the connection between the predicted BMI and measured BMI and your answers to Q6 on a subset of the UK Biobank (n=14K individuals, randomly sampled).
Load in the UK Biobank data:
ukb <- readRDS('./ukb704.rds')
ukb %>% head()
## # A tibble: 6 × 4
## birth_year sex bmi bmi_pgs
## <int> <chr> <dbl> <dbl>
## 1 1957 Male 29.7 -0.162
## 2 1944 Female 22.3 0.347
## 3 1967 Male 26.4 0.200
## 4 1970 Female 33.8 -0.651
## 5 1961 Female 28.4 -0.000529
## 6 1968 Male 24.9 -0.710
Each row contains the birth_year, sex, and Observed BMI (bmi) and a “normalized” PGS of BMI that ranges from roughly -4 to positive 4 (bmi_pgs). In short, these GPSs were trained on a cohort independent of UKB and the weights for prediction were then used to predict the GPS for each individual as a function of their genetic profile.
Hints:
## create a new variable called high_gps into obesity categories using ifelse
## carrier state= high GPS, is the top 10% of the distribution
quantile(ukb$bmi_pgs, probs = 0.9)
## 90%
## 1.079284
plot_dens <- density(ukb$bmi_pgs)
plot(plot_dens, main = "Density plot of BMI PGS scores")
highest_10_percent <- quantile(ukb$bmi_pgs, 0.9)
abline(v = highest_10_percent, col = "red", lty = 2)
axis(side = 1, at = highest_10_percent,
labels = paste(round(highest_10_percent, 2)))
# Creating a new variable high_pgs(=1 if carrier, 0 if not)
carrier_threshold <- 1.079284 # enter a threshold here
ukb <- ukb %>% mutate(high_pgs = ifelse(bmi_pgs > carrier_threshold, 1, 0))
ukb <- ukb %>% mutate(extreme_obesity_gt40 = ifelse(bmi>40, 1,0))
## logistic regression:
gps_mod <- glm(extreme_obesity_gt40 ~ high_pgs, data=ukb, family=binomial())
# Extracting the coefficient of the model.
tidy(gps_mod)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.22 0.0739 -57.1 0
## 2 high_pgs 1.41 0.136 10.3 4.19e-25
exp(1.409444)
## [1] 4.093679
16 (continued). What is the OR? How far is it from the figure 3B estimate?
Odds Ratio= 4.093679, it is very close to the odds ratio reported in figure 3B (4.22).
## R2 of Observed BMI & GPS here
r2_calc <- cor(ukb$extreme_obesity_gt40,ukb$high_pgs)
r2_calc^2
## [1] 0.00877234
Bonus: Attempt to recreate Figures 2 and 3B (top panel) here:
#Figure 2 (Can't recreate Figure 2 B- no data about weight)
# Create a variable of deciles 1-10
ukb$decile_var <- as.factor(cut(ukb$bmi_pgs, breaks = quantile(ukb$bmi_pgs, probs = seq(0, 1, by = 0.1)), include.lowest = TRUE, labels = FALSE))
# Creating clinical weight category variable
ukb$obesity_category <- as.factor(cut(ukb$bmi,
breaks=c(-Inf, 18.5, 25, 30, 40, Inf),
labels=c("Underweight", "Normal", "Overweight", "Obesity", "Severe obesity")))
# Figure 2, A: polygenic score decile vs mean body mass index
figure2_A <- ukb %>% group_by(decile_var) %>% summarize(mean_bmi=mean(bmi)) %>% ggplot(aes(x=decile_var, y=mean_bmi)) + geom_point(size=2.5)+labs(x='Polygenic score decile', y= 'Mean Body Mass index')+ theme_classic()
# Figure 2, C: polygenic score decile vs percent of severe obesity
figure2_C <- ukb %>% group_by(decile_var, obesity_category) %>% summarize(n=n()) %>%
mutate(severe_obesity_perc = (n[obesity_category == "Severe obesity"]/sum(n)*100)) %>% ggplot(aes(x=decile_var, y=severe_obesity_perc))+ geom_point(size=2.5) +labs(x='Polygenic score decile', y= 'Severe obesity, %')+ theme_classic()
library(cowplot)
plot_grid(figure2_A,figure2_C, ncol = 2)
# Figure 2, D: stratification into 3 decile categories.
ukb <- ukb %>% mutate(decile_3cat=case_when(decile_var==1~'Bottom Decile',decile_var %in% 2:9 ~ 'Deciles 2-9',decile_var==10 ~'Top Decile'))
ukb_summary <- ukb %>%
group_by(decile_3cat, obesity_category) %>%
summarize(n = n()) %>% # the count of individuals within each obesity group within each strata
group_by(decile_3cat) %>% # for the next step, the sum(n) will be the sum of each strata
mutate(percentage = n/sum(n)*100) # Calculate within each decile group the percentage of obesity category
# Plot Figure 2,D
figure2_D <- ukb_summary %>% ggplot(aes(x = obesity_category, y = percentage, fill = decile_3cat)) +
geom_bar(stat = "identity", position = "dodge",color='black') +#for me: stat identity: to plot data as is without transformation, position=dodge, to plot bars side by side.
labs(x = "Clinical Weight Category", y = "Percent", fill='Polygenic score') +
scale_fill_manual(values = c("white", "#59595A", "#010101")) + geom_text(aes(label = paste0(round(percentage,1))), position = position_dodge(width = 0.9), vjust = -0.5, size=3.5) +
theme_minimal()
figure2_D
# Figure 3B
# Creating different extreme obesity categories
ukb <- ukb %>% mutate(extreme_obesity_gt50 = ifelse(bmi>50, 1,0))
ukb <- ukb %>% mutate(extreme_obesity_gt60 = ifelse(bmi>60, 1,0))
## now run a logistic regression:
gps_mod <- glm(extreme_obesity_gt40 ~ high_pgs, data=ukb, family=binomial())
gps_mod5 <- glm(extreme_obesity_gt50 ~ high_pgs, data=ukb, family=binomial())
gps_mod6 <- glm(extreme_obesity_gt60 ~ high_pgs, data=ukb, family=binomial())
# Building a dataframe with the following columns: Extreme Obesity, odds ratio, lower CI, upper CI, pvalue
OR_table <- data.frame(
OR = c(exp(1.40944), exp( 1.5075),exp(-15.2062)),
lci = c(exp(1.137902), exp(0.5358421), exp(10)),
uci = c(exp(1.672466), exp(2.386119), exp(299.778873)),
study = c("BMI > 40 kg/m2", "BMI > 50 kg/m2","BMI > 60 kg/m2")
)
# # Create the plot using ggplot2
# library(ggplot2)
# ggplot(OR_table, aes(x = OR, y = study)) +
# geom_point(shape=15, size=3) +
# geom_errorbarh(aes(xmin = lci, xmax = uci)) +
# labs(x = "Odds Ratio", y = "Extreme Obesity") +
# theme_classic()
load('nhanes.Rdata')
## transform into obesity categories using case_when
nhanes$Obesity_cat <- as.factor(cut(nhanes$BMXBMI,
breaks=c(-Inf, 18.5, 25, 30, 40, Inf),
labels=c("Underweight", "Normal", "Overweight", "Obesity", "Severe obesity")))
# Changing the reference group to Normal.
nhanes$Obesity_cat <- relevel(nhanes$Obesity_cat, ref = "Normal")
## code binary variables for diabetes: https://www.diabetes.org/a1c/diagnosis
## According to the guidlines, diabetic state is defined as fasting blood glucose >= 126 mg/dl
nhanes <- nhanes %>% mutate(Diabetes = ifelse(LBXGLU>=126, 1,0))
## code binary variables for hypertension: https://www.heart.org/en/health-topics/high-blood-pressure
## hypertension_1 ==1 are individuals with hypertension stage 1.
nhanes <- nhanes %>% mutate(hypertension_1 = ifelse((MSYS>130 & MSYS <139)|(MDIS>80 &MDIS<89), 1,0))
Hint Steps: a. create a svydesign data frame, e.g.:
nhanes <- nhanes %>% filter(!is.na(WTMEC2YR), WTMEC2YR > 0) # select only those that have non-zero and available weights
## svydesign, subset for White race over 40, and model here
target_pop <- nhanes %>% filter(RIDAGEYR>=40 & white==1)
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights = ~WTMEC2YR, nest = T, data=target_pop)
model4 <- svyglm(Diabetes ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and diabetes
coef_diab_obes <- tidy(model4)
coef_diab_obes <- coef_diab_obes %>% mutate(odds_ratio=exp(estimate))
colnames(coef_diab_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")
library(kableExtra)
coef_diab_obes %>% as.data.frame %>% kable() %>% kable_styling()
| Variable | Estimate | Standard Error | T statistic | P value | Odds Ratio |
|---|---|---|---|---|---|
| (Intercept) | -5.8165258 | 0.3768442 | -15.4348291 | 0.0000000 | 0.0029779 |
| Obesity_catUnderweight | -0.3597052 | 1.0689721 | -0.3364963 | 0.7378508 | 0.6978820 |
| Obesity_catOverweight | 0.6211567 | 0.1910879 | 3.2506326 | 0.0020228 | 1.8610794 |
| Obesity_catObesity | 1.4172440 | 0.1802689 | 7.8618334 | 0.0000000 | 4.1257344 |
| Obesity_catSevere obesity | 2.1770880 | 0.2408182 | 9.0403785 | 0.0000000 | 8.8205831 |
| RIDAGEYR | 0.0479821 | 0.0052821 | 9.0839461 | 0.0000000 | 1.0491518 |
| female | -0.4327819 | 0.1333383 | -3.2457428 | 0.0020518 | 0.6487020 |
model1 <- svyglm(hypertension_1 ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and Hypertension
coef_HTN_obes <- tidy(model1)
coef_HTN_obes <- coef_HTN_obes %>% mutate(odds_ratio=exp(estimate))
colnames(coef_HTN_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")
coef_HTN_obes %>% as.data.frame %>% kable() %>% kable_styling()
| Variable | Estimate | Standard Error | T statistic | P value | Odds Ratio |
|---|---|---|---|---|---|
| (Intercept) | -1.5843767 | 0.2347156 | -6.750197 | 0.0000000 | 0.2050756 |
| Obesity_catUnderweight | -0.4958091 | 0.3288886 | -1.507529 | 0.1376132 | 0.6090779 |
| Obesity_catOverweight | 0.1251079 | 0.1119829 | 1.117205 | 0.2689456 | 1.1332707 |
| Obesity_catObesity | 0.4598710 | 0.0856401 | 5.369809 | 0.0000018 | 1.5838696 |
| Obesity_catSevere obesity | 0.5309901 | 0.1821038 | 2.915865 | 0.0051891 | 1.7006153 |
| RIDAGEYR | 0.0115151 | 0.0035069 | 3.283561 | 0.0018192 | 1.0115816 |
| female | -0.2893693 | 0.0868355 | -3.332385 | 0.0015749 | 0.7487357 |
model2 <- svyglm(any_cad ~ Obesity_cat+RIDAGEYR+female, dsn, family = quasibinomial())
# Output table of the coefficients: Obesity and Coronary artery disease
coef_CAD_obes <- tidy(model2)
coef_CAD_obes <- coef_CAD_obes %>% mutate(odds_ratio=exp(estimate))
colnames(coef_CAD_obes) <- c("Variable", "Estimate", "Standard Error", "T statistic","P value", "Odds Ratio")
coef_CAD_obes %>% as.data.frame %>% kable() %>% kable_styling()
| Variable | Estimate | Standard Error | T statistic | P value | Odds Ratio |
|---|---|---|---|---|---|
| (Intercept) | -6.4655933 | 0.2268696 | -28.4991630 | 0.0000000 | 0.0015561 |
| Obesity_catUnderweight | -0.0740032 | 0.5387198 | -0.1373686 | 0.8912600 | 0.9286687 |
| Obesity_catOverweight | 0.4020027 | 0.0851874 | 4.7190424 | 0.0000177 | 1.4948154 |
| Obesity_catObesity | 0.6877345 | 0.0995761 | 6.9066242 | 0.0000000 | 1.9892038 |
| Obesity_catSevere obesity | 0.6076405 | 0.2182914 | 2.7836214 | 0.0074348 | 1.8360940 |
| RIDAGEYR | 0.0705823 | 0.0029568 | 23.8708025 | 0.0000000 | 1.0731328 |
| female | -0.5956460 | 0.0957323 | -6.2219931 | 0.0000001 | 0.5512064 |
*** Hypertension (HTN): - Severly Obese individual, has 1.7 times the odds of having hypertension compared to an individual with a normal weight. (statistically significant) - Obese individual, has 1.6 times the odds of having hypertension compared to an individual with a normal weight. (statistically significant)
*** Coronary Artery Disease (CAD): - Severly Obese individual, has 1.8 times the odds of having CAD compared to an individual with a normal weight. (statistically significant) - Obese individual, has twice the odds of having CAD compared to an individual with a normal weight. (statistically significant)
Are observed BMI higher or lower? What figures give clues on your finding? The observed BMI ORs are higher than the GPS carrier odds ratios. The carrier state was defined as the top 10%, that does not translate exactly to the obese and severe obese categories (no total overlap). In figure 2 we can see clues of that: figure A: The mean BMI of the carriers (decile 10) is ~30 kg/m^2, the mean BMI of severely obese individuals would be by definition higher than 40. figure C: the severe obese individuals can be found across all deciles (in higher percentage in the top decile).
The observed BMI are individuals with the phenotype and not just with a genetic risk to it, thus the association with the outcome was stronger/higher. Also, BMI categories capture not only genetic risk but also other factors (such as lifestyle and environmental factors) that may increase the risk of developing the outcomes of interest.
Disadvantages: - False positive: false positives may beget unnecessary measures also emotional toll on the individual. - Not yet feasible in low income areas. which will create disparities based on socioeconomic and technical ground. Also, the genetic data in hand does not represent all ancestry equally, meaning we are unable to generialize to the overall population.