Human Genetics - Homework 2

(5 pts)

All required data files are located in the /homes/data/HW2_data/ directory on the bioi.cs.luc.edu server.

You will perform a GWAS for height using a simulated, i.e., made-up, height phenotype in your next homework. Before you can perform the GWAS, you will perform principal component analysis (PCA) in this homework to understand the population structure of your GWAS population.

One set of LD-pruned PLINK binary format files (end in .bed, .bim, and .fam, genome build 37), prefixed height_simGWAS_1000g_LDpruned, includes GWAS samples merged with several 1000 Genomes Project populations for anchoring. The other set, prefixed height_simGWAS_LDpruned, contains only the GWAS samples. Population information is in the height_simGWAS_1000g_PCA_pop_info.txt file.

Make a copy of this HW2.Rmd file from the /homes/data/HW2_data/ directory to your home directory. Perform the requested analyses and embed relevant code, plots, and explanations into your copy of the HW2.Rmd file. For some tips on using R Markdown files, see here. When finished, knit your .Rmd file as an HTML file and submit the HTML file to Sakai. Click Knit -> Knit to HTML. RStudio will prompt you to install any missing packages the first time you Click Knit or open an .Rmd file.

1. (0.5 pts) First perform PCA using PLINK with the merged GWAS and 1000 Genomes data. Generate 10 principal components. Embed your plink command below.

perform genetic principal component analysis

library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)


system("plink --bfile /homes/data/HW2_data/height_simGWAS_1000g_LDpruned --pca 10 --out 1000g_PCA")

 evec = fread("1000g_PCA.eigenvec")
 evec

##      V1      V2         V3         V4          V5          V6          V7
##   1:  0 GWAS001  0.0300163 -0.0134008 -0.01746790 -0.04831100  0.00946752
##   2:  0 GWAS002  0.0309294 -0.0141377 -0.01683950 -0.04081600  0.00851499
##   3:  0 GWAS003  0.0309340 -0.0146392 -0.02777610 -0.05049600  0.01115270
##   4:  0 GWAS004  0.0314644 -0.0144932 -0.02445130 -0.05686460  0.01591310
##   5:  0 GWAS005  0.0313949 -0.0154698 -0.02365930 -0.04635000  0.00528847
##  ---                                                                     
## 558:  0 NA19239 -0.0774484 -0.0397551  0.00233799  0.01240630 -0.04393990
## 559:  0 NA19247 -0.0769389 -0.0389941  0.00131112 -0.01567630  0.03601710
## 560:  0 NA19248 -0.0767681 -0.0383840  0.00285064 -0.00390116  0.02947510
## 561:  0 NA19256 -0.0767317 -0.0396923 -0.00472775 -0.01242450  0.03101170
## 562:  0 NA19257 -0.0774294 -0.0398226 -0.00334081 -0.01288690  0.03448140
##               V8          V9         V10         V11         V12
##   1:  0.03249540 -0.02342930 -0.05516860  0.06379910 -0.04456750
##   2:  0.01888260 -0.06027280 -0.03716520  0.09417590 -0.09254040
##   3: -0.02965290  0.01883210  0.02004550 -0.05317540  0.01396120
##   4: -0.02132880  0.00574815  0.03112000 -0.04487670 -0.00512081
##   5:  0.02997330  0.03773810 -0.09167370  0.03162080  0.01817940
##  ---                                                            
## 558: -0.00353169  0.06561640 -0.03678540 -0.01550610  0.06890530
## 559:  0.08814500 -0.03083440 -0.00158430 -0.02450810  0.07754360
## 560:  0.07728780 -0.03633940 -0.02905750 -0.02878790  0.01906200
## 561: -0.07427510 -0.05295520  0.03823760 -0.00427902  0.06623060
## 562: -0.07286520 -0.05374330 -0.00422912  0.02470730  0.08973120

2. (0.5 pts) Load your merged PCA results in R and make a scatter plot of PC1 (x-axis) vs PC2 (y-axis) and color the points by Population. Embed your code and plot below. Write a figure legend, like you would see in a scientific journal, that explains what is represented in your plot. Your legend should explain:

what does each point represent?
what do the colors represent?
what is the continental ancestry of the GWAS population?

#add your R code in code chunks like this one


# Figure 1. 

# Each point represents a person and their first 2 components from their genotype data. The colors represent the different populations. The continental ancestry of the GWAS population is east asian.  

library(data.table)
library(dplyr) 
library (ggplot2)

evec = fread("1000g_PCA.eigenvec")
ggplot(evec,aes(x=V3,y=V4)) + geom_point() + xlab("PC1") + ylab("PC2")

pops = fread("/homes/data/HW2_data/height_simGWAS_1000g_PCA_pop_info.txt")

head(pops)

##       #IID PAT MAT SEX SuperPop Population
## 1: NA06984   0   0   1      EUR        CEU
## 2: NA06985   0   0   2      EUR        CEU
## 3: NA06986   0   0   1      EUR        CEU
## 4: NA06989   0   0   2      EUR        CEU
## 5: NA06994   0   0   1      EUR        CEU
## 6: NA07000   0   0   2      EUR        CEU

pcs = left_join(evec,pops,by=c("V2"="#IID")) #by= tells the left_join function what columns to join the data.frames by

ggplot(pcs,aes(x=V3,y=V4,color=SuperPop)) + geom_point() + xlab("PC1") + ylab("PC2")

Figure 1.

Each point represents a person and their first 2 components from their genotype data. The colors represent the different populations. The continental ancestry of the GWAS population is east asian.

3. (0.5 pts) From your merged PCA results, what proportion of the variance is explained by the first PC? What proportion of the variance is explained by the first 5 PCs? Embed the code you used to calculate this below.

59.2% of the variance is explained by the first PC. 93.3% of the variance is explained by the first 5 PCs.

#add your R code in code chunks like this one


eval = fread("1000g_PCA.eigenval")

eval = mutate(eval,PC=as.factor(c(1:10)))
eval = mutate(eval, pve=V1/sum(V1))
eval

##           V1 PC        pve
##  1: 50.22480  1 0.59207858
##  2: 22.08250  2 0.26032110
##  3:  3.73398  3 0.04401829
##  4:  1.77246  4 0.02089477
##  5:  1.51117  5 0.01781453
##  6:  1.26358  6 0.01489580
##  7:  1.09282  7 0.01288279
##  8:  1.07215  8 0.01263912
##  9:  1.04370  9 0.01230373
## 10:  1.03077 10 0.01215130

ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")

59.2% of the variance is explained by the first PC. 93.3% of the variance is explained by the first 5 PCs.

4. (0.5 pts) Make a scree plot of your merged PCA results, i.e., proportion variance explained (y-axis) vs. PC (x-axis). Embed your code and plot below. Write a figure legend, like you would see in a scientific journal, that explains what is represented in your plot.

#add your R code in code chunks like this one


ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")

Figure 2. Plot represents PC value v. variance explained. The plot above shows that around 59% of the variance is explained by PC1, 26.0% by PC2, etc.

5. (0.5 pts) Now perform PCA using PLINK with just the GWAS data. Generate 10 principal components. Embed your plink command and R code below.

system("plink --bfile /homes/data/HW2_data/height_simGWAS_LDpruned --pca 10 --out 1000g_PCAtype2")


evec1 = fread("1000g_PCAtype2.eigenvec")
evec1

##      V1      V2           V3          V4          V5           V6         V7
##   1:  0 GWAS001  0.000121037 -0.07842320  0.02383830 -0.119890000 -0.0406738
##   2:  0 GWAS002  0.000118713 -0.05789990  0.02805280 -0.141622000 -0.0118148
##   3:  0 GWAS003  0.015356000 -0.07633990  0.00433162  0.043727300  0.0181991
##   4:  0 GWAS004  0.008599260 -0.08125220 -0.00699374  0.057515400  0.0431358
##   5:  0 GWAS005  0.008593770 -0.07045780 -0.00772354 -0.066427000 -0.0538334
##  ---                                                                        
## 246:  0 GWAS246 -0.106178000  0.03824840 -0.06871180 -0.000165117 -0.0632509
## 247:  0 GWAS247 -0.090080800  0.03464050 -0.04073720 -0.008700280 -0.0525406
## 248:  0 GWAS248 -0.105635000  0.06735110 -0.02762610  0.036274100  0.0111138
## 249:  0 GWAS249 -0.096938100  0.04301970 -0.03590870  0.050071600  0.0592868
## 250:  0 GWAS250 -0.076182600  0.00201139 -0.08042170  0.093814200 -0.0438510
##              V8          V9         V10         V11         V12
##   1:  0.0514348 -0.04970010  0.00524256  0.04291860 -0.03527040
##   2: -0.0468522 -0.04189440 -0.02451430  0.04369060 -0.00639382
##   3: -0.0125338  0.03197040 -0.01213790  0.05635140 -0.05542660
##   4: -0.0588823  0.05183330  0.04806030 -0.00807608 -0.05143510
##   5:  0.1037530  0.00202053  0.00319513  0.03146300  0.14433400
##  ---                                                           
## 246: -0.0244932 -0.04420150  0.03800010 -0.03652480  0.08705950
## 247: -0.0105253 -0.02382970  0.00355451  0.05201170  0.06576610
## 248: -0.0450552  0.01511180 -0.03197180 -0.03815740  0.11942300
## 249: -0.0212291 -0.00383297  0.00641667 -0.02311150  0.03102250
## 250: -0.0236285 -0.08284370  0.02097270 -0.03784390 -0.09059720

6. (0.5 pts) Load your GWAS-only PCA results in R and make a scatter plot of PC1 (x-axis) vs PC2 (y-axis). Embed your code and plot below. Write a figure legend, like you would see in a scientific journal, that explains what is represented in your plot.

#add your R code in code chunks like this one

evec1 = fread("1000g_PCAtype2.eigenvec")
evec1

##      V1      V2           V3          V4          V5           V6         V7
##   1:  0 GWAS001  0.000121037 -0.07842320  0.02383830 -0.119890000 -0.0406738
##   2:  0 GWAS002  0.000118713 -0.05789990  0.02805280 -0.141622000 -0.0118148
##   3:  0 GWAS003  0.015356000 -0.07633990  0.00433162  0.043727300  0.0181991
##   4:  0 GWAS004  0.008599260 -0.08125220 -0.00699374  0.057515400  0.0431358
##   5:  0 GWAS005  0.008593770 -0.07045780 -0.00772354 -0.066427000 -0.0538334
##  ---                                                                        
## 246:  0 GWAS246 -0.106178000  0.03824840 -0.06871180 -0.000165117 -0.0632509
## 247:  0 GWAS247 -0.090080800  0.03464050 -0.04073720 -0.008700280 -0.0525406
## 248:  0 GWAS248 -0.105635000  0.06735110 -0.02762610  0.036274100  0.0111138
## 249:  0 GWAS249 -0.096938100  0.04301970 -0.03590870  0.050071600  0.0592868
## 250:  0 GWAS250 -0.076182600  0.00201139 -0.08042170  0.093814200 -0.0438510
##              V8          V9         V10         V11         V12
##   1:  0.0514348 -0.04970010  0.00524256  0.04291860 -0.03527040
##   2: -0.0468522 -0.04189440 -0.02451430  0.04369060 -0.00639382
##   3: -0.0125338  0.03197040 -0.01213790  0.05635140 -0.05542660
##   4: -0.0588823  0.05183330  0.04806030 -0.00807608 -0.05143510
##   5:  0.1037530  0.00202053  0.00319513  0.03146300  0.14433400
##  ---                                                           
## 246: -0.0244932 -0.04420150  0.03800010 -0.03652480  0.08705950
## 247: -0.0105253 -0.02382970  0.00355451  0.05201170  0.06576610
## 248: -0.0450552  0.01511180 -0.03197180 -0.03815740  0.11942300
## 249: -0.0212291 -0.00383297  0.00641667 -0.02311150  0.03102250
## 250: -0.0236285 -0.08284370  0.02097270 -0.03784390 -0.09059720

ggplot(evec1,aes(x=V3,y=V4)) + geom_point() + xlab("PC1") + ylab("PC2")

Figure 3. Each point represents a person and their first 2 components from their genotype data. This is comparing PC1 v PC2.

7. (0.5 pts) From your GWAS-only PCA results, what proportion of the variance is explained by the first PC? What proportion of the variance is explained by the first 5 PCs? Embed the code you used to calculate this below.

#add your R code in code chunks like this one

eval = fread("1000g_PCAtype2.eigenval")

eval = mutate(eval,PC=as.factor(c(1:10)))
eval = mutate(eval, pve=V1/sum(V1))
eval

##          V1 PC        pve
##  1: 2.94900  1 0.22238746
##  2: 1.55791  2 0.11748377
##  3: 1.32654  3 0.10003590
##  4: 1.11296  4 0.08392958
##  5: 1.08684  5 0.08195985
##  6: 1.06342  6 0.08019372
##  7: 1.05677  7 0.07969223
##  8: 1.04368  8 0.07870510
##  9: 1.03567  9 0.07810106
## 10: 1.02785 10 0.07751134

22.2% of the variation explained by PC1. 60.28% is explained by first 5 PC.

8. (0.5 pts) Make a scree plot of your GWAS-only PCA results, i.e., proportion variance explained (y-axis) vs. PC (x-axis). Embed your code and plot below. Write a figure legend, like you would see in a scientific journal, that explains what is represented in your plot.

#add your R code in code chunks like this one


ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")

Figure 4. Plot represents PC value v. variance explained. The plot above shows what proportion of variance is explained by each PC.

9. (1 pt) Given all your PCA results, which PCs should you use as covariates to adjust for residual population stratification in your GWAS? Should you use your merged PCs or GWAS-only PCs? How many PCs should you include as covariates? Explain your reasoning below and save these PCs for use in Homework 3.

I would use the merged PCs because they have more variance explained. First 5 PCs of merged PC explained 93.3% of the variance. Whereas the GWAS-only first 5 PC explained only around 60%.