All required data files are located in the /homes/data/HW2_data/ directory on the bioi.cs.luc.edu server.
You will perform a GWAS for height using a simulated, i.e., made-up, height phenotype in your next homework. Before you can perform the GWAS, you will perform principal component analysis (PCA) in this homework to understand the population structure of your GWAS population.
One set of LD-pruned PLINK binary format files (end in .bed, .bim, and .fam, genome build 37), prefixed height_simGWAS_1000g_LDpruned, includes GWAS samples merged with several 1000 Genomes Project populations for anchoring. The other set, prefixed height_simGWAS_LDpruned, contains only the GWAS samples. Population information is in the height_simGWAS_1000g_PCA_pop_info.txt file.
Make a copy of this HW2.Rmd file from the /homes/data/HW2_data/ directory to your home directory. Perform the requested analyses and embed relevant code, plots, and explanations into your copy of the HW2.Rmd file. For some tips on using R Markdown files, see here. When finished, knit your .Rmd file as an HTML file and submit the HTML file to Sakai. Click Knit -> Knit to HTML. RStudio will prompt you to install any missing packages the first time you Click Knit or open an .Rmd file.
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
system("plink --bfile /homes/data/HW2_data/height_simGWAS_1000g_LDpruned --pca 10 --out 1000g_PCA")
evec = fread("1000g_PCA.eigenvec")
evec
## V1 V2 V3 V4 V5 V6 V7
## 1: 0 GWAS001 0.0300163 -0.0134008 -0.01746790 -0.04831100 0.00946752
## 2: 0 GWAS002 0.0309294 -0.0141377 -0.01683950 -0.04081600 0.00851499
## 3: 0 GWAS003 0.0309340 -0.0146392 -0.02777610 -0.05049600 0.01115270
## 4: 0 GWAS004 0.0314644 -0.0144932 -0.02445130 -0.05686460 0.01591310
## 5: 0 GWAS005 0.0313949 -0.0154698 -0.02365930 -0.04635000 0.00528847
## ---
## 558: 0 NA19239 -0.0774484 -0.0397551 0.00233799 0.01240630 -0.04393990
## 559: 0 NA19247 -0.0769389 -0.0389941 0.00131112 -0.01567630 0.03601710
## 560: 0 NA19248 -0.0767681 -0.0383840 0.00285064 -0.00390116 0.02947510
## 561: 0 NA19256 -0.0767317 -0.0396923 -0.00472775 -0.01242450 0.03101170
## 562: 0 NA19257 -0.0774294 -0.0398226 -0.00334081 -0.01288690 0.03448140
## V8 V9 V10 V11 V12
## 1: 0.03249540 -0.02342930 -0.05516860 0.06379910 -0.04456750
## 2: 0.01888260 -0.06027280 -0.03716520 0.09417590 -0.09254040
## 3: -0.02965290 0.01883210 0.02004550 -0.05317540 0.01396120
## 4: -0.02132880 0.00574815 0.03112000 -0.04487670 -0.00512081
## 5: 0.02997330 0.03773810 -0.09167370 0.03162080 0.01817940
## ---
## 558: -0.00353169 0.06561640 -0.03678540 -0.01550610 0.06890530
## 559: 0.08814500 -0.03083440 -0.00158430 -0.02450810 0.07754360
## 560: 0.07728780 -0.03633940 -0.02905750 -0.02878790 0.01906200
## 561: -0.07427510 -0.05295520 0.03823760 -0.00427902 0.06623060
## 562: -0.07286520 -0.05374330 -0.00422912 0.02470730 0.08973120
#add your R code in code chunks like this one
# Figure 1.
# Each point represents a person and their first 2 components from their genotype data. The colors represent the different populations. The continental ancestry of the GWAS population is east asian.
library(data.table)
library(dplyr)
library (ggplot2)
evec = fread("1000g_PCA.eigenvec")
ggplot(evec,aes(x=V3,y=V4)) + geom_point() + xlab("PC1") + ylab("PC2")
pops = fread("/homes/data/HW2_data/height_simGWAS_1000g_PCA_pop_info.txt")
head(pops)
## #IID PAT MAT SEX SuperPop Population
## 1: NA06984 0 0 1 EUR CEU
## 2: NA06985 0 0 2 EUR CEU
## 3: NA06986 0 0 1 EUR CEU
## 4: NA06989 0 0 2 EUR CEU
## 5: NA06994 0 0 1 EUR CEU
## 6: NA07000 0 0 2 EUR CEU
pcs = left_join(evec,pops,by=c("V2"="#IID")) #by= tells the left_join function what columns to join the data.frames by
ggplot(pcs,aes(x=V3,y=V4,color=SuperPop)) + geom_point() + xlab("PC1") + ylab("PC2")
Figure 1.
Each point represents a person and their first 2 components from their genotype data. The colors represent the different populations. The continental ancestry of the GWAS population is east asian.
59.2% of the variance is explained by the first PC. 93.3% of the variance is explained by the first 5 PCs.
#add your R code in code chunks like this one
eval = fread("1000g_PCA.eigenval")
eval = mutate(eval,PC=as.factor(c(1:10)))
eval = mutate(eval, pve=V1/sum(V1))
eval
## V1 PC pve
## 1: 50.22480 1 0.59207858
## 2: 22.08250 2 0.26032110
## 3: 3.73398 3 0.04401829
## 4: 1.77246 4 0.02089477
## 5: 1.51117 5 0.01781453
## 6: 1.26358 6 0.01489580
## 7: 1.09282 7 0.01288279
## 8: 1.07215 8 0.01263912
## 9: 1.04370 9 0.01230373
## 10: 1.03077 10 0.01215130
ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")
59.2% of the variance is explained by the first PC. 93.3% of the variance is explained by the first 5 PCs.
#add your R code in code chunks like this one
ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")
Figure 2. Plot represents PC value v. variance explained. The plot above shows that around 59% of the variance is explained by PC1, 26.0% by PC2, etc.
system("plink --bfile /homes/data/HW2_data/height_simGWAS_LDpruned --pca 10 --out 1000g_PCAtype2")
evec1 = fread("1000g_PCAtype2.eigenvec")
evec1
## V1 V2 V3 V4 V5 V6 V7
## 1: 0 GWAS001 0.000121037 -0.07842320 0.02383830 -0.119890000 -0.0406738
## 2: 0 GWAS002 0.000118713 -0.05789990 0.02805280 -0.141622000 -0.0118148
## 3: 0 GWAS003 0.015356000 -0.07633990 0.00433162 0.043727300 0.0181991
## 4: 0 GWAS004 0.008599260 -0.08125220 -0.00699374 0.057515400 0.0431358
## 5: 0 GWAS005 0.008593770 -0.07045780 -0.00772354 -0.066427000 -0.0538334
## ---
## 246: 0 GWAS246 -0.106178000 0.03824840 -0.06871180 -0.000165117 -0.0632509
## 247: 0 GWAS247 -0.090080800 0.03464050 -0.04073720 -0.008700280 -0.0525406
## 248: 0 GWAS248 -0.105635000 0.06735110 -0.02762610 0.036274100 0.0111138
## 249: 0 GWAS249 -0.096938100 0.04301970 -0.03590870 0.050071600 0.0592868
## 250: 0 GWAS250 -0.076182600 0.00201139 -0.08042170 0.093814200 -0.0438510
## V8 V9 V10 V11 V12
## 1: 0.0514348 -0.04970010 0.00524256 0.04291860 -0.03527040
## 2: -0.0468522 -0.04189440 -0.02451430 0.04369060 -0.00639382
## 3: -0.0125338 0.03197040 -0.01213790 0.05635140 -0.05542660
## 4: -0.0588823 0.05183330 0.04806030 -0.00807608 -0.05143510
## 5: 0.1037530 0.00202053 0.00319513 0.03146300 0.14433400
## ---
## 246: -0.0244932 -0.04420150 0.03800010 -0.03652480 0.08705950
## 247: -0.0105253 -0.02382970 0.00355451 0.05201170 0.06576610
## 248: -0.0450552 0.01511180 -0.03197180 -0.03815740 0.11942300
## 249: -0.0212291 -0.00383297 0.00641667 -0.02311150 0.03102250
## 250: -0.0236285 -0.08284370 0.02097270 -0.03784390 -0.09059720
#add your R code in code chunks like this one
evec1 = fread("1000g_PCAtype2.eigenvec")
evec1
## V1 V2 V3 V4 V5 V6 V7
## 1: 0 GWAS001 0.000121037 -0.07842320 0.02383830 -0.119890000 -0.0406738
## 2: 0 GWAS002 0.000118713 -0.05789990 0.02805280 -0.141622000 -0.0118148
## 3: 0 GWAS003 0.015356000 -0.07633990 0.00433162 0.043727300 0.0181991
## 4: 0 GWAS004 0.008599260 -0.08125220 -0.00699374 0.057515400 0.0431358
## 5: 0 GWAS005 0.008593770 -0.07045780 -0.00772354 -0.066427000 -0.0538334
## ---
## 246: 0 GWAS246 -0.106178000 0.03824840 -0.06871180 -0.000165117 -0.0632509
## 247: 0 GWAS247 -0.090080800 0.03464050 -0.04073720 -0.008700280 -0.0525406
## 248: 0 GWAS248 -0.105635000 0.06735110 -0.02762610 0.036274100 0.0111138
## 249: 0 GWAS249 -0.096938100 0.04301970 -0.03590870 0.050071600 0.0592868
## 250: 0 GWAS250 -0.076182600 0.00201139 -0.08042170 0.093814200 -0.0438510
## V8 V9 V10 V11 V12
## 1: 0.0514348 -0.04970010 0.00524256 0.04291860 -0.03527040
## 2: -0.0468522 -0.04189440 -0.02451430 0.04369060 -0.00639382
## 3: -0.0125338 0.03197040 -0.01213790 0.05635140 -0.05542660
## 4: -0.0588823 0.05183330 0.04806030 -0.00807608 -0.05143510
## 5: 0.1037530 0.00202053 0.00319513 0.03146300 0.14433400
## ---
## 246: -0.0244932 -0.04420150 0.03800010 -0.03652480 0.08705950
## 247: -0.0105253 -0.02382970 0.00355451 0.05201170 0.06576610
## 248: -0.0450552 0.01511180 -0.03197180 -0.03815740 0.11942300
## 249: -0.0212291 -0.00383297 0.00641667 -0.02311150 0.03102250
## 250: -0.0236285 -0.08284370 0.02097270 -0.03784390 -0.09059720
ggplot(evec1,aes(x=V3,y=V4)) + geom_point() + xlab("PC1") + ylab("PC2")
Figure 3. Each point represents a person and their first 2 components from their genotype data. This is comparing PC1 v PC2.
#add your R code in code chunks like this one
eval = fread("1000g_PCAtype2.eigenval")
eval = mutate(eval,PC=as.factor(c(1:10)))
eval = mutate(eval, pve=V1/sum(V1))
eval
## V1 PC pve
## 1: 2.94900 1 0.22238746
## 2: 1.55791 2 0.11748377
## 3: 1.32654 3 0.10003590
## 4: 1.11296 4 0.08392958
## 5: 1.08684 5 0.08195985
## 6: 1.06342 6 0.08019372
## 7: 1.05677 7 0.07969223
## 8: 1.04368 8 0.07870510
## 9: 1.03567 9 0.07810106
## 10: 1.02785 10 0.07751134
22.2% of the variation explained by PC1. 60.28% is explained by first 5 PC.
#add your R code in code chunks like this one
ggplot(eval, aes(x=PC, y=pve, group=1)) + geom_point() + geom_line() + ylab("variance explained")
Figure 4. Plot represents PC value v. variance explained. The plot above shows what proportion of variance is explained by each PC.
I would use the merged PCs because they have more variance explained. First 5 PCs of merged PC explained 93.3% of the variance. Whereas the GWAS-only first 5 PC explained only around 60%.