I will calculate group-level Polygenic Scores (PGS) for 52 populations and 7 super-populations for SNPs that were found to explain variation in educational attainment by the largest GWAS (N= 1.1 million) to date (Lee et al., in press) and in height (Wood et al., 2014).
(Skip this part you’re not running the code but just reading this) ** Download files from: https://osf.io/ewxqj/ HGDP frequencies file: “Lee_results_2.csv”. Frequencies of GWAS hits from 52 populations pre-computed using HGDP-CEPH browser (http://spsmart.cesga.es/ceph.php?dataSet=ceph_stanford). GWAS summary file: “Lee_etalGWAS2018.csv”. HGDP frequencies for EDU by continent file: “Lee_results_continent.csv”" HGDP frequencies for height: “Woodetal_results.csv”" Height GWAS summary file: “ng.3097-S2.csv”" HGDP frequencies for height by continent: “Woodetal_results_continent.csv”"
You will be requested to upload them (in the same order) as the code runs. Note that GWAS files were not LD clumped because this would result in too small number of SNPs being found in the HGDP dataset. However, LD clumping should be performed on the output HGDP files. For simplicity, this step is omitted but will be done in a follow-up version. **
HGDP_CEPH=read.csv(file.choose(), header=TRUE, sep = ";")#open HGDP-CEPH browser output with freqs from GWAS hits
gwas=read.csv(file.choose(), header=TRUE)#open GWAS summary file
gwas=gwas[which(gwas$Pval<5*10^-8),]#select only significant SNPs
Merge two datasets
HGDP_CEPH_merged=merge(HGDP_CEPH,gwas, by.x = "SNP", by.y = "MarkerName")
Split into 8 groups according to levels of vector “var”: “AC” ,“AG”, “CA”, “CT”, “GA”, “GT” ,“TC”, “TG”
Note that some SNPs are on reverse strand so AC is TG on GWAS and viceversa, so this will cause a nonmatch. Select reverse-stranded SNPs and translate into same strand as HGDP-CEPH reference panel. Then, flip frequency of non-matching alleles (when A2 (A1) is (not) HGDP reference allele).
Create objects based on match or nonmatch between A1 and reference allele and flip frequency of nonmatch alleles
Calculate weighted frequency by SNP by group.
Re-merge groups
compute PGS by population
PGS<-with(df_remerged, tapply(GVS, population, mean))
PGS
## Algeria (Mzab) - Mozabite Bougainville - NAN Melanesian
## -2.059055e-05 -3.567165e-04
## Brazil - Karitiana Brazil - Surui
## -3.317165e-04 -5.549449e-04
## C. African Republic - Biaka Pygmy Cambodia - Cambodian
## -1.962441e-04 -4.213228e-04
## China - Dai China - Daur
## 1.578740e-04 -1.492126e-05
## China - Han China - Hezhen
## 9.718110e-05 -2.491339e-05
## China - Lahu China - Miaozu
## -2.247165e-04 6.992913e-05
## China - Mongola China - Naxi
## 2.216535e-04 1.038504e-04
## China - Oroqen China - She
## 2.168268e-04 -6.139370e-05
## China - Tu China - Tujia
## -1.900315e-04 1.401575e-04
## China - Uygur China - Xibo
## -1.086614e-04 1.097165e-04
## China - Yizu Colombia - Piapoco and Curripaco
## 3.661417e-05 -1.979449e-04
## D. R. of Congo - Mbuti Pygmy France - Basque
## 4.722047e-05 1.481732e-04
## France - French Israel (Carmel) - Druze
## 6.676378e-05 -2.445276e-04
## Israel (Central) - Palestinian Israel (Negev) - Bedouin
## -1.733386e-04 -2.821890e-04
## Italy - from Bergamo Italy - Sardinian
## -8.703937e-05 -2.899606e-04
## Italy - Tuscan Japan - Japanese
## -4.414173e-05 -7.813386e-05
## Kenya - Bantu Mexico - Maya
## -2.382992e-04 -3.859055e-04
## Mexico - Pima Namibia - San
## -3.126850e-04 -4.937008e-04
## New Guinea - Papuan Nigeria - Yoruba
## -7.946378e-04 -3.792441e-04
## Orkney Islands - Orcadian Pakistan - Balochi
## 9.040945e-05 -5.339370e-05
## Pakistan - Brahui Pakistan - Burusho
## -1.697874e-04 -3.267717e-06
## Pakistan - Hazara Pakistan - Kalash
## 1.440945e-05 -3.177953e-05
## Pakistan - Makrani Pakistan - Pathan
## -1.910236e-04 -2.441417e-04
## Pakistan - Sindhi Population Set 1
## -1.127795e-04 -1.281969e-04
## Russia - Russian Russia (Caucasus) - Adygei
## -5.775591e-05 -1.708819e-04
## Senegal - Mandenka Siberia - Yakut
## -2.953858e-04 2.992126e-05
## South Africa - Bantu
## -2.875354e-04
Sort in PGS in descending order
sort(PGS,decreasing = TRUE)
## China - Mongola China - Oroqen
## 2.216535e-04 2.168268e-04
## China - Dai France - Basque
## 1.578740e-04 1.481732e-04
## China - Tujia China - Xibo
## 1.401575e-04 1.097165e-04
## China - Naxi China - Han
## 1.038504e-04 9.718110e-05
## Orkney Islands - Orcadian China - Miaozu
## 9.040945e-05 6.992913e-05
## France - French D. R. of Congo - Mbuti Pygmy
## 6.676378e-05 4.722047e-05
## China - Yizu Siberia - Yakut
## 3.661417e-05 2.992126e-05
## Pakistan - Hazara Pakistan - Burusho
## 1.440945e-05 -3.267717e-06
## China - Daur Algeria (Mzab) - Mozabite
## -1.492126e-05 -2.059055e-05
## China - Hezhen Pakistan - Kalash
## -2.491339e-05 -3.177953e-05
## Italy - Tuscan Pakistan - Balochi
## -4.414173e-05 -5.339370e-05
## Russia - Russian China - She
## -5.775591e-05 -6.139370e-05
## Japan - Japanese Italy - from Bergamo
## -7.813386e-05 -8.703937e-05
## China - Uygur Pakistan - Sindhi
## -1.086614e-04 -1.127795e-04
## Population Set 1 Pakistan - Brahui
## -1.281969e-04 -1.697874e-04
## Russia (Caucasus) - Adygei Israel (Central) - Palestinian
## -1.708819e-04 -1.733386e-04
## China - Tu Pakistan - Makrani
## -1.900315e-04 -1.910236e-04
## C. African Republic - Biaka Pygmy Colombia - Piapoco and Curripaco
## -1.962441e-04 -1.979449e-04
## China - Lahu Kenya - Bantu
## -2.247165e-04 -2.382992e-04
## Pakistan - Pathan Israel (Carmel) - Druze
## -2.441417e-04 -2.445276e-04
## Israel (Negev) - Bedouin South Africa - Bantu
## -2.821890e-04 -2.875354e-04
## Italy - Sardinian Senegal - Mandenka
## -2.899606e-04 -2.953858e-04
## Mexico - Pima Brazil - Karitiana
## -3.126850e-04 -3.317165e-04
## Bougainville - NAN Melanesian Nigeria - Yoruba
## -3.567165e-04 -3.792441e-04
## Mexico - Maya Cambodia - Cambodian
## -3.859055e-04 -4.213228e-04
## Namibia - San Brazil - Surui
## -4.937008e-04 -5.549449e-04
## New Guinea - Papuan
## -7.946378e-04
Calculate Z-score
scale(PGS, center = TRUE, scale = TRUE)
## [,1]
## Algeria (Mzab) - Mozabite 0.511838662
## Bougainville - NAN Melanesian -1.112980227
## Brazil - Karitiana -0.992131278
## Brazil - Surui -2.071207728
## C. African Republic - Biaka Pygmy -0.337263188
## Cambodia - Cambodian -1.425284364
## China - Dai 1.374528882
## China - Daur 0.539243778
## China - Han 1.081141888
## China - Hezhen 0.490942261
## China - Lahu -0.474897772
## China - Miaozu 0.949407017
## China - Mongola 1.682836440
## China - Naxi 1.113380963
## China - Oroqen 1.659504028
## China - She 0.314597951
## China - Tu -0.307231748
## China - Tujia 1.288887894
## China - Uygur 0.086107794
## China - Xibo 1.141737645
## China - Yizu 0.788363897
## Colombia - Piapoco and Curripaco -0.345484723
## D. R. of Congo - Mbuti Pygmy 0.839634302
## France - Basque 1.327635684
## France - French 0.934105828
## Israel (Carmel) - Druze -0.570663428
## Israel (Central) - Palestinian -0.226538906
## Israel (Negev) - Bedouin -0.752717137
## Italy - from Bergamo 0.190627863
## Italy - Sardinian -0.790284984
## Italy - Tuscan 0.397993242
## Japan - Japanese 0.233676733
## Kenya - Bantu -0.540555863
## Mexico - Maya -1.254078513
## Mexico - Pima -0.900133825
## Namibia - San -1.775156348
## New Guinea - Papuan -3.229873203
## Nigeria - Yoruba -1.221877502
## Orkney Islands - Orcadian 1.048408000
## Pakistan - Balochi 0.353269615
## Pakistan - Brahui -0.209372646
## Pakistan - Burusho 0.595576517
## Pakistan - Hazara 0.681027192
## Pakistan - Kalash 0.457751620
## Pakistan - Makrani -0.312027644
## Pakistan - Pathan -0.568798358
## Pakistan - Sindhi 0.066201022
## Population Set 1 -0.008325669
## Russia - Russian 0.332182900
## Russia (Caucasus) - Adygei -0.214663356
## Senegal - Mandenka -0.816510158
## Siberia - Yakut 0.756010635
## South Africa - Bantu -0.778561684
## attr(,"scaled:center")
## [1] -0.0001264745
## attr(,"scaled:scale")
## [1] 0.0002068698
Carry out the same job at the sub-continent level. This should make results more stable because populations have very low sample sizes but sub-continental clusters rely on larger samples. Upload “Lee_results_continent.csv”
compute PGS by population
PGS_continent<-with(df_remerged, tapply(GVS, population, mean))
PGS_continent
## AFRICA AMERICA CENTRAL-SOUTH ASIA
## -2.510866e-04 -3.590079e-04 -1.000551e-04
## EAST ASIA EUROPE MIDDLE EAST
## 1.702362e-05 -4.521260e-05 -1.950079e-04
## OCEANIA Population Set 1
## -6.217480e-04 -1.281969e-04
Sort in PGS in descending order
sort(PGS_continent,decreasing = TRUE)
## EAST ASIA EUROPE CENTRAL-SOUTH ASIA
## 1.702362e-05 -4.521260e-05 -1.000551e-04
## Population Set 1 MIDDLE EAST AFRICA
## -1.281969e-04 -1.950079e-04 -2.510866e-04
## AMERICA OCEANIA
## -3.590079e-04 -6.217480e-04
Calculate Z-score
scale(PGS_continent, center = TRUE, scale = TRUE)
## [,1]
## AFRICA -0.19945206
## AMERICA -0.72864722
## CENTRAL-SOUTH ASIA 0.54113545
## EAST ASIA 1.11523459
## EUROPE 0.81005740
## MIDDLE EAST 0.07553174
## OCEANIA -2.01700152
## Population Set 1 0.40314161
## attr(,"scaled:center")
## [1] -0.0002104114
## attr(,"scaled:scale")
## [1] 0.0002039347
Let’s compare the results using the largest height GWAS (Wood et al., 2014).
HGDP_CEPH=read.csv(file.choose(), header=TRUE, sep = ";")#open HGDP-CEPH browser output with freqs from GWAS hits
gwas=read.csv(file.choose(), header=TRUE, sep = ";")#open GWAS summary file
HGDP_CEPH_merged=merge(HGDP_CEPH,gwas, by.x = "SNP", by.y = "SNP")
df_AC=subset(HGDP_CEPH_merged, var=="AC")
df_AG=subset(HGDP_CEPH_merged, var=="AG")
df_CA=subset(HGDP_CEPH_merged, var=="CA")
df_CT=subset(HGDP_CEPH_merged, var=="CT")
df_GA=subset(HGDP_CEPH_merged, var=="GA")
df_GT=subset(HGDP_CEPH_merged, var=="GT")
df_TC=subset(HGDP_CEPH_merged, var=="TC")
df_TG=subset(HGDP_CEPH_merged, var=="TG")
#AC
df_AC_rev_strand_matching=subset(df_AC,A1=="T")
df_AC_rev_strand_nonmatching=subset(df_AC,A1=="G")
#replace "T" with "A" and "C" with "G"
df_AC_rev_strand_matching$A1[df_AC_rev_strand_matching$A1=="T"] <- "A"
df_AC_rev_strand_matching$A2[df_AC_rev_strand_matching$A2=="G"] <- "C"
df_AC_rev_strand_nonmatching$A1[df_AC_rev_strand_nonmatching$A1=="G"] <- "C"
df_AC_rev_strand_nonmatching$A2[df_AC_rev_strand_nonmatching$A2=="T"] <- "A"
df_AC_rev_strand_nonmatching$freq_A=df_AC_rev_strand_nonmatching$freq_C#flip freq
#AG
df_AG_rev_strand_matching=subset(df_AG,A1=="T")
df_AG_rev_strand_nonmatching=subset(df_AG,A1=="C")
#replace "T" with "A" and "C" with "G"
df_AG_rev_strand_matching$A1[df_AG_rev_strand_matching$A1=="T"] <- "A"
df_AG_rev_strand_matching$A2[df_AG_rev_strand_matching$A2=="C"] <- "G"
df_AG_rev_strand_nonmatching$A1[df_AG_rev_strand_nonmatching$A1=="C"] <- "G"
df_AG_rev_strand_nonmatching$A2[df_AG_rev_strand_nonmatching$A2=="T"] <- "A"
df_AG_rev_strand_nonmatching$freq_A=df_AG_rev_strand_nonmatching$freq_G#flip freq
#CA
df_CA_rev_strand_matching=subset(df_CA,A1=="G")
df_CA_rev_strand_nonmatching=subset(df_CA,A1=="T")
#replace "T" with "A" and "C" with "G"
df_CA_rev_strand_matching$A1[df_CA_rev_strand_matching$A1=="G"] <- "C"
df_CA_rev_strand_matching$A2[df_CA_rev_strand_matching$A2=="T"] <- "A"
df_CA_rev_strand_nonmatching$A1[df_CA_rev_strand_nonmatching$A1=="T"] <- "A"
df_CA_rev_strand_nonmatching$A2[df_CA_rev_strand_nonmatching$A2=="G"] <- "C"
df_CA_rev_strand_nonmatching$freq_C=df_CA_rev_strand_nonmatching$freq_A#flip freq
#CT
df_CT_rev_strand_matching=subset(df_CT,A1=="G")
df_CT_rev_strand_nonmatching=subset(df_CT,A1=="A")
#replace "T" with "A" and "C" with "G"
df_CT_rev_strand_matching$A1[df_CT_rev_strand_matching$A1=="G"] <- "C"
df_CT_rev_strand_matching$A2[df_CT_rev_strand_matching$A2=="A"] <- "T"
df_CT_rev_strand_nonmatching$A1[df_CT_rev_strand_nonmatching$A1=="A"] <- "T"
df_CT_rev_strand_nonmatching$A2[df_CT_rev_strand_nonmatching$A2=="G"] <- "C"
df_CT_rev_strand_nonmatching$freq_C=df_CT_rev_strand_nonmatching$freq_T#flip freq
#GA
df_GA_rev_strand_matching=subset(df_GA,A1=="C")
df_GA_rev_strand_nonmatching=subset(df_GA,A1=="T")
#replace "T" with "A" and "C" with "G"
df_GA_rev_strand_matching$A1[df_GA_rev_strand_matching$A1=="C"] <- "G"
df_GA_rev_strand_matching$A2[df_GA_rev_strand_matching$A2=="T"] <- "A"
df_GA_rev_strand_nonmatching$A1[df_GA_rev_strand_nonmatching$A1=="T"] <- "A"
df_GA_rev_strand_nonmatching$A2[df_GA_rev_strand_nonmatching$A2=="C"] <- "G"
df_GA_rev_strand_nonmatching$freq_G=df_GA_rev_strand_nonmatching$freq_A#flip freq
#GT
df_GT_rev_strand_matching=subset(df_GT,A1=="C")
df_GT_rev_strand_nonmatching=subset(df_GT,A1=="A")
#replace "T" with "A" and "C" with "G"and viceversa
df_GT_rev_strand_matching$A1[df_GT_rev_strand_matching$A1=="C"] <- "G"
df_GT_rev_strand_matching$A2[df_GT_rev_strand_matching$A2=="A"] <- "T"
df_GT_rev_strand_nonmatching$A1[df_GT_rev_strand_nonmatching$A1=="A"] <- "T"
df_GT_rev_strand_nonmatching$A2[df_GT_rev_strand_nonmatching$A2=="C"] <- "G"
df_GT_rev_strand_nonmatching$freq_G=df_GT_rev_strand_nonmatching$freq_T#flip freq
#TC
df_TC_rev_strand_matching=subset(df_TC,A1=="A")
df_TC_rev_strand_nonmatching=subset(df_TC,A1=="G")
#replace "T" with "A" and "C" with "G"and viceversa
df_TC_rev_strand_matching$A1[df_TC_rev_strand_matching$A1=="A"] <- "T"
df_TC_rev_strand_matching$A2[df_TC_rev_strand_matching$A2=="G"] <- "C"
df_TC_rev_strand_nonmatching$A1[df_TC_rev_strand_nonmatching$A1=="G"] <- "C"
df_TC_rev_strand_nonmatching$A2[df_TC_rev_strand_nonmatching$A2=="A"] <- "T"
df_TC_rev_strand_nonmatching$freq_T=df_TC_rev_strand_nonmatching$freq_C#flip freq
#TG
df_TG_rev_strand_matching=subset(df_TG,A1=="A")
df_TG_rev_strand_nonmatching=subset(df_TG,A1=="C")
#replace "T" with "A" and "C" with "G"and viceversa
df_TG_rev_strand_matching$A1[df_TG_rev_strand_matching$A1=="A"] <- "T"
df_TG_rev_strand_matching$A2[df_TG_rev_strand_matching$A2=="C"] <- "G"
df_TG_rev_strand_nonmatching$A1[df_TG_rev_strand_nonmatching$A1=="C"] <- "G"
df_TG_rev_strand_nonmatching$A2[df_TG_rev_strand_nonmatching$A2=="A"] <- "T"
df_TG_rev_strand_nonmatching$freq_T=df_TG_rev_strand_nonmatching$freq_G#flip freq
#AC
df_AC_match=df_AC[which(as.character(df_AC$reference)==as.character(df_AC$A1)),]
df_AC_nonmatch=df_AC[which(as.character(df_AC$reference)==as.character(df_AC$A2)),]
df_AC_nonmatch$freq_A=df_AC_nonmatch$freq_C
#re-merge objects
df_AC=rbind(df_AC_nonmatch,df_AC_match, df_AC_rev_strand_matching,df_AC_rev_strand_nonmatching)
#AG
df_AG_match=df_AG[which(as.character(df_AG$reference)==as.character(df_AG$A1)),]
df_AG_nonmatch=df_AG[which(as.character(df_AG$reference)==as.character(df_AG$A2)),]
df_AG_nonmatch$freq_A=df_AG_nonmatch$freq_G
#re-merge objects
df_AG=rbind(df_AG_nonmatch,df_AG_match,df_AG_rev_strand_matching,df_AG_rev_strand_nonmatching )
#CA
df_CA_match=df_CA[which(as.character(df_CA$reference)==as.character(df_CA$A1)),]
df_CA_nonmatch=df_CA[which(as.character(df_CA$reference)==as.character(df_CA$A2)),]
df_CA_nonmatch$freq_C=df_CA_nonmatch$freq_A
#re-merge objects
df_CA=rbind(df_CA_nonmatch,df_CA_match,df_CA_rev_strand_matching,df_CA_rev_strand_nonmatching)
#CT
df_CT_match=df_CT[which(as.character(df_CT$reference)==as.character(df_CT$A1)),]
df_CT_nonmatch=df_CT[which(as.character(df_CT$reference)==as.character(df_CT$A2)),]
df_CT_nonmatch$freq_C=df_CT_nonmatch$freq_T
#re-merge objects
df_CT=rbind(df_CT_nonmatch,df_CT_match,df_CT_rev_strand_matching,df_CT_rev_strand_nonmatching)
#GA
df_GA_match=df_GA[which(as.character(df_GA$reference)==as.character(df_GA$A1)),]
df_GA_nonmatch=df_GA[which(as.character(df_GA$reference)==as.character(df_GA$A2)),]
df_GA_nonmatch$freq_G=df_GA_nonmatch$freq_A
#re-merge objects
df_GA=rbind(df_GA_nonmatch,df_GA_match,df_GA_rev_strand_matching,df_GA_rev_strand_nonmatching)
df_GT_match=df_GT[which(as.character(df_GT$reference)==as.character(df_GT$A1)),]
df_GT_nonmatch=df_GT[which(as.character(df_GT$reference)==as.character(df_GT$A2)),]
df_GT_nonmatch$freq_G=df_GT_nonmatch$freq_T
#re-merge objects
df_GT=rbind(df_GT_nonmatch,df_GT_match,df_GT_rev_strand_matching,df_GT_rev_strand_nonmatching)
#TC
df_TC_match=df_TC[which(as.character(df_TC$reference)==as.character(df_TC$A1)),]
df_TC_nonmatch=df_TC[which(as.character(df_TC$reference)==as.character(df_TC$A2)),]
df_TC_nonmatch$freq_T=df_TC_nonmatch$freq_C
#re-merge objects
df_TC=rbind(df_TC_nonmatch,df_TC_match,df_TC_rev_strand_matching,df_TC_rev_strand_nonmatching)
#TG
df_TG_match=df_TG[which(as.character(df_TG$reference)==as.character(df_TG$A1)),]
df_TG_nonmatch=df_TG[which(as.character(df_TG$reference)==as.character(df_TG$A2)),]
df_TG_nonmatch$freq_T=df_TG_nonmatch$freq_G
#re-merge objects
df_TG=rbind(df_TG_nonmatch,df_TG_match,df_TG_rev_strand_matching,df_TG_rev_strand_nonmatching)
df_AC$GVS=df_AC$Beta*df_AC$freq_A
df_AG$GVS=df_AG$Beta*df_AG$freq_A
df_CA$GVS=df_CA$Beta*df_CA$freq_C
df_CT$GVS=df_CT$Beta*df_CT$freq_C
df_GA$GVS=df_GA$Beta*df_GA$freq_G
df_GT$GVS=df_GT$Beta*df_GT$freq_G
df_TC$GVS=df_TC$Beta*df_TC$freq_T
df_TG$GVS=df_TG$Beta*df_TG$freq_T
df_remerged=rbind(df_AC,df_AG,df_CA,df_CT,df_GA,df_GT,df_TC,df_TG)
compute PGS by population
PGS_height<-with(df_remerged, tapply(GVS, population, mean))
PGS_height
## Algeria (Mzab) - Mozabite Bougainville - NAN Melanesian
## 5.658260e-04 8.865022e-05
## Brazil - Karitiana Brazil - Surui
## 8.304815e-04 7.197035e-04
## C. African Republic - Biaka Pygmy Cambodia - Cambodian
## -2.471546e-04 1.691982e-04
## China - Dai China - Daur
## 1.870784e-04 -2.276167e-04
## China - Han China - Hezhen
## -2.139295e-04 -1.527322e-04
## China - Lahu China - Miaozu
## -9.264581e-05 -3.045727e-05
## China - Mongola China - Naxi
## 2.100238e-04 4.921185e-04
## China - Oroqen China - She
## 2.882057e-04 2.034705e-04
## China - Tu China - Tujia
## 5.267974e-05 2.288458e-05
## China - Uygur China - Xibo
## 4.501542e-04 1.105154e-05
## China - Yizu Colombia - Piapoco and Curripaco
## 1.066300e-04 5.386335e-04
## D. R. of Congo - Mbuti Pygmy France - Basque
## -2.331304e-04 1.158374e-03
## France - French Israel (Carmel) - Druze
## 1.221944e-03 6.357106e-04
## Israel (Central) - Palestinian Israel (Negev) - Bedouin
## 7.785256e-04 3.776189e-04
## Italy - from Bergamo Italy - Sardinian
## 9.144101e-04 6.108048e-04
## Italy - Tuscan Japan - Japanese
## 7.082286e-04 -1.293392e-04
## Kenya - Bantu Mexico - Maya
## -1.779181e-04 7.738093e-04
## Mexico - Pima Namibia - San
## 6.677194e-04 -5.796916e-04
## New Guinea - Papuan Nigeria - Yoruba
## 2.257982e-04 -3.220859e-04
## Orkney Islands - Orcadian Pakistan - Balochi
## 1.388032e-03 6.966396e-04
## Pakistan - Brahui Pakistan - Burusho
## 6.089916e-04 5.401361e-04
## Pakistan - Hazara Pakistan - Kalash
## 5.301066e-04 3.747515e-04
## Pakistan - Makrani Pakistan - Pathan
## 4.680559e-04 5.521789e-04
## Pakistan - Sindhi Population Set 1
## 4.809907e-04 4.253793e-04
## Russia - Russian Russia (Caucasus) - Adygei
## 1.219391e-03 1.099690e-03
## Senegal - Mandenka Siberia - Yakut
## -1.233216e-05 1.275590e-04
## South Africa - Bantu
## -2.494581e-05
Sort in PGS in descending order
sort(PGS_height,decreasing = TRUE)
## Orkney Islands - Orcadian France - French
## 1.388032e-03 1.221944e-03
## Russia - Russian France - Basque
## 1.219391e-03 1.158374e-03
## Russia (Caucasus) - Adygei Italy - from Bergamo
## 1.099690e-03 9.144101e-04
## Brazil - Karitiana Israel (Central) - Palestinian
## 8.304815e-04 7.785256e-04
## Mexico - Maya Brazil - Surui
## 7.738093e-04 7.197035e-04
## Italy - Tuscan Pakistan - Balochi
## 7.082286e-04 6.966396e-04
## Mexico - Pima Israel (Carmel) - Druze
## 6.677194e-04 6.357106e-04
## Italy - Sardinian Pakistan - Brahui
## 6.108048e-04 6.089916e-04
## Algeria (Mzab) - Mozabite Pakistan - Pathan
## 5.658260e-04 5.521789e-04
## Pakistan - Burusho Colombia - Piapoco and Curripaco
## 5.401361e-04 5.386335e-04
## Pakistan - Hazara China - Naxi
## 5.301066e-04 4.921185e-04
## Pakistan - Sindhi Pakistan - Makrani
## 4.809907e-04 4.680559e-04
## China - Uygur Population Set 1
## 4.501542e-04 4.253793e-04
## Israel (Negev) - Bedouin Pakistan - Kalash
## 3.776189e-04 3.747515e-04
## China - Oroqen New Guinea - Papuan
## 2.882057e-04 2.257982e-04
## China - Mongola China - She
## 2.100238e-04 2.034705e-04
## China - Dai Cambodia - Cambodian
## 1.870784e-04 1.691982e-04
## Siberia - Yakut China - Yizu
## 1.275590e-04 1.066300e-04
## Bougainville - NAN Melanesian China - Tu
## 8.865022e-05 5.267974e-05
## China - Tujia China - Xibo
## 2.288458e-05 1.105154e-05
## Senegal - Mandenka South Africa - Bantu
## -1.233216e-05 -2.494581e-05
## China - Miaozu China - Lahu
## -3.045727e-05 -9.264581e-05
## Japan - Japanese China - Hezhen
## -1.293392e-04 -1.527322e-04
## Kenya - Bantu China - Han
## -1.779181e-04 -2.139295e-04
## China - Daur D. R. of Congo - Mbuti Pygmy
## -2.276167e-04 -2.331304e-04
## C. African Republic - Biaka Pygmy Nigeria - Yoruba
## -2.471546e-04 -3.220859e-04
## Namibia - San
## -5.796916e-04
Calculate Z-score
scale(PGS_height, center = TRUE, scale = TRUE)
## [,1]
## Algeria (Mzab) - Mozabite 0.46202378
## Bougainville - NAN Melanesian -0.60887692
## Brazil - Karitiana 1.05597633
## Brazil - Surui 0.80736310
## C. African Republic - Biaka Pygmy -1.36250623
## Cambodia - Cambodian -0.42810721
## China - Dai -0.38797966
## China - Daur -1.31865837
## China - Han -1.28794085
## China - Hezhen -1.15059881
## China - Lahu -1.01575016
## China - Miaozu -0.87618365
## China - Mongola -0.33648455
## China - Naxi 0.29660590
## China - Oroqen -0.16102490
## China - She -0.35119179
## China - Tu -0.68960361
## China - Tujia -0.75647132
## China - Uygur 0.20242758
## China - Xibo -0.78302759
## China - Yizu -0.56852594
## Colombia - Piapoco and Curripaco 0.40099704
## D. R. of Congo - Mbuti Pygmy -1.33103238
## France - Basque 1.79184741
## France - French 1.93451533
## Israel (Carmel) - Druze 0.61886210
## Israel (Central) - Palestinian 0.93937431
## Israel (Negev) - Bedouin 0.03964051
## Italy - from Bergamo 1.24433299
## Italy - Sardinian 0.56296748
## Italy - Tuscan 0.78161060
## Japan - Japanese -1.09809923
## Kenya - Bantu -1.20712222
## Mexico - Maya 0.92878976
## Mexico - Pima 0.69069780
## Namibia - San -2.10880169
## New Guinea - Papuan -0.30108278
## Nigeria - Yoruba -1.53067059
## Orkney Islands - Orcadian 2.30725722
## Pakistan - Balochi 0.75560205
## Pakistan - Brahui 0.55889817
## Pakistan - Burusho 0.40436935
## Pakistan - Hazara 0.38186063
## Pakistan - Kalash 0.03320535
## Pakistan - Makrani 0.24260357
## Pakistan - Pathan 0.43139622
## Pakistan - Sindhi 0.27163247
## Population Set 1 0.14682658
## Russia - Russian 1.92878607
## Russia (Caucasus) - Adygei 1.66014763
## Senegal - Mandenka -0.83550641
## Siberia - Yakut -0.52155591
## South Africa - Bantu -0.86381458
## attr(,"scaled:center")
## [1] 0.0003599558
## attr(,"scaled:scale")
## [1] 0.0004455836
Work on files for height PGS by continent.
Calculate PGS by continent
PGS_height_continent<-with(df_remerged, tapply(GVS, population, mean))
PGS_height_continent
## AFRICA AMERICA CENTRAL-SOUTH ASIA
## -2.019476e-04 7.300819e-04 5.284300e-04
## EAST ASIA EUROPE MIDDLE EAST
## 7.809251e-06 1.055534e-03 5.900705e-04
## OCEANIA Population Set 1
## 1.699612e-04 4.253793e-04
sort(PGS_height_continent, decreasing = TRUE)
## EUROPE AMERICA MIDDLE EAST
## 1.055534e-03 7.300819e-04 5.900705e-04
## CENTRAL-SOUTH ASIA Population Set 1 OCEANIA
## 5.284300e-04 4.253793e-04 1.699612e-04
## EAST ASIA AFRICA
## 7.809251e-06 -2.019476e-04
#transform into Z scores
scale(PGS_height_continent, center = TRUE, scale = TRUE)
## [,1]
## AFRICA -1.51127588
## AMERICA 0.77863670
## CENTRAL-SOUTH ASIA 0.28319594
## EAST ASIA -0.99592224
## EUROPE 1.57824399
## MIDDLE EAST 0.43464117
## OCEANIA -0.59752943
## Population Set 1 0.03000975
## attr(,"scaled:center")
## [1] 0.0004131649
## attr(,"scaled:scale")
## [1] 0.0004070153
Calculate correlation between EDU and Height PGS:
cor(PGS_height, PGS)
## [1] 0.08979982
It is difficult to interpret single population scores because they rely on small (sometimes very small, eg. N<10 samples). The top scores for EDU are achieved by East Asians and the lowest scores by some African populations, Melanesians and Papuans (remarkably, populations that are both genetically and geographically very far apart). Conversely, Europeans have top scores for the Height PGS. East Asians and Africans have the lowest scores. The genotypic scores of Europeans (in particular Northern Europeans, such as Orcadians, French and Russians) and East Asians correspond to the well-known physical differences between these populations. The low score of Africans is difficult to explain, because Africans are not particularly short. However, the lowest scores are achieved by the Pygmies and San, which is in line with predictions. The null correlation between height and EDU suggests that selective pressures on these two traits are unrelated and that the particular rankings are not due to statistical artifacts or unknown biases that might produce higher scores for some populations (e.g. LD breakdown, sample size, etc. ).
The two major limitations of this approach are: 1) Very low coverage of the HGDP-CEPH database, which comprises only about 10% of all the SNPs present in 1000 Genomes; 2) Very small sample sizes,with some populations numbering less than 10 individuals (e.g. Italy - Tuscan= 8; San=5; Bantu=8; Colombia - Piapoco and Curripaco= 7; China - Lahu=8).
References:
Wood AR, Esko T, Yang J, et al.: Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014; 46(11): 1173–86.