3D PCA Plot of 1000 Genomes Exomes (EIGENSOFT)

I have downloaded and analysed ~300 exomes from different ethnic groups of the 1000 Genomes Project.
I used the genotypes in a .vcf file to run Eigensoft, which produces a .evec file (here called vcf2eigen.pca.evec) containing the first 10 principle components. This document shows how to plot the first 3 prinicple components in a 3D plot using the R package scatterplot3d.

library(scatterplot3d)

# read in the vcf2eigen.pca.evec file
df <- read.table("vcf2eigen.pca.evec")

# give column names
names(df) <- c("SampleID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10", "POPULATION")

# plot
with(df, scatterplot3d(PC1, PC2, PC3, color = as.numeric(POPULATION), pch=19, main="PCA 1000 Genomes Exome Data (3D)")) 

# add legend
legend("topleft", pch=19, col=c("black", "green", "red", "blue"), legend=c("AFR", "EUR", "SAS", "EAS"))

plot of chunk unnamed-chunk-1

The figure shows principle component analysis of the different super populations of the 1000 Genomes Project data. PC1 is the African eigenvector, which separates the Africans (AFR) from the Europeans (EUR), South Asians (SAS) and East Asians (EAS). PC2 is the Asian eigenvector, which separates the Europeans, East Asians and South Asians. PC3 separates the East and West Africans (Nigerian YRI and Kenyan LWK, respectively).

The first few lines of the vcf2eigen.pca.evec file looks like this:

head(df)

##    SampleID    PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8
## 1 SRR098359 0.0336 0.0497 -0.0003 -0.0004  0.0029  0.0001 -0.0013  0.0023
## 2 SRR098372 0.0317 0.0498  0.0013 -0.0003 -0.0005  0.0003  0.0002 -0.0007
## 3 SRR098401 0.0332 0.0488  0.0004 -0.0014  0.0008 -0.0015 -0.0010  0.0025
## 4 SRR098417 0.0333 0.0474  0.0025  0.0006  0.0016 -0.0002  0.0002 -0.0022
## 5 SRR098434 0.0321 0.0468 -0.0011  0.0018  0.0002  0.0003  0.0011  0.0003
## 6 SRR098436 0.0324 0.0438 -0.0007 -0.0015  0.0013 -0.0005  0.0004 -0.0008
##       PC9    PC10 POPULATION
## 1 -0.0009  0.0014        EUR
## 2 -0.0002  0.0020        EUR
## 3  0.0011  0.0009        EUR
## 4  0.0007 -0.0012        EUR
## 5  0.0019 -0.0003        EUR
## 6  0.0000 -0.0006        EUR

To make the plot with different symbols and colours using the sub-population level codes (e.g. replace EUR with CEU, GBR, TSI, etc.), first I need to add a column to the dataframe with the sub-population-level codes.

# read in subpopulation data (this has each sampleID with subpopulation code)
subpop <- read.table("subpop.txt", header=TRUE)

# have a look at the first few lines
head(subpop)

##    SampleID SUBPOP
## 1 SRR701477    LWK
## 2 SRR701484    YRI
## 3 SRR701485    LWK
## 4 SRR701486    LWK
## 5 SRR701487    LWK
## 6 SRR701488    LWK

# merge with the vcf2eigen.pca.evec data frame, using the common SampleID column
df2 <- merge(df, subpop)
head(df2)

##    SampleID    PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8
## 1 SRR098359 0.0336 0.0497 -0.0003 -0.0004  0.0029  0.0001 -0.0013  0.0023
## 2 SRR098372 0.0317 0.0498  0.0013 -0.0003 -0.0005  0.0003  0.0002 -0.0007
## 3 SRR098401 0.0332 0.0488  0.0004 -0.0014  0.0008 -0.0015 -0.0010  0.0025
## 4 SRR098417 0.0333 0.0474  0.0025  0.0006  0.0016 -0.0002  0.0002 -0.0022
## 5 SRR098434 0.0321 0.0468 -0.0011  0.0018  0.0002  0.0003  0.0011  0.0003
## 6 SRR098436 0.0324 0.0438 -0.0007 -0.0015  0.0013 -0.0005  0.0004 -0.0008
##       PC9    PC10 POPULATION SUBPOP
## 1 -0.0009  0.0014        EUR    CEU
## 2 -0.0002  0.0020        EUR    CEU
## 3  0.0011  0.0009        EUR    CEU
## 4  0.0007 -0.0012        EUR    CEU
## 5  0.0019 -0.0003        EUR    GBR
## 6  0.0000 -0.0006        EUR    GBR

Now make the 3D plot with the 14 different sub-populations. Firstly, make a vector of the 14 subpopulation names to save typing them out. Add a bit of space to the y axis to accommodate the legend in the scatterplot3d call with the y.margin.add option.

# names of sub-populations
popIDs <- unique(df2$SUBPOP)

# make plot
with(df2, scatterplot3d(PC1, PC2, PC3, color = as.numeric(SUBPOP), pch=as.numeric(SUBPOP), main="PCA 1000 Genomes Exome Data (3D)", y.margin.add=1)) 

# add legend using x, y coordinates to position legend
legend(3.5, 4, pch=as.numeric(popIDs), col=as.numeric(popIDs), legend=popIDs)

plot of chunk unnamed-chunk-4

The figure shows principle component analysis of the different super populations of the 1000 Genomes Project exome data, this time split up into different sub populations. PC1 is the African eigenvector, which separates the Africans (LWK, YRI, ACB, ASW) from the Europeans (CEU, GBR, TSI, IBS), South Asians (GIH) and East Asians (JPT, KHV, CHB, CHS). PC2 is the Asian eigenvector, which separates the Europeans, East Asians and South Asians. PC3 separates the East Africans (YRI, ACB, ASW) and West Africans (LWK).