library(scatterplot3d)
# read in the vcf2eigen.pca.evec file
df <- read.table("vcf2eigen.pca.evec")
# give column names
names(df) <- c("SampleID", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10", "POPULATION")
# plot
with(df, scatterplot3d(PC1, PC2, PC3, color = as.numeric(POPULATION), pch=19, main="PCA 1000 Genomes Exome Data (3D)"))
# add legend
legend("topleft", pch=19, col=c("black", "green", "red", "blue"), legend=c("AFR", "EUR", "SAS", "EAS"))
The figure shows principle component analysis of the different super populations of the 1000 Genomes Project data. PC1 is the African eigenvector, which separates the Africans (AFR) from the Europeans (EUR), South Asians (SAS) and East Asians (EAS). PC2 is the Asian eigenvector, which separates the Europeans, East Asians and South Asians. PC3 separates the East and West Africans (Nigerian YRI and Kenyan LWK, respectively).
The first few lines of the vcf2eigen.pca.evec file looks like this:
head(df)
## SampleID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 1 SRR098359 0.0336 0.0497 -0.0003 -0.0004 0.0029 0.0001 -0.0013 0.0023
## 2 SRR098372 0.0317 0.0498 0.0013 -0.0003 -0.0005 0.0003 0.0002 -0.0007
## 3 SRR098401 0.0332 0.0488 0.0004 -0.0014 0.0008 -0.0015 -0.0010 0.0025
## 4 SRR098417 0.0333 0.0474 0.0025 0.0006 0.0016 -0.0002 0.0002 -0.0022
## 5 SRR098434 0.0321 0.0468 -0.0011 0.0018 0.0002 0.0003 0.0011 0.0003
## 6 SRR098436 0.0324 0.0438 -0.0007 -0.0015 0.0013 -0.0005 0.0004 -0.0008
## PC9 PC10 POPULATION
## 1 -0.0009 0.0014 EUR
## 2 -0.0002 0.0020 EUR
## 3 0.0011 0.0009 EUR
## 4 0.0007 -0.0012 EUR
## 5 0.0019 -0.0003 EUR
## 6 0.0000 -0.0006 EUR
To make the plot with different symbols and colours using the sub-population level codes (e.g. replace EUR with CEU, GBR, TSI, etc.), first I need to add a column to the dataframe with the sub-population-level codes.
# read in subpopulation data (this has each sampleID with subpopulation code)
subpop <- read.table("subpop.txt", header=TRUE)
# have a look at the first few lines
head(subpop)
## SampleID SUBPOP
## 1 SRR701477 LWK
## 2 SRR701484 YRI
## 3 SRR701485 LWK
## 4 SRR701486 LWK
## 5 SRR701487 LWK
## 6 SRR701488 LWK
# merge with the vcf2eigen.pca.evec data frame, using the common SampleID column
df2 <- merge(df, subpop)
head(df2)
## SampleID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 1 SRR098359 0.0336 0.0497 -0.0003 -0.0004 0.0029 0.0001 -0.0013 0.0023
## 2 SRR098372 0.0317 0.0498 0.0013 -0.0003 -0.0005 0.0003 0.0002 -0.0007
## 3 SRR098401 0.0332 0.0488 0.0004 -0.0014 0.0008 -0.0015 -0.0010 0.0025
## 4 SRR098417 0.0333 0.0474 0.0025 0.0006 0.0016 -0.0002 0.0002 -0.0022
## 5 SRR098434 0.0321 0.0468 -0.0011 0.0018 0.0002 0.0003 0.0011 0.0003
## 6 SRR098436 0.0324 0.0438 -0.0007 -0.0015 0.0013 -0.0005 0.0004 -0.0008
## PC9 PC10 POPULATION SUBPOP
## 1 -0.0009 0.0014 EUR CEU
## 2 -0.0002 0.0020 EUR CEU
## 3 0.0011 0.0009 EUR CEU
## 4 0.0007 -0.0012 EUR CEU
## 5 0.0019 -0.0003 EUR GBR
## 6 0.0000 -0.0006 EUR GBR
Now make the 3D plot with the 14 different sub-populations. Firstly, make a vector of the 14 subpopulation names to save typing them out. Add a bit of space to the y axis to accommodate the legend in the scatterplot3d call with the y.margin.add option.
# names of sub-populations
popIDs <- unique(df2$SUBPOP)
# make plot
with(df2, scatterplot3d(PC1, PC2, PC3, color = as.numeric(SUBPOP), pch=as.numeric(SUBPOP), main="PCA 1000 Genomes Exome Data (3D)", y.margin.add=1))
# add legend using x, y coordinates to position legend
legend(3.5, 4, pch=as.numeric(popIDs), col=as.numeric(popIDs), legend=popIDs)
The figure shows principle component analysis of the different super populations of the 1000 Genomes Project exome data, this time split up into different sub populations. PC1 is the African eigenvector, which separates the Africans (LWK, YRI, ACB, ASW) from the Europeans (CEU, GBR, TSI, IBS), South Asians (GIH) and East Asians (JPT, KHV, CHB, CHS). PC2 is the Asian eigenvector, which separates the Europeans, East Asians and South Asians. PC3 separates the East Africans (YRI, ACB, ASW) and West Africans (LWK).