Principal Component Analysis (PCA) is a multivariate dimensionality reduction analysis that we will use for exploratory analysis and data visualization.

PCA applies a linear transformation that projects the data points onto the primary axis of a new set of orthogonal axes, in a way that minimizes mean square distance to the best-fitting line and maximizes the variance of the projected data. This allows us to reduce our data from a 34,364x127,445 matrix (127,445 variants for 34,364 individuals) to a 34,364x20 matrix (20 PCs) while preserving the maximum variance in the data. See the image below where the pink line is our new primary axis and maximizes the distance between the data and the orthogonal axis (our “new y-axis”):

(Parth Dholakiya, 2023)

The key pieces of information for us are the eigenvectors and eigenvalues of the resulting covariance matrix. The eigenvalues are scalar coefficients that represent the magnitude of variance captured by the corresponding eigenvectors. The eigenvectors represent the direction of maximum variance in the original data and correspond to a PC. The eigenvectors are orthogonal (and thus linearly uncorrelated) and are ranked by the magnitude of their eigenvalues. This means that each consecutive PC will explain the highest proportion of remaining variance (PC1 explains the most, PC2 the second most, etc., where the variance explained by each PC is independent of other PCs).

For us, larger eigenvalues = more variance explained by that PC.

In reducing the dimensions of our data, we are inherently losing some information by representing the data with fewer features, however in retaining the most variance in the data we can imagine PCA to be like compressing the data rather than deleting features. We want to reduce the dimensions of our data in order to interpret the information, while preserving as much information as possible.

When it comes to GWAS, we can utilize the first 10 PCs as covariates in our regression models to capture population stratification. In doing this, we can be more confident that significant associations reflect a true SNP-phenotype relationship rather than spurious correlations arising from underlying population structure.

We performed PCA on pruned genotype data for 12,000+ self-reported white individuals from the Canadian Partnership for Tomorrow’s Health (CanPath) using PLINKv1.9b (https://www.cog-genomics.org/plink/1.9/). This generates tables of eigenvalues and eigenvectors, the former useful for describing the proportion of variance explained by each PC. Ancestry data from the 1000 Genomes Project (1KG) phase 3 was appended to the CanPath phenotype file and included in the PCA plots to identify whether the self-reported white individuals from the CanPath data would cluster with European individuals from 1KG. Individuals beyond a certain Euclidean distance from the centroid were considered outliers and removed before conducting another PCA on the pruned dataset. The first 10 PCs from the second PCA were then incorporated into the phenotype file to covary for population substructure in our genome-wide association.


Data Manipulation

Here we import and merge our PCA results and phenotype file for plotting:

# Import PCA output
PCA_RESULTS <- read.table("1KG_CanPath_merged_PCA_PCs.eigenvec", header = FALSE)
colnames(PCA_RESULTS) <- c("FID", "ID", paste("PC", 1:20, sep=""))
PCA_RESULTS$FID <- NULL

# Create new phenotype dataframe
PHENOTYPE <- read.csv("PCA_4MODELS.csv", header = TRUE)

# Merge PCA results with phenotype file
PCA_PHENOTYPE_MERGED <- merge(PHENOTYPE, PCA_RESULTS, by = "ID")
head(PCA_PHENOTYPE_MERGED[1:12])
##         ID ADM_STUDY_DATASET DATASET ANCESTRY_POP ANCESTRY_SUPERPOP M1 M2 M3 M4
## 1 11100004              CAG3 CANPATH                            EUR  2  2 NA NA
## 2 11100015              CAG3 CANPATH                            EUR  2  2 NA NA
## 3 11100016              CAG2 CANPATH                            EUR  1 NA  1 NA
## 4 11100021              CAG3 CANPATH                            EUR  1  1  1  1
## 5 11100022              CAG2 CANPATH                            EUR  1 NA  1 NA
## 6 11100030              CAG3 CANPATH                            EUR NA NA  2  2
##          PC1          PC2          PC3
## 1 0.00122914 -8.42838e-05 -0.000639720
## 2 0.00182406 -1.27101e-03 -0.005869590
## 3 0.00163399 -8.67315e-04 -0.000635513
## 4 0.00132193 -7.48584e-04  0.000951973
## 5 0.00152363 -1.00798e-03 -0.001161220
## 6 0.00159927 -1.32319e-03  0.000165992

Here we read in our table of eigenvalues and create data for our Scree plots:

# Reading in eigenvalues
eigenvalues <- read.table("1KG_CanPath_merged_PCA_PCs.eigenval")
colnames(eigenvalues) <- "evals"

# Calculate variance explained for all 20 PCs
eval_df <- data.frame(pc = sub("^", "PC", 1:nrow(eigenvalues)),
  evals = eigenvalues,
  var_exp = (eigenvalues$evals/sum(eigenvalues$evals))*100)
eval_df <- eval_df %>% mutate(cum_var = cumsum(var_exp), 
                   var_exp_rounded = round(var_exp, 2),
                   cum_var_rounded = round(cum_var, 2),
                   var_exp_percent = sub("$", "%", var_exp_rounded),
                   cum_var_percent = sub("$", "%", cum_var_rounded))

head(eval_df)
##    pc    evals   var_exp  cum_var var_exp_rounded cum_var_rounded
## 1 PC1 477.5590 47.326586 47.32659           47.33           47.33
## 2 PC2 228.3020 22.624962 69.95155           22.62           69.95
## 3 PC3  42.7572  4.237282 74.18883            4.24           74.19
## 4 PC4  38.6608  3.831325 78.02016            3.83           78.02
## 5 PC5  34.5447  3.423415 81.44357            3.42           81.44
## 6 PC6  27.8508  2.760043 84.20361            2.76           84.20
##   var_exp_percent cum_var_percent
## 1          47.33%          47.33%
## 2          22.62%          69.95%
## 3           4.24%          74.19%
## 4           3.83%          78.02%
## 5           3.42%          81.44%
## 6           2.76%           84.2%

We also need to calculate the mean and SD of only the EUR individuals from 1KG, as well as creating a subset of our data with only PCA data points. This is important for removing outliers, as will be explained later.

# Creating means, SDs, and a dataframe with only IDs (end) and PCA points
PCA_MEANS <- sapply(PCA_PHENOTYPE_MERGED[PCA_PHENOTYPE_MERGED$ANCESTRY_SUPERPOP=="EUR" & PCA_PHENOTYPE_MERGED$DATASET=="KG",-(1:9)], mean)
PCA_SD <- sapply(PCA_PHENOTYPE_MERGED[PCA_PHENOTYPE_MERGED$ANCESTRY_SUPERPOP=="EUR" & PCA_PHENOTYPE_MERGED$DATASET=="KG",-(1:9)], sd)
PCA_POSITIONS <- PCA_PHENOTYPE_MERGED[,-(2:9)]
PCA_POSITIONS <- PCA_POSITIONS %>% relocate(ID, .after = PC20)

PCA Results

See below our PCA and Scree plots:

#PC2-PC1
ggplot(PCA_PHENOTYPE_MERGED) +
  geom_point(aes(x=PC1, y=PC2, color=ANCESTRY_SUPERPOP, shape=DATASET), alpha=0.5) +
  labs(title = "PC2 vs PC1 for CanPath+1KG", x=sub("@", eval_df$var_exp_percent[1], "PC1 (@)"), y=sub("@", eval_df$var_exp_percent[2], "PC2 (@)"), color = "ANCESTRY") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Note that the white individuals from CanPath (blue circles) cluster with European individuals from 1KG (blue triangles), as expected.

#PC3-PC2
ggplot(PCA_PHENOTYPE_MERGED) +
  geom_point(aes(x=PC2, y=PC3, color=ANCESTRY_SUPERPOP, shape=DATASET), alpha=0.5) +
  labs(title = "PC3 vs PC2 for CanPath+1KG", x=sub("@", eval_df$var_exp_percent[2], "PC2 (@)"), y=sub("@", eval_df$var_exp_percent[3], "PC3 (@)"), color = "ANCESTRY") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

#PC4-PC3
ggplot(PCA_PHENOTYPE_MERGED) +
  geom_point(aes(x=PC3, y=PC4, color=ANCESTRY_SUPERPOP, shape=DATASET), alpha=0.5) +
  labs(title = "PC4 vs PC3 for CanPath+1KG", x=sub("@", eval_df$var_exp_percent[3], "PC3 (@)"), y=sub("@", eval_df$var_exp_percent[4], "PC4 (@)"), color = "ANCESTRY") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Scree Plot
ggplot(eval_df[1:10,]) + 
  geom_col(aes(x = reorder(pc, -var_exp, sum), y = var_exp), colour = 'steelblue4', fill = 'steelblue4') + 
             geom_line(data = eval_df[1:10,], aes(x = reorder(pc, -cum_var, sum), y = cum_var), group = 1) + 
             geom_point(data = eval_df[1:10,], aes(x = pc, y = cum_var), size = 2) + 
             labs(title = "Scree Plot", x = "Principal Components", y = "% Variance Explained") + 
  geom_text(aes(x = pc, y = cum_var, label = cum_var_rounded, group = pc), size = 2, position = position_stack(vjust = 1.1), angle = 40) + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

This plot shows the proportion of variance explained by each PC as well as the cumulative variance explained by the first 10 PCs.

# Variance Explained Plot
ggplot(eval_df[1:10,]) + 
  geom_point(data = eval_df[1:10,], aes(x = reorder(pc, -var_exp, sum), y = var_exp, group = pc), color='steelblue4') +
  geom_line(data = eval_df[1:10,], aes(x = reorder(pc, -var_exp, sum), y = var_exp, group = 1), color='steelblue4') +
  labs(title = "Proportion of Variance Explained by Each PC", x = 'Principal Components', y = 'Proportion of Variance Explained') +
  geom_text(aes(x = pc, y = var_exp, label = var_exp_rounded, group = pc), size = 2, position = position_nudge(x=0.25, y=1)) + 
  theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5))

Like the Scree plot above, this shows the proportion of variance explained by each PC. Using the elbow method, we see that the variance levels off at PC3. Since we are using this PCA for data visualization, retaining 3 PCs is more than appropriate.

This is similar to the elbow method in k-means clustering, which would suggest that k=3 clusters is optimal for our dataset (should it be clusters rather than PCs).

Below we plot the first 3 PCs in an interactive 3D scatterplot to understand the true population substructure having retained 3 PCs (note that populations are coded the same as the above PC plots):


Addressing Outliers

We plot again PC2 vs PC1 with distances 3SD and 5SD from the centroid in blue and black respectively. The centroid is the data point coloured in black, where individuals beyond a certain Euclidean distance from the centroid will be omitted as outliers.

# Define the radius of each circle
liberal_radius_factor <- 5
liberal_radius <- liberal_radius_factor * max(PCA_SD[1:2])
conservative_radius_factor <- 3
conservative_radius <- conservative_radius_factor*max(PCA_SD[1:2])

# Plot again PC2vsPC1 with circle around centroid
ggplot(PCA_PHENOTYPE_MERGED) +
  geom_point(aes(x=PC1, y=PC2, color=ANCESTRY_SUPERPOP, shape=DATASET), alpha=0.5) +
  geom_point(aes(x=PCA_MEANS[1], y=PCA_MEANS[2]), color = "black", shape=1, alpha=0.5) +
  labs(title = "PC2 vs PC1 for CanPath+1KG", x=sub("@", eval_df$var_exp_percent[1], "PC1 (@)"), y=sub("@", eval_df$var_exp_percent[2], "PC2 (@)"), color = "ANCESTRY") +
  # 5 SD
  annotate("path", x = PCA_MEANS["PC1"] + liberal_radius * cos(seq(0, 2*pi, length.out = 100)), 
           y = PCA_MEANS["PC2"] + liberal_radius * sin(seq(0, 2*pi, length.out = 100)), 
           colour = "blue", linetype = "dashed") + 
  # 3 SD
  annotate("path", x = PCA_MEANS["PC1"] + conservative_radius * cos(seq(0, 2*pi, length.out = 100)), 
           y = PCA_MEANS["PC2"] + conservative_radius * sin(seq(0, 2*pi, length.out = 100)), 
           colour = "black", linetype = "dashed") + 
  annotate("text", x=0.002, y=0.0045, label="3 SD", colour="black", size=3) +
  annotate("text", x=0.002, y=0.007, label="5 SD", colour="blue", size=3) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Here we remove outliers beyond 3SD from the centroid. To briefly describe the goal, our CanPath data (all self-reported white individuals) clusters with European individuals as expected. To best control for population substructure, we now want to remove any individuals in the CanPath data who cluster away from the large European cluster (see EUR individuals in PC2vsPC1 who cluster nearer to AFR or EAS/SAS). We remove any data points whose Euclidean distance from the centroid is greater than 3x its standard deviation.

Euclidean distance between two points can be described as:

  D = sqrt((x2 − x1)^2 + (y2 − y1)^2)
  
  Euclidean distance for all points in R = sqrt(sum((xi-yi)^2))
# Exclude all samples (both CanPath and 1KG) that fall outside of the defined radius = 3SD
is_within_radius <- function(point, centroid, radius) {
  dist <- sqrt(sum((point - centroid) ^ 2))
  return(dist <= radius)
}

PC1PC2_centroid <- c(PCA_MEANS[1], PCA_MEANS[2])

PCA_PHENOTYPE_MERGED$within_radius <- apply(PCA_PHENOTYPE_MERGED[, c("PC1", "PC2")], 1, is_within_radius, centroid = PC1PC2_centroid, radius = conservative_radius)
PCA_within_radius <- PCA_PHENOTYPE_MERGED[PCA_PHENOTYPE_MERGED$within_radius,]
PCA_new_subset <- data.frame(FID = PCA_within_radius$ID, IID = PCA_within_radius$ID, DATASET=PCA_within_radius$DATASET)
PCA_new_subset <- subset(PCA_new_subset, DATASET=="CANPATH")
PCA_new_subset$DATASET <- NULL
PCA_final_subset <- PCA_new_subset
head(PCA_final_subset)
##        FID      IID
## 1 11100004 11100004
## 2 11100015 11100015
## 3 11100016 11100016
## 4 11100021 11100021
## 5 11100022 11100022
## 6 11100030 11100030

Note that we are only retaining the CanPath data. The 1KG data is used to visualize population stratification but is not part of our GWAS. We must run a second PCA after performing LD pruning on our CanPath data for visualizing population substructure.

Here we export PCA_final_subset as a .txt file to create subsetted genotype files for our second PCA.

# Create new .txt file of individuals to be kept, n=12,222
write.table(PCA_final_subset, file='PCA_final_subset.txt', row.names = FALSE, col.names = FALSE, quote = F)

Second PCA

Before our second PCA, we performed LD pruning on the binary files. You will note that this PCA looks different to the first, this is because we are working with only white individuals from CanPath and plotting by province. (Ignore where it says pca3, I made a few errors leading to 3 PCAs but this is the correct one).

At this point, the second run of PCA is completed on only CanPath individuals. Here we import our new PCA output and merge our new PCA results with the phenotype file.

# Import PCA output
pca3Eigenvecs <- read.table("allModelsExtract-pruned-chr1-22-PCs.eigenvec", header = FALSE)
colnames(pca3Eigenvecs) <- c("FID", "ID", paste("PC", 1:20, sep=""))
pca3Eigenvecs$FID <- NULL

# Merge PCA results with phenotype file
pca3PhenotypeMerged <- merge(PHENOTYPE, pca3Eigenvecs, by = "ID")
pca3PhenotypeMerged$ANCESTRY_POP <- NULL
head(pca3PhenotypeMerged[1:12])
##         ID ADM_STUDY_DATASET DATASET ANCESTRY_SUPERPOP M1 M2 M3 M4         PC1
## 1 11100004              CAG3 CANPATH               EUR  2  2 NA NA -0.00401334
## 2 11100015              CAG3 CANPATH               EUR  2  2 NA NA  0.03355340
## 3 11100016              CAG2 CANPATH               EUR  1 NA  1 NA -0.00257105
## 4 11100021              CAG3 CANPATH               EUR  1  1  1  1 -0.00210680
## 5 11100022              CAG2 CANPATH               EUR  1 NA  1 NA -0.00212562
## 6 11100030              CAG3 CANPATH               EUR NA NA  2  2 -0.00182651
##            PC2          PC3          PC4
## 1  0.004311280 -0.004320780 -0.001336690
## 2 -0.007870400 -0.001594680 -0.001741420
## 3  0.009481430 -0.000537064 -0.002503150
## 4  0.007020910 -0.003712600  0.000382915
## 5 -0.000239869  0.003415170  0.003969740
## 6  0.010043700 -0.008075920 -0.002031530
# Reading in eigenvalues
pca3Eigenvals <- read.table("allModelsExtract-pruned-chr1-22-PCs.eigenval")
colnames(pca3Eigenvals) <- "evals"

# Calculate variance explained for all 20 PCs
pca3Eval_df <- data.frame(pc = sub("^", "PC", 1:nrow(pca3Eigenvals)),
  evals = pca3Eigenvals,
  var_exp = (pca3Eigenvals$evals/sum(pca3Eigenvals$evals))*100)

pca3Eval_df <- pca3Eval_df %>% mutate(cum_var = cumsum(var_exp), 
                   var_exp_rounded = round(var_exp, 2),
                   cum_var_rounded = round(cum_var, 2),
                   var_exp_percent = sub("$", "%", var_exp_rounded),
                   cum_var_percent = sub("$", "%", cum_var_rounded))

head(pca3Eval_df)
##    pc    evals   var_exp  cum_var var_exp_rounded cum_var_rounded
## 1 PC1 12.28010 17.743032 17.74303           17.74           17.74
## 2 PC2  6.61514  9.557955 27.30099            9.56           27.30
## 3 PC3  4.52345  6.535755 33.83674            6.54           33.84
## 4 PC4  4.22243  6.100823 39.93756            6.10           39.94
## 5 PC5  3.89713  5.630810 45.56837            5.63           45.57
## 6 PC6  3.27153  4.726905 50.29528            4.73           50.30
##   var_exp_percent cum_var_percent
## 1          17.74%          17.74%
## 2           9.56%           27.3%
## 3           6.54%          33.84%
## 4            6.1%          39.94%
## 5           5.63%          45.57%
## 6           4.73%           50.3%
#PC2-PC1
ggplot(pca3PhenotypeMerged) +
  geom_point(aes(x=PC1, y=PC2, color=ADM_STUDY_DATASET), alpha=0.5) +
  labs(title = "PC2 vs PC1 for pca3CanPath", x=sub("@", pca3Eval_df$var_exp_percent[1], "PC1 (@)"), y=sub("@", pca3Eval_df$var_exp_percent[2], "PC2 (@)"), color = "DATASET") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

#PC3-PC2
ggplot(pca3PhenotypeMerged) +
  geom_point(aes(x=PC2, y=PC3, color=ADM_STUDY_DATASET), alpha=0.5) +
  labs(title = "PC3 vs PC2 for pca3CanPath", x=sub("@", pca3Eval_df$var_exp_percent[2], "PC2 (@)"), y=sub("@", pca3Eval_df$var_exp_percent[3], "PC3 (@)"), color = "DATASET") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

#PC4-PC3
ggplot(pca3PhenotypeMerged) +
  geom_point(aes(x=PC3, y=PC4, color=ADM_STUDY_DATASET), alpha=0.5) +
  labs(title = "PC4 vs PC3 for pca3CanPath", x=sub("@", pca3Eval_df$var_exp_percent[3], "PC3 (@)"), y=sub("@", pca3Eval_df$var_exp_percent[4], "PC4 (@)"), color = "DATASET") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Scree Plot
ggplot(pca3Eval_df[1:10,]) + 
  geom_col(aes(x = reorder(pc, -var_exp, sum), y = var_exp), colour = 'steelblue4', fill = 'steelblue4') + 
             geom_line(data = pca3Eval_df[1:10,], aes(x = reorder(pc, -cum_var, sum), y = cum_var), group = 1) + 
             geom_point(data = pca3Eval_df[1:10,], aes(x = pc, y = cum_var), size = 2) + 
             labs(title = "Scree Plot", x = "Principal Components", y = "% Variance Explained") + 
  geom_text(aes(x = pc, y = cum_var, label = cum_var_rounded, group = pc), size = 2, position = position_stack(vjust = 1.1), angle = 40) + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Variance Explained Plot
ggplot(pca3Eval_df[1:10,]) + 
  geom_point(data = pca3Eval_df[1:10,], aes(x = reorder(pc, -var_exp, sum), y = var_exp, group = pc), color='steelblue4') +
  geom_line(data = pca3Eval_df[1:10,], aes(x = reorder(pc, -var_exp, sum), y = var_exp, group = 1), color='steelblue4') +
  labs(title = "Proportion of Variance Explained by Each PC", x = 'Principal Components', y = 'Proportion of Variance Explained') +
  geom_text(aes(x = pc, y = var_exp, label = var_exp_rounded, group = pc), size = 2, position = position_nudge(x=0.25, y=1)) + 
  theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5))

Having pruned the data, the eigenvectors are also much different now. We want to use the first 10 PCs of this second run of PCA as covariates in our GWAS.

# Import
pca3Eigenvecs <- read.table("allModelsExtract-pruned-chr1-22-PCs.eigenvec", header = FALSE)
colnames(pca3Eigenvecs) <- c("FID", "IID", paste("PC", 1:20, sep=""))

phenotypeForGWAS <- read_excel("/Users/brendannewton/Desktop/WENDT_LAB/FSC485/GWAS/PHENOTYPEFORGWAS.xlsx")

# Merge new PCA results with phenotype file
pca3PhenotypeGWAS <- merge(phenotypeForGWAS, pca3Eigenvecs[1:12], by = "IID")
pca3CovariateGWAS <- pca3PhenotypeGWAS[,c(1:4,9:18)]
pca3PhenotypeGWAS[,c(3:4,9:19)] <- NULL

# Export
write.table(pca3PhenotypeGWAS, file='pca3PhenotypeGWAS.txt', row.names = FALSE, col.names = FALSE, quote = F)
write.table(pca3CovariateGWAS, file='pca3CovariateGWAS.txt', row.names = FALSE, col.names = FALSE, quote = F)

Now we export these files to use in our GWAS!