Introduction

This R Markdown tutorial provides a step-by-step guide for filtering genotype data and performing Principal Component Analysis (PCA) using tidyverse, FactoMineR, and plotly. The workflow includes data preprocessing, quality filtering, and visualization of population structure using 2D and interactive 3D PCA plots.

To ensure compatibility, input data must follow a predefined format. The demonstration uses the following example datasets:

Genotype Data: PineDB_genotype_data_GDA.csv
Target Sample List: Ooc_metadata_3unknown.csv

By following this guide, users will learn how to prepare SNP genotype data, compute principal components, and generate informative visualizations for genetic analysis.

Genotype Data (Input Data 1)

The genotype dataset can represent a full SNP database (e.g., SNP chip data from multiple species) or a filtered subset with:

Columns: Represent individual samples, with column names labeled as CEL file ID (or any unique identifier).
Rows: Represent SNP markers, with row names corresponding to Probeset ID or Marker ID.
SNP Data Format: Genotypes are recorded in the forward strand base format (e.g., T/T, A/G).
Missing Data: Missing values should be denoted as NA.

For this tutorial, we use PineDB_genotype_data_GDA.csv, which contains tropical pine SNP profiles for P. patula, P. tecunumanii and P. oocarpa samples from the Genome Diverstiy Atlas (GDA) - see publication A genome-wide SNP genotyping resource for tropical pine tree species. Jackson & Christie et al. (2021).

Demo genotype data dimensions: 49567 SNPs (rows) × 286 samples (columns).

Target Sample List (Input Data 2)

The Target Sample List contains metadata for the samples selected for analysis. It ensures that only relevant samples are included in the final PCA visualization (should be a subset of the samples in your genotype dataset).

Required columns:

CEL file ID (this must be the first column): The unique identifier of the sample, matching the column names in the Genotype Data input file. In this demo, we use the CEL file name from Thermo Fisher.
Structure_plot_label: An informative label used to identify the sample in the structure/PCA plot.

Additional metadata columns may be included but are optional.

For this tutorial, we use Ooc_metadata_3unknown.txt, which contains metadata for P. oocarpa GDA samples, including three unidentified P. oocarpa individuals.

Demo target sample list dimensions: 64 samples (rows) × 7 metadata columns (columns).

Structure of this Document

This tutorial is organized into three key sections:

Part 1: Data Preparation & Filtering
- Importing and cleaning genotype and sample metadata
- Filtering SNP markers based on quality thresholds
Part 2: Principal Component Analysis (PCA) & 2D Visualization
- Converting genotype data into a numeric format
- Performing PCA to explore population structure
Part 3: Interactive 3D PCA plot
- Enhancing PCA visualization with an interactive 3D plot
- Using Plotly to explore clustering patterns dynamically

Set Working Directory

Before starting the analysis, it’s essential to set the working directory to the folder where your input files are located. This ensures that R can find and load the data correctly. Additionally, any output files or plots generated during the analysis will be saved to this directory. It is generally recommended to create a dedicated directory for each analysis project in R to keep files organized.

# Set the working directory. Update the path below to reflect the location of your input files.
# Replace the example path with the actual path on your system.
setwd("/Users/nanette/Documents/SNP_data_analysis_pipeline/Tutorials/")

Load Libraries

Load the required R libraries. These libraries provide the functions needed for data manipulation (tidyverse) in Part 1, statistical analysis (FactoMineR) in Part 2, and visualization (plotly) in Part 3 of the tutorial.

If you haven’t already installed these libraries, you’ll need to install them first. You can install them using the install.packages() function in R.

# Install required packages (run this code only if you don't have the packages installed)
install.packages("tidyverse")
install.packages("FactoMineR")
install.packages("plotly")

# Load tidyverse for Part 1: A collection of R packages for data manipulation and visualization
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load FactoMineR for Part 2: an R package dedicated to multivariate exploratory data analysis
library(FactoMineR)
# Load plotly for Part 3: A library for crearing interactive plots
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Part 1: Data Preparation & Filtering

This section focuses on preparing the genotype and sample metadata for downstream analysis. It involves loading the data, transposing the genotype data to match the sample list format, merging the genotype data with sample metadata, and filtering SNP markers based on quality control metrics. Proper data preparation is essential for accurate and reliable results in subsequent steps.

Read Genotype Data

Load the Genotype Data. The data should be in a CSV format, where rows represent SNP markers and columns represent individual samples.

# Load the genotype data from a CSV file (ensure the file path is correct)
all_genotypes <- read_csv("PineDB_genotype_data_GDA.csv")

## Rows: 49567 Columns: 286
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (286): probeset_id, a551255-4438657-052823-641_A13.CEL, a551255-4438657-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check the dimensions of the loaded data: 
# The first value is the number of SNP markers (rows), 
# and the second value is the number of samples (columns)
dim(all_genotypes)

## [1] 49567   286

# Preview the first few rows and columns of the genotype data 
# to ensure it has loaded correctly (displays first 5 rows and first 3 columns)
all_genotypes[1:5,1:3]

## # A tibble: 5 × 3
##   probeset_id  `a551255-4438657-052823-641_A13.CEL` a551255-4438657-052823-641…¹
##   <chr>        <chr>                                <chr>                       
## 1 AX-386620069 T/T                                  T/T                         
## 2 AX-387102206 <NA>                                 <NA>                        
## 3 AX-386695154 T/T                                  C/C                         
## 4 AX-386020185 G/G                                  G/G                         
## 5 AX-385706488 A/A                                  A/A                         
## # ℹ abbreviated name: ¹`a551255-4438657-052823-641_A15.CEL`

Transpose Genotype Data

Transpose the genotype data so that samples are in rows (similar to the Target Sample List) and SNPs are in columns. This is necessary because many R functions expect samples to be organized in rows.

# Extract the first column (assumed to be the sample names) to use as new column names
new_colnames <- all_genotypes[[1]]

# Transpose the genotype data (excluding the first column) to switch rows and columns
transposed_data <- t(all_genotypes[,-1])

# Assign the extracted sample names (new_colnames) as the column names of the transposed data
colnames(transposed_data) <- new_colnames

# Convert the transposed matrix to a data frame and avoid converting strings to factors
result_df <- as.data.frame(transposed_data, stringsAsFactors = FALSE)

# Convert the data frame to a tibble and add a new column for row names (RowNames)
all_genotypes_t <- result_df %>%
  rownames_to_column(var = "RowNames") %>%   # Add row names as a new column
  as_tibble() %>%                           # Convert to tibble for easier manipulation
  rename("CEL file ID" = 1)                 # Rename the first column to "CEL file ID"

# Preview the first few rows and columns of the transposed data
all_genotypes_t[1:5,1:4]

## # A tibble: 5 × 4
##   `CEL file ID`                     `AX-386620069` `AX-387102206` `AX-386695154`
##   <chr>                             <chr>          <chr>          <chr>         
## 1 a551255-4438657-052823-641_A13.C… T/T            <NA>           T/T           
## 2 a551255-4438657-052823-641_A15.C… T/T            <NA>           C/C           
## 3 a551255-4438657-052823-641_A17.C… T/T            <NA>           T/T           
## 4 a551255-4438657-052823-641_A19.C… T/T            <NA>           T/C           
## 5 a551255-4438657-052823-641_A21.C… T/T            <NA>           T/C

Read Target Sample List

Load the Target Sample List, which includes metadata per sample that you would like to include in the structure-like plot. The first column must be the CEL file ID.

# Load the target sample metadata from a tab-separated file (ensure the file path is correct)
target_samples <- read_csv("Ooc_metadata_3unknown.csv")

## Rows: 64 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): CEL file ID, Project, Species2, PCA-based species, FMG_Sample_ID, S...
## dbl (2): Order1, Elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check the dimensions of the target sample data: 
# The first value is the number of rows (samples), 
# and the second value is the number of columns (metadata variables)
dim(target_samples)

## [1] 64 11

# View the first few rows of the target sample metadata to ensure it loaded correctly
head(target_samples)

## # A tibble: 6 × 11
##   `CEL file ID`        Project Species2 `PCA-based species` FMG_Sample_ID Order1
##   <chr>                <chr>   <chr>    <chr>               <chr>          <dbl>
## 1 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF339_1            1
## 2 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF340_1            2
## 3 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF342_1            3
## 4 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF343_1            4
## 5 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF287_1            5
## 6 a551255-4441886-052… GDA     Ooc.GDA  Ooc.GDA             GF288_1            6
## # ℹ 5 more variables: Structure_plot_label <chr>, SubSpName <chr>,
## #   Country <chr>, Provenance <chr>, Elevation <dbl>

Subset Genotype Data Using Target Sample List

This step effectively subsets the genotype data, retaining only the samples listed in the Target Sample List. By merging the genotype data with the metadata using “CEL file ID” as the key, we ensure that we proceed with only the samples of interest, which is crucial for focused and relevant downstream analysis.

# Merge the genotype data with the target sample metadata using "CEL file ID" as the common key
filtered_genotype_data <- all_genotypes_t %>%
  inner_join(target_samples, by = "CEL file ID")

# Check the dimensions of the filtered genotype data: 
# The first value is the number of rows (samples), 
# and the second value is the number of columns (genotype + metadata)
dim(filtered_genotype_data)

## [1]    64 49578

Note that the number of samples (rows) decreased from 286 to 64

Calculate Marker Statistics

Compute marker statistics (for all 49,567 SNP markers) including call rate, number of alleles, and minor allele frequency (MAF). These statistics are used to identify and remove low-quality or uninformative markers.

# Remove the "CEL file ID" column to focus on SNP markers
marker_data <- filtered_genotype_data %>% select(-`CEL file ID`)

# Calculate allele statistics for each SNP marker
allele_info <- apply(marker_data, 2, function(col) {
  alleles <- unlist(strsplit(na.omit(col), "/"))  # Split genotypes into alleles
  if (length(alleles) == 0) return(NULL)  # Skip empty columns
  allele_freqs <- table(alleles) / length(alleles)  # Calculate allele frequencies
  list(
    NumAlleles = length(unique(alleles)),  # Count unique alleles
    MAF = if (length(allele_freqs) > 1) min(allele_freqs) else 0  # Calculate minor allele frequency (MAF)
  )
})

# Create a data frame with marker statistics (call rate, number of alleles, MAF)
marker_stats <- data.frame(
  Marker = colnames(marker_data),  # SNP marker names
  CallRate = colMeans(!is.na(marker_data)),  # Call rate
  NumAlleles = sapply(allele_info, function(info) if (!is.null(info)) info$NumAlleles else NA),  # Number of alleles
  MAF = sapply(allele_info, function(info) if (!is.null(info)) info$MAF else NA)  # MAF
)

# Preview the marker statistics
head(marker_stats)

##                    Marker CallRate NumAlleles       MAF
## AX-386620069 AX-386620069 1.000000          1 0.0000000
## AX-387102206 AX-387102206 0.000000         NA        NA
## AX-386695154 AX-386695154 0.984375          2 0.2619048
## AX-386020185 AX-386020185 1.000000          2 0.1484375
## AX-385706488 AX-385706488 1.000000          1 0.0000000
## AX-386895519 AX-386895519 1.000000          2 0.4531250

Filter Markers

Filter markers based on call rate, number of alleles, and MAF threshold. Keep only informative markers.

# Filter valid markers based on criteria: 
# - At least 85% call rate, bi-allelic, MAF >= 0.05
valid_markers <- marker_stats %>%
  filter(CallRate >= 0.85,  # Call rate >= 85%
         NumAlleles <= 2,   # Bi-allelic markers
         MAF >= 0.05) %>%   # MAF >= 5%
  pull(Marker)  # Extract marker names

# Filter genotype data to include only valid markers and "CEL file ID"
filtered_genotype_data <- filtered_genotype_data %>%
  select(`CEL file ID`, all_of(valid_markers))  # Keep valid markers

# Check dimensions of filtered genotype data
dim(filtered_genotype_data)

## [1]    64 13328

For this P. oocarpa GDA dataset, we will be continuing with 13,328 informative SNPs in downstream analysis (and 64 samples).

Part 2: Principal Component Analysis (PCA) & 2D Visualization

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data by identifying principal components (PCs), which are orthogonal linear combinations of the original variables that capture the most variance in the data. In the context of genotype data, PCA helps to uncover population structure by identifying clusters of individuals based on their genetic similarity. This is useful to identify major genetic patterns, population groupings, and potential outliers within a dataset. By visualizing the first few principal components, we can gain insights into the genetic relationships among samples without having to interpret a large number of SNPs.

In this section, we will convert the filtered genotype data into a numeric format suitable for PCA, perform PCA to identify principal components, calculate variance explained by each PC, and plot the first two PCs to visualize population structure.

Convert SNP genotypes to numeric format for PCA

Converting SNP genotypes to numeric format is essential, as PCA requires numerical data for computing variance, covariances, and eigenvectors. SNP genotypes are often represented categorically (e.g., T/T, A/T), which PCA cannot process directly. In the additive model, numeric encoding represents homozygous (baseline allele) as 0, heterozygous as 1, and homozygous (other allele) as 2. This allows PCA to reduce dimensionality and analyze genetic variation across individuals effectively.

# Define a function to convert SNP genotypes to numeric format using an additive model
convert_to_numeric <- function(geno_column) {
  geno_column <- as.character(geno_column)  # Ensure character format to handle factor variables correctly
  alleles <- unique(unlist(strsplit(na.omit(geno_column), "/")))  # Extract unique alleles by splitting the genotype string
  if (length(alleles) != 2) return(rep(NA, length(geno_column)))  # Skip non-biallelic SNPs by returning NA for all individuals
  ref_allele <- alleles[1]  # Define the baseline allele, e.g. reference allele
  alt_allele <- alleles[2]  # Define the other allele, e.g. alternate allele
  # Define a function to recode each genotype based on the defined alleles
  recode <- function(gt) {
    if (is.na(gt)) return(NA) # Handle missing values
    if (gt == paste0(ref_allele, "/", ref_allele)) return(0) # Homozygous (allele 1) = 0
    if (gt == paste0(ref_allele, "/", alt_allele) | gt == paste0(alt_allele, "/", ref_allele)) return(1) # Heterozygous = 1
    if (gt == paste0(alt_allele, "/", alt_allele)) return(2) # Homozygous (allele 2) = 2
    return(NA) # Handle unexpected genotypes
  }
  sapply(geno_column, recode) # Apply the recode function to each genotype in the column
}

# Apply the conversion function to all SNP columns in the filtered genotype data
snp_data <- filtered_genotype_data[, -1]  # Remove the "CEL file ID" column to focus on SNP markers
numeric_genotype_matrix <- apply(snp_data, 2, convert_to_numeric)  # Apply the conversion function to each SNP column
numeric_genotype_matrix <- apply(numeric_genotype_matrix, 2, function(x) { # Impute missing values with the mean of each column
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  return(x)
})

Perform PCA

Performing PCA on the numeric genotype matrix is essential for reducing the dimensionality of the data and identifying principal components that capture the most variance. By performing PCA, we can transform the high-dimensional genotype data into a smaller set of uncorrelated variables (principal components) that explain most of the variance in the data.

# Perform PCA using the FactoMineR package
pca_result <- PCA(numeric_genotype_matrix, graph = FALSE, scale.unit = TRUE)  
# Perform PCA with scaling to unit variance for each SNP marker
# graph = FALSE suppresses the default graphical output from FactoMineR
# scale.unit = TRUE scales the data to unit variance before PCA, which is recommended when variables are measured in different units or have different variances

# Extract PC scores (coordinates) for each individual from the PCA result
dim(pca_result$ind$coord)  # Check the dimensions of the PC scores

## [1] 64  5

# Prepare PCA results for visualization
pc_data <- data.frame(
  `CEL file ID` = filtered_genotype_data$`CEL file ID`,  # Retrieve the CEL file IDs from the original data
  PC1 = pca_result$ind$coord[, 1],  # Extract PC1 scores
  PC2 = pca_result$ind$coord[, 2],  # Extract PC2 scores
  PC3 = pca_result$ind$coord[, 3],  # Extract PC3 scores
  check.names = FALSE  # Prevents R from converting spaces to dots in column names
)

# Merge PC scores with target sample metadata using "CEL file ID" as the key
pc_data_meta <- pc_data %>% inner_join(target_samples, by = "CEL file ID")

Calculate variance explained

Calculating the variance explained by each principal component helps us understand the amount of variation captured by each component. This information is crucial for determining which PCs are most informative and relevant for downstream analysis and visualization.

# Calculate the percentage of variance explained by each principal component
variance_explained <- pca_result$eig[,2]  # Second column contains % variance explained by each PC

# Create labels for the PCA plot axes, including the percentage of variance explained
pc1_label <- paste0("PC1 (", round(variance_explained[1], 2), "%)")  # Label for PC1
pc2_label <- paste0("PC2 (", round(variance_explained[2], 2), "%)")  # Label for PC2
pc3_label <- paste0("PC3 (", round(variance_explained[3], 2), "%)")  # Label for PC3

Create 2D PCA plot

Creating a 2D PCA plot allows us to visualize the population structure of the samples based on the first two principal components. This visualization helps identify clusters of individuals with similar genetic profiles, revealing underlying patterns and relationships in the data.

# Define colors for each subspecies
categories <- unique(pc_data_meta$SubSpName) # Unique values in "SubSpName" column
length(categories) # Check the number of categories

## [1] 4

my_col <- c("#9ecae1", "#3ecfff", "#000000", "#0000FF")
my_colors <- setNames(my_col, categories)
my_colors # Check colors assigned for each category

##       Ooc.S       Ooc.C Ooc.unknown       Ooc.N 
##   "#9ecae1"   "#3ecfff"   "#000000"   "#0000FF"

# Use the ggplot2 R package to plot the first 2 PCs
ggplot(pc_data_meta, aes(x = PC1, y = PC2, color = SubSpName)) +  # Specify the data and aesthetics for the plot
  geom_point(alpha = 0.7) +  # Add points to the plot with transparency (alpha)
  scale_color_manual(values = my_colors) +  # Set custom colors for each species or population
  labs(title = "PCA Plot",  # Add a title to the plot
       x = pc1_label,  # Label the x-axis with PC1 and variance explained
       y = pc2_label) +  # Label the y-axis with PC2 and variance explained
  theme_minimal()  # Use a minimal theme for a clean look

Part 3: Interactive 3D PCA Plot

This section focuses on creating an interactive 3D visualization of the PCA results using the plotly library. A 3D plot allows for exploration of the data from multiple angles, potentially revealing patterns and groupings that might not be apparent in a 2D representation. The interactivity allows users to zoom, rotate, and hover over data points to gain deeper insights into the genetic relationships among samples.

Create 3D PCA plot

This step generates an interactive 3D scatter plot using the PCA results, allowing users to explore the data in three dimensions and identify potential clusters and relationships among samples.

# Load the plotly library, which is used to create interactive plots
# Create an interactive 3D PCA plot using the plotly library
p <- plot_ly(pc_data_meta, # Use the merged PCA data with metadata
             x = ~PC1,  # Map PC1 scores to the x-axis
             y = ~PC2,  # Map PC2 scores to the y-axis
             z = ~PC3,  # Map PC3 scores to the z-axis
             hoverinfo = "text",  # Set hover information to display custom text
             text= paste("Sample: ", pc_data_meta$FMG_Sample_ID, # Create hover text with sample ID
                         "<br>", "Info: ", pc_data_meta$Structure_plot_label),  # Add additional info to hover text
             color = ~ SubSpName,  # Color points by subspecies
             opacity = 1,  # Set point opacity
             size = 2,  # Set point size
             colors = my_colors) %>% # Use the previously defined color palette for subspecies
  add_markers() %>%  # Add markers (points) to the plot
  layout(title = "3D PCA Plot",  # Set the plot title
         scene = list(xaxis = list(title =pc1_label),  # Customize x-axis label
                      yaxis = list(title =pc2_label),  # Customize y-axis label
                      zaxis = list(title =pc3_label)))  # Customize z-axis label

p  # Display the interactive 3D PCA plot

Here is a link to the interactive 3D PCA plot hosted on Plotly’s Chart Studio, enabling users to explore the data in a web browser and share the visualization with others: 3D plot in the browser

Genotype Data Filtering and PCA Visualization in R: A Step-by-Step Guide

Dr N Christie

2025-04-02