Workflow: Analysis of 1000 Genomes Data with PCA

This particular R markdown file documents the workflow which produces the csv file used in the final report. It is primarily a demonstration of the preparation and data cleaning process of the raw vcf file received from Ensembl Data Slicer

Preliminaries

Load the necessary libraries that is used.

library(vcfR)
## Warning: package 'vcfR' was built under R version 4.2.2
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)

Setting and checking the working directory

getwd() #This gets the current directory, if it is not the right project folder, then we need to set it. For the sake of outlining the entire preparation process, a setwd will be called
## [1] "C:/BIOSC 1540 Proj"
setwd("C:/BIOSC 1540 Proj")
getwd() #Just to check
## [1] "C:/BIOSC 1540 Proj"

Making sure that the vcf file is located within the working directory

list.files(pattern = "vcf") #looks for files containing the string vcf in their name
## [1] "4.48252897-48492897.ALL.chr4_GRCh38.genotypes.20170504.vcf.gz"
## [2] "vcf_num.csv"                                                  
## [3] "vcf_num_df.csv"                                               
## [4] "vcf_scaled.csv"

Data Preparation

Loading the SNPs from the VCF file

First, we need to extract the data from the compressed format .gz

my_vcf <- "4.48252897-48492897.ALL.chr4_GRCh38.genotypes.20170504.vcf.gz"

Loading the raw data from the vcf file. This is the basis of analysis. We need to use the read.vcfR function to read data from the .vcf file, which we will later use for analysis.

vcf <- vcfR::read.vcfR(my_vcf, convertNA = T)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 6378
##   column count: 2513
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 6378
##   Character matrix gt cols: 2513
##   skip: 0
##   nrows: 6378
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant: 6378
## All variants processed

Extract the Genotype Scores

Even though the raw data has been read into R, they are still not usable for our functions such as PCA. Think of them as untranslated text, we need to translate them into R-compatible data to being our data processing.

vcf_num <- vcfR::extract.gt(vcf, 
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T,
           return.alleles = F)

Save to csv, just as a backup of processed raw data, which we can now use.

write.csv(vcf_num, file = "vcf_num.csv", row.names = F)

Confirm the presence of the file.

list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "vcf_num.csv"                   
## [3] "vcf_num_df.csv"                 "vcf_scaled.csv"

Rotating the data

Even though we have extracted the data and converted it into R compatible form to work with, we still need to transpose them for our analysis. This rotates the columns and the rows, which makes it easier for our population genetic analysis.

vcf_num_t <- t(vcf_num)

transpose function gives its output in matrix, however we need data frames, not only for some of our functions, but to also modify the values within the data structure.

vcf_num_df <- data.frame(vcf_num_t)

Retrieve the sample name before we proceed.

sample <- row.names(vcf_num_df)

Add sample info to dataframe

vcf_num_df <- data.frame(sample, vcf_num_df)

Save the processed dataframe into a csv file, as backup.

write.csv(vcf_num_df,
          file = "vcf_num_df.csv",
          row.names = F)

Check that the file has been created.

list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "vcf_num.csv"                   
## [3] "vcf_num_df.csv"                 "vcf_scaled.csv"

Merge Data with Population Metadata

Data was mostly obtained from these sources:

Data obtained from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

The reason why this is done is to identify the population from which our samples are from. This will be helpful for our analysis later down the line.

pop_meta <- read.csv(file = "1000genomes_people_info2-1.csv")

Checking whether the column “sample” appears in both snp and metadata

names(pop_meta)
## [1] "pop"       "super_pop" "sample"    "sex"       "lat"       "lng"
names(vcf_num_df[1:10])
##  [1] "sample" "X1"     "X2"     "X3"     "X4"     "X5"     "X6"     "X7"    
##  [9] "X8"     "X9"

Merging the two sets of data.

vcf_num_df2 <- merge(pop_meta,
                     vcf_num_df,
                     by= "sample")

Checking whether the merge was successful.

nrow(vcf_num_df) == nrow(vcf_num_df2)
## [1] TRUE
names(vcf_num_df2[1:13])
##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"        "X5"        "X6"       
## [13] "X7"

Omit Invariant Features

If a SNP is fixed, all individuals within a population have the same variant at that specific locus. This is not useful in our analysis, so we will create a function to remove them.

invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      
}

Check which columns/features have character data (The columns we do not want to use our function on)

names(vcf_num_df2[1:10])
##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"

We will then skip them using negative indexing, and perform invar_omit on the rest of the columns/features

vcf_noinvar <- vcf_num_df2
vcf_noinvar[, -c(1:6)] <- invar_omit(vcf_noinvar[, -c(1:6)])
## Dataframe of dim 2504 6378 processed...
## 1648 columns removed

Creating an object to store the number of columns removed, for later reference if needed.

my_meta_N_invar_cols <- 1648

Removing NAs

The default method for removing invalid values is na.omit(), which works by removing the rows/columns that contains them. An easy but disruptive method, which may sometimes not suit the data. This is why we need our own ways of dealing with invalid values. The following function finds the NAs present in a data structure, and I have modified it so that it would only print a message if there are 1 or more NAs present in the row.

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  if(N_NA > 0){ #Made a little modification, so that messages only show up if more than 1 (inclusive) NAs have been found in the given input
    cat("Results:",N_NA, "NAs present\n.")
  }
  return(i_NA)
}

We then use the function created above to iterate through our transposed data structure to identify the locations of NA values.

# N_rows
# number of rows (individuals)
N_rows <- nrow(vcf_noinvar)

# N_NA
# vector to hold output (number of NAs)
N_NA   <- rep(x = 0, times = N_rows)

# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(vcf_noinvar)

# the for() loop
for(i in 1:N_rows){
  
  # for each row, find the location of
  ## NAs with snps_num_t()
  i_NA <- find_NAs(vcf_noinvar[i,]) 
  
  # then determine how many NAs
  ## with length()
  N_NA_i <- length(i_NA)
  
  # then save the output to 
  ## our storage vector
  N_NA[i] <- N_NA_i
}

Given the lack of output from the function find_NAs, we have no NAs in our vcf_noinvar dataframe.

Furthermore, if there are no NAs, there is no point doing imputation.

Thus, we can now conclude, the data has been processed and is now ready for use.

Preparing for PCA

Data Scaling

Many studies use centering around 0 and scaling by the standard deviation.

There are other forms of scaling that are more appropriate for SNPs, but for the purpose of this project, we will use standard scaling.

Note that scaling is only done on our numeric columns so we skip the first few containing the names and which population they are from.

vcf_scaled <- vcf_noinvar #new copy just in case

The R function scale will do the trick here.

vcf_scaled[,-c(1:6)] <- scale(vcf_scaled[,-c(1:6)])

This is the last step of our data preparation and we will now write this into a csv for our final report.

write.csv(vcf_scaled, 
          file = "vcf_scaled.csv", 
          row.names = F)

Check that the file exists and all other csv files are written.

list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "vcf_num.csv"                   
## [3] "vcf_num_df.csv"                 "vcf_scaled.csv"