Preparing the R Arrangements

Load necessary R packages

library(vcfR)

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Warning: package 'vegan' was built under R version 4.2.2

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.2.2

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Confirm the working directory

getwd()

## [1] "C:/Users/Grief Mage/Documents"

Locate the files

list.files(pattern = "vcf")

##  [1] "22.26006210-26246210.ALL.chr22_GRCh38.genotypes.20170504.vcf"   
##  [2] "22.26006210-26246210.ALL.chr22_GRCh38.genotypes.20170504.vcf.gz"
##  [3] "ALL.chr22_GRCh38.genotypes.20170504 (1).vcf.gz"                 
##  [4] "all_loci.vcf"                                                   
##  [5] "vcf_num.csv"                                                    
##  [6] "vcf_num_df.csv"                                                 
##  [7] "vcf_num_df2.csv"                                                
##  [8] "vcf_scaled.csv"                                                 
##  [9] "vcfR_test.vcf"                                                  
## [10] "vcfR_test.vcf.gz"

#Set SNP data up for R

Load the vcf data

my_vcf <- "22.26006210-26246210.ALL.chr22_GRCh38.genotypes.20170504.vcf.gz"

Load the vcf file

vcf <- vcfR::read.vcfR(my_vcf, convertNA = T)

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 7212
##   column count: 2513
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 7212
##   Character matrix gt cols: 2513
##   skip: 0
##   nrows: 7212
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant: 7212
## All variants processed

Convert the raw VCF file to genotype scores

vcf_num <- extract.gt(vcf, 
                      element = "GT",
                      IDtoRowNames = F,
                      as.numeric = T,
                      convertNA = T)

Save the csv

write.csv(vcf_num, file = "vcf_num.csv", row.names = F)

Confirm prescence of file

list.files()

##  [1] "07-mean_imputation---Gabriel-Medeiros.html"                     
##  [2] "07-mean_imputation--1-.docx"                                    
##  [3] "07-mean_imputation--1-.html"                                    
##  [4] "09-PCA_worked_example-SNPs-part1.Rmd"                           
##  [5] "10-PCA_worked_example-SNPs-part2.html"                          
##  [6] "10-PCA_worked_example-SNPs-part2.Rmd"                           
##  [7] "1000-genomes"                                                   
##  [8] "1000-genomes.gz"                                                
##  [9] "1000genomes_people_info2-1.csv"                                 
## [10] "22.26006210-26246210.ALL.chr22_GRCh38.genotypes.20170504.vcf"   
## [11] "22.26006210-26246210.ALL.chr22_GRCh38.genotypes.20170504.vcf.gz"
## [12] "ALL.chr22_GRCh38.genotypes.20170504 (1).vcf.gz"                 
## [13] "all_loci.vcf"                                                   
## [14] "bird_snps_remove_NAs.Rmd"                                       
## [15] "cover letter.pdf"                                               
## [16] "desktop.ini"                                                    
## [17] "Dolphin Emulator"                                               
## [18] "Final-Project.Rmd"                                              
## [19] "Final Project.Rmd"                                              
## [20] "final_report_template.Rmd"                                      
## [21] "FPSMonitor.txt"                                                 
## [22] "GitHub"                                                         
## [23] "gpm23_Lab9"                                                     
## [24] "gwas_pheno_env.csv"                                             
## [25] "IEF essay 2.txt"                                                
## [26] "League of Legends"                                              
## [27] "My Games"                                                       
## [28] "My Music"                                                       
## [29] "My Pictures"                                                    
## [30] "My Videos"                                                      
## [31] "my_snps"                                                        
## [32] "NetBeansProjects"                                               
## [33] "pheno.csv"                                                      
## [34] "Quiz8_Spring2014_solutions.doc"                                 
## [35] "R files"                                                        
## [36] "removing_fixed_alleles.html"                                    
## [37] "removing_fixed_alleles.Rmd"                                     
## [38] "rsconnect"                                                      
## [39] "SNPs_cleaned.csv"                                               
## [40] "transpose_VCF_data.html"                                        
## [41] "transpose_VCF_data.Rmd"                                         
## [42] "untitled.Rmd"                                                   
## [43] "vcf_num.csv"                                                    
## [44] "vcf_num_df.csv"                                                 
## [45] "vcf_num_df2.csv"                                                
## [46] "vcf_scaled.csv"                                                 
## [47] "vcfR_test.vcf"                                                  
## [48] "vcfR_test.vcf.gz"                                               
## [49] "walsh2017morphology.csv"                                        
## [50] "working_directory_practice.html"                                
## [51] "working_directory_practice.Rmd"                                 
## [52] "Zoom"

Transpose original VCF orientation to R dataframe orientation

vcf_num_t <- t(vcf_num)

vcf_num_df <- data.frame(vcf_num_t)

Get person (sample) names

sample <- row.names(vcf_num_df)

Add sample into the dataframe

vcf_num_df <- data.frame(sample, vcf_num_df)

Check working directory

getwd()

## [1] "C:/Users/Grief Mage/Documents"

Save the csv

write.csv(vcf_num_df,
          file = "vcf_num_df.csv",
          row.names = F)

Confirm the file

list.files(pattern = "csv")

## [1] "1000genomes_people_info2-1.csv" "gwas_pheno_env.csv"            
## [3] "pheno.csv"                      "SNPs_cleaned.csv"              
## [5] "vcf_num.csv"                    "vcf_num_df.csv"                
## [7] "vcf_num_df2.csv"                "vcf_scaled.csv"                
## [9] "walsh2017morphology.csv"

Clean data

Merge data with population data

Load population meta data

pop_meta <- read.csv(file = "1000genomes_people_info2-1.csv")

Make sure the column “sample” appears in the meta data and SNP data

names(pop_meta)

## [1] "pop"       "super_pop" "sample"    "sex"       "lat"       "lng"

names(vcf_num_df)[1:10]

##  [1] "sample" "X1"     "X2"     "X3"     "X4"     "X5"     "X6"     "X7"    
##  [9] "X8"     "X9"

Merge the two sets of data

vcf_num_df2 <- merge(pop_meta,
                     vcf_num_df,
                     by = "sample")

Check dimensions before and after the merge

nrow(vcf_num_df) == nrow(vcf_num_df2)

## [1] TRUE

Check the names of the new datframe

names(vcf_num_df2)[1:15]

##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"        "X5"        "X6"       
## [13] "X7"        "X8"        "X9"

Check working directory

getwd()

## [1] "C:/Users/Grief Mage/Documents"

Save the csv

write.csv(vcf_num_df2, file = "vcf_num_df2.csv", row.names = F)

Confirm presence of file

list.files(pattern = "csv")

## [1] "1000genomes_people_info2-1.csv" "gwas_pheno_env.csv"            
## [3] "pheno.csv"                      "SNPs_cleaned.csv"              
## [5] "vcf_num.csv"                    "vcf_num_df.csv"                
## [7] "vcf_num_df2.csv"                "vcf_scaled.csv"                
## [9] "walsh2017morphology.csv"

Omit invarient features

invar_omit <- function(x){
  cat("Dataframe of dim", dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
  
  cat(length(i_var0), "columns removed\n")
  
  if(length(i_var0) > 0){
    x <- x[, -i_var0]
  }
  return(x)
}

Check which columns have character data

names(vcf_num_df2)[1:10]

##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"

New dataframe to store output

vcf_noinvar <- vcf_num_df2

Run invar_omit() on numeric data

vcf_noinvar[, -c(1:6)] <- invar_omit(vcf_noinvar[, -c(1:6)])

## Dataframe of dim 2504 7212 processed...
## 1921 columns removed

Create an object to store the number of invariant columns removed

my_meta_N_invar_cols <- 1921

Remove low-quality data

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  return(i_NA)
}

#number of rows (indivisuals)
N_rows <- nrow(vcf_noinvar)

# vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)

# total number of columns (SNPs)
N_SNPs <- ncol(vcf_noinvar)

cat("This may take a minute...")

## This may take a minute...

for(i in 1:N_rows){
  i_NA <- find_NAs(vcf_noinvar[i,])
  N_NA_i <- length(i_NA)
  N_NA[i] <- N_NA_i
}

Check if any row has >50% NAs

cutoff50 <- N_SNPs*0.5
percent_NA <- N_NA/N_SNPs*100
any(percent_NA > 50)

## [1] FALSE

my_meta_N_meanNA_rows <-mean(percent_NA)

Mean Imputation

mean_imputation <- function(df){
  cat("This may take some time...")
  n_cols <- ncol(df)
  
  for (i in 1:n_cols) {
    # get the current column
    column_i <- df[, i]
    
    # get the mean of the current column
    mean_i <- mean(column_i, na.rm = TRUE)
    
    # get the NAs in the current column
    NAs_i <- which(is.na(column_i))
    
    # report the number of NAs
    N_NAs <- length(NAs_i)
    
    # replace the NAs in the current column
    column_i[NAs_i] <- mean_i
    
    # replace the original column with the updated columns
    df[, i] <- column_i
    
  }
  return(df)
}

names(vcf_noinvar)[1:10]

##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"

# new copy of the data
vcf_noNA <- vcf_noinvar
vcf_noNA[, -c(1:6)] <- mean_imputation(vcf_noinvar[, -c(1:6)])

## This may take some time...

# new copy of the data
vcf_scaled <- vcf_noNA
vcf_scaled[, -c(1:6)] <- scale(vcf_noNA[, -c(1:6)])

write.csv(vcf_scaled, file = "vcf_scaled.csv", row.names = F)

Run the PCA

#vcf_pca <- prcomp(vcf_scaled[, -c(1:6)])

PCA Diagnostics

Default screeplot

#screeplot(vcf_pca)

Calculate explained variation

#PCA_variation <- function(pca_summary, PCs = 2){
 # var_explained <- pca_summary$importance[2,1:PCs]*100
  #var_explained <- round(var_explained, 3)
  #return(var_explained)
  
#}

Get summary information

#vcf_pca_summary <- summary(vcf_pca)
#var_out <- PCA_variation(vcf_pca_summary, PCs = 500)

Calculate the cut off for the rule of thumb

# number of dimensions in the data
#N_columns <- ncol(vcf_scaled)

# The value of the cutoff
#cut_off <- 1/N_columns*100

Calculate the number PCs which exceed the cut off

#i_cut_off <- which(var_out < cut_off)

#i_cut_off <- min(i_cut_off)

Save the first value below the cutoff

#my_meta_N_meanNA_rowsPCs <- i_cut_off

Extract the amount of variation explained by the first 3 PCs

#my_meta_N_var_PC123 <- var_out[c(1,2,3)]

Plot percentage variation

# Make the biplot
#barplot(var_out,
 #       main = "Percent variation (%) Scree plot",
  #      ylab = "Percent variation (%) explained",
     #   names.arg = 1:length(var_out))
#abline(h = cut_off, col = 2, lwd = 2)
#abline(v = i_cut_off)
#legend("topright",
 #      col = c(2,1),
  #     lty = c(1,1),
    #   legend = c("Vertical line: cutoff",
   #               "Horizontal line: 1st value below cut off"))

Plot cumulative percentage variation

#cumulative_variation <- cumsum(var_out)
#plot(cumulative_variation, type = "l")

Plot PCA results

Calculate scores

Get the scores

# call vegan::scores()
#vcf_pca_scores <- vegan::scores(vcf_pca)

#Combine scores with species information into a dataframe
#vcf_pca_scores2 <- data.frame(super_pop = vcf_noNA$super_pop,
 #                             vcf_pca_scores)

Plot PC1 versus PC2

#ggpubr::ggscatter(data = vcf_pca_scores2,
 #                 y = "PC2",
  #                x = "PC1",
   #               color = "super_pop",
    #              shape = "super_pop",
     #             main = "PCA Scatterplot",
      #            xlab = "PC1 (1.9% of variation)",
       #           ylab = "PC2 (1.1% of variation)")

Plot PC2 versus PC3

#ggpubr::ggscatter(data = vcf_pca_scores2,
 #                 y = "PC3",
  #                x = "PC2",
   #               color = "super_pop",
    #              shape = "super_pop",
     #             main = "PCA Scatterplot",
      #            xlab = "PC2 (1.1% of variation)",
       #           ylab = "PC3 (1.0% of variation)")

Plot PC1 versus PC3

#ggpubr::ggscatter(data = vcf_pca_scores2,
 #                 y = "PC3",
  #                x = "PC1",
   #               color = "super_pop",
    #              shape = "super_pop",
     #             main = "PCA Scatterplot",
      #            xlab = "PC1 (1.9% of variation)",
       #           ylab = "PC3 (1.0% of variation)")

Follow-up/ Alternative analyses

K-means cluster analysis

Sample information on columns 1-6

#vcf_pca_scores3 <- vcf_pca_scores2[, c(1:6)]

Subset the data

#vcf_pca_scores_best <- vcf_pca_scores3[, c(1:my_meta_N_meanNA_rowsPCs)]

final-project-gpm23

Gabriel Medeiros

2022-11-30

Preparing the R Arrangements

Load necessary R packages

Confirm the working directory

Locate the files

Load the vcf data

Load the vcf file

Convert the raw VCF file to genotype scores

Save the csv

Confirm prescence of file

Transpose original VCF orientation to R dataframe orientation

Get person (sample) names

Add sample into the dataframe

Check working directory

Save the csv

Confirm the file

Clean data

Merge data with population data

Load population meta data

Make sure the column “sample” appears in the meta data and SNP data

Merge the two sets of data

Check dimensions before and after the merge

Check the names of the new datframe

Check working directory

Save the csv

Confirm presence of file

Omit invarient features

Check which columns have character data

New dataframe to store output

Run invar_omit() on numeric data

Create an object to store the number of invariant columns removed

Remove low-quality data

Check if any row has >50% NAs

Mean Imputation

Run the PCA

PCA Diagnostics

Default screeplot

Calculate explained variation

Get summary information

Calculate the cut off for the rule of thumb

Calculate the number PCs which exceed the cut off

Save the first value below the cutoff

Extract the amount of variation explained by the first 3 PCs

Plot percentage variation

Plot cumulative percentage variation

Plot PCA results

Calculate scores

Get the scores

Plot PC1 versus PC2

Plot PC2 versus PC3

Plot PC1 versus PC3

Follow-up/ Alternative analyses

K-means cluster analysis

Sample information on columns 1-6

Subset the data