Removing Fixed Allels from SNP data

Learning objectives

This lesson introduces the concept of invariant columns and why they should be removed. It also provides a function to remove them.

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Main objectives

By the end of this lesson you will

Understand what can lead to a column of SNP data being invariant.
Know why invariant features can cause problem when preparing data for analysis or running machine learning algorithms on it
Know how to use a function I provide for removing invariant columns

Review

Setting your working directory
Loading .vcf files
Preparing .vcf files for analysis

Introduction

SNPs are Single Nucleotide Polymorphisms. This means that they are by definition polymorphic - more than one nucleotide occurs in that position (locus) in different individual in a species. For example, most individuals in a population may be G|G, but some are G|A or even A|A.

The SNP concept applies at the level of a species. Within separate populations of a species (or individuals genotyped for a study) only one allele of the SNP may be present. This means that though there is genetic variation at that locus in the level of species, for the the population or sample for a study there is no variation.

If all individuals in a population are homozygous for either the major allele (e.g. all are MM), or all are homozygous for the minor allele (e.g. all are mm) we say that the allele is fixed in that population. They other allele is present in other populations, but not all of them. When an allele is fixed in a population it is actually very useful for distinguishing that population genetically from others.

Similarly, for a sample of individuals in a single study, all individuals genotyped may turn out to be the same type of homozygote for a SNP - all are mm or MM. A different sample may have yield heterozygotes or a mix of both types of homozygotes (mm and MM), but for the data at hand, only one type of homozygote is present. While the allele is not necessarily fixed in any population, it is invariant in the individuals that happened to be gentoyped.

Implications of invariant SNPs

SNP data is converted to numeric genotype scores for analysis with machine learning methods such as PCA. If a SNP is invariant in all of the individuals genotyped in a sample for a study, the entire column (aka feature) in the dataframe will be either 0 (all individuals have the major allele) or 2 (all individuals have the minor allele). Similarly, if all individuals are heterozygotes, the entire column will be 1.

Completely invariant alleles in a dataset present 2 issues:

Information content: They provide no information about differences between individuals and groups in the data. You therefore expend computational effort on worthless features.
Mathematical problems: They may cause errors in some mathematical operations and R functions.

Invariant features, information content and computational effort

When a column in a dataset is invariant, whether a SNP or anything else, it doesn’t provide you any useful information. For example, if I want to predict whether taking Advanced Placement (AP) Biology predicts success in my intro biology course, and everyone in the class has taken AP Bio, then I can’t learn or conclude anything.

Invariant features, however, still have to be handled by the computer. They therefore eat up computational resources. Since SNP data can have tens or ever hundreds of thousands of features, this can be a major waste of computational time. It is therefore advisable to remove invariant features to speed up the your analyses.

Invariant features and mathematical operations

A feature that is invariant will have a standard deviation of 0. For example:

x <- c(2,2,2,2,2,2)
mean(x)

## [1] 2

sd(x)

## [1] 0

While we can calculate the mean and standard deviation of an invariant feature, we can’t scale it because this requires division by 0, which is not defined: there is no answer to, for example, x/0.

More generally, machine learning algorithms can run into problems with invariant features and fail to work. It is therefore advisable to remove invariant features before an analysis.

Removing invariant features

There are several ways to identify if a column in a dataframe is invariant, but probably the easiest is to calculate the standard deviation. The standard deviation is a measure of variation in the data, and if everything in a column is the same, there is no variation and the sd = 0. You could similarly calculate the range; if all the data in a column is 2, then the minimum is 2, the max is 2, and the range is 2 - 2 = 0.

Here’s some example data that shows that this works for all 3 possible forms of invariant SNP genotypic scores data:

First, some fake data. Add c() to make the vectors.

# add the c() function to make all of the vectors

# All samples are homozygous for major allele
SNP_fixed_0    <- c(0, 0, 0, 0, 0, 0)        # TODO

# All samples are homozygous for minor allele
SNP_fixed_2    <- c(2, 2, 2, 2, 2, 2)        # TODO

# All samples are heterozygous, e.g "A|G"
SNP_all_hetero <- c(1, 1, 1, 1, 1, 1)        # TODO

Now the standard deviations with sd():

# add sd() to calculate the SDs
sd(SNP_fixed_0)     # TODO

## [1] 0

sd(SNP_fixed_2)

## [1] 0

sd(SNP_all_hetero)

## [1] 0

Now let’s make a larger dataframe with some features that aren’t invariant.

First, some invariant features:

# Mixture of the 2 homozygotes
SNP_mix_vs1    <- c(0, 0, 0, 2, 2, 2)

# Mixture of the 2 homozygotes 
## and heterozygotes
SNP_mix_vs2    <- c(0, 0, 1, 1, 2, 2)

We put these into a dataframe:

# add data.frame() and assign to an object
## called df to look at the data
df <- data.frame(SNP_fixed_0,   # TODO
                 SNP_fixed_2,
                 SNP_mix_vs1,
                 SNP_mix_vs2)

df

##   SNP_fixed_0 SNP_fixed_2 SNP_mix_vs1 SNP_mix_vs2
## 1           0           2           0           0
## 2           0           2           0           0
## 3           0           2           0           1
## 4           0           2           2           1
## 5           0           2           2           2
## 6           0           2           2           2

To identify which columns are invariant we first calculate all the SDs.

# add sd() to calculate the SDs
# add [, 1] etc to each line
## lines to calculate the sd on each column
## The first one is shown as an exame

#add [, 1]
sd_col01 <- sd(df[,1]) # EXAMPLE

#add [, 2]
sd_col02 <- sd(df[,2]) # TODO

#add [, 3]
sd_col03 <- sd(df[,3]) # TODO

#add [, 4]
sd_col04 <- sd(df[,4]) # TODO

We can put this all into a vector that contains the SDs:

# add c() to make into a vector
## assign it to a vector called sd_vector
sd_vector <- c(sd_col01,  # TODO
               sd_col02,
               sd_col03,
               sd_col04)

Now use which() to see which parts of the vector are equal to 0.

# add which() and == 0
which(sd_vector == 0 ) # TODO

## [1] 1 2

We’ll save this to a vector to hold the indices of the invariant columns.

# assign to a vector called i_invariant

i_invariant <- which(sd_vector == 0)  # TODO

This vector of indices can be used to look at which columns are invariant in our dataframe:

# add [,i_invariant] to see
## the invariant columns

df[,i_invariant] # TODO

##   SNP_fixed_0 SNP_fixed_2
## 1           0           2
## 2           0           2
## 3           0           2
## 4           0           2
## 5           0           2
## 6           0           2

We can then use this to show use which columns DO vary by using negative indexing to remove the invariant columns.

# add  -i_invariant to REMOVE the invariant columns
df[,-i_invariant] # TODO

##   SNP_mix_vs1 SNP_mix_vs2
## 1           0           0
## 2           0           0
## 3           0           1
## 4           2           1
## 5           2           2
## 6           2           2

We can then make a new dataframe of just the variant columns

# assign to a dataframe called df03
df03 <- df[,-i_invariant] #TODO

Look at the results:

df03

##   SNP_mix_vs1 SNP_mix_vs2
## 1           0           0
## 2           0           0
## 3           0           1
## 4           2           1
## 5           2           2
## 6           2           2

A function to remove invariant columns

The function below will remove invariant columns from a dataframe, as long as you add return(x) to the end.

# run this function
## add return(x_no_invar) 
invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      #TODO
}

warning("Reminder: Did you add return() to the function?")

## Warning: Reminder: Did you add return() to the function?

We can test it on our data above.

# run invar_omit() on df
invar_omit(df)          # TODO

## Dataframe of dim 6 4 processed...
## 2 columns removed

##   SNP_mix_vs1 SNP_mix_vs2
## 1           0           0
## 2           0           0
## 3           0           1
## 4           2           1
## 5           2           2
## 6           2           2

It should only change data dataframe when columns are invariant. Its always good to test a function to make sure it does what you want. If we run it once, save the output, and run it again on the output we just made, it shouldn’t remove anything the second time.

# run invar_omit() on df
df_no_invar <- invar_omit(df) # TODO

## Dataframe of dim 6 4 processed...
## 2 columns removed

# look out output
df_no_invar

##   SNP_mix_vs1 SNP_mix_vs2
## 1           0           0
## 2           0           0
## 3           0           1
## 4           2           1
## 5           2           2
## 6           2           2

# run invar_omit() on df_no_invar
invar_omit(df_no_invar) # TODO

## Dataframe of dim 6 2 processed...
## 0 columns removed

##   SNP_mix_vs1 SNP_mix_vs2
## 1           0           0
## 2           0           0
## 3           0           1
## 4           2           1
## 5           2           2
## 6           2           2

Worked example

Let’s work through an example with a real VCF file. The steps we need to follow are

Load the vcfR package.
Set our working directory via Session->Set Working Directory -> To source file location
Make sure the .vcffile is present with list.files()
Load the data with vcfR::read.vcfR()
Convert the data to genotype scores with vcfR::extract.gt()
Transpose the data with t()
Remove any invariant features (columns) invar_omit()

Load vcfR

First, load up the vcfR package with library() and check the working directory with getwd().

# call library() on vcfR
library(vcfR)           #TODO

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

# check your working directory with getwd()
getwd()                 #TODO

## [1] "C:/Users/Sonia/OneDrive/Documents/0collegeFall2022/comp bio BIOSC1540/projects/removing invariant col"

Set working directory

Second, set the working directory to the location of this file if it isn’t already. Do this via

Session->Set Working Directory -> To source file location

Make sure the vcf file is presentt

Third, make sure the file all_loci.vcf is present with list.files() and/or list.files(pattern = "vcf").

# call list.files()
list.files()                #TODO

## [1] "all_loci-1.vcf"             "removing_fixed_alleles.Rmd"

# call list.files(pattern = "vcf") 
 list.files(pattern = "vcf")               #TODO

## [1] "all_loci-1.vcf"

warning("Friendly remindeR: Make sure you have set your working directory")

## Warning: Friendly remindeR: Make sure you have set your working directory

Load the vcf file

Fourth, load the .vcf file.

# call vcfR::read.vcfR()
bird_snps_again <- vcfR::read.vcfR("all_loci-1.vcf",
                             convertNA = T)      #TODO

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

warning("RemindeR: If this didn't work, you may not have set your working directory to the location of the vcf file")

## Warning: RemindeR: If this didn't work, you may not have set your working
## directory to the location of the vcf file

Convert the vcf data to numeric data

Fifth, convert to numeric genotypic score (counts of the number of the minor allele) with vcfR::extract.gt().

# call vcfR::extract.gt()
bird_snps_num_again <- vcfR::extract.gt(bird_snps_again, # TODO
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T)

Transpose the data

Sixth, transpose the data into the format that works with R with t().

# call t() on bird_snps_num_again

bird_snps_num_t_again <- t(bird_snps_num_again) # TODO

Remove invariant columns

Seventh, remove the invariant columns with invar_omit().

# call invar_omit() on bird_snps_num_t_again
bird_snps_no_invar <- invar_omit(bird_snps_num_t_again) #TODO

## Dataframe of dim 72 1929 processed...
## 590 columns removed

Compare the original and the new dfs

# call dim() on bird_snps_num_t_again
dim(bird_snps_num_t_again) #TODO

## [1]   72 1929

# call dim() on bird_snps_no_invar
dim(bird_snps_no_invar)    #TODO

## [1]   72 1339

Preview - dealing with NAs

These data have a lot of NAs. If we just call na.omit() on them, what happens? Call na.omit() on the bird_snps_num_t_again object, then check the dimensions.

# Call na.omit() on bird_snps_num_t
## and assign the output to no_NAs
no_NAs <- na.omit(bird_snps_num_t_again) # TODO

# what is the remaining size of the data?
# why
dim(no_NAs)

## [1]    0 1929