Learning objectives
This lesson introduces the concept of invariant
columns and why they should be removed. It also provides a
function to remove them.
All of this material will appear on the exam. Take notes on the
workflow, functions, and concepts.
Main objectives
By the end of this lesson you will
- Understand what can lead to a column of SNP data being
invariant.
- Know why invariant features can cause problem when preparing data
for analysis or running machine learning algorithms on it
- Know how to use a function I provide for removing invariant
columns
Review
- Setting your working directory
- Loading .vcf files
- Preparing .vcf files for analysis
Introduction
SNPs are Single Nucleotide Polymorphisms. This means that they are by
definition polymorphic - more than one nucleotide
occurs in that position (locus) in different individual
in a species. For example, most individuals in a population may be G|G,
but some are G|A or even A|A.
The SNP concept applies at the level of a species. Within separate
populations of a species (or individuals genotyped for a study) only one
allele of the SNP may be present. This means that though there is
genetic variation at that locus in the level of
species, for the the population or sample for a study there is no
variation.
If all individuals in a population are homozygous for either the
major allele (e.g. all are MM), or all are homozygous for the minor
allele (e.g. all are mm) we say that the allele is
fixed in that population. They other allele is present
in other populations, but not all of them. When an allele is fixed in a
population it is actually very useful for distinguishing that population
genetically from others.
Similarly, for a sample of individuals in a single study, all
individuals genotyped may turn out to be the same type of homozygote for
a SNP - all are mm or MM. A different sample may have yield
heterozygotes or a mix of both types of homozygotes (mm and MM), but for
the data at hand, only one type of homozygote is present. While the
allele is not necessarily fixed in any population, it is
invariant in the individuals that happened to be
gentoyped.
Implications of invariant SNPs
SNP data is converted to numeric genotype scores for
analysis with machine learning methods such as PCA. If a SNP is
invariant in all of the individuals genotyped in a sample for a study,
the entire column (aka feature) in the dataframe will be either 0 (all
individuals have the major allele) or 2 (all individuals have the minor
allele). Similarly, if all individuals are heterozygotes, the entire
column will be 1.
Completely invariant alleles in a dataset present 2 issues:
- Information content: They provide no information
about differences between individuals and groups in the data. You
therefore expend computational effort on worthless features.
- Mathematical problems: They may cause errors in
some mathematical operations and R functions.
Invariant features, information content and computational
effort
When a column in a dataset is invariant, whether a SNP or anything
else, it doesn’t provide you any useful information. For example, if I
want to predict whether taking Advanced Placement (AP) Biology predicts
success in my intro biology course, and everyone in the class has taken
AP Bio, then I can’t learn or conclude anything.
Invariant features, however, still have to be handled by the
computer. They therefore eat up computational
resources. Since SNP data can have tens or ever hundreds of
thousands of features, this can be a major waste of computational time.
It is therefore advisable to remove invariant features to speed up the
your analyses.
Invariant features and mathematical operations
A feature that is invariant will have a standard deviation of 0. For
example:
x <- c(2,2,2,2,2,2)
mean(x)
## [1] 2
sd(x)
## [1] 0
While we can calculate the mean and standard deviation of an
invariant feature, we can’t scale it because this requires division by
0, which is not defined: there is no answer to, for example, x/0.
More generally, machine learning algorithms can run into problems
with invariant features and fail to work. It is therefore advisable to
remove invariant features before an analysis.
Removing invariant features
There are several ways to identify if a column in a dataframe is
invariant, but probably the easiest is to calculate the standard
deviation. The standard deviation is a measure of variation in the data,
and if everything in a column is the same, there is no variation and the
sd = 0. You could similarly calculate the range; if all the data in a
column is 2, then the minimum is 2, the max is 2, and the range is 2 - 2
= 0.
Here’s some example data that shows that this works for all 3
possible forms of invariant SNP genotypic scores data:
First, some fake data. Add c()
to make the vectors.
# add the c() function to make all of the vectors
# All samples are homozygous for major allele
SNP_fixed_0 <- c(0, 0, 0, 0, 0, 0) # TODO
# All samples are homozygous for minor allele
SNP_fixed_2 <- c(2, 2, 2, 2, 2, 2) # TODO
# All samples are heterozygous, e.g "A|G"
SNP_all_hetero <- c(1, 1, 1, 1, 1, 1) # TODO
Now the standard deviations with sd()
:
# add sd() to calculate the SDs
sd(SNP_fixed_0) # TODO
## [1] 0
sd(SNP_fixed_2)
## [1] 0
sd(SNP_all_hetero)
## [1] 0
Now let’s make a larger dataframe with some features that aren’t
invariant.
First, some invariant features:
# Mixture of the 2 homozygotes
SNP_mix_vs1 <- c(0, 0, 0, 2, 2, 2)
# Mixture of the 2 homozygotes
## and heterozygotes
SNP_mix_vs2 <- c(0, 0, 1, 1, 2, 2)
We put these into a dataframe:
# add data.frame() and assign to an object
## called df to look at the data
df <- data.frame(SNP_fixed_0, # TODO
SNP_fixed_2,
SNP_mix_vs1,
SNP_mix_vs2)
df
## SNP_fixed_0 SNP_fixed_2 SNP_mix_vs1 SNP_mix_vs2
## 1 0 2 0 0
## 2 0 2 0 0
## 3 0 2 0 1
## 4 0 2 2 1
## 5 0 2 2 2
## 6 0 2 2 2
To identify which columns are invariant we first calculate all the
SDs.
# add sd() to calculate the SDs
# add [, 1] etc to each line
## lines to calculate the sd on each column
## The first one is shown as an exame
#add [, 1]
sd_col01 <- sd(df[,1]) # EXAMPLE
#add [, 2]
sd_col02 <- sd(df[,2]) # TODO
#add [, 3]
sd_col03 <- sd(df[,3]) # TODO
#add [, 4]
sd_col04 <- sd(df[,4]) # TODO
We can put this all into a vector that contains the SDs:
# add c() to make into a vector
## assign it to a vector called sd_vector
sd_vector <- c(sd_col01, # TODO
sd_col02,
sd_col03,
sd_col04)
Now use which()
to see which parts of the vector are
equal to 0.
# add which() and == 0
which(sd_vector == 0 ) # TODO
## [1] 1 2
We’ll save this to a vector to hold the indices of the invariant
columns.
# assign to a vector called i_invariant
i_invariant <- which(sd_vector == 0) # TODO
This vector of indices can be used to look at which columns are
invariant in our dataframe:
# add [,i_invariant] to see
## the invariant columns
df[,i_invariant] # TODO
## SNP_fixed_0 SNP_fixed_2
## 1 0 2
## 2 0 2
## 3 0 2
## 4 0 2
## 5 0 2
## 6 0 2
We can then use this to show use which columns DO vary by using
negative indexing to remove the invariant columns.
# add -i_invariant to REMOVE the invariant columns
df[,-i_invariant] # TODO
## SNP_mix_vs1 SNP_mix_vs2
## 1 0 0
## 2 0 0
## 3 0 1
## 4 2 1
## 5 2 2
## 6 2 2
We can then make a new dataframe of just the variant columns
# assign to a dataframe called df03
df03 <- df[,-i_invariant] #TODO
Look at the results:
df03
## SNP_mix_vs1 SNP_mix_vs2
## 1 0 0
## 2 0 0
## 3 0 1
## 4 2 1
## 5 2 2
## 6 2 2
A function to remove invariant columns
The function below will remove invariant columns from a dataframe, as
long as you add return(x)
to the end.
# run this function
## add return(x_no_invar)
invar_omit <- function(x){
cat("Dataframe of dim",dim(x), "processed...\n")
sds <- apply(x, 2, sd, na.rm = TRUE)
i_var0 <- which(sds == 0)
cat(length(i_var0),"columns removed\n")
if(length(i_var0) > 0){
x <- x[, -i_var0]
}
## add return() with x in it
return(x) #TODO
}
warning("Reminder: Did you add return() to the function?")
## Warning: Reminder: Did you add return() to the function?
We can test it on our data above.
# run invar_omit() on df
invar_omit(df) # TODO
## Dataframe of dim 6 4 processed...
## 2 columns removed
## SNP_mix_vs1 SNP_mix_vs2
## 1 0 0
## 2 0 0
## 3 0 1
## 4 2 1
## 5 2 2
## 6 2 2
It should only change data dataframe when columns are invariant. Its
always good to test a function to make sure it does what you want. If we
run it once, save the output, and run it again on the output we just
made, it shouldn’t remove anything the second time.
# run invar_omit() on df
df_no_invar <- invar_omit(df) # TODO
## Dataframe of dim 6 4 processed...
## 2 columns removed
# look out output
df_no_invar
## SNP_mix_vs1 SNP_mix_vs2
## 1 0 0
## 2 0 0
## 3 0 1
## 4 2 1
## 5 2 2
## 6 2 2
# run invar_omit() on df_no_invar
invar_omit(df_no_invar) # TODO
## Dataframe of dim 6 2 processed...
## 0 columns removed
## SNP_mix_vs1 SNP_mix_vs2
## 1 0 0
## 2 0 0
## 3 0 1
## 4 2 1
## 5 2 2
## 6 2 2
Worked example
Let’s work through an example with a real VCF file. The steps we need
to follow are
- Load the
vcfR
package.
- Set our working directory via Session->Set Working Directory
-> To source file location
- Make sure the
.vcf
file is present with
list.files()
- Load the data with
vcfR::read.vcfR()
- Convert the data to genotype scores with
vcfR::extract.gt()
- Transpose the data with
t()
- Remove any invariant features (columns)
invar_omit()
Load vcfR
First, load up the vcfR package with library()
and check
the working directory with getwd()
.
# call library() on vcfR
library(vcfR) #TODO
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
# check your working directory with getwd()
getwd() #TODO
## [1] "C:/Users/Sonia/OneDrive/Documents/0collegeFall2022/comp bio BIOSC1540/projects/removing invariant col"
Set working directory
Second, set the working directory to the location of this file if it
isn’t already. Do this via
Session->Set Working Directory -> To source file location
Make sure the vcf file is presentt
Third, make sure the file all_loci.vcf
is present with
list.files()
and/or
list.files(pattern = "vcf")
.
# call list.files()
list.files() #TODO
## [1] "all_loci-1.vcf" "removing_fixed_alleles.Rmd"
# call list.files(pattern = "vcf")
list.files(pattern = "vcf") #TODO
## [1] "all_loci-1.vcf"
warning("Friendly remindeR: Make sure you have set your working directory")
## Warning: Friendly remindeR: Make sure you have set your working directory
Load the vcf file
Fourth, load the .vcf
file.
# call vcfR::read.vcfR()
bird_snps_again <- vcfR::read.vcfR("all_loci-1.vcf",
convertNA = T) #TODO
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
warning("RemindeR: If this didn't work, you may not have set your working directory to the location of the vcf file")
## Warning: RemindeR: If this didn't work, you may not have set your working
## directory to the location of the vcf file
Convert the vcf data to numeric data
Fifth, convert to numeric genotypic score (counts of the number of
the minor allele) with vcfR::extract.gt()
.
# call vcfR::extract.gt()
bird_snps_num_again <- vcfR::extract.gt(bird_snps_again, # TODO
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T)
Transpose the data
Sixth, transpose the data into the format that works with R with
t()
.
# call t() on bird_snps_num_again
bird_snps_num_t_again <- t(bird_snps_num_again) # TODO
Remove invariant columns
Seventh, remove the invariant columns with
invar_omit()
.
# call invar_omit() on bird_snps_num_t_again
bird_snps_no_invar <- invar_omit(bird_snps_num_t_again) #TODO
## Dataframe of dim 72 1929 processed...
## 590 columns removed
Compare the original and the new dfs
# call dim() on bird_snps_num_t_again
dim(bird_snps_num_t_again) #TODO
## [1] 72 1929
# call dim() on bird_snps_no_invar
dim(bird_snps_no_invar) #TODO
## [1] 72 1339
Preview - dealing with NAs
These data have a lot of NAs. If we just call na.omit()
on them, what happens? Call na.omit()
on the
bird_snps_num_t_again
object, then check the
dimensions.
# Call na.omit() on bird_snps_num_t
## and assign the output to no_NAs
no_NAs <- na.omit(bird_snps_num_t_again) # TODO
# what is the remaining size of the data?
# why
dim(no_NAs)
## [1] 0 1929