Learning Module Two: Species Classification with Random Forests

What is a Random Forest?

A Random Forest is a versatile classification method known as an “ensemble” method that builds multiple models and combines the results. In this case, the “Random Forest” is the ensemble, and the models are “classification trees.”

Advantages of this method:

Random forests can be used for simple or complex classification problems
Random forests handle nearly any kind of data (non-linear, non-normal, zero-truncated).
Random forests can handle thousands of variables and does not require variable selection, but can be optimized through variable selection.
Directly identified variables that are most useful for classification and ranks them by classification “importance.”

Classification Problem AND Research Question

Can genetic information from the nuclear and/or chloroplast genomes (specificly, SNPs) be used to classify specimens from the genus Cedrela to species?

The Data

The nuclear and chloroplast genomes of 189 trees were sampled via target capture and short-read sequencing. These data were aligned to nuclear and chloropalst references and SNPs were called with a protocol that utilized the Burrow-Wheeler Aligner (bwa), the Genome Analysis Tool Kit (GATK), and Samtools. Samtools mpileup generated two Variant Call Format (vcf) files: one for the nuclear genome and one for the chloroplast genome.Vcftools was used to filter the variants and output a 0,1,2 genotye matrices.

The nuclear matrix contains information from 189 Cedrela specimens (samples) and 10,560 SNPs. The chloroplast matrix contains information from 189 samples and 652 SNPs.

Exercise 1: Data Manipulation

The purpose of this exercise is to combine the vcftools 0,1,2 genotype matrix output files into one dataframe. The final matrix will contain sample names, species identifications, and the SNP genotypes (one locus per column). Random Forest will use the SNP genotypes as predicter variables to classify each and the samples to species. We can refer to the species variable as the grouping variable and each species as a class within the grouping variable.

Grouping Variable	Number of Classes	Classes
Species	6	CEAN, CEFI, CEMO, CENE, CEOD, CESA

First refer to the “Learning Module Two: R Basics” to set your working directory to the directory containing the data for this Demonstration.

Now load the required packages. By highlighting the code in block “Exercise 1” starting at line 6 on in the R script and clicking the “Run” button.

library(dplyr) #package used to manipulate data see 
library(tidyr) #package used to manipulate data see 
library(stringr) #package used to change data. This package can be used 
library(randomForest) #package required to run Random Forest Analysis see
library(rpart) #package to build a single decision tree
library(party) #package used to plotting the decisions tree
library(partykit) #2nd package used to plotting the decisions tree

Package information:

dplyr cheatsheet

dplyr and tidyr pdf

regular expressions cheatsheet

stringr examples

randomForest documentation

Decision Tree Methods in R

Next we will load the data outputs from vcftools and examine them.

Genotype Matrix

The genotype matrix will contain a column for each locus and a row for each sample. For now the variables are called V1:n. We will change the column named to match the SNP ID.

geno<-read.table("RF_nuclear_data_10560.012",header = FALSE,row.names = 1) 
dim(geno) #print the dimensions of the matrix

## [1]   192 10560

head(geno[1:10]) #print the first few rows of the matrix

##   V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 0  0  0  0  0  0  1  0  0   0   0
## 1  0  0  0  0  0  1  0  0   0   0
## 2  0  0  0  0  0  1  0  0   0   0
## 3  0  0  0  0  0  1  0  0   0   0
## 4  0  0  0  0  0  1  0  0   0   0
## 5  0  0  0  1  0  1  0  0   0   0

Sample ID list

The sample ID list contains one row per sample. We will attached this column to the front of the genotype matrix.

samples<-read.table("RF_nuclear_data_10560.012.indv",header = FALSE)
dim(samples)

## [1] 192   1

head(samples)

##                V1
## 1   CEOD_4_ced101
## 2  CEFI_61_ced102
## 3  CEOD_95_ced103
## 4 CEOD_142_ced104
## 5 CEOD_157_ced105
## 6 CEFI_201_ced106

SNP ID list

The SNP ID list contains one row per SNP. Right now the SNPs are identified by the gene model name (V1) and the SNP position on the gene model (V2).

snps<-read.table("RF_nuclear_data_10560.012.pos",header = FALSE)
dim(snps)

## [1] 10560     2

head(snps)

##                  V1   V2
## 1 ced.132.ref000004  497
## 2 ced.132.ref000004  503
## 3 ced.132.ref000004  506
## 4 ced.132.ref000007 1186
## 5 ced.132.ref000008 2737
## 6 ced.132.ref000008 2740

We will combine these two columns into one annd use them as headers for the genotype matrix with the unite() function.

snps<-unite(snps,V1,V2,sep="_",col="snp")

Code Breakdown

snps	<-	unite(	snps,	V1,V2,	sep=“_“,	col=“snp”)
overwrite the snps dataset	effectively “save as”	use the unite function to	combine columns from snps dataframe	unite columns V1 and V2	separate V1 and V2 with an underscore	call the new column “snp”

head(snps)

##                      snp
## 1  ced.132.ref000004_497
## 2  ced.132.ref000004_503
## 3  ced.132.ref000004_506
## 4 ced.132.ref000007_1186
## 5 ced.132.ref000008_2737
## 6 ced.132.ref000008_2740

Eventually this dataframe will become the column headers for our final dataframe. However, we will also have a column for “sample” ID and a column for “species” the grouping variable.We can create these rows in the SNP ID list and they will also become headers for the final dataframe.

temp<-data.frame(c("sample","species")) #create a temporary dataframe with two rows (observations)
head(temp)

##   c..sample....species..
## 1                 sample
## 2                species

To append the SNP ID list to this dataframe “temp” the column names must match. We can change the column name with the function names() to match the column name from the snps dataframe.

names(temp)[1]<-"snp" 
head(temp)

##       snp
## 1  sample
## 2 species

Append the temp dataframe containing the rows “sample” and “species” to the snps dataframe and the new dataframe as “headers”. This dataframe will become the headers for the final dataframe containing SNP genotypes.

headers<-rbind.data.frame(temp,snps)

Code Breakdown

headers	<-	rbind.data.frame(	temp,snps)
New data frame called headers	effectively “save as”	function to row bind and convert data to type dataframe	append dataframes temp and snps

head(headers)

##                      snp
## 1                 sample
## 2                species
## 3  ced.132.ref000004_497
## 4  ced.132.ref000004_503
## 5  ced.132.ref000004_506
## 6 ced.132.ref000007_1186

Now we need to transpose our headers dataframe so that they can be used as column headers. The function t() performs this transformation.

t_headers<-t(headers)
dim(t_headers) #results in one row with 10,562 columns

## [1]     1 10562

head(t_headers[1:10])

## [1] "sample"                 "species"               
## [3] "ced.132.ref000004_497"  "ced.132.ref000004_503" 
## [5] "ced.132.ref000004_506"  "ced.132.ref000007_1186"

Grouping Variable

We also need to create a column for the grouping variable or species in this case. We can use the sample ID list to create the grouping variable because the first four letters of each sample ID is also the species.

species<-data.frame(str_extract(samples$V1,"[CE]..."))

Code Breakdown

species	<-	data.frame(	str_extract(	samples$V1,	“[CE]…”))
New data frame called species	effectively “save as”	function to convert data to type dataframe	function to identify a regular expression	string input for function	regular expression pattern

Final Dataframe

Now we will join the dataframes containing the sample IDs “samples”, the dataframe containing the grouping variable “species”, and the dataframe containing the 0,1,2 genotypes “geno” and save this joined dataframe as “df.”

df<-cbind.data.frame(samples,species,geno)

Code Breakdown

df	<-	cbind.data.frame(	samples,species,geno)
New dataframe called df	“save as”	column bind and convert data to type dataframe	join dataframes called samples, species, and geno. Dataframes much have the same length (number of rows).

Note: any columns in common will be repeated, but we do not have any repeated columns in this case.

head(df[1:9])

##                V1 str_extract.samples.V1....CE...... V2 V3 V4 V5 V6 V7 V8
## 0   CEOD_4_ced101                               CEOD  0  0  0  0  0  1  0
## 1  CEFI_61_ced102                               CEFI  0  0  0  0  0  1  0
## 2  CEOD_95_ced103                               CEOD  0  0  0  0  0  1  0
## 3 CEOD_142_ced104                               CEOD  0  0  0  0  0  1  0
## 4 CEOD_157_ced105                               CEOD  0  0  0  0  0  1  0
## 5 CEFI_201_ced106                               CEFI  0  0  0  1  0  1  0

Notice that R has arbitrarily assigned column headers as V1-n or an another name based on limited information. We can provide R with informative column names. We already prepared our desired column headers in dataframe “t_headers.” Now we can tell R to use these column names for our dataframe.

colnames(df)<-t_headers
head(df[1:9])

##            sample species ced.132.ref000004_497 ced.132.ref000004_503
## 0   CEOD_4_ced101    CEOD                     0                     0
## 1  CEFI_61_ced102    CEFI                     0                     0
## 2  CEOD_95_ced103    CEOD                     0                     0
## 3 CEOD_142_ced104    CEOD                     0                     0
## 4 CEOD_157_ced105    CEOD                     0                     0
## 5 CEFI_201_ced106    CEFI                     0                     0
##   ced.132.ref000004_506 ced.132.ref000007_1186 ced.132.ref000008_2737
## 0                     0                      0                      0
## 1                     0                      0                      0
## 2                     0                      0                      0
## 3                     0                      0                      0
## 4                     0                      0                      0
## 5                     0                      1                      0
##   ced.132.ref000008_2740 ced.132.ref000009_2070
## 0                      1                      0
## 1                      1                      0
## 2                      1                      0
## 3                      1                      0
## 4                      1                      0
## 5                      1                      0

If desired, we can save this dataframe in this format for future use.

write.csv(df,"randomForest_df.csv",row.names = FALSE,quote = FALSE)

Next time we run this analysis, we can skip the above steps and read in the prepared data with the function

df<-read.csv("randomForest_df.csv",header=1).

For this and all following exercises, we are using the variable, “species” as the grouping variable and the snp variables (columns 3-10,562) as the predictor variables. The variable “sample” is no longer required for classification, but may be helpful later on.

Remove the “sample” variable from df.

df[1]<-NULL
head(df[1:4])

##   species ced.132.ref000004_497 ced.132.ref000004_503
## 0    CEOD                     0                     0
## 1    CEFI                     0                     0
## 2    CEOD                     0                     0
## 3    CEOD                     0                     0
## 4    CEOD                     0                     0
## 5    CEFI                     0                     0
##   ced.132.ref000004_506
## 0                     0
## 1                     0
## 2                     0
## 3                     0
## 4                     0
## 5                     0

Exercise 2: Classification Trees

When building a classification model, it is good practice to first partition the data into a training and a validation/test dataframe to avoid overfitting the model with potential outlier samples. The training dataframe will be used to build the model and the validation dataframe will be used to test the error of the model. There are many different methods for creating training and validation sets. With large sample size, 80% of data maybe used for model training and 20% can be reserved for validation. With smaller sample sizes (n), 90% of data can be used for training and 10% for validation. We will use the 80:20 method so that our validation set will contain at least one sample from each species. In cases of extreme small sample size (n), a Leave-One-Out Cross Validation method may be selected. In that case, n-1 samples are used to train the model and the remaining sample is used to test the model.

Learn more: Lever et al. 2016

First let’s examine the data we are using for this analysis.

dim(df) #show the dimensions of df.

## [1]   192 10561

The function summary() will display a summary of the data contained within a dataframe of within a column of a dataframe with the dataframe$column_name syntax.

summary(df$species) #displays the number of samples per species group.

## CEAN CEFI CEMO CENE CEOD CESA 
##   18   39   22    8   97    8

We will create our training and validation dataframes by randomly drawing 80% of samples from the dataframe (“df”) to become the training dataframe (“df.training”), and saving the remaining samples in the validation dataframe called “df.validate.”

Randomly draw 80% of samples.

train<-sample(nrow(df), 0.8*nrow(df))

Use the list “train” to filter the full dataframe (“df”) and save rows matching “train” to a new dataframe called “df.train.”

Code Breakdown

train	<-	sample(	nrow(df),	0.8*nrow(df))
New list called train	“save as”	function to randomly select rows	select rows from all rows in df	select 80% of rows in df

df.train<-df[train,] #use this list to filter the full dataframe ("df") and save this dataframe as "df.train."
dim(df.train)

## [1]   153 10561

summary(df.train$species)

## CEAN CEFI CEMO CENE CEOD CESA 
##   14   29   20    8   76    6

Use the list “train” to filter df by selecting rows “not” matching the 80% training dataframe.

Note: here the “-” symbol indicates rows that do not match the rows in train.

df.validate<-df[-train,] 
dim(df.validate)

## [1]    39 10561

summary(df.validate$species)

## CEAN CEFI CEMO CENE CEOD CESA 
##    4   10    2    0   21    2

Grow a Tree

We are now going to build a classification tree that classified all samples to species using all 10,560 nuclear SNPs.

First we want to make sure that the 0,1,2 genotypes are treated as discrete variables or factors instead of continuous variables or integers which is default.

df.train[2:10561] <- lapply(df.train[2:10561], as.factor)

Code Breakdown

df.train[2:10561]	<-	lapply(	df.train[2:10561],	as.factor)
Overwrite columns 2:10561	“save as”	function to apply another function to multiple objects, in this case, columns 2:10561	df.train columns 2:10561	convert them to data type factor

Grow the tree.

tree<-rpart(species~., data=df.train)

Code Breakdown

tree	<-	rpart(	species	~.,	data=df.train)
New dataframe called tree	“save as”	function that builds the classification tree	grouping vairable	predictor variables “.” indicates “all”	from dataframe df.train

plot(as.party(tree))

Unfortunately, building a classification tree with all branches tends to result in a tree that is too large and might be overfitted to outlier or otherwise “noisey” samples. One way to allevate this is to prune the classification tree based on the lowest 10-fold cross-validation error. 10-fold cross-validation erroralso known as “xerror” is essentially a measure of error that is calculated by treating “knowns” as “unknowns” and classifying them with a tree with n branches. The rpart object we created, “tree”, includes a complexity parameter table (cptable) where trees different split (node) numbers are generated and xerror is calculated for each.

tree$cptable #use this syntax to print the cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.22727273      0 1.0000000 1.0000000 0.08031852
## 2 0.15584416      2 0.5454545 0.5714286 0.07271151
## 3 0.05194805      3 0.3896104 0.4285714 0.06607104
## 4 0.01298701      4 0.3376623 0.5194805 0.07058831
## 5 0.01000000      5 0.3246753 0.5454545 0.07168853

We want to prune the tree so that we have the fewest number of splits and the lowest xerror. In theory, this will give us the highest accuracy for classifying unknowns.

pruned_tree<- prune(tree, cp=tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])

Code Breakdown

pruned_tree	<-	prune(	tree,	cp=tree$cptable	[which.min(tree$cptable[,“xerror”]),	“CP”])
New tree named pruned_tree	“save as”	function that performs pruning	the complexity parameter (CP) we want to use to prune the tree is in the list object tree in sublist cptable	we want to select the CP with the minimum xerror	find the CP in the column called “CP”

Plot the pruned tree.

plot(as.party(pruned_tree))

Exercise 3: Random Forest Analysis with Observed Data

Now we use the function randomForest to grow a forest of classification trees based on df.train. Each tree is constructed by selecting a predictor variable random and classifying all samples based on that variable. Then selecting another variable at random to further classify the samples and so on until all samples have been classified into one of the six species groups. The each tree will grow to our specified number of predictor branches with the parameter “mtry,” and we can select a specified number of trees for the random forest with the parameter “ntree.” For now we will keep these parameters default or ntree=500 mtry=sqrt(10560).

fit.forest<-randomForest(species~., data=df.train, importance=TRUE) #if this takes longer than 2 minutes something has gone wrong. 
print(fit.forest)

## 
## Call:
##  randomForest(formula = species ~ ., data = df.train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 102
## 
##         OOB estimate of  error rate: 20.26%
## Confusion matrix:
##      CEAN CEFI CEMO CENE CEOD CESA class.error
## CEAN    8    0    2    0    1    0   0.2727273
## CEFI    0   21    0    0    9    2   0.3437500
## CEMO    0    0   14    0    3    0   0.1764706
## CENE    0    0    0    3    4    0   0.5714286
## CEOD    1    0    3    2   74    0   0.0750000
## CESA    0    4    0    0    0    2   0.6666667

Your fit.forest may not match the fit.forest shown above. Instability is a feature of random forest analysis, and complete reproducibility across replicate analyses cannot be assured. This is due to the random variable selection to begin building the ensemble of classification trees.

A key feature of Random Forests is Gini Node Impurity (Gini index) which is a direct measure of the “importance” or utility of a predictor variable for classifying samples with low error. Node impurity decreases each time a variable is used during the Random Forest Analysis; more specificly, node impurity decreases when samples are classified. After each classification event at a node, the samples remaining to be classified are more alike (i.e., belong to the same class/species) and descendent nodes have a lower node impurity. Variables that classify a greater number of samples across Random Forests have a higher decrease in node impurity, which is estimated as a mean considering all 500 classification trees in the Random Forest. The scale of the Gini index is based on the number of samples remaining to be classified after each classification event, such that the scale of the Mean Decrease in Node Impruity will be larger for a larger sample size.

We can examine the importance of each SNP for classifying our samples.

gini<-data.frame(importance(fit.forest, type=2))
gini$snp<-row.names(gini)
sorted_gini<-arrange(gini,desc(MeanDecreaseGini))
head(sorted_gini)

##   MeanDecreaseGini                    snp
## 1        0.8963676  ced.132.ref002071_496
## 2        0.5987892  ced.132.ref002071_438
## 3        0.5843808 ced.132.ref000096_1829
## 4        0.5663042  ced.132.ref000830_487
## 5        0.5374718  ced.132.ref003714_499
## 6        0.3898398  ced.132.ref001654_681

Exercise 4: Random Forest Analysis with Randomized Data

Because Random Forest selects variables and samples at random and classified them into a grouping variable, the Random Forest Analysis may classify a sample correctly simply by chance. For example, about half of the samples in our dataframe are C. odorata so by chance the model can guess that samples are C. odorata or “CEOD” and be correct. We call this a baseline classification error. In this case, we expect for the Random Forest program to classify a sample accurately as C. odorata about 50% of the time. By randomizing the classes in the grouping variable (species), we can better estimate the baseline classification error we expect by chance and compare this to our observed error.

First we will make a copy of df. We do this to avoid erroneously changing the observed data.

df_copy<-data.frame(df)

Next we will save the grouping variable “species” as a dataframe. We will randomize this column and join it to the data.

rand.col<-data.frame(df_copy$species)
head(rand.col)

##   df_copy.species
## 1            CEOD
## 2            CEFI
## 3            CEOD
## 4            CEOD
## 5            CEOD
## 6            CEFI

We must delete the grouping variable “species” in the main dataframe. We will replace this column with the new, randomized grouping variable species.

df_copy$species<-NULL

Randomize the values in the grouping variable “species” with the function sample().

rand.col<-data.frame(rand.col[sample(nrow(rand.col)),])
head(rand.col)

##   rand.col.sample.nrow.rand.col.....
## 1                               CEOD
## 2                               CEFI
## 3                               CEOD
## 4                               CEFI
## 5                               CEOD
## 6                               CEOD

Re-join the randomized column containing the grouping vairable with the main dataframe.

Note: We have not made changes to the genotypes, just the grouping variable. We are trying to build a a model that does not classify samples with low error.

df_copy<-cbind(rand.col,df_copy)
names(df_copy)[1]<-"species" #name the randomized column "species"

Use the same method as above to make the training and validation dataframes.

train_copy<-sample(nrow(df_copy), 0.8*nrow(df_copy))
df.train_copy<-df_copy[train_copy,]
df.validate_copy<-df_copy[-train_copy,]

Build the random forest with the same data, but a randomized grouping variable.

fit.forest_randomized<-randomForest(species~., data=df.train_copy, importance=TRUE)
print(fit.forest_randomized)

## 
## Call:
##  randomForest(formula = species ~ ., data = df.train_copy, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 102
## 
##         OOB estimate of  error rate: 49.67%
## Confusion matrix:
##      CEAN CEFI CEMO CENE CEOD CESA class.error
## CEAN    0    1    0    0   14    0  1.00000000
## CEFI    0    1    0    0   27    0  0.96428571
## CEMO    0    0    0    0   18    0  1.00000000
## CENE    0    0    0    0    6    0  1.00000000
## CEOD    0    3    0    0   76    0  0.03797468
## CESA    0    1    0    0    6    0  1.00000000

As expected the baseline classification error is about 50% and most samples are classified as C. odorata which half the time is correct just by chance.

Exercise 5: Interpretting Results

Compare to the observed classification error confusion matrix.

print(fit.forest)

## 
## Call:
##  randomForest(formula = species ~ ., data = df.train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 102
## 
##         OOB estimate of  error rate: 20.26%
## Confusion matrix:
##      CEAN CEFI CEMO CENE CEOD CESA class.error
## CEAN    8    0    2    0    1    0   0.2727273
## CEFI    0   21    0    0    9    2   0.3437500
## CEMO    0    0   14    0    3    0   0.1764706
## CENE    0    0    0    3    4    0   0.5714286
## CEOD    1    0    3    2   74    0   0.0750000
## CESA    0    4    0    0    0    2   0.6666667

The “OOB estimate of error rate” is the overall error rate, and this measure for observed data is much lower than that for the data with a randomized grouping variable.

Let’s examine the fit.forest object with the function str(). str means structure.

str(fit.forest)

Just like the $ can be used to examine a variable of a dataframe. We can use the same syntax to examine part of an object. Specificly we can look at the classification error rate. fit.forest$err.rate what we see if a list of error rates for each classification tree in the random forest; there are 500 error rates recorded here. This error rate is actually a mean error across all sample in df.train. Let’s make a two histograms to compare the error rate of the observed and randomized data.

Note: 1.0 mean error rate is equivlent to 100% mean error or a incorrect classification of all samples in df.train.

Print a frquency histogram of error rates for the observed and randomized data. We refer to this type of graph in R as a quick and dirty. The “Advanced Species Classification with Random Forests” packet will provide instructions for nicer graphs.

hist(fit.forest$err.rate)

hist(fit.forest_randomized$err.rate)

As the confusion matrices show, the classification error rates for observed data is much lower than that for randomized data.

Conclusions

The observed data performs much better than the randomized data for classifying species. This tells us that Random Forest, though not perfect, is able to classify species based on our data, even for species with low sample size. Additionally, randomization is useful for establishing the baseline error rate for our genotype data.

Exercise 6: Random Forests to Classify “Unknowns”

The Random Forest trained with observed data can now be tested with df.validate. The samples in df.validate are treated as unknowns and each classified with the Random Forest that we built with df.train and observed data (fit.forest).

forest.pred<-predict(fit.forest, df.validate) #classifies new cases
forest.perf<-table(df.validate$species, forest.pred, dnn=c("Actual", "Predicted"))
print(forest.perf)

##       Predicted
## Actual CEAN CEFI CEMO CENE CEOD CESA
##   CEAN    5    0    1    0    1    0
##   CEFI    0    6    0    0    1    0
##   CEMO    0    0    5    0    0    0
##   CENE    0    0    0    1    0    0
##   CEOD    0    2    0    0   15    0
##   CESA    0    0    0    0    1    1

With the exception of low-sampled species classes and potential outliers, the Random Forest classifies “unknowns” fairly accurately with nuclear genotypes.

Learning Module Two: Species Classification with Random Forests

Kristen N. Finch

4/16/2018

What is a Random Forest?

Classification Problem AND Research Question

The Data

Exercise 1: Data Manipulation

Genotype Matrix

Sample ID list

SNP ID list

Code Breakdown

Code Breakdown

Grouping Variable

Code Breakdown

Final Dataframe

Code Breakdown

Exercise 2: Classification Trees

Code Breakdown

Grow a Tree

Code Breakdown

Code Breakdown

Code Breakdown

Exercise 3: Random Forest Analysis with Observed Data

Exercise 4: Random Forest Analysis with Randomized Data

Exercise 5: Interpretting Results

Conclusions

Exercise 6: Random Forests to Classify “Unknowns”