Proportion Tests

Details

This is an R Markdown document. These data pretain to Finch et al. 2017 Applications in Plant Sciences.

Structure

Douglas-fir two or three annuals rings per increment core 2-6 trees per population 2 regions: Coast Range and Cascades Range random forest classification

Sample totals for this analysis

Cascades: 308

Coast: 252

Total: 560

Questions

What is the effect of unbalanced class size on classification accuracy? What other changes might be associated with this difference in sampling?

Purpose

One of the main results of this analysis was based on the observation that samples from the Cascades were easier to classify to region, but this observation was based on one iteration of the analysis and the classes were not balanced. If the analysis is sensitive to balanced samples, then it is possible that classification assymetry is actually due to unbalanced samples. And iff the analysis is sensitive to balanced samples, all analyses using these techniques should have balanced classes.

Background

The question arose because one iteraction of the model with unbalanced design suggested that the Cascades samples were more frequently accurately classified to region than the Coast sample. But when I investigated why that might be the case (variability among samples from the same core) I saw that really the Coast samples might be easier to classify based on the numbers I was seeing, but that Cascades had about 20 more samples (see http://rpubs.com/finchnSNPs/250368). I wanted to make sure that the result was not because of unbalanced classes.

library(randomForest)
library(ggplot2)
library(gridExtra)
library(ROCR)
library(dplyr)

Below is sample code that I ran to generate these data. This runs for 500 iterations, so the data used for t tests and graphs has been saved independently to be loaded.

#load matrix containing sample metadata and abundance data from the mass spectrometer for 947 molecules in wood. 
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
#subset matrix to only include the grouping variable and the molecular abundances.
df<-df_250_1[c(4,7:952)]

#generate empty list
fit.list <- list()

for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform. 
  cat("Repetition number ") #Repetition number X will let us know that our loop is working. 
  cat(i)
  cat("\n")
  
  #Subset data so that classes are in equal proportion. 
  cas<-df[sample((1:nrow(df))[df$Region == "Cascades"], 94, replace=FALSE), ]
  coa<-df[sample((1:nrow(df))[df$Region == "Coast"], 94, replace=FALSE), ]
  
  #recombine those new dataframes of equal proportion.
  df<-rbind(cas,coa)
  
  #this will run randomForest for each (i) in fit.list
  fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df, importance=TRUE)) #Grows the forest
} #end of for loop

The output of fit.forest looks like this:

Call:
 randomForest(formula = Region ~ ., data = df, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 30

        OOB estimate of  error rate: 19.15%
Confusion matrix:
         Cascades Coast class.error
Cascades      125     9  0.06716418
Coast          27    27  0.50000000

We are interested in fit.forest$err.rate which contains the error rates out-of-bag error rate, which is an averaged error rate for both classes, and the error rate for each class in the grouping variable. Out grouping variable is region and out classes are Coast and Cascades.

head(fit.forest$err.rate)
outputs this: 
           OOB  Cascades     Coast
[1,] 0.3424658 0.1764706 0.7272727
[2,] 0.3693694 0.2560976 0.6896552
[3,] 0.3283582 0.2346939 0.5833333
[4,] 0.3116883 0.2105263 0.6000000
[5,] 0.3253012 0.2439024 0.5581395
[6,] 0.2768362 0.2109375 0.4489796

Below is the code to obtain this data from the fit.forest object. You can also pull out other things. Check out str(fit.forest).

#make a list for each class in the grouping variable.
fit.median.cas <- list()
for (i in 1:length(fit.list)){#for each iteration of fit.forest in fit.list (500 iterations in this example) this will take the median($err.rate) for the Cascades group. fit.median.cas will contain 500 median class error rates for the Cascades class. 
  fit.median.cas[[i]] <- median(fit.list[[i]]$err.rate[,'Cascades'])
}
#Do the same for the Coast group
fit.median.coa <- list()
for (i in 1:length(fit.list)){
  fit.median.coa[[i]] <- median(fit.list[[i]]$err.rate[,'Coast'])
}
# Unlisting the values and make dataframes.
fit.median.coa <- unlist(fit.median.coa)
err.obs.coa<-data.frame(fit.median.coa)
fit.median.cas <- unlist(fit.median.cas)
err.obs.cas<-data.frame(fit.median.cas)

#make a new variable called group for both dataframes
err.obs.cas$group<-rep("Cascades", length(err.obs.cas))
names(err.obs.cas)[1]<-"class.err"

err.obs.coa$group<-rep("Coast", length(err.obs.coa))
names(err.obs.coa)[1]<-"class.err"

#combine the dataframes. Now we have 1000 observations, 500 medians from the Coast class and 500 medians from the Cascades class and each median resulted from the class error rates for 500 trees. 500 trees/random forest = 500 error rates per class within 500 random forests/class error medians per class.
err.rates.class_equal<-rbind(err.obs.cas,err.obs.coa)
err.rates.class_equal$class.err<-(err.rates.class_equal$class.err)*100
err.rates.class_equal$class.err<-100-(err.rates.class_equal$class.err)

#save that dataframe to avoid overwriting.
write.csv(err.rates.class_equal, "err.rates.class_equal.csv")

Now for this R markdown I will just read in the data for each test and show the results.

Test 1: Coast and Cascades at equal proportions (1:1 or 94:94)

err.rates.class_equal<-read.csv("err.rates.class_equal.csv")
err.rates.class_equal_plot<-ggplot(err.rates.class_equal, aes(x = group, y = class.err))+
  geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
  #scale_fill_manual(values=c("red","blue"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0),
        legend.position="none",
        axis.title=element_text(size=8),
        axis.text=element_text(size=6))+
  xlab("Class")+ylab("Classification Accuracy (%)")+
  ggtitle("Coast and Cascades at Equal Proportions")

t.test(class.err ~ group, data=err.rates.class_equal, paired=TRUE)

## 
##  Paired t-test
## 
## data:  class.err by group
## t = -161.58, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.32631 -14.95806
## sample estimates:
## mean of the differences 
##               -15.14219

Test 2: Cascades at a 40 sample deficit (74:114)

err.rates.class_cas20def<-read.csv("err.rates.class_cas20def.csv")
err.rates.class_cas20def_plot<-ggplot(err.rates.class_cas20def, aes(x = group, y = class.err))+
  geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
  #scale_fill_manual(values=c("red","blue"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0),
        legend.position="none",
        axis.title=element_text(size=8),
        axis.text=element_text(size=6))+
  xlab("Class")+ylab("Classification Accuracy (%)")+
  ggtitle("Cascades at 40 Sample Deficit")

t.test(class.err ~ group, data=err.rates.class_cas20def, paired=TRUE)

## 
##  Paired t-test
## 
## data:  class.err by group
## t = -281.95, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -28.91730 -28.51707
## sample estimates:
## mean of the differences 
##               -28.71719

Test 3: Cascades at a 80 sample deficit (54:134)

err.rates.class_cas40def<-read.csv("err.rates.class_cas40def.csv")
err.rates.class_cas40def_plot<-ggplot(err.rates.class_cas40def, aes(x = group, y = class.err))+
  geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
  #scale_fill_manual(values=c("red","blue"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0),
        legend.position="none",
        axis.title=element_text(size=8),
        axis.text=element_text(size=6))+
  xlab("Class")+ylab("Classification Accuracy (%)")+
  ggtitle("Cascades at 80 Sample Deficit")

t.test(class.err ~ group, data=err.rates.class_cas40def, paired=TRUE)

## 
##  Paired t-test
## 
## data:  class.err by group
## t = -443.06, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -45.94996 -45.54423
## sample estimates:
## mean of the differences 
##               -45.74709

Test 4: Coast at a 40 sample deficit (114:74)

err.rates.class_coa20def<-read.csv("err.rates.class_coa20def.csv")
err.rates.class_coa20def_plot<-ggplot(err.rates.class_coa20def, aes(x = group, y = class.err))+
  geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
  #scale_fill_manual(values=c("red","blue"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0),
        legend.position="none",
        axis.title=element_text(size=8),
        axis.text=element_text(size=6))+
  xlab("Class")+ylab("Classification Accuracy (%)")+
  ggtitle("Coast at 40 Sample Deficit")

t.test(class.err ~ group, data=err.rates.class_coa20def, paired=TRUE)

## 
##  Paired t-test
## 
## data:  class.err by group
## t = 173.99, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17.65978 18.06317
## sample estimates:
## mean of the differences 
##                17.86147

Test 5: Coast at a 80 sample deficit (134:54)

err.rates.class_coa40def<-read.csv("err.rates.class_coa40def.csv")
err.rates.class_coa40def_plot<-ggplot(err.rates.class_coa40def, aes(x = group, y = class.err))+
  geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
  #scale_fill_manual(values=c("red","blue"))+
  theme_bw()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0),
        legend.position="none",
        axis.title=element_text(size=8),
        axis.text=element_text(size=6))+
  xlab("Class")+ylab("Classification Accuracy (%)")+
  ggtitle("Coast at 80 Sample Deficit")

t.test(class.err ~ group, data=err.rates.class_coa40def, paired=TRUE)

## 
##  Paired t-test
## 
## data:  class.err by group
## t = 397.72, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  42.45385 42.87537
## sample estimates:
## mean of the differences 
##                42.66461

Blank plot Placeholder

blankPlot <- ggplot()+geom_blank(aes(1,1))+
  theme(
    plot.background = element_blank(), 
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_blank(), 
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.line = element_blank()
  )

Results presented as a single plot

proportion_test<-grid.arrange(err.rates.class_equal_plot,blankPlot,err.rates.class_cas20def_plot,err.rates.class_cas40def_plot,err.rates.class_coa20def_plot,err.rates.class_coa40def_plot, ncol=2)

#ggsave("proportion_test.jpg",plot=proportion_test, width=7.2, height=9)

Concerning our second question: What other changes might be associated with this difference in sampling?

I noticed that there is a difference in model variability, or certainly in the relationship between True Positive Rate and False Positive Rate AKA model performance. This study uses a portion of the samples as a training set to build the model and a second portion to validate the model.

I compared the validation step using a model built with balanced class size and a model build with unbalanced class size. Model performance was measured by describing the relationship between the true positive rate and the false positive rate, or generating an ROC curve.

Code with unbalanced data. Here I take an 80% subsample, but there are more Cascades samples than Coast samples. In almost all cases Coast is at a deficit.

df_250_1<-read.csv("~/Documents/DART_DouglasFir/DART_datafiles/Work since 20160602/Cascade vs Coast 250 mmu and 1.csv")
df_250_1<-df_250_1[c(4:952)]
avg <- df_250_1%>%group_by(Region, Pop, Ind)
indavg<- avg%>%summarise_each(funs(mean))

df<-indavg[c(1, 4:949)]
#You need a dataset that contains your predictor variables (ions) and one column
#that contains the class information.

### For Loop
# To do this X times its easier to do a for loop:

# We create an empty list, to fill using the values of the fit element

fit.list <- list()
roc.list <- list()
pred.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform. 
  cat("Repetition number ") #Repetition number X will let us know that our loop is working. 
  cat(i)
  cat("\n")
  
  #Subset your data into a training and validation set. 
  train<-sample(nrow(df), 0.8*nrow(df))
  #Choose the ratio of training to validation that works for your data. For
  #small datasets you could do a 90:10 or 95:5, and for larger datasets 80:20 is
  #recommended.
  df.train<-df[train,]
  df.validate<-df[-train,]
  table(df.train$Region)
  table(df.validate$Region)
  
  #this will run randomForest for each (i) in fit.list
  fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
  
  #this next part uses the model we built to classify the validation set. 
  forest.pr<-predict(fit.forest, type="prob",df.validate)[,2]
  forest.pred<-prediction(forest.pr, df.validate$Region)
  #We want to save the True Positive rate and the False positive rate. These
  #will then become the x and y values for the ROC curve.
  forest.pref<- performance(forest.pred,"tpr","fpr")
  pred.list[[i]] <- forest.pred
  # Saving x  and y values in a dataframe
  df.forest <- data.frame(unlist(forest.pref@x.values), unlist(forest.pref@y.values))
  df.forest$replicate <- rep(i, nrow(df.forest))
  colnames(df.forest) <- c("x_axis","y_axis","replicate")
  roc.list[[i]] <- df.forest #save the data frame as a list.
} #end of for loop

#To see the distributions of the median of OOB error, you can use the list you
#just created and do a simple for loop again: calculate the median OOB
#classification error for each interation of the randomForest.
fit.median <- list()
for (i in 1:length(fit.list)){
  fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting the values
#err.obs.v.r.2
fit.median <- unlist(fit.median)
err.obs.v.r.2<-data.frame(fit.median)
# use these values to make a density curve of the median OOB classification
# errors for 500 iterations.

# ROC: Binding and creating the curve
roc.total.obs.v.r.2_unbal <- as.data.frame(do.call(rbind, roc.list))
roc.total.obs.v.r.2_unbal$replicate<-as.factor(roc.total.obs.v.r.2_unbal$replicate)
write.csv(roc.total.obs.v.r.2_unbal,"roc.total.obs.v.r.2_unbal.csv")

Code for balanced classes.

Subset your data into a training and validation set. BUT MAKE SURE THAT THE CLASSES ARE BALANCED! For my data this required some calculations. The total number of Coast samples was my upper limit. For this averaged dataset I have 84 Coast samples. 85*2=170. 80% of 170 is 136 samples and 20% of 170 is 17 samples. So for the 80% subsample there will be 68 samples from the Coast and 68 samples from the Cascades. For the 20% sample there will be 17 samples from the Coast and 17 from the Cascades. This will result in balanced samples is the training set and balanced samples in the validation set.

df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
#subset to keep variables for averaging
df_250_1<-df_250_1[c(4:952)]
#dplyr functions
avg <- df_250_1%>%group_by(Region, Pop, Ind)
indavg<- avg%>%summarise_each(funs(mean))
write.csv(indavg, "source_mean_data.csv")

#now we just want the grouping variable we are interested in an molecular
#abundance variables.
df<-indavg[c(1, 4:949)]

#separate the Cascades dataset from the Coast dataset
cas<-subset(df, Region == "Cascades")
coa<-subset(df, Region == "Coast")
#You need a dataset that contains your predictor variables (ions) and one column
#that contains the class information.

#For Loop to run random forest X times. We create an empty list, to fill
#using the values of the fit and other elements

fit.list <- list()
roc.list <- list()
pred.list <- list()
for (i in 1:length(c(1:500))){ 
  #for all items (i) in the list of chosen length--change to number of
  #iterations you want to perform.
  cat("Repetition number ") #Repetition number X will let us know that our loop is working. 
  cat(i)
  cat("\n")

  #randomly select 68 from each; saved as a list
  cas.80<-sample(nrow(cas), 68)
  coa.80<-sample(nrow(coa), 68)
  #make new dataframe that matches the samples in the list cas.80. This will 
  #become the Cascades portion of the dataset to train the Random Forest
  #analysis.
  df.train.cas<-cas[cas.80,]
  df.validate.cas<-cas[-cas.80,]
  #make new dataframe that DOES NOT match the samples in the list cas.80, and
  #randomly select 17 of those samples and reserve for the validation set. 
  df.validate.cas<-df.validate.cas[sample((1:nrow(df.validate.cas)), 17,replace=TRUE), ]
  #Repeat the same protocol to make dataframes for the Coast portions of the
  #training and validation tests.
  df.train.coa<-coa[coa.80,]
  df.validate.coa<-coa[-coa.80,]
  df.validate.coa<-df.validate.coa[sample((1:nrow(df.validate.coa)), 17,replace=TRUE), ]
  #Combine the Cascades and Coast portions of the training and validation sets.
  df.train<-rbind(df.train.cas,df.train.coa)
  df.validate<-rbind(df.validate.cas,df.validate.coa)
  
  #this will run randomForest for each (i) in fit.list 500 times.
  fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
  
  #this next part uses the model we built to classify the validation set. 
  forest.pr<-predict(fit.forest, type="prob",df.validate)[,2]
  forest.pred<-prediction(forest.pr, df.validate$Region)
  #We want to save the True Positive rate and the False positive rate. These
  #will then become the x and y values for the ROC curve.
  forest.pref<- performance(forest.pred,"tpr","fpr")
  pred.list[[i]] <- forest.pred
  # Saving x  and y values in a dataframe
  df.forest <- data.frame(unlist(forest.pref@x.values), unlist(forest.pref@y.values))
  df.forest$replicate <- rep(i, nrow(df.forest))
  colnames(df.forest) <- c("x_axis","y_axis","replicate")
  roc.list[[i]] <- df.forest #save the data frame as a list.
} #end of for loop

#To see the distributions of the median of OOB error, you can use the list you
#just created and do a simple for loop again: calculate the median OOB
#classification error for each interation of the randomForest.
fit.median <- list()
for (i in 1:length(fit.list)){
  fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting the values
#err.obs.v.r.2
fit.median <- unlist(fit.median)
err.obs.v.r.2<-data.frame(fit.median)
# use these values to make a density curve of the median OOB classification
# errors for 500 iterations.

# ROC: Binding and creating the curve
roc.total.obs.v.r.2_bal <- as.data.frame(do.call(rbind, roc.list))
roc.total.obs.v.r.2_bal$replicate <-as.factor(roc.total.obs.v.r.2_bal$replicate)
write.csv(roc.total.obs.v.r.2_bal,"roc.total.obs.v.r.2_bal.csv")

Code for plots

roc.total.obs.v.r.2_unbal<-read.csv("roc.total.obs.v.r.2_unbal.csv")
roc.total.obs.v.r.2_unbal[1]<-NULL
roc.total.obs.v.r.2_unbal$replicate<-factor(roc.total.obs.v.r.2_unbal$replicate)
roc.obs.v.r.2_unbal_plot<-ggplot(roc.total.obs.v.r.2_unbal, 
                                aes(x=x_axis, y=y_axis)) +         
  geom_line(aes(color=replicate), alpha=0.2)+ 
  geom_smooth(colour="darkturquoise", method="loess") + 
  xlab("False Positive Rate")+
  ylab("True Positive Rate")+
  scale_color_grey()+ 
  ggtitle("Unbalanced")+
  theme_bw()+
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title=element_text(size=8),
    axis.text.x=element_text(size=8),
    axis.text.y=element_text(size=8))

roc.total.obs.v.r.2_bal<-read.csv("roc.total.obs.v.r.2_bal.csv")
roc.total.obs.v.r.2_bal[1]<-NULL
roc.total.obs.v.r.2_bal$replicate<-factor(roc.total.obs.v.r.2_bal$replicate)
roc.obs.v.r.2_bal_plot<-ggplot(roc.total.obs.v.r.2_bal, 
                                aes(x=x_axis, y=y_axis)) +         
  geom_line(aes(color=replicate), alpha=0.2)+ 
  geom_smooth(colour="darkgoldenrod1", method="loess") + 
  xlab("False Positive Rate")+
  ylab("True Positive Rate")+
  scale_color_grey()+ 
  ggtitle("Balanced")+
  theme_bw()+
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title=element_text(size=8),
    axis.text.x=element_text(size=8),
    axis.text.y=element_text(size=8))

unbal<-roc.total.obs.v.r.2_unbal
un.group<-data.frame(rep("unbal",18614))
names(un.group)[1]<-"group"
unbal<-cbind(unbal,un.group)
bal<-roc.total.obs.v.r.2_bal
bal.group<-data.frame(rep("bal",12364))
names(bal.group)[1]<-"group"
bal<-cbind(bal,bal.group)
combined.roc<-rbind(bal,unbal)

combined.roc.plot<-ggplot(combined.roc, aes(x=x_axis, y=y_axis, color=factor(group))) + 
  geom_blank() + 
  stat_smooth(aes(colour = factor(group)),se=FALSE, method="loess") + 
  theme_bw()+theme(legend.position = "none")+
  xlab("False Positive Rate")+
  ylab("True Positive Rate")+scale_color_manual(values=c("darkturquoise", "darkgoldenrod1"))+
  ggtitle("Comparison")+
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0),
    axis.title=element_text(size=8),
    axis.text.x=element_text(size=8),
    axis.text.y=element_text(size=8))

grid.arrange(roc.obs.v.r.2_bal_plot,roc.obs.v.r.2_unbal_plot,combined.roc.plot, ncol=1)

Conclusions

This analysis is very sensitive to unbalanced classes.
From 500 iterations of random forest with equal proportions, it appears that Coast samples are asymmetrically easier to accurately classify to region.
Model variability improves when class size is balanced, but this does not improve mean model performance.
There seems to be no effect of balanced class size on validation performance. All samples are treated as unknowns. (Data not shown)