This is an R Markdown document. These data pretain to Finch et al. 2017 Applications in Plant Sciences.
This sampling design allows us to address two specific questions: (i) what is the magnitude of inter-annual variation in wood metabolic molecules relative to that of geographic variation, and (ii) can wood from Douglas-fir trees originating in the Oregon Coast and Cascade Ranges be accurately classified to geographic source using their DART-TOFMS metabolite profiles, and if so, which molecules allow for the discrimination of regional sources of wood?
Random forest classification of observed and randomized molecular abundance data taking care to ensure balanced class size (sample size for each class in the grouping variable). Four models were tested: the Year model, Year*Source, SourceINDIV, and SourceMEAN (Table 1).
This analysis follows round one of reviewer comments. This comes after two analyses that lead to the requirement for balanced class size. (see http://rpubs.com/finchnSNPs/250368 and http://rpubs.com/finchnSNPs/250743)
Douglas-fir Two or three annuals rings per increment core, and increment core averages over 3 growth years. 2-6 trees per population 2 regions: Coast Range and Cascades Range Random forest classification
| Model Identifier | Grouping Variable | Classes | Sample size (n) |
|---|---|---|---|
| Year | Growth Year | 1986, 1987, 1988 | 560 |
| Year*Source | Growth Year and Region of Origin | Cascades 1986, Cascades 1987, Cascades 1988, Coast 1986, Coast 1987, Coast 1988 | 560 |
| SourceINDIV | Region of Origin | Cascades, Coast | 560 |
| SourceMEAN | Region of Origin | Cascades, Coast | 188 |
| Model | Class | Sample size (n) | Class Size for Analysis |
|---|---|---|---|
| Year | 1986 | 185 | 185 |
| Year | 1987 | 187 | 185 |
| Year | 1988 | 188 | 185 |
| Year | Total | 560 | 555 |
| Year*Source | Cascades 1986 | 102 | 83 |
| Year*Source | Cascades 1987 | 103 | 83 |
| Year*Source | Cascades 1988 | 103 | 83 |
| Year*Source | Coast 1986 | 83 | 83 |
| Year*Source | Coast 1987 | 84 | 83 |
| Year*Source | Coast 1988 | 85 | 83 |
| Year*Source | Total | 560 | 498 |
| SourceINDIV | Cascades | 308 | 224 training; 84 validation |
| SourceINDIV | Coast | 252 | 224 training; 28 validation |
| SourceINDIV | Total | 560 | 560; 448 training; 112 validation |
| SourceMEAN | Cascades | 103 | 75 training; 28 validation |
| SourceMEAN | Coast | 85 | 75 training; 10 validation |
| SourceMEAN | Total | 188 | 188; 150 training; 38 validation |
library(randomForest)
library(ggplot2)
library(dplyr)
library(ROCR)
library(gridExtra)
Below is sample code that I ran to generate these data. This runs for 500 iterations, so the data used for graphed results have been saved independently and loaded subsequently.
| Abbreviation | Model | Grouping Variable | Number of Classes |
|---|---|---|---|
| v.r.2 | SourceMEAN | region | 2 |
| a.r.2 | SourceINDIV | region | 2 |
| a.s.3 | Year | growth year | 3 |
| a.r.s.6 | Year*Source | growth year and region | 6 |
Subset your data into a training and validation set. BUT MAKE SURE THAT THE CLASSES ARE BALANCED! I have 188 samples for the SourceMEAN model. 80% of 188 is ~150 and 150/2 is 75. There are 85 Coast samples and Coast is at a deficit, so this will work. 75 Coast, 75 Cascades, and the remaining samples will be reserved for the validation set (38).
#v.r.2 observed----
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
#subset to keep variables for averaging
df_250_1<-df_250_1[c(4:952)]
#dplyr functions
avg <- df_250_1%>%group_by(Region, Pop, Ind)
indavg<- avg%>%summarise_each(funs(mean))
write.csv(indavg, "source_mean_data.csv")
#now we just want the grouping variable we are interested in an molecular abundance variables.
df<-indavg[c(1, 4:949)]
#You need a dataset that contains your predictor variables (ions) and one column that contains the class information.
#For Loop to run random forest X times. We create an empty list, to fill using the values of the fit and other elements
fit.list <- list()
roc.list <- list()
pred.list <- list()
for (i in 1:length(c(1:500))){
#for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#randomly select 75 from each; saved as a list
cas.80<-sample(nrow(cas), 75)
coa.80<-sample(nrow(coa), 75)
#make new dataframe that matches the samples in the list cas.80. This will become the Cascades portion of the dataset to train the Random Forest analysis.
df.train.cas<-cas[cas.80,]
df.validate.cas<-cas[-cas.80,]
#make new dataframe that DOES NOT match the samples in the list cas.80, and repeat the same protocol to make dataframes for the Coast portions of the training and validation tests.
df.train.coa<-coa[coa.80,]
df.validate.coa<-coa[-coa.80,]
#Combine the Cascades and Coast portions of the training and validation sets.
df.train<-rbind(df.train.cas,df.train.coa)
df.validate<-rbind(df.validate.cas,df.validate.coa)
#this will run randomForest for each (i) in fit.list 500 times.
fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
#this next part uses the model we built to classify the validation set.
forest.pr<-predict(fit.forest, type="prob",df.validate)[,2]
forest.pred<-prediction(forest.pr, df.validate$Region)
#We want to save the True Positive rate and the False positive rate. These will then become the x and y values for the ROC curve.
forest.pref<- performance(forest.pred,"tpr","fpr")
pred.list[[i]] <- forest.pred
# Saving x and y values in a dataframe
df.forest <- data.frame(unlist(forest.pref@x.values), unlist(forest.pref@y.values))
df.forest$replicate <- rep(i, nrow(df.forest))
colnames(df.forest) <- c("x_axis","y_axis","replicate")
roc.list[[i]] <- df.forest #save the data frame as a list.
} #end of for loop
#To see the distributions of the median of OOB error, you can use the list you just created and do a simple for loop again: calculate the median OOB classification error for each interation of the randomForest.
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting the values
#err.obs.v.r.2
fit.median <- unlist(fit.median)
err.obs.v.r.2<-data.frame(fit.median)#write this out >>>>
# use these values to make a density curve of the median OOB classification
# errors for 500 iterations.
# ROC: Binding and creating the curve
roc.total.obs.v.r.2 <- as.data.frame(do.call(rbind, roc.list))
#write this out >>>>
roc.total.obs.v.r.2$replicate <-as.factor(roc.total.obs.v.r.2$replicate)
# Calculate area under the curve for each curve
auc.list <- list()
for (i in 1:length(pred.list)){
auc <- performance(pred.list[[i]],"auc")
auc.list[[i]] <- unlist(slot(auc, "y.values"))
}
#stats to report.
auc.v.r.2 <- unlist(auc.list)
meanauc.obs.v.r.2 <- mean(round(auc, digits = 2))
auc.error.obs.v.r.2<-qt(0.975,df=length(auc.v.r.2)-1)*sd(auc.v.r.2)/sqrt(length(auc.v.r.2))
auc.left.obs.v.r.2<-meanauc.obs.v.r.2-auc.error.obs.v.r.2
auc.right.obs.v.r.2<-meanauc.obs.v.r.2+aucerror.obs.v.r.2
auc.v.r.2.stats<-data.frame(cbind(meanauc.obs.v.r.2,auc.error.obs.v.r.2,auc.left.obs.v.r.2,auc.right.obs.v.r.2))
names(auc.v.r.2.stats)<-c("mean", "sd", "lower", "upper")
auc.v.r.2.stats$model<-c("v.r.2") #write this out >>>>
#graph the ROC curve
roc.auc.obs.v.r.2<-ggplot(roc.total.obs.v.r.2, aes(x=x_axis, y=y_axis)) +
geom_line(aes(color=replicate), alpha=0.2) +
stat_smooth(colour="darkred", method="loess") + #this adds the stat line of the mean and the ribbon
theme_bw()+theme(legend.position = "none")+
xlab("True Positive Rate")+
ylab("False Positive Rate")+scale_color_grey()+
annotate("text", label = "mean(AUC)= 0.79", x = 0.75, y = .3, size = 3, colour = "black")+
#annotation_custom(tableGrob(mytable.obs.v.r.2), xmin=0.75, xmax=1.00, ymin=0, ymax=0.3)+
ggtitle("B")+
xlab("False Positive Rate")+
ylab("True Positive Rate")+
theme(plot.title = element_text(hjust = 0))+
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title=element_text(size=8),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8))
Now do the same with randomized data, but this time we are not interested in the ROC curve.
#v.r.2 randomized----
#make your empty lists
fit.list <- list()
#start your for loop in the same way
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#reload the dataset.
df<-indavg[c(1, 4:949)]
#make a new dataframe containing the grouping variable.
rand.col.1<-data.frame(df$Region)
#delete the grouping variable in the main dataframe.
df$Region<-NULL
#shuffle the values in the grouping variable
rand.col.1<-data.frame(rand.col.1[sample(nrow(rand.col.1)),])
#rename the grouping variable
names(rand.col.1)[1]<-"Region"
#recombine the main dataframe with the randomized values in the grouping variable. Now the data have been maintained, but the classes of the grouping variable have been randomized.
df<-cbind(rand.col.1,df)
#Now we have to subset the dataset to obtain the 80% training set and the 20% validation set, taking care to maintain balanced class sample sizes.
cas<-subset(df, Region == "Cascades")
coa<-subset(df, Region == "Coast")
cas.80<-sample(nrow(cas), 75)
coa.80<-sample(nrow(coa), 75)
df.train.cas<-cas[cas.80,]
df.validate.cas<-cas[-cas.80,]
df.train.coa<-coa[coa.80,]
df.validate.coa<-coa[-coa.80,]
df.train<-rbind(df.train.cas,df.train.coa)
df.validate<-rbind(df.validate.cas,df.validate.coa)
#randomForest for 500 iterations
fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
#Obtain OOB error rates.
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting the values to make a dataframe of OOB values for Randomized data.
#err.rand.v.r.2
fit.median <- unlist(fit.median)
err.rand.v.r.2<-data.frame(fit.median) #write this out >>>>
Subset your data into a training and validation set. BUT MAKE SURE THAT THE CLASSES ARE BALANCED! For my data this required some calculations. I have 560 samples for the SourceINDIV model. 80% of 560 is 448 and 448/2 is 224. There are 252 samples for the Coast class which is at a deficit, and 224 will work. 224 Coast samples, 224 Cascades samples, and the remaining samples will be used for the validation set (112).
#a.r.2 observed----
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df<-df_250_1[c(4,7:952)]
fit.list <- list()
roc.list <- list()
pred.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
cas<-subset(df, Region == "Cascades")
coa<-subset(df, Region == "Coast")
cas.80<-sample(nrow(cas), 224)
coa.80<-sample(nrow(coa), 224)
df.train.cas<-cas[cas.80,]
df.validate.cas<-cas[-cas.80,]
df.train.coa<-coa[coa.80,]
df.validate.coa<-coa[-coa.80,]
df.train<-rbind(df.train.cas,df.train.coa)
df.validate<-rbind(df.validate.cas,df.validate.coa)
#randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
#this next part uses the model we built to classify the validation set.
forest.pr<-predict(fit.forest, type="prob",df.validate)[,2]
forest.pred<-prediction(forest.pr, df.validate$Region)
forest.pref<- performance(forest.pred,"tpr","fpr")
pred.list[[i]] <- forest.pred
df.forest <- data.frame(unlist(forest.pref@x.values), unlist(forest.pref@y.values))
df.forest$replicate <- rep(i, nrow(df.forest))
colnames(df.forest) <- c("x_axis","y_axis","replicate")
roc.list[[i]] <- df.forest #save the data frame as a list.
} #end of for loop
#Build dataframe of median of OOB error
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting values and save
#err.obs.a.r.2
fit.median <- unlist(fit.median)
err.obs.a.r.2<-data.frame(fit.median) #write this out >>>>
# ROC: Binding and creating the curve
roc.total.obs.a.r.2 <- as.data.frame(do.call(rbind, roc.list))
roc.total.obs.a.r.2$replicate <-as.factor(roc.total.obs.a.r.2$replicate) #write this out >>>>
# Calculate area under the curve for each curve
auc.list <- list()
for (i in 1:length(pred.list)){
auc <- performance(pred.list[[i]],"auc")
auc.list[[i]] <- unlist(slot(auc, "y.values"))
}
#stats to report.
auc.a.r.2 <- unlist(auc.list)
meanauc.obs.a.r.2 <- mean(round(auc.a.r.2, digits = 2))
auc.error.obs<-qt(0.975,df=length(auc.a.r.2)-1)*sd(auc.a.r.2)/sqrt(length(auc.a.r.2))
auc.left.obs<-meanauc.obs.a.r.2-auc.error.obs.a.r.2
auc.right.obs<-meanauc.obs.a.r.2+aucerror.obs.a.r.2
auc.a.r.2.stats<-data.frame(cbind(meanauc.obs.a.r.2,auc.error.obs.a.r.2,auc.left.obs.a.r.2,auc.right.obs.a.r.2))
names(auc.a.r.2.stats)<-c("mean", "sd", "lower", "upper")
auc.a.r.2.stats$model<-c("a.r.2") #write this out >>>>
#graph the ROC curves
roc.auc.obs.a.r.2<-ggplot(roc.total.obs.a.r.2, aes(x=x_axis, y=y_axis)) +
geom_line(aes(color=replicate), alpha=0.2) +
stat_smooth(colour="midnightblue", method="loess") + #this adds the stat line of the mean and the ribbon
theme_bw()+theme(legend.position = "none")+
xlab("True Positive Rate")+
ylab("False Positive Rate")+scale_color_grey()+
annotate("text", label = "mean(AUC)= 0.85", x = 0.75, y = .3, size = 3, colour = "black")+
#annotation_custom(tableGrob(mytable.obs.a.r.2), xmin=0.75, xmax=1.00, ymin=0, ymax=0.3)+
ggtitle("A")+
xlab("False Positive Rate")+
ylab("True Positive Rate")+
theme(plot.title = element_text(hjust = 0))+
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title=element_text(size=8),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8))
#a.r.2 randomized----
fit.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#reload dataset
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df<-df_250_1[c(4,7:952)]
#make a new dataset with the grouping variable
rand.col.1<-data.frame(df$Region)
#remove the grouping variable from the main dataset
df$Region<-NULL
#randomize the values in the grouping vairiable
rand.col.1<-data.frame(rand.col.1[sample(nrow(rand.col.1)),])
names(rand.col.1)[1]<-"Region"
#recombine the grouping variable and the rest of the molecular abundance data.
df<-cbind(rand.col.1,df)
#now subset the data into a training and validation set.
cas<-subset(df, Region == "Cascades")
coa<-subset(df, Region == "Coast")
cas.80<-sample(nrow(cas), 224)
coa.80<-sample(nrow(coa), 224)
df.train.cas<-cas[cas.80,]
df.validate.cas<-cas[-cas.80,]
df.train.coa<-coa[coa.80,]
df.validate.coa<-coa[-coa.80,]
df.train<-rbind(df.train.cas,df.train.coa)
df.validate<-rbind(df.validate.cas,df.validate.coa)
#Run randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Region~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
#obtain OOB's and calculate medians.
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlisting the values
#err.rand.a.r.2
fit.median <- unlist(fit.median)
err.rand.a.r.2<-data.frame(fit.median) #write this out >>>>
Subset your data into a training and validation set. BUT MAKE SURE THAT THE CLASSES ARE BALANCED! For my data this required some calculations. The smallest class (1986) in the grouping variable had 185 samples. 185 was set as the upper limit for each class. For this analysis, we are not interested in validating unknowns.
#a.s.3 observed----
fit.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df_250_1$Year<-factor(df_250_1$Year)
df<-df_250_1[c(2, 7:952)]
is.factor(df$Year) #this must be TRUE.
#Divide the dataset into balanced classes of 185 samples each.
year.86<-subset(df, Year == "86")
year.87<-subset(df, Year == "87")
year.88<-subset(df, Year == "88")
year.86<-year.86[sample((1:nrow(year.86)), 185, replace=FALSE),]
year.87<-year.87[sample((1:nrow(year.87)), 185, replace=FALSE),]
year.88<-year.88[sample((1:nrow(year.88)), 185, replace=FALSE),]
df.train<-rbind(year.86,year.87,year.88)
#run randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Year~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
#Get median OOB values
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
# Unlist the values and save
#err.obs.a.s.3
fit.median <- unlist(fit.median)
err.obs.a.s.3<-data.frame(fit.median)
#a.s.3 randomized----
fit.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#reload dataset
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df_250_1<-df_250_1[c(2, 7:952)]
#make a new dataframe with the grouping variable.
rand.col.1<-data.frame(df_250_1$Year)
names(rand.col.1)[1]<-"Year"
rand.col.1$Year<-factor(rand.col.1$Year)
is.factor(rand.col.1$Year)
#must be true
df_250_1$Year<-NULL
#randomize the values in the grouping variable.
rand.col.1<-data.frame(rand.col.1[sample(nrow(rand.col.1)),])
#Recombine the randomized grouping variable with the molecular abundance data.
df<-cbind(rand.col.1,df_250_1)
names(df)[1]<-"Year"
df$Year<-factor(df$Year)
is.factor(df$Year)#this must be TRUE.
#subset the data to obtain balanced class size.
year.86<-subset(df, Year == "86")
year.87<-subset(df, Year == "87")
year.88<-subset(df, Year == "88")
year.86<-year.86[sample((1:nrow(year.86)), 185, replace=FALSE),]
year.87<-year.87[sample((1:nrow(year.87)), 185, replace=FALSE),]
year.88<-year.88[sample((1:nrow(year.88)), 185, replace=FALSE),]
df.train<-rbind(year.86,year.87,year.88)
#run randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Year~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
fit.median <- unlist(fit.median)
err.rand.a.s.3<-data.frame(fit.median)
Subset your data into a training and validation set. BUT MAKE SURE THAT THE CLASSES ARE BALANCED! For my data this required some calculations. The smallest class (Coast.86) in the grouping variable had 83 samples. 83 was set as the upper limit for each class. For this analysis, we are not interested in validating unknowns.
#a.r.s.6 observed----
fit.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#reload dataset
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df<-df_250_1[c(3, 7:952)]
#subset data for balanced classes.
cas.86<-subset(df, Reg.Sea == "Cascades.86")
cas.87<-subset(df, Reg.Sea == "Cascades.87")
cas.88<-subset(df, Reg.Sea == "Cascades.88")
coa.86<-subset(df, Reg.Sea == "Coast.86")
coa.87<-subset(df, Reg.Sea == "Coast.87")
coa.88<-subset(df, Reg.Sea == "Coast.88")
cas.86<-cas.86[sample((1:nrow(cas.86)), 83, replace=FALSE),]
cas.87<-cas.87[sample((1:nrow(cas.87)), 83, replace=FALSE),]
cas.88<-cas.88[sample((1:nrow(cas.88)), 83, replace=FALSE),]
coa.86<-coa.86[sample((1:nrow(coa.86)), 83, replace=FALSE),]
coa.87<-coa.87[sample((1:nrow(coa.87)), 83, replace=FALSE),]
coa.88<-coa.88[sample((1:nrow(coa.88)), 83, replace=FALSE),]
df.train<-rbind(cas.86,cas.87,cas.88,coa.86,coa.87,coa.88)
#run randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Reg.Sea~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
fit.median <- unlist(fit.median)
err.obs.a.r.s.6<-data.frame(fit.median)
#a.r.s.6 randomized----
fit.list <- list()
for (i in 1:length(c(1:500))){ #for all items (i) in the list of chosen length--change to number of iterations you want to perform.
cat("Repetition number ") #Repetition number X will let us know that our loop is working.
cat(i)
cat("\n")
#reload the dataset
df_250_1<-read.csv("Cascade vs Coast 250 mmu and 1.csv")
df_250_1<-df_250_1[c(3, 7:952)]
#randomize values in the grouping variable
rand.col.1<-data.frame(df_250_1$Reg.Sea)
df_250_1$Reg.Sea<-NULL
rand.col.1<-data.frame(rand.col.1[sample(nrow(rand.col.1)),])
names(rand.col.1)[1]<-"Reg.Sea"
df<-cbind(rand.col.1,df_250_1)
#subset data for balanced classes.
cas.86<-subset(df, Reg.Sea == "Cascades.86")
cas.87<-subset(df, Reg.Sea == "Cascades.87")
cas.88<-subset(df, Reg.Sea == "Cascades.88")
coa.86<-subset(df, Reg.Sea == "Coast.86")
coa.87<-subset(df, Reg.Sea == "Coast.87")
coa.88<-subset(df, Reg.Sea == "Coast.88")
cas.86<-cas.86[sample((1:nrow(cas.86)), 83, replace=FALSE),]
cas.87<-cas.87[sample((1:nrow(cas.87)), 83, replace=FALSE),]
cas.88<-cas.88[sample((1:nrow(cas.88)), 83, replace=FALSE),]
coa.86<-coa.86[sample((1:nrow(coa.86)), 83, replace=FALSE),]
coa.87<-coa.87[sample((1:nrow(coa.87)), 83, replace=FALSE),]
coa.88<-coa.88[sample((1:nrow(coa.88)), 83, replace=FALSE),]
df.train<-rbind(cas.86,cas.87,cas.88,coa.86,coa.87,coa.88)
#run randomForest
fit.list[[i]] <- (fit.forest<-randomForest(Reg.Sea~., data=df.train, importance=TRUE)) #Grows the forest
} #end of for loop
fit.median <- list()
for (i in 1:length(fit.list)){
fit.median[[i]] <- median(fit.list[[i]]$err.rate[,'OOB'])
}
fit.median <- unlist(fit.median)
err.rand.a.r.s.6<-data.frame(fit.median)
Data saved as .csv ~/Documents/DART_DouglasFir/DART_datafiles/Work since 20160602/Revision_1_datasets
Data preparation. Code shown, but not evaluated to avoid over-writing and because all data has been saved.
#density plots data prep----
#bring in v.r.2 data saved after random forest
v.r.2.errs<-read.csv("Revision_1_datasets/err.v.r.2_20170218.csv")
v.r.2.errs[1]<-NULL
v.r.2.dens<-read.csv("Revision_1_datasets/dens.v.r.2_20170218.csv")
v.r.2.dens[1]<-NULL
#bring in a.r.2 data saved after random forest
a.r.2.errs<-read.csv("Revision_1_datasets/err.a.r.2_20170218.csv")
a.r.2.errs[1]<-NULL
a.r.2.dens<-read.csv("Revision_1_datasets/dens.a.r.2_20170218.csv")
a.r.2.dens[1]<-NULL
#bring in a.s.3 data saved after random forest
a.s.3.errs<-read.csv("Revision_1_datasets/err.a.s.3_rev1.csv")
a.s.3.errs[1]<-NULL
a.s.3.dens<-read.csv("Revision_1_datasets/dens.a.s.3_rev1.csv")
a.s.3.dens[1]<-NULL
#bring in a.r.s.6 data saved after random forest
a.r.s.6.errs<-read.csv("Revision_1_datasets/err.a.r.s.6_rev1.csv")
a.r.s.6.errs[1]<-NULL
a.r.s.6.dens<-read.csv("Revision_1_datasets/dens.a.r.s.6_rev1.csv")
a.r.s.6.dens[1]<-NULL
#Calculate the compliment of OOB classification error or classification accuracy.
#Step 1: Convert to percentage
a.r.2.errs$Observed<-(a.r.2.errs$Observed*100)
a.r.2.errs$Randomized<-(a.r.2.errs$Randomized*100)
v.r.2.errs$Observed<-(v.r.2.errs$Observed*100)
v.r.2.errs$Randomized<-(v.r.2.errs$Randomized*100)
a.s.3.errs$Observed<-(a.s.3.errs$Observed*100)
a.s.3.errs$Randomized<-(a.s.3.errs$Randomized*100)
a.r.s.6.errs$Observed<-(a.r.s.6.errs$Observed*100)
a.r.s.6.errs$Randomized<-(a.r.s.6.errs$Randomized*100)
a.r.2.dens$OOB.med.err<-(a.r.2.dens$OOB.med.err*100)
v.r.2.dens$OOB.med.err<-(v.r.2.dens$OOB.med.err*100)
a.s.3.dens$OOB.med.err<-(a.s.3.dens$OOB.med.err*100)
a.r.s.6.dens$OOB.med.err<-(a.r.s.6.dens$OOB.med.err*100)
#Step 2: Calculate compliment/accuracy
a.r.2.errs$Observed<-(100-a.r.2.errs$Observed)
a.r.2.errs$Randomized<-(100-a.r.2.errs$Randomized)
v.r.2.errs$Observed<-(100-v.r.2.errs$Observed)
v.r.2.errs$Randomized<-(100-v.r.2.errs$Randomized)
a.s.3.errs$Observed<-(100-a.s.3.errs$Observed)
a.s.3.errs$Randomized<-(100-a.s.3.errs$Randomized)
a.r.s.6.errs$Observed<-(100-a.r.s.6.errs$Observed)
a.r.s.6.errs$Randomized<-(100-a.r.s.6.errs$Randomized)
a.r.2.dens$OOB.med.err<-(100-a.r.2.dens$OOB.med.err)
v.r.2.dens$OOB.med.err<-(100-v.r.2.dens$OOB.med.err)
a.s.3.dens$OOB.med.err<-(100-a.s.3.dens$OOB.med.err)
a.r.s.6.dens$OOB.med.err<-(100-a.r.s.6.dens$OOB.med.err)
#Write out.
write.csv(a.r.2.errs,"Revision_1_datasets/a.r.2.acc_20170219.csv",row.names=FALSE)
write.csv(v.r.2.errs,"Revision_1_datasets/v.r.2.acc_20170219.csv",row.names=FALSE)
write.csv(a.s.3.errs,"Revision_1_datasets/a.s.3.acc_20170219.csv",row.names=FALSE)
write.csv(a.r.s.6.errs,"Revision_1_datasets/a.r.s.6.acc_20170219.csv",row.names=FALSE)
write.csv(a.r.2.dens,"Revision_1_datasets/a.r.2.dens.acc_20170219.csv",row.names=FALSE)
write.csv(v.r.2.dens,"Revision_1_datasets/v.r.2.dens.acc_20170219.csv",row.names=FALSE)
write.csv(a.s.3.dens,"Revision_1_datasets/a.s.3.dens.acc_20170219.csv",row.names=FALSE)
write.csv(a.r.s.6.dens,"Revision_1_datasets/a.r.s.6.dens.acc_20170219.csv",row.names=FALSE)
Calculate statistics and plot density graphs.
#density plots----
#Bring in those files.
a.r.2.accs<-read.csv("Revision_1_datasets/a.r.2.acc_20170219.csv")
v.r.2.accs<-read.csv("Revision_1_datasets/v.r.2.acc_20170219.csv")
a.s.3.accs<-read.csv("Revision_1_datasets/a.s.3.acc_20170219.csv")
a.r.s.6.accs<-read.csv("Revision_1_datasets/a.r.s.6.acc_20170219.csv")
a.r.2.acc.dens<-read.csv("Revision_1_datasets/a.r.2.dens.acc_20170219.csv")
v.r.2.acc.dens<-read.csv("Revision_1_datasets/v.r.2.dens.acc_20170219.csv")
a.s.3.acc.dens<-read.csv("Revision_1_datasets/a.s.3.dens.acc_20170219.csv")
a.r.s.6.acc.dens<-read.csv("Revision_1_datasets/a.r.s.6.dens.acc_20170219.csv")
#a.r.2 density plot----
#observed stats
mean.a.r.2.obs<-mean(a.r.2.accs$Observed)
error.a.r.2.obs<-qt(0.975,df=length(a.r.2.accs$Observed)-1)*sd(a.r.2.accs$Observed)/sqrt(length(a.r.2.accs$Observed))
a.r.2.obs.left<-mean(a.r.2.accs$Observed)-error.a.r.2.obs
a.r.2.obs.right<-mean(a.r.2.accs$Observed)+error.a.r.2.obs
a.r.2.obs.random.forest.stats<-data.frame(cbind(mean.a.r.2.obs,error.a.r.2.obs,a.r.2.obs.left,a.r.2.obs.right))
names(a.r.2.obs.random.forest.stats)<-c("mean","se","lower","upper")
a.r.2.obs.random.forest.stats$format<-c("observed")
#randomized stats
mean.a.r.2.rand<-mean(a.r.2.accs$Randomized)
error.a.r.2.rand<-qt(0.975,df=length(a.r.2.accs$Randomized)-1)*sd(a.r.2.accs$Randomized)/sqrt(length(a.r.2.accs$Randomized))
a.r.2.rand.left<-mean(a.r.2.accs$Randomized)-error.a.r.2.rand
a.r.2.rand.right<-mean(a.r.2.accs$Randomized)+error.a.r.2.rand
a.r.2.rand.random.forest.stats<-data.frame(cbind(mean.a.r.2.rand,error.a.r.2.rand,a.r.2.rand.left,a.r.2.rand.right))
names(a.r.2.rand.random.forest.stats)<-c("mean","se","lower","upper")
a.r.2.rand.random.forest.stats$format<-c("randomized")
a.r.2.random.forest.stats<-rbind(a.r.2.obs.random.forest.stats,a.r.2.rand.random.forest.stats)
a.r.2.random.forest.stats$model<-c("source.indiv","source.indiv")
RFInd <- ggplot(a.r.2.acc.dens, aes(x=OOB.med.err, fill=group))+
geom_density(alpha=0.7)+
scale_fill_manual(values=c("grey80", "grey50"),labels=c("Observed Data", "Randomized Data")) +
geom_vline(xintercept = mean.a.r.2.rand,colour="black")+
geom_vline(xintercept = mean.a.r.2.obs,colour="blue")+
theme_classic()+ theme_bw()+
scale_x_continuous(limits = c(0, 100))+
xlab("Classification Accuracy (%)")+
ylab("Kernal Density Estimate")+
ggtitle("SourceINDIV")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.justification=c(1,1),
legend.position=c(.5,1),
legend.title=element_blank(),
legend.background = element_blank(),
legend.key.size = unit(.4, "cm"),
legend.text = element_text(size = 8),
axis.title=element_text(size=8),
legend.key=element_rect(color="white"))
#v.r.2 density plot----
#observed stats
mean.v.r.2.obs<-mean(v.r.2.accs$Observed)
error.v.r.2.obs<-qt(0.975,df=length(v.r.2.accs$Observed)-1)*sd(v.r.2.accs$Observed)/sqrt(length(v.r.2.accs$Observed))
v.r.2.obs.left<-mean(v.r.2.accs$Observed)-error.a.r.2.obs
v.r.2.obs.right<-mean(v.r.2.accs$Observed)+error.a.r.2.obs
v.r.2.obs.random.forest.stats<-data.frame(cbind(mean.v.r.2.obs,error.v.r.2.obs,v.r.2.obs.left,v.r.2.obs.right))
names(v.r.2.obs.random.forest.stats)<-c("mean","se","lower","upper")
v.r.2.obs.random.forest.stats$format<-c("observed")
#randomized stats
mean.v.r.2.rand<-mean(v.r.2.accs$Randomized)
error.v.r.2.rand<-qt(0.975,df=length(v.r.2.accs$Randomized)-1)*sd(v.r.2.accs$Randomized)/sqrt(length(v.r.2.accs$Randomized))
v.r.2.rand.left<-mean(v.r.2.accs$Randomized)-error.v.r.2.rand
v.r.2.rand.right<-mean(v.r.2.accs$Randomized)+error.v.r.2.rand
v.r.2.rand.random.forest.stats<-data.frame(cbind(mean.v.r.2.rand,error.v.r.2.rand,v.r.2.rand.left,v.r.2.rand.right))
names(v.r.2.rand.random.forest.stats)<-c("mean","se","lower","upper")
v.r.2.rand.random.forest.stats$format<-c("randomized")
v.r.2.random.forest.stats<-rbind(v.r.2.obs.random.forest.stats,v.r.2.rand.random.forest.stats)
v.r.2.random.forest.stats$model<-c("source.mean","source.mean")
RFMean <- ggplot(v.r.2.acc.dens, aes(x=OOB.med.err, fill=group))+
geom_density(alpha=0.7)+
scale_fill_manual(values=c("grey80", "grey50"),labels=c("Observed Data", "Randomized Data")) +
geom_vline(xintercept = mean.v.r.2.rand,colour="black")+
geom_vline(xintercept = mean.v.r.2.obs,colour="blue")+
theme_classic()+ theme_bw()+
scale_x_continuous(limits = c(0, 100))+
xlab("Classification Accuracy (%)")+
ylab("Kernal Density Estimate")+
ggtitle("SourceMEAN")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.justification=c(1,1),
legend.position=c(.5,1),
legend.title=element_blank(),
legend.background = element_blank(),
legend.key.size = unit(.4, "cm"),
legend.text = element_text(size = 8),
axis.title=element_text(size=8),
legend.key=element_rect(color="white"))
#a.s.3 density plot----
#observed stats
mean.a.s.3.obs<-mean(a.s.3.accs$Observed)
error.a.s.3.obs<-qt(0.975,df=length(a.s.3.accs$Observed)-1)*sd(a.s.3.accs$Observed)/sqrt(length(a.s.3.accs$Observed))
a.s.3.obs.left<-mean(a.s.3.accs$Observed)-error.a.s.3.obs
a.s.3.obs.right<-mean(a.s.3.accs$Observed)+error.a.s.3.obs
a.s.3.obs.random.forest.stats<-data.frame(cbind(mean.a.s.3.obs,error.a.s.3.obs,a.s.3.obs.left,a.s.3.obs.right))
names(a.s.3.obs.random.forest.stats)<-c("mean","se","lower","upper")
a.s.3.obs.random.forest.stats$format<-c("observed")
#randomized stats
mean.a.s.3.rand<-mean(a.s.3.accs$Randomized)
error.a.s.3.rand<-qt(0.975,df=length(a.s.3.accs$Randomized)-1)*sd(a.s.3.accs$Randomized)/sqrt(length(a.s.3.accs$Randomized))
a.s.3.rand.left<-mean(a.s.3.accs$Randomized)-error.a.s.3.rand
a.s.3.rand.right<-mean(a.s.3.accs$Randomized)+error.a.s.3.rand
a.s.3.rand.random.forest.stats<-data.frame(cbind(mean.a.s.3.rand,error.a.s.3.rand,a.s.3.rand.left,a.s.3.rand.right))
names(a.s.3.rand.random.forest.stats)<-c("mean","se","lower","upper")
a.s.3.rand.random.forest.stats$format<-c("randomized")
a.s.3.random.forest.stats<-rbind(a.s.3.obs.random.forest.stats,a.s.3.rand.random.forest.stats)
a.s.3.random.forest.stats$model<-c("year","year")
RFY <- ggplot(a.s.3.acc.dens, aes(x=OOB.med.err, fill=group))+
geom_density(alpha=0.7)+
scale_fill_manual(values=c("grey80", "grey50"),labels=c("Observed Data", "Randomized Data")) +
geom_vline(xintercept = mean.a.s.3.rand,colour="black")+
geom_vline(xintercept = mean.a.s.3.obs,colour="blue")+
theme_classic()+ theme_bw()+
scale_x_continuous(limits = c(0, 100))+
xlab("Classification Accuracy (%)")+
ylab("Kernal Density Estimate")+
ggtitle("Year")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.justification=c(1,1),
legend.position=c(1,1),
legend.title=element_blank(),
legend.background = element_blank(),
legend.key.size = unit(.4, "cm"),
legend.text = element_text(size = 8),
axis.title=element_text(size=8),
legend.key=element_rect(color="white"))
#a.r.s.6 density plot----
#observed stats
mean.a.r.s.6.obs<-mean(a.r.s.6.accs$Observed)
error.a.r.s.6.obs<-qt(0.975,df=length(a.r.s.6.accs$Observed)-1)*sd(a.r.s.6.accs$Observed)/sqrt(length(a.r.s.6.accs$Observed))
a.r.s.6.obs.left<-mean(a.r.s.6.accs$Observed)-error.a.r.s.6.obs
a.r.s.6.obs.right<-mean(a.r.s.6.accs$Observed)+error.a.r.s.6.obs
a.r.s.6.obs.random.forest.stats<-data.frame(cbind(mean.a.r.s.6.obs,error.a.r.s.6.obs,a.r.s.6.obs.left,a.r.s.6.obs.right))
names(a.r.s.6.obs.random.forest.stats)<-c("mean","se","lower","upper")
a.r.s.6.obs.random.forest.stats$format<-c("observed")
#randomized stats
mean.a.r.s.6.rand<-mean(a.r.s.6.accs$Randomized)
error.a.r.s.6.rand<-qt(0.975,df=length(a.r.s.6.accs$Randomized)-1)*sd(a.r.s.6.accs$Randomized)/sqrt(length(a.r.s.6.accs$Randomized))
a.r.s.6.rand.left<-mean(a.r.s.6.accs$Randomized)-error.a.r.s.6.rand
a.r.s.6.rand.right<-mean(a.r.s.6.accs$Randomized)+error.a.r.s.6.rand
a.r.s.6.rand.random.forest.stats<-data.frame(cbind(mean.a.r.s.6.rand,error.a.r.s.6.rand,a.r.s.6.rand.left,a.r.s.6.rand.right))
names(a.r.s.6.rand.random.forest.stats)<-c("mean","se","lower","upper")
a.r.s.6.rand.random.forest.stats$format<-c("randomized")
a.r.s.6.random.forest.stats<-rbind(a.r.s.6.obs.random.forest.stats,a.r.s.6.rand.random.forest.stats)
a.r.s.6.random.forest.stats$model<-c("year.source","year.source")
RFYR <- ggplot(a.r.s.6.acc.dens, aes(x=OOB.med.err, fill=group))+
geom_density(alpha=0.7)+
scale_fill_manual(values=c("grey80", "grey50"),labels=c("Observed Data", "Randomized Data")) +
geom_vline(xintercept = mean.a.r.s.6.rand,colour="black")+
geom_vline(xintercept = mean.a.r.s.6.obs,colour="blue")+
theme_classic()+ theme_bw()+
scale_x_continuous(limits = c(0, 100))+
xlab("Classification Accuracy (%)")+
ylab("Kernal Density Estimate")+
ggtitle("Year*Source")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.justification=c(1,1),
legend.position=c(1,1),
legend.title=element_blank(),
legend.background = element_blank(),
legend.key.size = unit(.4, "cm"),
legend.text = element_text(size = 8),
axis.title=element_text(size=8),
legend.key=element_rect(color="white"))
#bring together stats from all models
random.forest.stats<-data.frame(rbind(a.s.3.random.forest.stats,a.r.s.6.random.forest.stats,v.r.2.random.forest.stats,a.r.2.random.forest.stats))
#write.csv(random.forest.stats,"Revision_1_datasets/random.forest.stats.csv", row.names = FALSE)
random.forest.stats
## mean se lower upper format model
## 1 24.50172 0.09435472 24.40737 24.59608 observed year
## 2 32.88643 0.18097771 32.70546 33.06741 randomized year
## 3 16.01527 0.10486770 15.91040 16.12014 observed year.source
## 4 16.19120 0.15733344 16.03386 16.34853 randomized year.source
## 5 70.09133 0.18578071 69.99959 70.18308 observed source.mean
## 6 48.91364 0.43004359 48.48359 49.34368 randomized source.mean
## 7 75.72991 0.09174699 75.63816 75.82166 observed source.indiv
## 8 49.77670 0.24042774 49.53627 50.01713 randomized source.indiv
grid.arrange(RFY,RFYR,RFInd,RFMean,ncol=2)
ROC Curves showing model performance (SourceINDIV and SourceMEAN). These analyses can only be performed for binary classifications only.
#ROCs----
#v.r.2 for ROC
roc.v.r.2_20170218<-read.csv("Revision_1_datasets/roc.total.obs.v.r.2_20170218.csv")
roc.v.r.2_20170218$replicate<-factor(roc.v.r.2_20170218$replicate)
#a.r.2 for ROC
roc.a.r.2_20170218<-read.csv("Revision_1_datasets/roc.total.obs.a.r.2_20170218.csv")
roc.a.r.2_20170218$replicate<-factor(roc.a.r.2_20170218$replicate)
#ROC Graphs
roc.v.r.2_20170218_plot<-ggplot(roc.v.r.2_20170218,
aes(x=x_axis, y=y_axis)) +
geom_line(aes(color=replicate), alpha=0.2)+
geom_smooth(colour="darkred", method="auto") +
xlab("False Positive Rate")+
ylab("True Positive Rate")+
scale_color_grey()+
ggtitle("SourceMEAN")+
theme_bw()+
theme(
legend.position = "none",
plot.title = element_text(hjust = 0),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title=element_text(size=8),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8))
roc.a.r.2_20170218_plot<-ggplot(roc.a.r.2_20170218,
aes(x=x_axis, y=y_axis)) +
geom_line(aes(color=replicate), alpha=0.2)+
geom_smooth(colour="midnightblue", method="auto") +
xlab("False Positive Rate")+
ylab("True Positive Rate")+
scale_color_grey()+
ggtitle("SourceINDIV")+
theme_bw()+
theme(
legend.position = "none",
plot.title = element_text(hjust = 0),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title=element_text(size=8),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8))
vr2<-roc.v.r.2_20170218
vr2.group<-data.frame(rep("vr2",18646))
names(vr2.group)[1]<-"group"
vr2<-cbind(vr2,vr2.group)
ar2<-roc.a.r.2_20170218
ar2.group<-data.frame(rep("ar2",49833))
names(ar2.group)[1]<-"group"
ar2<-cbind(ar2,ar2.group)
combined.roc_20170218<-rbind(ar2,vr2)
write.csv(combined.roc_20170218, "Revision_1_datasets/combined.roc_20170218.csv", row.names = FALSE)
combined.roc.plot_20170218<-ggplot(combined.roc_20170218, aes(x=x_axis, y=y_axis, color=factor(group))) +
geom_blank() +
stat_smooth(aes(colour = factor(group)),se=FALSE, method="auto") +
theme_bw()+theme(legend.position = "none")+
xlab("False Positive Rate")+
ylab("True Positive Rate")+scale_color_manual(values=c("midnightblue", "darkred"))+
ggtitle("Comparison")+
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
axis.title=element_text(size=8),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8))
Combined ROC Figure
grid.arrange(roc.v.r.2_20170218_plot, roc.a.r.2_20170218_plot, combined.roc.plot_20170218, ncol=1)
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
Append the following code after the random forest analysis contained in fit.list to obtain the OOB error rates for each class in the grouping variable. This code will not be evaluated here.
fit.median.cas <- list()
for (i in 1:length(fit.list)){
fit.median.cas[[i]] <- median(fit.list[[i]]$err.rate[,'Cascades'])
}
fit.median.coa <- list()
for (i in 1:length(fit.list)){
fit.median.coa[[i]] <- median(fit.list[[i]]$err.rate[,'Coast'])
}
# Unlisting the values
fit.median.coa <- unlist(fit.median.coa)
a.r.2.err.coa<-data.frame(fit.median.coa)
fit.median.cas <- unlist(fit.median.cas)
a.r.2.err.cas<-data.frame(fit.median.cas)
a.r.2.err.cas$group<-rep("Cascades", length(a.r.2.err.cas))
names(a.r.2.err.cas)[1]<-"class.err"
a.r.2.err.coa$group<-rep("Coast", length(a.r.2.err.coa))
names(a.r.2.err.coa)[1]<-"class.err"
a.r.2.err.rates.by.class<-rbind(a.r.2.err.cas,a.r.2.err.coa)
#convert to accuracy
a.r.2.err.rates.by.class$class.err<-(a.r.2.err.rates.by.class$class.err)*100
a.r.2.err.rates.by.class$class.err<-100-(a.r.2.err.rates.by.class$class.err) #write this out >>>>
Graph and paired t-test.
#asymmetry analysis v.r.2----
#SourceMEAN model
v.r.2.err.rates.by.class<-read.csv("Revision_1_datasets/v.r.2.err.rates.by.class.csv")
v.r.2.err.rates.by.class_plot<-ggplot(v.r.2.err.rates.by.class, aes(x = group, y = class.err, fill=group, alpha=0.5))+
geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
scale_fill_manual(values=c("red","blue"))+
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.position="none",
axis.title=element_text(size=8),
axis.text=element_text(size=8))+
xlab("Class")+ylab("Classification Accuracy (%)")+
ggtitle("SourceMEAN")
v.r.2.err.rates.by.class_test<-t.test(class.err ~ group, data=v.r.2.err.rates.by.class, paired=TRUE)
#asymmetry analysis a.r.2----
#SourceINDIV model
a.r.2.err.rates.by.class<-read.csv("Revision_1_datasets/a.r.2.err.rates.by.class_85.csv")
a.r.2.err.rates.by.class_plot<-ggplot(a.r.2.err.rates.by.class, aes(x = group, y = class.err, fill=group, alpha=0.5))+
geom_boxplot(size=.3,outlier.colour = "black", outlier.shape = 1)+
scale_fill_manual(values=c("red","blue"))+
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0),
legend.position="none",
axis.title=element_text(size=8),
axis.text=element_text(size=8))+
xlab("Class")+ylab("Classification Accuracy (%)")+
ggtitle("SourceINDIV")
a.r.2.err.rates.by.class_test<-t.test(class.err ~ group, data=a.r.2.err.rates.by.class, paired=TRUE)
#Show results of t.test
a.r.2.err.rates.by.class_test
##
## Paired t-test
##
## data: class.err by group
## t = -59.915, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.967759 -5.588795
## sample estimates:
## mean of the differences
## -5.778277
v.r.2.err.rates.by.class_test
##
## Paired t-test
##
## data: class.err by group
## t = -48.632, df = 499, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.468792 -8.733423
## sample estimates:
## mean of the differences
## -9.101107
grid.arrange(a.r.2.err.rates.by.class_plot,v.r.2.err.rates.by.class_plot,ncol=1)