Click here for other works of the author on RPubs
In this homework, we first conduct the mrpp method to validate groups we got using hierarchical clustering. Next, we use discriminant analysis and CART to identify variables that have significant impact on the groups. Finally, we assess the predictive accuracy of these models and explore other classification models in the appendix.
library(knitr)
library(fpc)
library(vegan)
library(cluster)
library(factoextra)
library(rattle)
library(rpart.plot)
library(MASS)
library(ggplot2)
library(caret)
library(purrr)
library(corrplot)
library(reshape2)
library(randomForest)
library(nnet)
library(e1071)
The cluster result of hierarchical clustering using Jaccard distance and Ward’s linkage will be used for this homework.
#import copepod data
copdata = read.table("copepod_composition.txt", header = T)
#import species name
species_name <- read.table("copepodSPlist.txt", sep = "\t")
#assign copepod species name
rownames(copdata) <- as.character(unlist(species_name))
#convert species frequency into percentage
copdata = copdata / 100
#find dominant species
dom = apply(copdata >= 0.02, 2, which)
dom = sort(unique(unlist(dom)))
dom = t(copdata[dom, ])
#calculate distance matrix
dist_dom = vegdist(dom, method = "jaccard")
#perform hierarchical clustring
hc_dom = agnes(dist_dom, method = "ward")
group = cutree(hc_dom, 3)
group = factor(group, labels = c("Spring", "Summer", "Winter"))
#visualize the result of hierarchical clustering using dendrogram
fviz_dend(hc_dom, cex = 0.5, k = 3, lwd = 0.8, main = "Ward's linkage Dendrogram", horiz = T, ylab = paste("Height", "\n", "\n", "Agglomerative coefficient=", round(hc_dom$ac, 2), "\n", "Cophenetic correlation=", round(cor(dist_dom, cophenetic(hc_dom)), 2)))
Define function MRPP
to conduct multi-response permutation procedures on clusters. This function takes the original dataset and a grouping index as arguments, the user can also change the number of permutations or the distance method used in computing distance matrix.
MRPP <- function(data, group, n = 999, ...){
col_n = ncol(data)
group = factor(group)
k = nlevels(group)
group_split = split(data, group)
d = numeric()
#compute observed delta
for(i in 1:k){
d[i] = mean(vegdist(matrix(group_split[[i]], length(group_split[[i]]) / col_n, col_n), ...))
}
delta = sum(table(group) / length(group) * d)
#conduct permutation
delta_perm = numeric()
for(j in 1:n){
group_perm = sample(group)
group_perm_split = split(data, group_perm)
d_perm = numeric()
for(i in 1:k){
d_perm[i] = mean(vegdist(matrix(group_perm_split[[i]], length(group_perm_split[[i]]) / col_n, col_n), ...))
}
delta_perm[j] = sum(table(group_perm) / length(group_perm) * d_perm)
}
#calculate various outputs of mrpp
A = 1-delta / mean(delta_perm)
sig = sum(delta_perm <= delta)
if(sig == 0){sig = paste("<", format(1 / (n+1), scientific = F))}
return(list(class_delta = d, observ_delta = delta, e_delta = mean(delta_perm), permuted_delta = delta_perm, A = A, significance = sig))
}
MRPP
function and built-in mrpp
function in rself_cs = MRPP(dom, group, method = "jaccard")
builtin_cs = mrpp(dist_dom, group)
#extract results
self_cs_result = cbind(paste(round(self_cs$class_delta, 3), collapse = " "), round(cbind(self_cs$observ_delta, self_cs$e_delta, self_cs$A), 3), self_cs$significance)
builtin_cs_result = cbind(paste(round(builtin_cs$classdelta, 3), collapse = " "), round(cbind(builtin_cs$delta, builtin_cs$E.delta, builtin_cs$A), 3), builtin_cs$Pvalue)
mrpp_result = data.frame(t(matrix(cbind(self_cs_result, builtin_cs_result), 5, 2)))
rownames(mrpp_result) <- c("Self defined function", "Built-in function")
#display and compare results
kable(mrpp_result, digits = 3, col.names = c("Group deltas", "Observed delta", "Expected delta", "A", "Significance level (p-value)"), align = "r")
Group deltas | Observed delta | Expected delta | A | Significance level (p-value) | |
---|---|---|---|---|---|
Self defined function | 0.593 0.679 0.661 | 0.649 | 0.779 | 0.167 | < 0.001 |
Built-in function | 0.593 0.679 0.661 | 0.649 | 0.778 | 0.166 | 0.001 |
We can see that the three clusters are well divided. Additionally, the self-defined function MRPP
gives almost identical results as built-in function mrpp
.
The equal variance assumption would be tested using Levene’s equal variance test
var_t = apply(dom, 2, car::leveneTest, g = group)
var_unequal = c(1:43)[unlist(transpose(var_t)$`Pr(>F)`)[-2 * c(1:43)] <= 0.05]
According to Levene’s equal variance test, copepods no. 3, 4, 6, 10, 12, 13, 15, 20, 22, 23, 24, 25, 26, 27, 31, 32, 33, 35, 36, 37, 39, 40, 43 may have unequal variance across groups.
The normality assumption would be tested using Shapiro-Wilk normality test
norm_t = apply(dom, 2, shapiro.test)
non_norm = c(1:43)[unlist(transpose(norm_t)$p.value) <= 0.05]
According to Shapiro-Wilk normality test, copepods no. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43 may not be normally distributed.
It seems that a large proportion of our data violates either equal variance or the normality assumption, likely due to a small sample size (only 34). Discriminant analysis would still be conducted but these violations could detriment our result.
Variables that eliminate multicollinearity and have significant effect on group would be selected
Copepods with high correlation would be grouped together using hierarchical clustering.
dom_cor = dom
num = ncol(dom_cor)
colnames(dom_cor) <- c(1:43)
corrplot(abs(cor(dom_cor)), order = "hclust", addrect = 11, cl.pos = "b", tl.cex = 0.7)
From the correlation matrix plot, we can see which copepods are highly corrlated with each other.
sig_aov = numeric()
for(i in 1:43){
sig_aov[i] = summary(aov(dom[, i] ~ group))[[1]]$'Pr(>F)'[1]
}
non_sig = c(1:43)[sig_aov > 0.05]
Group do not have significant effect on copepod no. 1, 2, 5, 7, 8, 11, 16, 17, 18, 19, 28, 29, 30, 34, 35, 38, 41, 42.
Based on the results above, copepod no. 3, 9, 10, 15, 22, 23, 31, 32, 33, 36, 37 and 39 would be used to conduct discriminant analysis.
dom_lda = dom[, c(3, 9, 10, 15, 22, 23, 31, 32, 33, 36, 37, 39)]
lda_names = colnames(dom_lda)
lda_model <- lda(dom_lda, grouping = group)
#Assessing the importance of the canonical functions
mag_eigen = lda_model$svd^2/sum(lda_model$svd^2) #relative magnitude of eigenvalue
#classification of data
lda_pred = predict(lda_model)
#R^2 gives squared canonical correlation coef
scores = lda_pred$x
canon_cor = summary(lm(scores ~ group))
ld1_r2 = canon_cor$`Response LD1`$adj.r.squared
ld2_r2 = canon_cor$`Response LD2`$adj.r.squared
The DA model has a canonical correlation coefficient of 0.91 for LD1 and 0.81 for LD2, indicating that our model has a good fit. The relative magnitude of the two canonical functions are 69.97% and 30.03% respectively.
dataset = data.frame(Group = group, lda = lda_pred$x)
ggplot(dataset, aes(x = lda.LD1, y = lda.LD2, col = Group)) +
geom_point(size = 2.5, alpha = 0.5) +
labs(title = "Discriminant function analysis result", x = paste("LD1 (", 100 * round(mag_eigen[1], 3), "%)", sep = ""), y = paste("LD2 (", 100 * round(mag_eigen[2], 3), "%)", sep = ""))
From the graph, we can see that LD1 seperates summer from other seasons and LD2 seperates spring and winter.
kable(lda_model$scaling, caption = "Coefficients of linear discriminants:")
LD1 | LD2 | |
---|---|---|
Calanus sinicus | 5.584891 | -3.2561071 |
Subeucalanus pileatus | 2.175027 | -47.5902974 |
Subeucalanus subcrassus | 51.623539 | -171.8487385 |
Acrocalanus gracilis | -114.653752 | -26.1499884 |
Paracalanus serrulus | -33.623515 | 7.3459676 |
Parvocalanus crassirostris | -11.262712 | -2.6483692 |
Oithona similis | -4.474415 | 8.5272036 |
Euterpina acutifrons | -15.922788 | -3.0803143 |
Corycaeus (Ditrichocorycaeus) affinis | -1.141550 | 4.6115308 |
Corycaeus (Ditrichocorycaeus) subtilis | -204.711462 | 0.0916365 |
Corycaeus (Onychocorycaeus) giesbrechti | -78.211709 | -58.5536704 |
Corycaeidae copepodid | 6.934872 | 26.8959069 |
Copepods Subeucalanus subcrassus, Acrocalanus gracilis, Paracalanus serrulus, Corycaeus (Ditrichocorycaeus) subtilis, Corycaeus (Onychocorycaeus) giesbrechti have significant effect on LD1 and copepods Subeucalanus pileatus, Subeucalanus subcrassus, Acrocalanus gracilis, Corycaeus (Onychocorycaeus) giesbrechti, Corycaeidae copepodid have significant effect on LD2.
As we’ve seen from the graph, LD1 determines whether a station data is recorded in summer and LD2 separates spring and winter data. We can verify this effect by looking at the mean percentage of each copepod in different seasons:
Spring | Summer | Winter | |
---|---|---|---|
Calanus sinicus | 0.0703 | 0.0008 | 0.0964 |
Subeucalanus pileatus | 0.0000 | 0.0071 | 0.0146 |
Subeucalanus subcrassus | 0.0000 | 0.0014 | 0.0155 |
Acrocalanus gracilis | 0.0003 | 0.0069 | 0.0022 |
Paracalanus serrulus | 0.0000 | 0.0248 | 0.0007 |
Parvocalanus crassirostris | 0.0088 | 0.1511 | 0.0009 |
Oithona similis | 0.0912 | 0.0104 | 0.0000 |
Euterpina acutifrons | 0.0065 | 0.0537 | 0.0000 |
Corycaeus (Ditrichocorycaeus) affinis | 0.1233 | 0.0019 | 0.0000 |
Corycaeus (Ditrichocorycaeus) subtilis | 0.0007 | 0.0053 | 0.0006 |
Corycaeus (Onychocorycaeus) giesbrechti | 0.0000 | 0.0101 | 0.0024 |
Corycaeidae copepodid | 0.0160 | 0.0000 | 0.0000 |
#classification table
conf = table(group,lda_pred$class)
kable(conf, caption = "Classification table of discriminant analysis")
Spring | Summer | Winter | |
---|---|---|---|
Spring | 10 | 0 | 0 |
Summer | 0 | 15 | 0 |
Winter | 0 | 0 | 9 |
The model has a 100% correct classification rate.
envdata = read.table("enviANDdensity.txt", header = T)
rownames(envdata) <- envdata[, 1]
envdata = envdata[, -1]
tree <- rpart(group ~ ., data = data.frame(dom, envdata), control = rpart.control(minsplit = 5))
fancyRpartPlot(tree, sub = "")
It seems that dissolved oxygen gives a perfect seperation between stations data in summer and other seasons. Afterwards, percentage of the copepod Subeucalanus pileatus would seperate spring and winter data perfectly. The classification tree model chooses dissolved oxygen and copepod Subeucalanus pileatus as the most important variables in determining which group a station should belong to.
This appendix is not a part of the homework questions.
As mentioned above, both models we built in this homework have perfect classification rate. This might be indicating that groups can be perfectly predicted by our model, or we have issues of overfitting. In this first part of the appendix, we will apply 10-fold cross-validation to assess the predictive accuracy of these models.
Define function k_cv_da
that conducts a k-fold cross-validation on a discriminant analysis model and returns the accuracy of overall prediction
k_cv_lda <- function(data, group, k = 10, ...){
n = nrow(data) / k
cv_index = createFolds(group, k = 10)
pred = numeric()
for(i in 1:k){
model = lda(data[-cv_index[[i]], ], group[-cv_index[[i]]], ...)
pred = append(pred, predict(model, data[cv_index[[i]], ], type = "class")$class)
}
conf = table(group[unlist(cv_index)], pred)
accu = sum(diag(conf)) / sum(conf)
return(accu)
}
10-fold cross-validation would be conducted 100 times to assess the performance of the discriminant analysis model
lda_perf = numeric()
for(j in 1:100){
lda_perf[j] = mean(k_cv_lda(dom_lda, group))
}
lda_perf_sum = summary(lda_perf)
kable(t(cbind(lda_perf_sum)), row.names = F, caption = "Predictive accuracy of discriminant function model")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0.9118 | 0.9706 | 0.9706 | 0.965 | 0.9706 | 1 |
The discriminant analysis model has an average accuracy of 96.50% with a worst accuracy of 91.18% and best accuracy of 100.00%.
Define function k_cv_tree
that conducts a k-fold cross-validation on a classification tree model
k_cv_tree <- function(data, group, k = 10, ...){
n = nrow(data) / k
cv_index = createFolds(group, k = 10)
pred = numeric()
for(i in 1:k){
model = rpart(group[-cv_index[[i]]] ~., data[-cv_index[[i]], ], ...)
pred = append(pred, predict(model, data[cv_index[[i]], ], type = "class"))
}
conf = table(group[unlist(cv_index)], pred)
accu = sum(diag(conf)) / sum(conf)
return(accu)
}
10-fold cross-validation would be conducted 100 times to assess the performance of the tree model
tree_perf = numeric()
for(j in 1:100){
tree_perf[j] = mean(k_cv_tree(data.frame(dom, envdata), group))
}
tree_perf_sum = summary(tree_perf)
kable(t(cbind(tree_perf_sum)), row.names = F, caption = "Predictive accuracy of classification tree model")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0.6471 | 0.7059 | 0.7059 | 0.7091 | 0.7353 | 0.7353 |
The classification tree model has an average accuracy of 70.91% with a worst accuracy of 64.71% and best accuracy of 73.53%.
#estimate prediction accuracy with only the knowledge of proportion (guessing)
naive_accu=numeric()
for(i in 1:100){
con = table(group, sample(group, 34, replace = T))
naive_accu[i] = sum(diag(con)) / sum(con)
}
#show and compare the summary of accuracy for different methods
cv_accu = t(cbind(summary(naive_accu), lda_perf_sum, tree_perf_sum))
rownames(cv_accu) = c("Guess", "Discriminant analysis", "Classification tree")
kable(cv_accu)
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Guess | 0.1471 | 0.2941 | 0.3529 | 0.3594 | 0.4118 | 0.5294 |
Discriminant analysis | 0.9118 | 0.9706 | 0.9706 | 0.9650 | 0.9706 | 1.0000 |
Classification tree | 0.6471 | 0.7059 | 0.7059 | 0.7091 | 0.7353 | 0.7353 |
#visualize the distribution of accuracy for different methods
accu_all = data.frame(naive_accu, lda_perf, tree_perf)
colnames(accu_all) <- c("Guess", "DA", "Tree")
accu_all = melt(accu_all)
ggplot(accu_all, aes(value, col = variable, fill = variable)) +
geom_density(alpha = 0.6) +
labs(title = "Predictive accuracy for different methods", x = "Accuracy") +
scale_fill_discrete(name = "Method") +
scale_color_discrete(name = "Method")
As we can see, the discriminant analysis model gives significantly better accuracy than the classification tree model and guessing. Even in the worst case scenario, the discriminant analysis model has an accuracy of 91.18%, outperforming other methods’ best performances. It is also worth noting that both models performs better than guessing, which has an average accuracy of 35.94%.
Do these results prove that discriminant analysis is a much better model? Not necessarily! In this study, the sample size is relatively small (34) and therefore cross-validation could yield skewed estimation. Additionally, the classification tree model uses only 2 variables compared to 12 variables used in discriminant analysis. Furthermore, discriminant analysis requires much stricter assumptions and is thus more likely to fail if new data violate its assumptions. Nevertheless, discriminant analysis would remain the favorable model until new evidence suggests otherwise.
We will explore other possible models in part 2 of the appendix, including a robust version of CART, random forest.
Besides discriminant analysis and CART, there are many other models for classification purpose. In this section, I would build and assess three different classification models including random forest, multinomial logistics regression and support vector machine. These models will only be assessed using cross-validation.
Random forest is a more robust version of CART. The algorithm builds many CART models using a subset of original variables and a subset of data (using bootstrap) with every CART model fully grown (without pruning). Predictions made by a random forest model would be the mode of all its individual CART model.
forest = randomForest(data.frame(dom, envdata), group, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
conf_forest = table(group, predict(forest))
accu_forest = sum(diag(conf_forest)) / sum(conf_forest)
kable(conf_forest)
Spring | Summer | Winter | |
---|---|---|---|
Spring | 10 | 0 | 0 |
Summer | 0 | 15 | 0 |
Winter | 0 | 0 | 9 |
The correct classification rate of random forest model is 100%. Cross-validation would be conducted next to assess predictive accuracy.
Define function k_cv_forest
that conducts a k-fold cross-validation on a random forest model
k_cv_forest <- function(data, group, k = 10, ...){
n = nrow(data) / k
cv_index = createFolds(group, k = 10)
pred = numeric()
for(i in 1:k){
model = randomForest(data[-cv_index[[i]], ], group[-cv_index[[i]]])
pred = append(pred, predict(model, data[cv_index[[i]], ]))
}
conf = table(group[unlist(cv_index)], pred)
accu = sum(diag(conf)) / sum(conf)
return(accu)
}
A 10-fold cross-validation would be conducted 100 times to assess the performance of random forest model
forest_perf = numeric()
for(j in 1:100){
forest_perf[j] = k_cv_forest(data.frame(dom, envdata), group)
}
forest_perf_sum = summary(forest_perf)
kable(t(cbind(forest_perf_sum)), row.names = F, caption = "Predictive accuracy of random forest model")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 |
Although it’s hard to believe, the model always has an 100% accuracy! If no calculation/coding error exists (please inform me if you find one!), the random forest model is a perfect model for predicting groups, as least for data currently on hand.
Similar to discriminant analysis, multinomial logistics models also assumes collinearity to be relatively low. As a result, the same variables(variables with low correlation and has significant effect on group) used in DA would be used for building this regression model.
multi_logit = multinom(group ~ ., data = data.frame(dom_lda), trace = F)
conf_logit = table(group, predict(multi_logit))
accu_logit = sum(diag(conf_logit)) / sum(conf_logit)
kable(conf_logit)
Spring | Summer | Winter | |
---|---|---|---|
Spring | 10 | 0 | 0 |
Summer | 0 | 15 | 0 |
Winter | 0 | 0 | 9 |
The correct classification rate of multinomial logistics regression model is 100%. Cross-validation would be conducted next to assess predictive accuracy.
Define function k_cv_multilogit
that conducts a k-fold cross-validation on a multinomial logistics regression model
k_cv_multilogit <- function(data, group, k = 10, ...){
n = nrow(data) / k
cv_index = createFolds(group, k = 10)
pred = numeric()
for(i in 1:k){
model = multinom(group[-cv_index[[i]]] ~ ., data[-cv_index[[i]], ], trace = F, ...)
pred = append(pred, predict(model, data[cv_index[[i]], ]))
}
conf = table(group[unlist(cv_index)], pred)
accu = sum(diag(conf)) / sum(conf)
return(accu)
}
10-fold cross-validation would be conducted 100 times to assess the performance of multinomial logistics regression model
multilogit_perf = numeric()
for(j in 1:100){
multilogit_perf[j] = k_cv_multilogit(data.frame(dom_lda), group)
}
multilogit_perf_sum = summary(multilogit_perf)
kable(t(cbind(multilogit_perf_sum)), row.names = F, caption = "Predictive accuracy of multinomial logistics regression model")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0.8824 | 0.9118 | 0.9412 | 0.9332 | 0.9412 | 1 |
svm_model = svm(group ~ ., data = data.frame(dom, envdata), type = "C-classification")
conf_svm = table(group, predict(svm_model))
accu_svm = sum(diag(conf_svm)) / sum(conf_svm)
kable(conf_svm)
Spring | Summer | Winter | |
---|---|---|---|
Spring | 10 | 0 | 0 |
Summer | 0 | 15 | 0 |
Winter | 0 | 0 | 9 |
The correct classification rate of support vector machine model is 100%. Cross-validation would be conducted next to assess predictive accuracy.
Define function k_cv_svm
that conducts a k-fold cross-validation on a SVM model
k_cv_svm <- function(data, group, k = 10, ...){
n = nrow(data) / k
cv_index = createFolds(group, k = 10)
pred = numeric()
for(i in 1:k){
model = svm(group[-cv_index[[i]]] ~., data[-cv_index[[i]], ], ...)
pred = append(pred, predict(model, data[cv_index[[i]], ]))
}
conf = table(group[unlist(cv_index)], pred)
accu = sum(diag(conf)) / sum(conf)
return(accu)
}
10-fold cross-validation would be conducted 100 times to assess the performance of SVM model
svm_perf = numeric()
for(j in 1:100){
svm_perf[j] = k_cv_svm(data.frame(dom, envdata), group, type = "C-classification", kernel = "linear")
}
svm_perf_sum = summary(svm_perf)
kable(t(cbind(svm_perf_sum)), row.names = F, caption = "Predictive accuracy of support vector machine model")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0.9412 | 0.9706 | 1 | 0.9853 | 1 | 1 |
#show and compare the summary of accuracy for different methods
cv_accu = t(cbind(summary(naive_accu), lda_perf_sum, tree_perf_sum, forest_perf_sum, multilogit_perf_sum, svm_perf_sum))
rownames(cv_accu) = c("Guess", "Discriminant analysis", "Classification tree", "Random forest", "Multinomial logistics regression", "Support vector machine")
kable(cv_accu)
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Guess | 0.1471 | 0.2941 | 0.3529 | 0.3594 | 0.4118 | 0.5294 |
Discriminant analysis | 0.9118 | 0.9706 | 0.9706 | 0.9650 | 0.9706 | 1.0000 |
Classification tree | 0.6471 | 0.7059 | 0.7059 | 0.7091 | 0.7353 | 0.7353 |
Random forest | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Multinomial logistics regression | 0.8824 | 0.9118 | 0.9412 | 0.9332 | 0.9412 | 1.0000 |
Support vector machine | 0.9412 | 0.9706 | 1.0000 | 0.9853 | 1.0000 | 1.0000 |
Based on predictive accuracy, random forest model is undoubtedly the best model, always making 100% accurate predictions. The discriminant function analysis, support vector machine and multinomial logistics regression model have similar performance. The classification tree model has an unsatisfying performance, likely due to overfitting (which random forest avoids). Finally, all methods are better than guessing.