Click here for other works of the author on RPubs

In this homework, we first conduct the mrpp method to validate groups we got using hierarchical clustering. Next, we use discriminant analysis and CART to identify variables that have significant impact on the groups. Finally, we assess the predictive accuracy of these models and explore other classification models in the appendix.

Load packages

library(knitr)
library(fpc)
library(vegan)
library(cluster)
library(factoextra)
library(rattle)
library(rpart.plot)
library(MASS)
library(ggplot2)
library(caret)
library(purrr)
library(corrplot)
library(reshape2)
library(randomForest)
library(nnet)
library(e1071)

Import data and conduct cluster analysis as described in the previous homework.

The cluster result of hierarchical clustering using Jaccard distance and Ward’s linkage will be used for this homework.

#import copepod data
copdata = read.table("copepod_composition.txt", header = T)

#import species name
species_name <- read.table("copepodSPlist.txt", sep = "\t")

#assign copepod species name
rownames(copdata) <- as.character(unlist(species_name))

#convert species frequency into percentage
copdata = copdata / 100

#find dominant species
dom = apply(copdata >= 0.02, 2, which)
dom = sort(unique(unlist(dom)))
dom = t(copdata[dom, ])

#calculate distance matrix
dist_dom = vegdist(dom, method = "jaccard")

#perform hierarchical clustring
hc_dom = agnes(dist_dom, method = "ward")
group = cutree(hc_dom, 3)
group = factor(group, labels = c("Spring", "Summer", "Winter"))

#visualize the result of hierarchical clustering using dendrogram
fviz_dend(hc_dom, cex = 0.5, k = 3, lwd = 0.8, main = "Ward's linkage Dendrogram", horiz = T, ylab = paste("Height", "\n", "\n", "Agglomerative coefficient=", round(hc_dom$ac, 2), "\n", "Cophenetic correlation=", round(cor(dist_dom, cophenetic(hc_dom)), 2)))

Q1. Choose one of the results from the cluster analysis from your previous homework. Evaluate the cluster structure using one of the three nonparametric tests. Apply your MRPP code to perform the nonparametric tests, then use built-in functions in Matlab or R to check your results.

Self-defined MRPP function

Define function MRPP to conduct multi-response permutation procedures on clusters. This function takes the original dataset and a grouping index as arguments, the user can also change the number of permutations or the distance method used in computing distance matrix.

MRPP <- function(data, group, n = 999, ...){
    col_n = ncol(data)
    group = factor(group)
    k = nlevels(group)
    group_split = split(data, group)
    d = numeric()
    
    #compute observed delta
    for(i in 1:k){
        d[i] = mean(vegdist(matrix(group_split[[i]], length(group_split[[i]]) / col_n, col_n), ...))
    }
    delta = sum(table(group) / length(group) * d)
    
    #conduct permutation
    delta_perm = numeric()
    for(j in 1:n){
        group_perm = sample(group)
        group_perm_split = split(data, group_perm)
        d_perm = numeric()
        for(i in 1:k){
            d_perm[i] = mean(vegdist(matrix(group_perm_split[[i]],  length(group_perm_split[[i]]) / col_n, col_n), ...))
        }
        delta_perm[j] = sum(table(group_perm) / length(group_perm) * d_perm)
    }
    #calculate various outputs of mrpp
    A = 1-delta / mean(delta_perm)
    sig = sum(delta_perm <= delta)
    if(sig == 0){sig = paste("<", format(1 / (n+1), scientific = F))}
    return(list(class_delta = d, observ_delta = delta, e_delta = mean(delta_perm), permuted_delta = delta_perm, A = A, significance = sig))
}

Compare results from the self-defined MRPP function and built-in mrpp function in r

self_cs = MRPP(dom, group, method = "jaccard")
builtin_cs = mrpp(dist_dom, group)

#extract results
self_cs_result = cbind(paste(round(self_cs$class_delta, 3), collapse = " "), round(cbind(self_cs$observ_delta, self_cs$e_delta, self_cs$A), 3), self_cs$significance)
builtin_cs_result = cbind(paste(round(builtin_cs$classdelta, 3), collapse = " "), round(cbind(builtin_cs$delta, builtin_cs$E.delta, builtin_cs$A), 3), builtin_cs$Pvalue)
mrpp_result = data.frame(t(matrix(cbind(self_cs_result, builtin_cs_result), 5, 2)))
rownames(mrpp_result) <- c("Self defined function", "Built-in function")

#display and compare results
kable(mrpp_result, digits = 3, col.names = c("Group deltas", "Observed delta", "Expected delta", "A", "Significance level (p-value)"), align = "r")
Group deltas Observed delta Expected delta A Significance level (p-value)
Self defined function 0.593 0.679 0.661 0.649 0.779 0.167 < 0.001
Built-in function 0.593 0.679 0.661 0.649 0.778 0.166 0.001

We can see that the three clusters are well divided. Additionally, the self-defined function MRPP gives almost identical results as built-in function mrpp.

Q2. Use one of the results from the cluster analysis of your previous homework to perform discriminant analysis (DA). Describe 1) whether your data meet the requirement for doing DA, and, 2) if so, which species are most distinct among the clusters.

Check assumptions of discriminant analysis

Equal variance assumption

The equal variance assumption would be tested using Levene’s equal variance test

var_t = apply(dom, 2, car::leveneTest, g = group)
var_unequal = c(1:43)[unlist(transpose(var_t)$`Pr(>F)`)[-2 * c(1:43)] <= 0.05]

According to Levene’s equal variance test, copepods no. 3, 4, 6, 10, 12, 13, 15, 20, 22, 23, 24, 25, 26, 27, 31, 32, 33, 35, 36, 37, 39, 40, 43 may have unequal variance across groups.

Normality assumption

The normality assumption would be tested using Shapiro-Wilk normality test

norm_t = apply(dom, 2, shapiro.test)
non_norm = c(1:43)[unlist(transpose(norm_t)$p.value) <= 0.05]

According to Shapiro-Wilk normality test, copepods no. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43 may not be normally distributed.

It seems that a large proportion of our data violates either equal variance or the normality assumption, likely due to a small sample size (only 34). Discriminant analysis would still be conducted but these violations could detriment our result.

Variable selection

Variables that eliminate multicollinearity and have significant effect on group would be selected

Multicollinearity

Copepods with high correlation would be grouped together using hierarchical clustering.

dom_cor = dom
num = ncol(dom_cor)
colnames(dom_cor) <- c(1:43)
corrplot(abs(cor(dom_cor)), order = "hclust", addrect = 11, cl.pos = "b", tl.cex = 0.7)

From the correlation matrix plot, we can see which copepods are highly corrlated with each other.

Group effect on copepod

sig_aov = numeric()
for(i in 1:43){
    sig_aov[i] = summary(aov(dom[, i] ~ group))[[1]]$'Pr(>F)'[1]
}
non_sig = c(1:43)[sig_aov > 0.05]

Group do not have significant effect on copepod no. 1, 2, 5, 7, 8, 11, 16, 17, 18, 19, 28, 29, 30, 34, 35, 38, 41, 42.

Based on the results above, copepod no. 3, 9, 10, 15, 22, 23, 31, 32, 33, 36, 37 and 39 would be used to conduct discriminant analysis.

Conduct discriminant analysis on the 3 groups of stations

dom_lda = dom[, c(3, 9, 10, 15, 22, 23, 31, 32, 33, 36, 37, 39)]
lda_names = colnames(dom_lda)
lda_model <- lda(dom_lda, grouping = group)

Analyze results of discriminant analysis

#Assessing the importance of the canonical functions
mag_eigen = lda_model$svd^2/sum(lda_model$svd^2)    #relative magnitude of eigenvalue

#classification of data
lda_pred = predict(lda_model)

#R^2 gives squared canonical correlation coef
scores = lda_pred$x
canon_cor = summary(lm(scores ~ group))
ld1_r2 = canon_cor$`Response LD1`$adj.r.squared
ld2_r2 = canon_cor$`Response LD2`$adj.r.squared

The DA model has a canonical correlation coefficient of 0.91 for LD1 and 0.81 for LD2, indicating that our model has a good fit. The relative magnitude of the two canonical functions are 69.97% and 30.03% respectively.

Visualize discriminant analysis result

dataset = data.frame(Group = group, lda = lda_pred$x)
ggplot(dataset, aes(x = lda.LD1, y = lda.LD2, col = Group)) +
    geom_point(size = 2.5, alpha = 0.5) +
    labs(title = "Discriminant function analysis result", x = paste("LD1 (", 100 * round(mag_eigen[1], 3), "%)", sep = ""), y = paste("LD2 (", 100 * round(mag_eigen[2], 3), "%)", sep = ""))

From the graph, we can see that LD1 seperates summer from other seasons and LD2 seperates spring and winter.

Inspect impact of each variable (copepod) on the canonical functions

kable(lda_model$scaling, caption = "Coefficients of linear discriminants:")
Coefficients of linear discriminants:
LD1 LD2
Calanus sinicus 5.584891 -3.2561071
Subeucalanus pileatus 2.175027 -47.5902974
Subeucalanus subcrassus 51.623539 -171.8487385
Acrocalanus gracilis -114.653752 -26.1499884
Paracalanus serrulus -33.623515 7.3459676
Parvocalanus crassirostris -11.262712 -2.6483692
Oithona similis -4.474415 8.5272036
Euterpina acutifrons -15.922788 -3.0803143
Corycaeus (Ditrichocorycaeus) affinis -1.141550 4.6115308
Corycaeus (Ditrichocorycaeus) subtilis -204.711462 0.0916365
Corycaeus (Onychocorycaeus) giesbrechti -78.211709 -58.5536704
Corycaeidae copepodid 6.934872 26.8959069

Copepods Subeucalanus subcrassus, Acrocalanus gracilis, Paracalanus serrulus, Corycaeus (Ditrichocorycaeus) subtilis, Corycaeus (Onychocorycaeus) giesbrechti have significant effect on LD1 and copepods Subeucalanus pileatus, Subeucalanus subcrassus, Acrocalanus gracilis, Corycaeus (Onychocorycaeus) giesbrechti, Corycaeidae copepodid have significant effect on LD2.

As we’ve seen from the graph, LD1 determines whether a station data is recorded in summer and LD2 separates spring and winter data. We can verify this effect by looking at the mean percentage of each copepod in different seasons:

Spring Summer Winter
Calanus sinicus 0.0703 0.0008 0.0964
Subeucalanus pileatus 0.0000 0.0071 0.0146
Subeucalanus subcrassus 0.0000 0.0014 0.0155
Acrocalanus gracilis 0.0003 0.0069 0.0022
Paracalanus serrulus 0.0000 0.0248 0.0007
Parvocalanus crassirostris 0.0088 0.1511 0.0009
Oithona similis 0.0912 0.0104 0.0000
Euterpina acutifrons 0.0065 0.0537 0.0000
Corycaeus (Ditrichocorycaeus) affinis 0.1233 0.0019 0.0000
Corycaeus (Ditrichocorycaeus) subtilis 0.0007 0.0053 0.0006
Corycaeus (Onychocorycaeus) giesbrechti 0.0000 0.0101 0.0024
Corycaeidae copepodid 0.0160 0.0000 0.0000

Correct classification rate of DA model

#classification table
conf = table(group,lda_pred$class) 
kable(conf, caption = "Classification table of discriminant analysis")
Classification table of discriminant analysis
Spring Summer Winter
Spring 10 0 0
Summer 0 15 0
Winter 0 0 9

The model has a 100% correct classification rate.

Q3. Use the “dominant” copepod species data (from HW1). Perform CART of stations based on percent composition data of the dominant species, and tell your story about these copepod data. You can use also the environmental data if you wish.

Load and extract environmental data

envdata = read.table("enviANDdensity.txt", header = T)
rownames(envdata) <- envdata[, 1]
envdata = envdata[, -1]

Build CART model for stations using dominant copepod species and environmental data

tree <- rpart(group ~ ., data = data.frame(dom, envdata), control = rpart.control(minsplit = 5))
fancyRpartPlot(tree, sub = "")

It seems that dissolved oxygen gives a perfect seperation between stations data in summer and other seasons. Afterwards, percentage of the copepod Subeucalanus pileatus would seperate spring and winter data perfectly. The classification tree model chooses dissolved oxygen and copepod Subeucalanus pileatus as the most important variables in determining which group a station should belong to.

Appendix

This appendix is not a part of the homework questions.

Part 1

As mentioned above, both models we built in this homework have perfect classification rate. This might be indicating that groups can be perfectly predicted by our model, or we have issues of overfitting. In this first part of the appendix, we will apply 10-fold cross-validation to assess the predictive accuracy of these models.

Function for cross-validation of discriminant analysis

Define function k_cv_da that conducts a k-fold cross-validation on a discriminant analysis model and returns the accuracy of overall prediction

k_cv_lda <- function(data, group, k = 10, ...){
    n = nrow(data) / k
    cv_index = createFolds(group, k = 10)
    pred = numeric()
    for(i in 1:k){
        model = lda(data[-cv_index[[i]], ], group[-cv_index[[i]]], ...)
        pred = append(pred, predict(model, data[cv_index[[i]], ], type = "class")$class)
        
    }
    conf = table(group[unlist(cv_index)], pred)
    accu = sum(diag(conf)) / sum(conf)
    return(accu)
}

Assess predictive accuracy of the discriminant analysis model

10-fold cross-validation would be conducted 100 times to assess the performance of the discriminant analysis model

lda_perf = numeric()
for(j in 1:100){
    lda_perf[j] = mean(k_cv_lda(dom_lda, group))
}
lda_perf_sum = summary(lda_perf)
kable(t(cbind(lda_perf_sum)), row.names = F, caption = "Predictive accuracy of discriminant function model")
Predictive accuracy of discriminant function model
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9118 0.9706 0.9706 0.965 0.9706 1

The discriminant analysis model has an average accuracy of 96.50% with a worst accuracy of 91.18% and best accuracy of 100.00%.

Function for cross-validation of classification tree model

Define function k_cv_tree that conducts a k-fold cross-validation on a classification tree model

k_cv_tree <- function(data, group, k = 10, ...){
    n = nrow(data) / k
    cv_index = createFolds(group, k = 10)
    pred = numeric()
    for(i in 1:k){
        model = rpart(group[-cv_index[[i]]] ~., data[-cv_index[[i]], ], ...)
        pred = append(pred, predict(model, data[cv_index[[i]], ], type = "class"))
    }
    conf = table(group[unlist(cv_index)], pred)
    accu = sum(diag(conf)) / sum(conf)
    return(accu)
}

Assess predictive accuracy of the classification tree model

10-fold cross-validation would be conducted 100 times to assess the performance of the tree model

tree_perf = numeric()
for(j in 1:100){
    tree_perf[j] = mean(k_cv_tree(data.frame(dom, envdata), group))
}
tree_perf_sum = summary(tree_perf)
kable(t(cbind(tree_perf_sum)), row.names = F, caption = "Predictive accuracy of classification tree model")
Predictive accuracy of classification tree model
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6471 0.7059 0.7059 0.7091 0.7353 0.7353

The classification tree model has an average accuracy of 70.91% with a worst accuracy of 64.71% and best accuracy of 73.53%.

Estimate the predictive accuracy when guessing

#estimate prediction accuracy with only the knowledge of proportion (guessing)
naive_accu=numeric()
for(i in 1:100){
    con = table(group, sample(group, 34, replace = T))
    naive_accu[i] = sum(diag(con)) / sum(con)
}

Compare predictive accuracy of guessing, discriminant analysis and classification tree model

#show and compare the summary of accuracy for different methods
cv_accu = t(cbind(summary(naive_accu), lda_perf_sum, tree_perf_sum))
rownames(cv_accu) = c("Guess", "Discriminant analysis", "Classification tree")
kable(cv_accu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
Guess 0.1471 0.2941 0.3529 0.3594 0.4118 0.5294
Discriminant analysis 0.9118 0.9706 0.9706 0.9650 0.9706 1.0000
Classification tree 0.6471 0.7059 0.7059 0.7091 0.7353 0.7353
#visualize the distribution of accuracy for different methods
accu_all = data.frame(naive_accu, lda_perf, tree_perf)
colnames(accu_all) <- c("Guess", "DA", "Tree")
accu_all = melt(accu_all)
ggplot(accu_all, aes(value, col = variable, fill = variable)) +
    geom_density(alpha = 0.6) +
    labs(title = "Predictive accuracy for different methods", x = "Accuracy") +
    scale_fill_discrete(name = "Method") +
    scale_color_discrete(name = "Method")

As we can see, the discriminant analysis model gives significantly better accuracy than the classification tree model and guessing. Even in the worst case scenario, the discriminant analysis model has an accuracy of 91.18%, outperforming other methods’ best performances. It is also worth noting that both models performs better than guessing, which has an average accuracy of 35.94%.

Do these results prove that discriminant analysis is a much better model? Not necessarily! In this study, the sample size is relatively small (34) and therefore cross-validation could yield skewed estimation. Additionally, the classification tree model uses only 2 variables compared to 12 variables used in discriminant analysis. Furthermore, discriminant analysis requires much stricter assumptions and is thus more likely to fail if new data violate its assumptions. Nevertheless, discriminant analysis would remain the favorable model until new evidence suggests otherwise.

We will explore other possible models in part 2 of the appendix, including a robust version of CART, random forest.

Part 2

Besides discriminant analysis and CART, there are many other models for classification purpose. In this section, I would build and assess three different classification models including random forest, multinomial logistics regression and support vector machine. These models will only be assessed using cross-validation.

Random forest model

Random forest is a more robust version of CART. The algorithm builds many CART models using a subset of original variables and a subset of data (using bootstrap) with every CART model fully grown (without pruning). Predictions made by a random forest model would be the mode of all its individual CART model.

forest = randomForest(data.frame(dom, envdata), group, importance = TRUE, proximity = TRUE, keep.forest = TRUE)
conf_forest = table(group, predict(forest))
accu_forest = sum(diag(conf_forest)) / sum(conf_forest)
kable(conf_forest)
Spring Summer Winter
Spring 10 0 0
Summer 0 15 0
Winter 0 0 9

The correct classification rate of random forest model is 100%. Cross-validation would be conducted next to assess predictive accuracy.

Function for cross-validation of random forest model

Define function k_cv_forest that conducts a k-fold cross-validation on a random forest model

k_cv_forest <- function(data, group, k = 10, ...){
    n = nrow(data) / k
    cv_index = createFolds(group, k = 10)
    pred = numeric()
    for(i in 1:k){
        model = randomForest(data[-cv_index[[i]], ], group[-cv_index[[i]]])
        pred = append(pred, predict(model, data[cv_index[[i]], ]))
    }
    conf = table(group[unlist(cv_index)], pred)
    accu = sum(diag(conf)) / sum(conf)
    return(accu)
}

Assess predictive accuracy of random forest model

A 10-fold cross-validation would be conducted 100 times to assess the performance of random forest model

forest_perf = numeric()
for(j in 1:100){
    forest_perf[j] = k_cv_forest(data.frame(dom, envdata), group)
}
forest_perf_sum = summary(forest_perf)
kable(t(cbind(forest_perf_sum)), row.names = F, caption = "Predictive accuracy of random forest model")
Predictive accuracy of random forest model
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1

Although it’s hard to believe, the model always has an 100% accuracy! If no calculation/coding error exists (please inform me if you find one!), the random forest model is a perfect model for predicting groups, as least for data currently on hand.

Multinomial logistics regression model

Similar to discriminant analysis, multinomial logistics models also assumes collinearity to be relatively low. As a result, the same variables(variables with low correlation and has significant effect on group) used in DA would be used for building this regression model.

multi_logit = multinom(group ~ ., data = data.frame(dom_lda), trace = F)
conf_logit = table(group, predict(multi_logit))
accu_logit = sum(diag(conf_logit)) / sum(conf_logit)
kable(conf_logit)
Spring Summer Winter
Spring 10 0 0
Summer 0 15 0
Winter 0 0 9

The correct classification rate of multinomial logistics regression model is 100%. Cross-validation would be conducted next to assess predictive accuracy.

Function for cross-validation of multinomial logistics regression model

Define function k_cv_multilogit that conducts a k-fold cross-validation on a multinomial logistics regression model

k_cv_multilogit <- function(data, group, k = 10, ...){
    n = nrow(data) / k
    cv_index = createFolds(group, k = 10)
    pred = numeric()
    for(i in 1:k){
        model = multinom(group[-cv_index[[i]]] ~ ., data[-cv_index[[i]], ], trace = F,  ...)
        pred = append(pred, predict(model, data[cv_index[[i]], ]))
    }
    conf = table(group[unlist(cv_index)], pred)
    accu = sum(diag(conf)) / sum(conf)
    return(accu)
}

Assess predictive accuracy of multinomial logistics regression model

10-fold cross-validation would be conducted 100 times to assess the performance of multinomial logistics regression model

multilogit_perf = numeric()
for(j in 1:100){
    multilogit_perf[j] = k_cv_multilogit(data.frame(dom_lda), group)
}
multilogit_perf_sum = summary(multilogit_perf)
kable(t(cbind(multilogit_perf_sum)), row.names = F, caption = "Predictive accuracy of multinomial logistics regression model")
Predictive accuracy of multinomial logistics regression model
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8824 0.9118 0.9412 0.9332 0.9412 1

Support vector machine model

svm_model = svm(group ~ ., data = data.frame(dom, envdata), type = "C-classification")
conf_svm = table(group, predict(svm_model))
accu_svm = sum(diag(conf_svm)) / sum(conf_svm)
kable(conf_svm)
Spring Summer Winter
Spring 10 0 0
Summer 0 15 0
Winter 0 0 9

The correct classification rate of support vector machine model is 100%. Cross-validation would be conducted next to assess predictive accuracy.

Function for cross-validation of SVM model

Define function k_cv_svm that conducts a k-fold cross-validation on a SVM model

k_cv_svm <- function(data, group, k = 10, ...){
    n = nrow(data) / k
    cv_index = createFolds(group, k = 10)
    pred = numeric()
    for(i in 1:k){
        model = svm(group[-cv_index[[i]]] ~., data[-cv_index[[i]], ], ...)
        pred = append(pred, predict(model, data[cv_index[[i]], ]))
    }
    conf = table(group[unlist(cv_index)], pred)
    accu = sum(diag(conf)) / sum(conf)
    return(accu)
}

Assess predictive accuracy of SVM model

10-fold cross-validation would be conducted 100 times to assess the performance of SVM model

svm_perf = numeric()
for(j in 1:100){
    svm_perf[j] = k_cv_svm(data.frame(dom, envdata), group, type = "C-classification", kernel = "linear")
}
svm_perf_sum = summary(svm_perf)
kable(t(cbind(svm_perf_sum)), row.names = F, caption = "Predictive accuracy of support vector machine model")
Predictive accuracy of support vector machine model
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9412 0.9706 1 0.9853 1 1

Compare predictive accuracy for all methods mentioned in this homework

#show and compare the summary of accuracy for different methods
cv_accu = t(cbind(summary(naive_accu), lda_perf_sum, tree_perf_sum, forest_perf_sum, multilogit_perf_sum, svm_perf_sum))
rownames(cv_accu) = c("Guess", "Discriminant analysis", "Classification tree", "Random forest", "Multinomial logistics regression", "Support vector machine")
kable(cv_accu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
Guess 0.1471 0.2941 0.3529 0.3594 0.4118 0.5294
Discriminant analysis 0.9118 0.9706 0.9706 0.9650 0.9706 1.0000
Classification tree 0.6471 0.7059 0.7059 0.7091 0.7353 0.7353
Random forest 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Multinomial logistics regression 0.8824 0.9118 0.9412 0.9332 0.9412 1.0000
Support vector machine 0.9412 0.9706 1.0000 0.9853 1.0000 1.0000

Based on predictive accuracy, random forest model is undoubtedly the best model, always making 100% accurate predictions. The discriminant function analysis, support vector machine and multinomial logistics regression model have similar performance. The classification tree model has an unsatisfying performance, likely due to overfitting (which random forest avoids). Finally, all methods are better than guessing.