The PCA we did on the mammal sleep data didn’t indicate any interesting groups. Instead lets work with a famous dataset with known strong groups.
Read about these data here https://en.wikipedia.org/wiki/Iris_flower_data_set
You should defintely read this as further background on PCA and cluster analysis.
For more on PCA see https://rpubs.com/brouwern/veganpca
Notes on drawing dendrograms: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning#plot.dendrogram-function
Only do this once, then comment out of the script. You probably already did this in the previous Code Checkpoint.
# install.packages("ggplot2")
# install.package("vegan")
library(ggplot2)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
data(iris)
Scatterplot matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
When looking at a plot like this you should understand how to read it, what the correlations mean, what the red lines meanm, etc.
plot(iris,upper.panel = panel.cor,
panel = panel.smooth)
Run the PCA using rda()
rda.out <- vegan::rda(iris[,-5], scale = TRUE)
This displays the 2D PCA plot without the arrows. For more info on what this code does, see the RPubs document linked above
biplot(rda.out, display = "sites")
vegan has some nice tools for groups things.
In this dataset I don’t expect there to be any interest groups, but I’ll check anyway. I will supply this code if needed.
PCA is an exploratory method. First I’ll see if there are any groupsing based on diet (“vore”). Not really
biplot(rda.out, display = "sites")
vegan::ordihull(rda.out,
group = iris$Species,
col = 1:3)
Cluster analysis in biology often involves building tree diagrams (dendrograms), which uses hiearchical cluster algorithms.
Hiearchical clustering involves first calculating a distance matrix. There are MANY ways to calculate a distance matrix depending on the data and application. The easiest one to think about is Euclidean distance.
Add row names. Don’t worry about what this is doing
row.names(iris) <- paste(iris$Species,1:nrow(iris),sep = ".")
There’s a TON of data here so let’s randomly cut it in half. You should understand this code
i <- 1:nrow(iris)
i <- sample(i, length(i)/2, replace = F)
iris_sub <- iris[i,]
dist_euc <- dist(iris_sub[,-5],
method = "euclidean")
We can then carry out a cluster analysis using the hclust() function. The default function used for calculating branch lenths is called “complete.” Other options allow you to implement UPGMA, WPGMA and other common formss.
clust_euc <- hclust(dist_euc)
Plotting the dendrogram results in a very messy graph
plot(clust_euc, hang = -1, cex = 0.5)
It can help in R to plot things horizontally. This requires converting to a different object type.
par(mar = c(1,1,1,1))
is(clust_euc)
## [1] "hclust"
dendro_euc <- as.dendrogram(clust_euc)
is(dendro_euc)
## [1] "dendrogram"
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 0.5))
Generate the cluster diagram and upload it to this assignment