Dendograms using ggdendro Package

Brief tutorial (very brief), on the “ggdendro” package for creating dendogram. USarrests dataset is an R dataset of number of violent crimes per 100,000 population for each of the 50 US states.

The dist() function is used to calculate distances between rows of the dataset. Remember rows in R are observations (data points), while columns are variables (dependent or independent).

hclust() takes the calculated distances and creates a dissimilarity structure that can be plotted using ggdendogram().

#firstly take a look at the structure of the data, you can use View() to look at the whole dataset or head() for just the first few rows
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# ? in front of a function will pull up the manual for the function
?dist

# before using a function for the first time, always a good idea to look at the manual page
?hclust

# dist() is used to calculate distances between rows, there are different methods to do this, the default is "euclidean"   
hc <- hclust(dist(USArrests), method = "average")  # hierarchical clustering  

# plot using ggdendogram, hc is the dissimilarity structure produced by dist()
ggdendrogram(hc, rotate = TRUE, size = 2)

Questions for Friday discussion

How is this plot useful, what can it tell us?

What are some downfalls to the current plot (think about the number of variables in the datasheet)?

The row distances are calculated from numeric values but you are comparing DNA sequence similarities. What needs to be done to transform the multiple sequence alignments into a datasheet that can be used with ggdendro?