knitr::opts_chunk$set(echo = T,warning=F,message=F)
##Kevin Kuipers (Completed byself) ##04/16/2019
#1. Question 10.7.8 pg 416
In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let rij denote the correlation between the ith and jth observations, then the quantity \(1???r_{ij}\) is proportional to the squared Euclidean distance between the ith and jth observations. On the USArrests data, show that this proportionality holds.
Hint: The Euclidean distance can be calculated using the dist() function, and correlations can be calculated using the cor() function.
library(ISLR)
data("USArrests")
set.seed(702)
usa.scale <- scale(USArrests)
distance <- dist(usa.scale)
distance2 <- distance*distance
r_ij <- cor(t(usa.scale))
minus.r_ij <- as.dist((1-t(r_ij)))
op <- summary(minus.r_ij/distance2)
op
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686
plot(minus.r_ij/distance2)
#2. Question 10.7.9 pg 416 Consider the USArrests data. We will now perform hierarchical clustering on the states.
First I will load the USArrests data set found in the ISLR library
library(ISLR)
data("USArrests")
##Question 9 a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
I will use the hclust command and use the complete method to this clustering. After that I will plot the data first using plol() and the second one using ggplot. In this case case for ggplot I will using ggdendrogram found in the ggdendro library.
set.seed(702)
hclust.mod <- hclust(dist(USArrests), method='complete')
plot(hclust.mod, main='Hierarchical clustering with complete linkage and Euclidean distance')
library(ggdendro)
library(tidyverse)
ggdendrogram(hclust.mod, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical clustering with complete linkage and Euclidean distance')
##Question 9 b)
Now I will cut the dendrogram at a height that results in three distinct clusters. From there I can determine which states belong to each cluster. After the cut I will produce a table showing how many states fall within each cut, then I will produce the output that shows which state fell into each cut.
set.seed(702)
cut3 <- cutree(hclust.mod, 3)
table(cut3)
## cut3
## 1 2 3
## 16 14 20
cut3
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
plot(cut3)
It appears roughly a 1/3 of the number of states fall within each cut. The third group has the most.
##Question 9 c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
First I will scale the data then I will use hclust command() with the method as complete. Then I will plot the data using both plot() command and ggplot() using ggdendro command.
hclust.scale <- scale(USArrests)
set.seed(702)
hclust.scale.mod <- hclust(dist(hclust.scale),method='complete')
plot(hclust.scale.mod, main='Hierarchical cluster After scaling variables to 1 SD')
ggdendrogram(hclust.scale.mod, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical cluster After scaling variables to 1 SD')
##Question 9 d)
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
It is kind of difficult to tell the cutoff between the three clusters. I will cut the scaled SD=1 by 3. Then I will output the the results again in a table containing the number of states in each cut, and then I will output the states that fall in each cut.
set.seed(702)
cut.hclust.scale.mod <- cutree(hclust.scale.mod,3)
table(cut.hclust.scale.mod)
## cut.hclust.scale.mod
## 1 2 3
## 8 11 31
cut.hclust.scale.mod
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
The number of states in each cut varies in this cluster compared to the first clustering. Now I will compare the two tables in a confusion matrix sort of way to see how they compare.
Non.SD <- cut3
SD.1 <- cut.hclust.scale.mod
tab <- table(Non.SD, SD.1)
tab
## SD.1
## Non.SD 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
same <- (tab[1,1] + tab[2,2] + tab[3,3]) / sum(tab)
cat('It appears that ', same*100,'% of the observations are assigned to the same clusters')
## It appears that 56 % of the observations are assigned to the same clusters
It appears that scaling the variables does change the clusters but they are also still similar. I think the scaling should be done because the units of measure are very different.
#3. Question 10.7.11 pg 417
On the book website, www.StatLearning.com, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
##Question 11 a)
I will load the data set using the data.table library. I will read it directly from the website ‘http://www-bcf.usc.edu/~gareth/ISL/Ch10Ex11.csv’
I will then output the first line of observations to make sure it looks correct.
#install.packages('data.table')
library(data.table)
gene <- fread('https://faculty.marshall.usc.edu/gareth-james/ISL/Ch10Ex11.csv')
head(gene,1)
##Question 11 b) Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
##Complete linkage results.
set.seed(702)
hc.complete <- hclust(as.dist(1 - cor(gene)), method='complete')
#Ploting the cluster using the plot() command
plot(hc.complete, main='Hierachical cluster of Gene Expression Data set (Complete Method)')
##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.complete, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical cluster of Gene Expression (Complete Method)')
The normal plot command produced a much cleaner and nicer plot for interpreting results.
##Average method results
set.seed(702)
hc.avg <- hclust(as.dist(1 - cor(gene)), method='average')
#Ploting the cluster using the plot() command
plot(hc.avg, main='Hierachical cluster of Gene Expression Data set (Average Method)')
##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.avg, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical cluster of Gene Expression (Average Method)')
##Single method results
set.seed(702)
hc.single <- hclust(as.dist(1 - cor(gene)), method='single')
#Ploting the cluster using the plot() command
plot(hc.single, main='Hierachical cluster of Gene Expression Data set (Single Method)')
##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.single, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical cluster of Gene Expression (Single Method)')
##Centroid Method
set.seed(702)
hc.cent <- hclust(as.dist(1 - cor(gene)), method='centroid')
#Ploting the cluster using the plot() command
plot(hc.cent, main='Hierachical cluster of Gene Expression Data set (Single Method)')
##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.cent, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) +
labs(title='Hierarchical cluster of Gene Expression (Single Method)')
The plot() command produces easier and nicer graphs of the trees compared the ggdendrograms. I am going to replot these 4 linkage methods in a matrix to easily compare them.
par(mfrow=c(2,2))
plot(hc.cent, main='Centroid Method', lwd=1.5, lty=1,hang=-4)
plot(hc.single, main='Single Method', lwd=1.5, lty=1)
plot(hc.avg, main='Average Method', lwd=1.5, lty=1)
plot(hc.complete, main='Complete Method', lwd=1., lty=1)
The linkage methods all produce different clustering. However, for the most part they suggest two main clusters. The average linkage method has 3 main clusters.
##Question 11 c) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.
I believe I can use PCA to determine the genes that differ between the two group–diseased and non-diseased. This will be done using the prcomp command. Since prcomp() centers the variables to have an average of 0, I will use the absolute values to be able to categorize the weight for all the genes. When applying the prcomp I will also use the scale=TRUE for the variables to have a standard deviation of 1.
set.seed(702)
pr.gene <- prcomp(t(gene), scale=T)
plot(pr.gene)
summary(pr.gene)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 9.00460 5.87302 5.74347 5.61806 5.55344 5.50107
## Proportion of Variance 0.08108 0.03449 0.03299 0.03156 0.03084 0.03026
## Cumulative Proportion 0.08108 0.11558 0.14856 0.18013 0.21097 0.24123
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 5.40069 5.38575 5.3762 5.34146 5.31878 5.25016
## Proportion of Variance 0.02917 0.02901 0.0289 0.02853 0.02829 0.02756
## Cumulative Proportion 0.27040 0.29940 0.3283 0.35684 0.38513 0.41269
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 5.18737 5.1667 5.10384 5.04667 5.03288 4.98926
## Proportion of Variance 0.02691 0.0267 0.02605 0.02547 0.02533 0.02489
## Cumulative Proportion 0.43960 0.4663 0.49234 0.51781 0.54314 0.56803
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 4.92635 4.90996 4.88803 4.85159 4.79974 4.78202
## Proportion of Variance 0.02427 0.02411 0.02389 0.02354 0.02304 0.02287
## Cumulative Proportion 0.59230 0.61641 0.64030 0.66384 0.68688 0.70975
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 4.70171 4.66105 4.64595 4.59194 4.53246 4.47381
## Proportion of Variance 0.02211 0.02173 0.02158 0.02109 0.02054 0.02001
## Cumulative Proportion 0.73185 0.75358 0.77516 0.79625 0.81679 0.83681
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 4.4389 4.41670 4.39404 4.3591 4.23504 4.2184
## Proportion of Variance 0.0197 0.01951 0.01931 0.0190 0.01794 0.0178
## Cumulative Proportion 0.8565 0.87602 0.89533 0.9143 0.93226 0.9501
## PC37 PC38 PC39 PC40
## Standard deviation 4.12936 4.0738 4.03658 5.481e-15
## Proportion of Variance 0.01705 0.0166 0.01629 0.000e+00
## Cumulative Proportion 0.96711 0.9837 1.00000 1.000e+00
##Top 15 most different genes across diseased vs non-diseased patients.
set.seed(702)
gl <- apply(pr.gene$rotation, 1, sum)
gl.dif <- order(abs(gl), decreasing=T)
top15 <-gl.dif[1:15]
top15
## [1] 676 889 755 960 907 475 374 716 673 174 878 840 567 327 281