Homework 12

knitr::opts_chunk$set(echo = T,warning=F,message=F)

##Kevin Kuipers (Completed byself) ##04/16/2019

#1. Question 10.7.8 pg 416

In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let rij denote the correlation between the ith and jth observations, then the quantity \(1???r_{ij}\) is proportional to the squared Euclidean distance between the ith and jth observations. On the USArrests data, show that this proportionality holds.

Hint: The Euclidean distance can be calculated using the dist() function, and correlations can be calculated using the cor() function.

library(ISLR)
data("USArrests")
set.seed(702)

usa.scale <- scale(USArrests)

distance <- dist(usa.scale)
distance2 <- distance*distance


r_ij <- cor(t(usa.scale))

minus.r_ij <- as.dist((1-t(r_ij)))

op <- summary(minus.r_ij/distance2)

op

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686

plot(minus.r_ij/distance2)

#2. Question 10.7.9 pg 416 Consider the USArrests data. We will now perform hierarchical clustering on the states.

First I will load the USArrests data set found in the ISLR library

library(ISLR)
data("USArrests")

##Question 9 a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

I will use the hclust command and use the complete method to this clustering. After that I will plot the data first using plol() and the second one using ggplot. In this case case for ggplot I will using ggdendrogram found in the ggdendro library.

set.seed(702)
hclust.mod <- hclust(dist(USArrests), method='complete')

plot(hclust.mod, main='Hierarchical clustering with complete linkage and Euclidean distance')

library(ggdendro)

library(tidyverse)

ggdendrogram(hclust.mod, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical clustering with complete linkage and Euclidean distance')

##Question 9 b)

Now I will cut the dendrogram at a height that results in three distinct clusters. From there I can determine which states belong to each cluster. After the cut I will produce a table showing how many states fall within each cut, then I will produce the output that shows which state fell into each cut.

set.seed(702)
cut3 <- cutree(hclust.mod, 3)
table(cut3)

## cut3
##  1  2  3 
## 16 14 20

cut3

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

plot(cut3)

It appears roughly a 1/3 of the number of states fall within each cut. The third group has the most.

##Question 9 c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

First I will scale the data then I will use hclust command() with the method as complete. Then I will plot the data using both plot() command and ggplot() using ggdendro command.

hclust.scale <- scale(USArrests)
set.seed(702)
hclust.scale.mod <- hclust(dist(hclust.scale),method='complete')

plot(hclust.scale.mod, main='Hierarchical cluster After scaling variables to 1 SD')

ggdendrogram(hclust.scale.mod, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical cluster After scaling variables to 1 SD')

##Question 9 d)

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

It is kind of difficult to tell the cutoff between the three clusters. I will cut the scaled SD=1 by 3. Then I will output the the results again in a table containing the number of states in each cut, and then I will output the states that fall in each cut.

set.seed(702)
cut.hclust.scale.mod <- cutree(hclust.scale.mod,3)

table(cut.hclust.scale.mod)

## cut.hclust.scale.mod
##  1  2  3 
##  8 11 31

cut.hclust.scale.mod

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3

The number of states in each cut varies in this cluster compared to the first clustering. Now I will compare the two tables in a confusion matrix sort of way to see how they compare.

Non.SD <- cut3
SD.1 <- cut.hclust.scale.mod

tab <- table(Non.SD, SD.1)
tab

##       SD.1
## Non.SD  1  2  3
##      1  6  9  1
##      2  2  2 10
##      3  0  0 20

same <- (tab[1,1] + tab[2,2] + tab[3,3]) / sum(tab)
cat('It appears that ', same*100,'% of the observations are assigned to the same clusters')

## It appears that  56 % of the observations are assigned to the same clusters

It appears that scaling the variables does change the clusters but they are also still similar. I think the scaling should be done because the units of measure are very different.

#3. Question 10.7.11 pg 417

On the book website, www.StatLearning.com, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

##Question 11 a)

I will load the data set using the data.table library. I will read it directly from the website ‘http://www-bcf.usc.edu/~gareth/ISL/Ch10Ex11.csv’

I will then output the first line of observations to make sure it looks correct.

#install.packages('data.table')
library(data.table)
gene <- fread('https://faculty.marshall.usc.edu/gareth-james/ISL/Ch10Ex11.csv')
head(gene,1)

##Question 11 b) Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

##Complete linkage results.

set.seed(702)
hc.complete <- hclust(as.dist(1 - cor(gene)), method='complete')

#Ploting the cluster using the plot() command
plot(hc.complete, main='Hierachical cluster of Gene Expression Data set (Complete Method)')

##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.complete, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical cluster of Gene Expression (Complete Method)')

The normal plot command produced a much cleaner and nicer plot for interpreting results.

##Average method results

set.seed(702)
hc.avg <- hclust(as.dist(1 - cor(gene)), method='average')

#Ploting the cluster using the plot() command
plot(hc.avg, main='Hierachical cluster of Gene Expression Data set (Average Method)')

##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.avg, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical cluster of Gene Expression (Average Method)')

##Single method results

set.seed(702)
hc.single <- hclust(as.dist(1 - cor(gene)), method='single')

#Ploting the cluster using the plot() command
plot(hc.single, main='Hierachical cluster of Gene Expression Data set (Single Method)')

##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.single, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical cluster of Gene Expression (Single Method)')

##Centroid Method

set.seed(702)
hc.cent <- hclust(as.dist(1 - cor(gene)), method='centroid')

#Ploting the cluster using the plot() command
plot(hc.cent, main='Hierachical cluster of Gene Expression Data set (Single Method)')

##Ploting the cluser using ggendrogram for ggplot
ggdendrogram(hc.cent, segements=TRUE, labels=TRUE, leaf_labels = TRUE, rotate=FALSE, theme_dendro = TRUE) + 
  labs(title='Hierarchical cluster of Gene Expression (Single Method)')

The plot() command produces easier and nicer graphs of the trees compared the ggdendrograms. I am going to replot these 4 linkage methods in a matrix to easily compare them.

par(mfrow=c(2,2))
plot(hc.cent, main='Centroid Method', lwd=1.5, lty=1,hang=-4)
plot(hc.single, main='Single Method', lwd=1.5, lty=1)
plot(hc.avg, main='Average Method', lwd=1.5, lty=1)
plot(hc.complete, main='Complete Method', lwd=1., lty=1)

The linkage methods all produce different clustering. However, for the most part they suggest two main clusters. The average linkage method has 3 main clusters.

##Question 11 c) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

I believe I can use PCA to determine the genes that differ between the two group–diseased and non-diseased. This will be done using the prcomp command. Since prcomp() centers the variables to have an average of 0, I will use the absolute values to be able to categorize the weight for all the genes. When applying the prcomp I will also use the scale=TRUE for the variables to have a standard deviation of 1.

set.seed(702)
pr.gene <- prcomp(t(gene), scale=T)
plot(pr.gene)

summary(pr.gene)

## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     9.00460 5.87302 5.74347 5.61806 5.55344 5.50107
## Proportion of Variance 0.08108 0.03449 0.03299 0.03156 0.03084 0.03026
## Cumulative Proportion  0.08108 0.11558 0.14856 0.18013 0.21097 0.24123
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     5.40069 5.38575 5.3762 5.34146 5.31878 5.25016
## Proportion of Variance 0.02917 0.02901 0.0289 0.02853 0.02829 0.02756
## Cumulative Proportion  0.27040 0.29940 0.3283 0.35684 0.38513 0.41269
##                           PC13   PC14    PC15    PC16    PC17    PC18
## Standard deviation     5.18737 5.1667 5.10384 5.04667 5.03288 4.98926
## Proportion of Variance 0.02691 0.0267 0.02605 0.02547 0.02533 0.02489
## Cumulative Proportion  0.43960 0.4663 0.49234 0.51781 0.54314 0.56803
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     4.92635 4.90996 4.88803 4.85159 4.79974 4.78202
## Proportion of Variance 0.02427 0.02411 0.02389 0.02354 0.02304 0.02287
## Cumulative Proportion  0.59230 0.61641 0.64030 0.66384 0.68688 0.70975
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     4.70171 4.66105 4.64595 4.59194 4.53246 4.47381
## Proportion of Variance 0.02211 0.02173 0.02158 0.02109 0.02054 0.02001
## Cumulative Proportion  0.73185 0.75358 0.77516 0.79625 0.81679 0.83681
##                          PC31    PC32    PC33   PC34    PC35   PC36
## Standard deviation     4.4389 4.41670 4.39404 4.3591 4.23504 4.2184
## Proportion of Variance 0.0197 0.01951 0.01931 0.0190 0.01794 0.0178
## Cumulative Proportion  0.8565 0.87602 0.89533 0.9143 0.93226 0.9501
##                           PC37   PC38    PC39      PC40
## Standard deviation     4.12936 4.0738 4.03658 5.481e-15
## Proportion of Variance 0.01705 0.0166 0.01629 0.000e+00
## Cumulative Proportion  0.96711 0.9837 1.00000 1.000e+00

##Top 15 most different genes across diseased vs non-diseased patients.

set.seed(702)
gl <- apply(pr.gene$rotation, 1, sum)
gl.dif <- order(abs(gl), decreasing=T)
top15 <-gl.dif[1:15]
top15

##  [1] 676 889 755 960 907 475 374 716 673 174 878 840 567 327 281

Homework 12

Modern Applied Statistics II