Integrantes Carné
Keiry Margarita Umanzor Portillo UP17003
Patricia Lilibeth Portillo Hernández PH17001
Iris Leonor Márquez Arévalo MA15003

Análisis de Clúster

Explique en qué consiste el análisis de conglomerados

La agrupación es un método para encontrar subgrupos de observaciones dentro de un conjunto de datos.

Cuando estamos agrupando, necesitamos observaciones en el mismo grupo con patrones similares y observaciones en diferentes grupos para que sean diferentes.

Si no hay una variable de respuesta, entonces es adecuado para un método no supervisado, lo que implica que busca encontrar relaciones entre las n observaciones sin ser entrenado por una variable de respuesta.

La agrupación en clústeres nos permite identificar grupos homogéneos y categorizarlos a partir del conjunto de datos.

Uno de los agrupamientos más simples es K-means, el método de agrupamiento más utilizado para dividir un conjunto de datos en un conjunto de n grupos.

Si los conjuntos de datos no contienen una variable de respuesta y tienen muchas variables, entonces se aplica a un enfoque no supervisado.

El análisis de clústeres es un enfoque no supervisado y se utiliza para segmentar los mercados en grupos de clientes o patrones similares.

Fuente: https://www.r-bloggers.com/2021/04/cluster-analysis-in-r/

Cuadro comparativo

Análisis de Clúster

Técnicas Disponibles

Ventajas

Desventajas

Jerárquico: 

   Es un enfoque alternativo para particionar el agrupamiento en función de su similitud.

 

  • Agrupación aglomerativa:
  • Dividir agrupamiento
  • Es basada en una representación en árboles de un dato
  • Los objetos en el dendrograma están vinculados entre sí en función de su similitud.
  • Se hace  posible cortar el árbol a una altura determinada para dividir los datos en varios grupos.
  • Tiende a ser incompleto los árboles o agrupamientos.

No Jerárquico:

   El método CLARA (Clustering Large Applications) combina la idea de K-medoides con el remuestreo y así poder trabajar con grandes volúmenes de datos de manera más eficiente.

  • Muestreo y Simulación

  •  Pueden tabajar con grandes volúmenes de datos de manera más eficiente.
  • Las bases de datos deben de ser grandes, para la implementación de este método.

Técnicas disponibles para realizar el análisis de Clúster

JERÁRQUICO

Agrupación aglomerativa

K-medias (MacQueen, 1967) Es en el que, cada clúster está representado por el centro o por medio de los puntos de datos pertenecientes al clúster.Este método es sensible a puntos de datos anómalos y valores atípicos.

División por agrupamiento

K-medoids clustering or PAM (Partitioning Around Medoids) (Kaufman & Rousseeuw, 1990)en el cual, cada clúster está representado por uno de los objetos en el cluster. PAM es menos sensible a valores atípicos en comparación con k-medias.

NO JERÁRQUICO

CLARA (Clustering Large Applications) que es una extensión de PAM adaptado para grandes conjuntos de datos.

PAQUETERIA

FactoExtra El paquete factoextra en R puede manejar los resultados de PCA, CA, MCA, MFA, FAMD y HMFA de varios paquetes, para extraer y visualizar la información más importante contenida en sus datos.

  Cluster Se utiliza para calcular algoritmos de clúster, como PAM (Partitioning Around Medoids).

  Dendextend Es un paquete R para crear y comparar diagramas de árbol visualmente atractivos. dendextend proporciona funciones de utilidad para manipular objetos dendrograma (su color, forma y contenido), así como varios métodos avanzados para comparar árboles entre sí (tanto estadísticamente como visualmente).

Desarrollo de ejemplos del texto Ksassambara, A. (2017)

Capítulo 4: Clustering por particiones

Clustering por K-medias

library(factoextra)
# Cargando el data set
data("USArrests")
# Normalizando la data
df <- scale(USArrests)
# Visualizando las primeras 3 filas
head(df, n = 3)
##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388
# Estimando el número óptimo de Clusters
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)

# Calculando las K-medias con k=4
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
print(km.res)
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# También se puede calcular la media de cada variable por Clustering usando la data original
aggregate(USArrests, by=list(cluster=km.res$cluster), mean)
##   cluster   Murder   Assault UrbanPop     Rape
## 1       1 13.93750 243.62500 53.75000 21.41250
## 2       2  3.60000  78.53846 52.07692 12.17692
## 3       3  5.65625 138.87500 73.87500 18.78125
## 4       4 10.81538 257.38462 76.00000 33.19231
# Si se desea agregar el punto de clasificacines a la data original se debe usar:
dd <- cbind(USArrests, cluster = km.res$cluster)
head(dd)
##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       4
## Arizona       8.1     294       80 31.0       4
## Arkansas      8.8     190       50 19.5       1
## California    9.0     276       91 40.6       4
## Colorado      7.9     204       78 38.7       4
# Función kmeans()
# Número de Cluster para cada observación
km.res$cluster
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3
head(km.res$cluster, 4)
##  Alabama   Alaska  Arizona Arkansas 
##        1        4        4        1
# Tamaño de Cluster
km.res$size
## [1]  8 13 16 13
# Medias de Cluster
km.res$centers
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
# Visualizando las k-medias
fviz_cluster(km.res, data = df,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)

Capítulo 5: K-Medoids

Clustering por K-medias

library(cluster)
library(factoextra)
# Calculando PAM 
data("USArrests") # Load the data set
df <- scale(USArrests) # Scale the data
head(df, n = 3) # View the firt 3 rows of the data
##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388
#  Estimating the optimal number of clusters
fviz_nbclust(df, pam, method = "silhouette")+
theme_classic()

# Calculando PAM clustering con k=2
pam.res <- pam(df, 2)
print(pam.res)
## Medoids:
##            ID     Murder    Assault   UrbanPop       Rape
## New Mexico 31  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
##    build     swap 
## 1.441358 1.368969 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
# If you want to add the point classifications to the original data, use this:
dd <- cbind(USArrests, cluster = pam.res$cluster)
head(dd, n = 3)
##         Murder Assault UrbanPop Rape cluster
## Alabama   13.2     236       58 21.2       1
## Alaska    10.0     263       48 44.5       1
## Arizona    8.1     294       80 31.0       1
# Accessing to the results of the pam() function
# Cluster medoids: New Mexico, Nebraska
pam.res$medoids
##                Murder    Assault   UrbanPop       Rape
## New Mexico  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   -0.8008247 -0.8250772 -0.2445636 -0.5052109
# Cluster numbers
head(pam.res$clustering)
##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1
# Visualizing PAM clusters
fviz_cluster(pam.res,
palette = c("#00AFBB", "#FC4E07"), # color palette
ellipse.type = "t", # Concentration ellipse
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic()
)

Capítulo 6: CLARA - Clustering Large Applications

Calculando CLARA

library(cluster)
library(factoextra)
set.seed(1234)
# Generate 500 objects, divided into 2 clusters.
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames(df) <- paste0("S", 1:nrow(df))
# Previewing the data
head(df, nrow = 6)
##             x        y
## S1  -9.656526 3.881815
## S2   2.219434 5.574150
## S3   8.675529 1.484111
## S4 -18.765582 5.605868
## S5   3.432998 2.493448
## S6   4.048447 6.083699
# Estimating the optimal number of clusters
fviz_nbclust(df, clara, method = "silhouette")+
theme_classic()

# Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
# Print components of clara.res
print(clara.res)
## Call:     clara(x = df, k = 2, samples = 50, pamLike = TRUE) 
## Medoids:
##              x         y
## S121 -1.531137  1.145057
## S455 48.357304 50.233499
## Objective function:   9.87862
## Clustering vector:    Named int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:500] "S1" "S2" "S3" "S4" "S5" "S6" "S7" ...
## Cluster sizes:            200 300 
## Best sample:
##  [1] S37  S49  S54  S63  S68  S71  S76  S80  S82  S101 S103 S108 S109 S118 S121
## [16] S128 S132 S138 S144 S162 S203 S210 S216 S231 S234 S249 S260 S261 S286 S299
## [31] S304 S305 S312 S315 S322 S350 S403 S450 S454 S455 S456 S465 S488 S497
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"
# If you want to add the point classifications to the original data, use this:
dd <- cbind(df, cluster = clara.res$cluster)
head(dd, n = 4)
##             x        y cluster
## S1  -9.656526 3.881815       1
## S2   2.219434 5.574150       1
## S3   8.675529 1.484111       1
## S4 -18.765582 5.605868       1
# You can access to the results returned by clara() as follow:
# Medoids
clara.res$medoids
##              x         y
## S121 -1.531137  1.145057
## S455 48.357304 50.233499
# Clustering
head(clara.res$clustering, 10)
##  S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
##   1   1   1   1   1   1   1   1   1   1
# Visualizing CLARA clusters
fviz_cluster(clara.res,
palette = c("#00AFBB", "#FC4E07"), # color palette
ellipse.type = "t", # Concentration ellipse
geom = "point", pointsize = 1,
ggtheme = theme_classic()
)

Capítulo 7: Clostering Aglomerativo

library("factoextra")
library("cluster")
# Load the data
data("USArrests")
# Standardize the data
df <- scale(USArrests)
# Show the first 6 rows
head(df, nrow = 6)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207
# Similarity measures
# Compute the dissimilarity matrix
# df = the standardized data
res.dist <- dist(df, method = "euclidean")
as.matrix(res.dist)[1:6, 1:6]
##             Alabama   Alaska  Arizona Arkansas California Colorado
## Alabama    0.000000 2.703754 2.293520 1.289810   3.263110 2.651067
## Alaska     2.703754 0.000000 2.700643 2.826039   3.012541 2.326519
## Arizona    2.293520 2.700643 0.000000 2.717758   1.310484 1.365031
## Arkansas   1.289810 2.826039 2.717758 0.000000   3.763641 2.831051
## California 3.263110 3.012541 1.310484 3.763641   0.000000 1.287619
## Colorado   2.651067 2.326519 1.365031 2.831051   1.287619 0.000000
# hclust() can be used as follow:
res.hc <- hclust(d = res.dist, method = "ward.D2")
# Dendrogram
# cex: label size
fviz_dend(res.hc, cex = 0.5) 

# Verify the cluster tree
# Compute cophentic distance
res.coph <- cophenetic(res.hc)
# Correlation between cophenetic distance and
# the original distance
cor(res.dist, res.coph)
## [1] 0.6975266
res.hc2 <- hclust(res.dist, method = "average")
cor(res.dist, cophenetic(res.hc2))
## [1] 0.7180382
# Cut the dendrogram into dierent groups
# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
head(grp, n = 4)
##  Alabama   Alaska  Arizona Arkansas 
##        1        2        2        3
# Number of members in each cluster
table(grp)
## grp
##  1  2  3  4 
##  7 12 19 12
# Get the names for the members of cluster 1
rownames(df)[grp == 1]
## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)

# Using the function fviz_cluster()
fviz_cluster(list(data = df, cluster = grp),
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "convex", # Concentration ellipse
repel = TRUE, # Avoid label overplotting (slow)
show.clust.cent = FALSE, ggtheme = theme_minimal())

# Cluster R package
# Agglomerative Nesting (Hierarchical Clustering)
res.agnes <- agnes(x = USArrests, # data matrix
stand = TRUE, # Standardize the data
metric = "euclidean", # metric for distance matrix
method = "ward" # Linkage method
)
# DIvisive ANAlysis Clustering
res.diana <- diana(x = USArrests, # data matrix
stand = TRUE, # standardize the data
metric = "euclidean" # metric for distance matrix
)
# After running agnes() and diana(), you can use the function fviz_dend()[in factoextra] to visualize the output:
fviz_dend(res.agnes, cex = 0.6, k = 4)

Capítulo 8: Clostering Aglomerativo

library(dendextend)
library(corrplot)
#Data preparation
df <- scale(USArrests)
# Subset containing 10 rows
set.seed(123)
ss <- sample(1:50, 10)
df <- df[ss,]
#Comparing dendrograms
# Compute distance matrix
res.dist <- dist(df, method = "euclidean")
# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
# Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)
#Visual comparison of two dendrograms
#Draw a tanglegram:
tanglegram(dend1, dend2)

#Customized the tanglegram using many other options as follow:
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)

#Correlation matrix between a list of dendrograms
# Cophenetic correlation matrix
cor.dendlist(dend_list, method = "cophenetic")
##           [,1]      [,2]
## [1,] 1.0000000 0.9925544
## [2,] 0.9925544 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
##           [,1]      [,2]
## [1,] 1.0000000 0.9895528
## [2,] 0.9895528 1.0000000
# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.9925544
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9895528
#COMPARING DENDROGRAMS
# Create multiple dendrograms by chaining
dend1 <- df %>% dist %>% hclust("complete") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("average") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
##          Complete Single Average Centroid
## Complete     1.00   0.46    0.45     0.30
## Single       0.46   1.00    0.23     0.17
## Average      0.45   0.23    1.00     0.31
## Centroid     0.30   0.17    0.31     1.00
# Visualize the correlation matrix using corrplot package
corrplot(cors, "pie", "lower")

Capítulo 9: Visualización de Dendogramas

library(factoextra)
library(dendextend)
library(igraph)
# Load data
data(USArrests)
# Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")
#Visualizing dendrograms
fviz_dend(hc, cex = 0.5)

#You can use the arguments main, sub, xlab, ylab to change plot titles as follow:
fviz_dend(hc, cex = 0.5,
main = "Dendrogram - ward.D2",
xlab = "Objects", ylab = "Distance", sub = "")

#To draw a horizontal dendrogram, type this:
fviz_dend(hc, cex = 0.5, horiz = TRUE)

#It’s also possible to cut the tree at a given height for partitioning the data into multiple groups as described in the previous chapter: Hierarchical clustering (Chapter 7). In this case, it’s possible to color branches by groups and to add rectangle around each group. For example:
fviz_dend(hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
rect_fill = TRUE)

#To change the plot theme
fviz_dend(hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
ggtheme = theme_gray() # Change theme
)

#In the R code below, we’ll change group colors using “jco” (journal of clinical oncology) color palette:
fviz_dend(hc, cex = 0.5, k = 4, # Cut in four groups
k_colors = "jco")

#If you want to draw a horizontal dendrogram with rectangle around clusters, use this:
fviz_dend(hc, k = 4, cex = 0.4, horiz = TRUE, k_colors = "jco",
rect = TRUE, rect_border = "jco", rect_fill = TRUE)

#Additionally, you can plot a circular dendrogram using the option type = “circular”.
fviz_dend(hc, cex = 0.5, k = 4,
k_colors = "jco", type = "circular")

fviz_dend(hc, k = 4, k_colors = "jco",
type = "phylogenic", repel = TRUE)

# Case of dendrogram with large data sets
#Zooming in the dendrogram
fviz_dend(hc, xlim = c(1, 20), ylim = c(1, 8))

#Plotting a sub-tree of dendrograms
# Create a plot of the whole dendrogram,
# and extract the dendrogram data
dend_plot <- fviz_dend(hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = "jco"
)
dend_data <- attr(dend_plot, "dendrogram") # Extract dendrogram data
# Cut the dendrogram at height h = 10
dend_cuts <- cut(dend_data, h = 10)
# Visualize the truncated version containing
# two branches
fviz_dend(dend_cuts$upper)

# Plot the whole dendrogram
print(dend_plot)

# Plot subtree 1
fviz_dend(dend_cuts$lower[[1]], main = "Subtree 1")

# Plot subtree 2
fviz_dend(dend_cuts$lower[[2]], main = "Subtree 2")

#You can also plot circular trees as follow:
fviz_dend(dend_cuts$lower[[2]], type = "circular")

# Saving dendrogram into a large PDF page
pdf("dendrogram.pdf", width=30, height=15) # Open a PDF
p <- fviz_dend(hc, k = 4, cex = 1, k_colors = "jco" ) # Do plotting
print(p)
dev.off()
## png 
##   2
#Manipulating dendrograms using dendextend
#Standard R code for creating a dendrogram:
data <- scale(USArrests)
dist.res <- dist(data)
hc <- hclust(dist.res, method = "ward.D2")
dend <- as.dendrogram(hc)
plot(dend)

#R code for creating a dendrogram using chaining operator
library(dendextend)
dend <- USArrests[1:5,] %>% # data
scale %>% # Scale the data
dist %>% # calculate a distance matrix,
hclust(method = "ward.D2") %>% # Hierarchical clustering
as.dendrogram # Turn the object into a dendrogram.
plot(dend)

#MANIPULATING DENDROGRAMS USING DENDEXTEND
#Example
library(dendextend)
# 1. Create a customized dendrogram
mycols <- c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07")
dend <- as.dendrogram(hc) %>%
set("branches_lwd", 1) %>% # Branches line width
set("branches_k_color", mycols, k = 4) %>% # Color branches by groups
set("labels_colors", mycols, k = 4) %>% # Color labels by groups
set("labels_cex", 0.5) # Change label size
# 2. Create plot
fviz_dend(dend)