1. Analisis de Clasificacion utilizando Cluster

1.1 Creacion del objeto

Se preparan los datos para lo cual utilizamos la base USArrests, que contiene estadísticas de arrestos por 100,000 residentes por asaltos, asesinatos, y violaciones, en los 50 estados en USA en 1973. También incluye el porcentae (%) de población viviendo en áreas urbanas

# Limpiamos el ambiente de trabajo
rm(list = ls())

# Semilla para reproducir resultados
set.seed(1)

# Cargamos la base de datos
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

1.2 Visualización de los datos: Matriz de distancia

distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# k-means clustering
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)

## List of 9
##  $ cluster     : Named int [1:50] 1 1 1 2 1 1 2 2 1 1 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:2, 1:4] 1.005 -0.67 1.014 -0.676 0.198 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:2] 46.7 56.1
##  $ tot.withinss: num 103
##  $ betweenss   : num 93.1
##  $ size        : int [1:2] 20 30
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

## Visualización
fviz_cluster(k2, data = df)

df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
  geom_text()

2. Generacion de los Clusters

# k-means clustering
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)

## List of 9
##  $ cluster     : Named int [1:50] 2 2 2 1 2 2 1 1 2 2 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:2, 1:4] -0.67 1.005 -0.676 1.014 -0.132 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:2] 56.1 46.7
##  $ tot.withinss: num 103
##  $ betweenss   : num 93.1
##  $ size        : int [1:2] 30 20
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

## Visualización
fviz_cluster(k2, data = df)

df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
  geom_text()

k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

2.1 Se generan los graficos para contrastar

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

2.2 Determinación del número óptimo de clusters

## Método del codo (Elbow)
set.seed(123)

# Función para calcular el suma cuadrática de distancias intra-cluster
wss <- function(k) {
  kmeans(df, k, nstart = 10 )$tot.withinss
}

# Calcular y graficar wss para k = 1 hasta k = 15
k.values <- 1:15

Caso Taller R Markown

Luis Gastelu Ortiz

21/8/2021