Ejercicio de aplicación

Un grupo investigador estudió a un grupo de recién nacidos de un determinado hospital registrando la siguiente información: edad gestacional en semanas (egest), sexo (1. Femenino 2. Masculino), peso al nacer en gramos (pesonac), edad de la madre en años (edadmadre), presencia de catéter central (0. No 1. Si), días de hospitalización (estadía), presencia de fiebre (0.No 1.Si) y presencia de infección nosocomial (0. No 1. Si)

1. Haga un análisis descriptivo de los datos

library(readxl)
datos <- read_excel("infeccion.xlsx")

datos$sexo <- factor(datos$sexo, labels = c("F","M"))
datos$fiebre <- factor(datos$fiebre, labels = c("No","Si"))
datos$infeccion <- factor(datos$infeccion, labels = c("No","Si"))
datos$cateter <- factor(datos$cateter, labels = c("No","Si"))

summary(datos[-1])

##     pesonac     sexo       egest           emadre      fiebre  infeccion
##  Min.   : 808   F:11   Min.   :23.00   Min.   :16.00   No:21   No:21    
##  1st Qu.:1900   M:19   1st Qu.:34.00   1st Qu.:23.00   Si: 9   Si: 9    
##  Median :2226          Median :37.00   Median :27.00                    
##  Mean   :2274          Mean   :35.53   Mean   :27.67                    
##  3rd Qu.:2788          3rd Qu.:38.00   3rd Qu.:31.50                    
##  Max.   :3500          Max.   :40.00   Max.   :39.00                    
##     estadia      cateter
##  Min.   : 1.00   No:26  
##  1st Qu.: 4.00   Si: 4  
##  Median : 9.00          
##  Mean   :13.80          
##  3rd Qu.:19.75          
##  Max.   :42.00

par(mfrow = c(3, 3))
hist(datos$pesonac, main = "Histograma peso al nacer en gramos")
plot(datos$sexo, main = "Histograma sexo")
hist(datos$egest, main = "Histograma edad gestacional en semanasl")
hist(datos$emadre, main = "Histograma edad de la madre en años")
plot(datos$fiebre, main = "Histograma presencia de fiebre")
plot(datos$infeccion, main = "Histograma presencia de nosocomial")
hist(datos$estadia, main = "Histograma días de hospitalización")
plot(datos$cateter, main = "Histograma presencia de catéter central")
par(mfrow = c(1, 1))

2. El objetivo principal del estudio es agrupar a estos recién nacidos, ¿cuántos grupos crearía, que características tiene cada uno de ellos y cuáles son sus principales diferencias?

Seleccionando solo las variables cuantitativas

datos2 <- datos[-c(1,3,6,7,9)]
datos3 <- scale(datos2)

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

m.distancia <- get_dist(datos3, method = "euclidean")
fviz_dist(m.distancia, gradient = list(low = "blue", mid = "white", high = "red"))

# m.distancia

library(NbClust)
resnumclust<-NbClust(datos3, distance = "euclidean", method = "centroid", index = "alllong")

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 7 proposed 9 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 2 proposed 14 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

# fviz_nbclust(resnumclust)

resnumclust<-NbClust(datos3, distance = "euclidean", method = "kmeans", index = "alllong")

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 8 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

# fviz_nbclust(resnumclust)

dendograma<- hcut(datos3, k = 3, stand = TRUE)
fviz_dend(dendograma, rect = TRUE, cex = 0.5,
          k_colors = c("red","green","blue"))

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

k3 <- kmeans(datos3, centers = 3, nstart = 50)
fviz_cluster(k3, data = datos3)

fviz_cluster(k3, data = datos3, ellipse.type = "euclid",repel = TRUE,star.plot = TRUE)

k3

## K-means clustering with 3 clusters of sizes 8, 17, 5
## 
## Cluster means:
##      pesonac      egest      emadre     estadia
## 1 -1.2190156 -1.2766260  0.05287202  1.08365571
## 2  0.2472322  0.4066569 -0.46962791 -0.53387475
## 3  1.1098357  0.6599682  1.51213968  0.08132501
## 
## Clustering vector:
##  [1] 1 2 2 3 1 1 1 1 3 2 1 2 2 2 2 2 2 3 2 2 3 2 1 2 2 2 3 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 23.659922 20.397910  5.351649
##  (between_SS / total_SS =  57.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

str(k3)

## List of 9
##  $ cluster     : int [1:30] 1 2 2 3 1 1 1 1 3 2 ...
##  $ centers     : num [1:3, 1:4] -1.219 0.247 1.11 -1.277 0.407 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:4] "pesonac" "egest" "emadre" "estadia"
##  $ totss       : num 116
##  $ withinss    : num [1:3] 23.66 20.4 5.35
##  $ tot.withinss: num 49.4
##  $ betweenss   : num 66.6
##  $ size        : int [1:3] 8 17 5
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

datos$cluster <- as.factor(k3$cluster)
datos4 <- datos[-c(3,6,7,9)]

library(tidyr)
data_long <- gather(datos, caracteristica, valor, pesonac:cateter, factor_key=TRUE)

## Warning: attributes are not identical across measure variables; they will be
## dropped

#data_long

par(mfrow = c(3, 3))
boxplot(datos$pesonac~datos$cluster)
boxplot(datos$egest~datos$cluster)
boxplot(datos$emadre~datos$cluster)
boxplot(datos$estadia~datos$cluster)
plot(datos$infeccion~datos$cluster)
plot(datos$cateter~datos$cluster)
plot(datos$sexo~datos$cluster)
plot(datos$fiebre~datos$cluster)
par(mfrow = c(1,1))

library(ggplot2)
ggplot(data_long, aes(as.factor(x = caracteristica), y = valor,group=cluster, colour = cluster)) + 
  stat_summary(fun = mean, geom="pointrange", size = 1)+
  stat_summary(geom="line")

## No summary function supplied, defaulting to `mean_se()`

## Warning: Removed 24 rows containing missing values (`geom_segment()`).

Análisis de cluster

Jonathan Patricio Baldera

8/2/2021

Ejercicio de aplicación

1. Haga un análisis descriptivo de los datos

2. El objetivo principal del estudio es agrupar a estos recién nacidos, ¿cuántos grupos crearía, que características tiene cada uno de ellos y cuáles son sus principales diferencias?

Seleccionando solo las variables cuantitativas