Ejercicio de aplicaciĂ³n

Un grupo investigador estudiĂ³ a un grupo de reciĂ©n nacidos de un determinado hospital registrando la siguiente informaciĂ³n: edad gestacional en semanas (egest), sexo (1. Femenino 2. Masculino), peso al nacer en gramos (pesonac), edad de la madre en años (edadmadre), presencia de catĂ©ter central (0. No 1. Si), dĂ­as de hospitalizaciĂ³n (estadĂ­a), presencia de fiebre (0.No 1.Si) y presencia de infecciĂ³n nosocomial (0. No 1. Si)

1. Haga un anĂ¡lisis descriptivo de los datos

library(readxl)
datos <- read_excel("infeccion.xlsx")

datos$sexo <- factor(datos$sexo, labels = c("F","M"))
datos$fiebre <- factor(datos$fiebre, labels = c("No","Si"))
datos$infeccion <- factor(datos$infeccion, labels = c("No","Si"))
datos$cateter <- factor(datos$cateter, labels = c("No","Si"))

summary(datos[-1])
##     pesonac     sexo       egest           emadre      fiebre  infeccion
##  Min.   : 808   F:11   Min.   :23.00   Min.   :16.00   No:21   No:21    
##  1st Qu.:1900   M:19   1st Qu.:34.00   1st Qu.:23.00   Si: 9   Si: 9    
##  Median :2226          Median :37.00   Median :27.00                    
##  Mean   :2274          Mean   :35.53   Mean   :27.67                    
##  3rd Qu.:2788          3rd Qu.:38.00   3rd Qu.:31.50                    
##  Max.   :3500          Max.   :40.00   Max.   :39.00                    
##     estadia      cateter
##  Min.   : 1.00   No:26  
##  1st Qu.: 4.00   Si: 4  
##  Median : 9.00          
##  Mean   :13.80          
##  3rd Qu.:19.75          
##  Max.   :42.00
par(mfrow = c(3, 3))
hist(datos$pesonac, main = "Histograma peso al nacer en gramos")
plot(datos$sexo, main = "Histograma sexo")
hist(datos$egest, main = "Histograma edad gestacional en semanasl")
hist(datos$emadre, main = "Histograma edad de la madre en años")
plot(datos$fiebre, main = "Histograma presencia de fiebre")
plot(datos$infeccion, main = "Histograma presencia de nosocomial")
hist(datos$estadia, main = "Histograma dĂ­as de hospitalizaciĂ³n")
plot(datos$cateter, main = "Histograma presencia de catéter central")
par(mfrow = c(1, 1))

2. El objetivo principal del estudio es agrupar a estos reciĂ©n nacidos, ¿cuĂ¡ntos grupos crearĂ­a, que caracterĂ­sticas tiene cada uno de ellos y cuĂ¡les son sus principales diferencias?

Seleccionando solo las variables cuantitativas

datos2 <- datos[-c(1,3,6,7,9)]
datos3 <- scale(datos2)

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
m.distancia <- get_dist(datos3, method = "euclidean")
fviz_dist(m.distancia, gradient = list(low = "blue", mid = "white", high = "red"))

# m.distancia
library(NbClust)
resnumclust<-NbClust(datos3, distance = "euclidean", method = "centroid", index = "alllong")
## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 7 proposed 9 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 2 proposed 14 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
# fviz_nbclust(resnumclust)

resnumclust<-NbClust(datos3, distance = "euclidean", method = "kmeans", index = "alllong")
## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 8 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
# fviz_nbclust(resnumclust)
dendograma<- hcut(datos3, k = 3, stand = TRUE)
fviz_dend(dendograma, rect = TRUE, cex = 0.5,
          k_colors = c("red","green","blue"))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

k3 <- kmeans(datos3, centers = 3, nstart = 50)
fviz_cluster(k3, data = datos3)

fviz_cluster(k3, data = datos3, ellipse.type = "euclid",repel = TRUE,star.plot = TRUE)

k3
## K-means clustering with 3 clusters of sizes 8, 17, 5
## 
## Cluster means:
##      pesonac      egest      emadre     estadia
## 1 -1.2190156 -1.2766260  0.05287202  1.08365571
## 2  0.2472322  0.4066569 -0.46962791 -0.53387475
## 3  1.1098357  0.6599682  1.51213968  0.08132501
## 
## Clustering vector:
##  [1] 1 2 2 3 1 1 1 1 3 2 1 2 2 2 2 2 2 3 2 2 3 2 1 2 2 2 3 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 23.659922 20.397910  5.351649
##  (between_SS / total_SS =  57.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
str(k3)
## List of 9
##  $ cluster     : int [1:30] 1 2 2 3 1 1 1 1 3 2 ...
##  $ centers     : num [1:3, 1:4] -1.219 0.247 1.11 -1.277 0.407 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:4] "pesonac" "egest" "emadre" "estadia"
##  $ totss       : num 116
##  $ withinss    : num [1:3] 23.66 20.4 5.35
##  $ tot.withinss: num 49.4
##  $ betweenss   : num 66.6
##  $ size        : int [1:3] 8 17 5
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
datos$cluster <- as.factor(k3$cluster)
datos4 <- datos[-c(3,6,7,9)]

library(tidyr)
data_long <- gather(datos, caracteristica, valor, pesonac:cateter, factor_key=TRUE)
## Warning: attributes are not identical across measure variables; they will be
## dropped
#data_long

par(mfrow = c(3, 3))
boxplot(datos$pesonac~datos$cluster)
boxplot(datos$egest~datos$cluster)
boxplot(datos$emadre~datos$cluster)
boxplot(datos$estadia~datos$cluster)
plot(datos$infeccion~datos$cluster)
plot(datos$cateter~datos$cluster)
plot(datos$sexo~datos$cluster)
plot(datos$fiebre~datos$cluster)
par(mfrow = c(1,1))

library(ggplot2)
ggplot(data_long, aes(as.factor(x = caracteristica), y = valor,group=cluster, colour = cluster)) + 
  stat_summary(fun = mean, geom="pointrange", size = 1)+
  stat_summary(geom="line")
## No summary function supplied, defaulting to `mean_se()`
## Warning: Removed 24 rows containing missing values (`geom_segment()`).