Un grupo investigador estudiĂ³ a un grupo de reciĂ©n nacidos de un determinado hospital registrando la siguiente informaciĂ³n: edad gestacional en semanas (egest), sexo (1. Femenino 2. Masculino), peso al nacer en gramos (pesonac), edad de la madre en años (edadmadre), presencia de catĂ©ter central (0. No 1. Si), dĂas de hospitalizaciĂ³n (estadĂa), presencia de fiebre (0.No 1.Si) y presencia de infecciĂ³n nosocomial (0. No 1. Si)
library(readxl)
datos <- read_excel("infeccion.xlsx")
datos$sexo <- factor(datos$sexo, labels = c("F","M"))
datos$fiebre <- factor(datos$fiebre, labels = c("No","Si"))
datos$infeccion <- factor(datos$infeccion, labels = c("No","Si"))
datos$cateter <- factor(datos$cateter, labels = c("No","Si"))
summary(datos[-1])
## pesonac sexo egest emadre fiebre infeccion
## Min. : 808 F:11 Min. :23.00 Min. :16.00 No:21 No:21
## 1st Qu.:1900 M:19 1st Qu.:34.00 1st Qu.:23.00 Si: 9 Si: 9
## Median :2226 Median :37.00 Median :27.00
## Mean :2274 Mean :35.53 Mean :27.67
## 3rd Qu.:2788 3rd Qu.:38.00 3rd Qu.:31.50
## Max. :3500 Max. :40.00 Max. :39.00
## estadia cateter
## Min. : 1.00 No:26
## 1st Qu.: 4.00 Si: 4
## Median : 9.00
## Mean :13.80
## 3rd Qu.:19.75
## Max. :42.00
par(mfrow = c(3, 3))
hist(datos$pesonac, main = "Histograma peso al nacer en gramos")
plot(datos$sexo, main = "Histograma sexo")
hist(datos$egest, main = "Histograma edad gestacional en semanasl")
hist(datos$emadre, main = "Histograma edad de la madre en años")
plot(datos$fiebre, main = "Histograma presencia de fiebre")
plot(datos$infeccion, main = "Histograma presencia de nosocomial")
hist(datos$estadia, main = "Histograma dĂas de hospitalizaciĂ³n")
plot(datos$cateter, main = "Histograma presencia de catéter central")
par(mfrow = c(1, 1))
datos2 <- datos[-c(1,3,6,7,9)]
datos3 <- scale(datos2)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
m.distancia <- get_dist(datos3, method = "euclidean")
fviz_dist(m.distancia, gradient = list(low = "blue", mid = "white", high = "red"))
# m.distancia
library(NbClust)
resnumclust<-NbClust(datos3, distance = "euclidean", method = "centroid", index = "alllong")
## Warning in pf(beale, pp, df2): NaNs produced
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 7 proposed 9 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 2 proposed 14 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
# fviz_nbclust(resnumclust)
resnumclust<-NbClust(datos3, distance = "euclidean", method = "kmeans", index = "alllong")
## Warning in pf(beale, pp, df2): NaNs produced
## Warning in pf(beale, pp, df2): NaNs produced
## Warning in pf(beale, pp, df2): NaNs produced
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 8 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
# fviz_nbclust(resnumclust)
dendograma<- hcut(datos3, k = 3, stand = TRUE)
fviz_dend(dendograma, rect = TRUE, cex = 0.5,
k_colors = c("red","green","blue"))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
k3 <- kmeans(datos3, centers = 3, nstart = 50)
fviz_cluster(k3, data = datos3)
fviz_cluster(k3, data = datos3, ellipse.type = "euclid",repel = TRUE,star.plot = TRUE)
k3
## K-means clustering with 3 clusters of sizes 8, 17, 5
##
## Cluster means:
## pesonac egest emadre estadia
## 1 -1.2190156 -1.2766260 0.05287202 1.08365571
## 2 0.2472322 0.4066569 -0.46962791 -0.53387475
## 3 1.1098357 0.6599682 1.51213968 0.08132501
##
## Clustering vector:
## [1] 1 2 2 3 1 1 1 1 3 2 1 2 2 2 2 2 2 3 2 2 3 2 1 2 2 2 3 2 2 1
##
## Within cluster sum of squares by cluster:
## [1] 23.659922 20.397910 5.351649
## (between_SS / total_SS = 57.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
str(k3)
## List of 9
## $ cluster : int [1:30] 1 2 2 3 1 1 1 1 3 2 ...
## $ centers : num [1:3, 1:4] -1.219 0.247 1.11 -1.277 0.407 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:4] "pesonac" "egest" "emadre" "estadia"
## $ totss : num 116
## $ withinss : num [1:3] 23.66 20.4 5.35
## $ tot.withinss: num 49.4
## $ betweenss : num 66.6
## $ size : int [1:3] 8 17 5
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
datos$cluster <- as.factor(k3$cluster)
datos4 <- datos[-c(3,6,7,9)]
library(tidyr)
data_long <- gather(datos, caracteristica, valor, pesonac:cateter, factor_key=TRUE)
## Warning: attributes are not identical across measure variables; they will be
## dropped
#data_long
par(mfrow = c(3, 3))
boxplot(datos$pesonac~datos$cluster)
boxplot(datos$egest~datos$cluster)
boxplot(datos$emadre~datos$cluster)
boxplot(datos$estadia~datos$cluster)
plot(datos$infeccion~datos$cluster)
plot(datos$cateter~datos$cluster)
plot(datos$sexo~datos$cluster)
plot(datos$fiebre~datos$cluster)
par(mfrow = c(1,1))
library(ggplot2)
ggplot(data_long, aes(as.factor(x = caracteristica), y = valor,group=cluster, colour = cluster)) +
stat_summary(fun = mean, geom="pointrange", size = 1)+
stat_summary(geom="line")
## No summary function supplied, defaulting to `mean_se()`
## Warning: Removed 24 rows containing missing values (`geom_segment()`).