PARCIAL FINAL-ANALISIS DE DATOS

url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

nombres_columnas <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                      "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")

diabetes <- read.csv(url, header = FALSE, col.names = nombres_columnas)

head(diabetes)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

summary(diabetes[, c("Glucose", "BMI")])

##     Glucose           BMI       
##  Min.   :  0.0   Min.   : 0.00  
##  1st Qu.: 99.0   1st Qu.:27.30  
##  Median :117.0   Median :32.00  
##  Mean   :120.9   Mean   :31.99  
##  3rd Qu.:140.2   3rd Qu.:36.60  
##  Max.   :199.0   Max.   :67.10

Glucose: Concentracion de la glucos en la sangre

BMI: Indice de masa corporal; (Peso, altura).

##Planteamiento de hipotesis

HIPOTESIS NULA: No existe correlacion significativa entre Glocose y BMI.

HIPOTESIS ALTERNA: Existe Correlacion signicativa entre Glocose y BMI.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

VERIFICACION DE NORMALIDAD

ggplot(diabetes,aes(x=Glucose))+geom_histogram(fill="lightgreen")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diabetes,aes(x=BMI))+geom_histogram(fill="yellow")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Pruebas de normalidad
shapiro.test(diabetes$Glucose)

## 
##  Shapiro-Wilk normality test
## 
## data:  diabetes$Glucose
## W = 0.9701, p-value = 1.986e-11

shapiro.test(diabetes$BMI)

## 
##  Shapiro-Wilk normality test
## 
## data:  diabetes$BMI
## W = 0.94999, p-value = 1.842e-15

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.3

#GRAFICO DE DISPERSION CON LINEA DE REGRESION
ggscatter(diabetes, x = "Glucose", y = "BMI", 
          add = "reg.line", conf.int = TRUE,
          cor.coef = TRUE, cor.method = "spearman")

# PRUEBA ESTADISTICA DE CORRELACION 
cor.test(diabetes$Glucose, diabetes$BMI, method = "spearman")

## Warning in cor.test.default(diabetes$Glucose, diabetes$BMI, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  diabetes$Glucose and diabetes$BMI
## S = 58046798, p-value = 8.984e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2311412

# Gráfico boxplot
ggboxplot(diabetes, x = "Outcome", y = "Glucose", color = "Outcome", 
          palette = "jco", ylab = "Glucose", xlab = "Outcome (0 = No diabetes, 1 = Diabetes)")

# Normalidad por grupo
by(diabetes$Glucose, diabetes$Outcome, shapiro.test)

## diabetes$Outcome: 0
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.96795, p-value = 5.447e-09
## 
## ------------------------------------------------------------ 
## diabetes$Outcome: 1
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.95882, p-value = 6.587e-07

# Como no es normal, usamos prueba de Mann-Whitney
wilcox.test(Glucose ~ Outcome, data = diabetes)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Glucose by Outcome
## W = 28391, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

CONCLUSUONES:

-La variable Glucose no tiene distribucion normal, lo que jsutifica el uso de pruebas no parametricas.

-Se encontro una correlacion significativa entre glucosa y BMI.

-Se encontraron diferencias significativas en los niveles de glucosa entre personas con y sin diabetes.

#¿Por qué Wilcoxon (Mann-Whitney U Test)?

Es una prueba no paramétrica para comparar dos grupos independientes. Y que esta no asume normalidad y ademas es alternativa al t-test cuando los datos no son normales.

#¿Por qué Spearman?

Es una prueba no paramétrica que no asume la normalidad, ademas se bbasa en los rangos de los datos. y esta prueba es ideal cuando las variables no tienen una distribución normal.

PARCIAL FINAL-ANALISIS DE DATOS

Johany Steven Quiroga

2025-05-30

VERIFICACION DE NORMALIDAD