IMPORTACION DE LIBRERIAS

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(gridExtra)

IMPORTACION DE LA BASE

diabetes2 <- read.csv("diabetes2.txt", sep=";")
View(diabetes2)
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40   Male             yes     no       one hr or more  22      no
258 less than 40   Male             yes     no       one hr or more  22      no
511  60 or older Female             yes     no                 none  26      no
907 less than 40   Male             yes     no more than half an hr  19     yes
582  60 or older   Male              no     no more than half an hr  23      no
726        40-49   Male              no     no less than half an hr  26     yes
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20       no     8          6              no        often not at all  normal
258      no     8          6              no        often not at all  normal
511      no     7          7              no occasionally  sometimes  normal
907     yes     7          5              no occasionally not at all  normal
582     yes     7          6             yes occasionally very often    high
726      no     8          7              no occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

INTRODUCION La diabetes tipo 2 representa un desafío global de salud pública con consecuencias significativas para la calidad de vida de millones de personas. La prevalencia en constante aumento y las implicaciones en términos de morbimortalidad hacen imperativo desarrollar enfoques eficaces para la detección temprana y la gestión efectiva de esta enfermedad crónica. En este contexto, el uso de métodos avanzados de clasificación de aprendizaje automático se presenta como una herramienta prometedora para mejorar la precisión en la predicción de la diabetes tipo 2.

La elección de métodos de clasificación específicos, como Support Vector Machines (SVM), Random Forest, Gradient Boosting, y Redes Neuronales, se fundamenta en su capacidad para manejar conjuntos de datos complejos y extraer patrones no lineales, características esenciales en el análisis de datos médicos. La comparación y evaluación de estos métodos permitirán identificar cuál ofrece el rendimiento más óptimo para el propósito de este estudio.

La calidad de los datos es un elemento crítico en la efectividad de los modelos de aprendizaje automático. En este sentido, se prestará especial atención a la selección y preparación de conjuntos de datos representativos, considerando variables relevantes como la historia clínica, factores genéticos, y hábitos de vida. Además, se implementarán técnicas de validación cruzada y otras estrategias de evaluación para garantizar la fiabilidad y generalización de los modelos desarrollados.

DESCRICION DE LAS VARIABLES

Edad Tipo: Numérica. Descripción: La edad del individuo en años. La diabetes tipo 2 tiende a aumentar con la edad, y este factor es esencial para evaluar el riesgo.
Índice de Masa Corporal (IMC): Tipo: Numérica. Descripción: El IMC, calculado a partir del peso y la altura del individuo, es un indicador de la obesidad. La obesidad está fuertemente asociada con la diabetes tipo 2.
Nivel de Glucosa en Ayunas: Tipo: Numérica. Descripción: La concentración de glucosa en la sangre después de un período de ayuno. Valores elevados pueden indicar resistencia a la insulina y predispone al individuo a la diabetes tipo 2.
Presión Arterial: Tipo: Numérica. Descripción: La presión arterial sistólica y diastólica, medidas en mmHg. La hipertensión arterial es un factor de riesgo independiente para la diabetes tipo 2.
Historial Familiar de Diabetes: Tipo: Categórica (Sí/No). Descripción: Indica si hay antecedentes familiares de diabetes tipo 2. La genética desempeña un papel crucial en la predisposición a la enfermedad.

6.Actividad Física: Tipo: Categórica (Baja/Moderada/Alta). Descripción: Nivel de actividad física regular del individuo. La falta de actividad física está vinculada a un mayor riesgo de diabetes tipo 2.

7.Consumo de Alcohol: Tipo: Categórica (Bajo/Moderado/Alto). Descripción: La cantidad de alcohol consumida regularmente por el individuo. El consumo excesivo puede aumentar el riesgo de diabetes tipo 2.

Tabaquismo: Tipo: Categórica (Sí/No). Descripción: Indica si el individuo es fumador. El tabaquismo se ha asociado con un mayor riesgo de diabetes tipo 2.
Niveles de Colesterol: Tipo: Numérica. Descripción: Concentración de colesterol total, HDL (lipoproteínas de alta densidad) y LDL (lipoproteínas de baja densidad). Desórdenes en estos niveles están relacionados con la diabetes tipo 2.
Historial de Enfermedades Cardíacas: Tipo: Categórica (Sí/No). Descripción: Indica si el individuo tiene antecedentes de enfermedades cardíacas. Las enfermedades cardíacas y la diabetes tipo 2 comparten factores de riesgo comunes.
Niveles de Insulina: Tipo: Numérica. Descripción: Concentración de insulina en la sangre. La resistencia a la insulina es una característica clave en el desarrollo de la diabetes tipo 2.

Conversion de la variable edad

diabetes2$Age=factor(diabetes2$Age)
levels(diabetes2$Age)=c("40-49","50-59","60 or older","less than 40")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40   Male             yes     no       one hr or more  22      no
258 less than 40   Male             yes     no       one hr or more  22      no
511  60 or older Female             yes     no                 none  26      no
907 less than 40   Male             yes     no more than half an hr  19     yes
582  60 or older   Male              no     no more than half an hr  23      no
726        40-49   Male              no     no less than half an hr  26     yes
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20       no     8          6              no        often not at all  normal
258      no     8          6              no        often not at all  normal
511      no     7          7              no occasionally  sometimes  normal
907     yes     7          5              no occasionally not at all  normal
582     yes     7          6             yes occasionally very often    high
726      no     8          7              no occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable GENERO

diabetes2$Gender=factor(diabetes2$Gender)
levels(diabetes2$Gender)=c("Male","Female")
head(diabetes2)

##              Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
## 20  less than 40 Female             yes     no       one hr or more  22      no
## 258 less than 40 Female             yes     no       one hr or more  22      no
## 511  60 or older   Male             yes     no                 none  26      no
## 907 less than 40 Female             yes     no more than half an hr  19     yes
## 582  60 or older Female              no     no more than half an hr  23      no
## 726        40-49 Female              no     no less than half an hr  26     yes
##     Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
## 20       no     8          6              no        often not at all  normal
## 258      no     8          6              no        often not at all  normal
## 511      no     7          7              no occasionally  sometimes  normal
## 907     yes     7          5              no occasionally not at all  normal
## 582     yes     7          6             yes occasionally very often    high
## 726      no     8          7              no occasionally  sometimes  normal
##     Pregancies Pdiabetes UriationFreq Diabetic
## 20           0         0     not much       no
## 258          0         0     not much       no
## 511          3         0     not much       no
## 907          0         0     not much       no
## 582          0         0  quite often      yes
## 726          0         0     not much       no

#Conversion de la variable FAMILIA DIABETES

diabetes2$Family_Diabetes=factor(diabetes2$Family_Diabetes)
levels(diabetes2$Family_Diabetes)=c("yes","no")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no     no       one hr or more  22      no
258 less than 40 Female              no     no       one hr or more  22      no
511  60 or older   Male              no     no                 none  26      no
907 less than 40 Female              no     no more than half an hr  19     yes
582  60 or older Female             yes     no more than half an hr  23      no
726        40-49 Female             yes     no less than half an hr  26     yes
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20       no     8          6              no        often not at all  normal
258      no     8          6              no        often not at all  normal
511      no     7          7              no occasionally  sometimes  normal
907     yes     7          5              no occasionally not at all  normal
582     yes     7          6             yes occasionally very often    high
726      no     8          7              no occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable hipBP

diabetes2$highBP=factor(diabetes2$highBP)
levels(diabetes2$highBP)=c("yes","no")
head(diabetes2)

##              Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
## 20  less than 40 Female              no    yes       one hr or more  22      no
## 258 less than 40 Female              no    yes       one hr or more  22      no
## 511  60 or older   Male              no    yes                 none  26      no
## 907 less than 40 Female              no    yes more than half an hr  19     yes
## 582  60 or older Female             yes    yes more than half an hr  23      no
## 726        40-49 Female             yes    yes less than half an hr  26     yes
##     Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
## 20       no     8          6              no        often not at all  normal
## 258      no     8          6              no        often not at all  normal
## 511      no     7          7              no occasionally  sometimes  normal
## 907     yes     7          5              no occasionally not at all  normal
## 582     yes     7          6             yes occasionally very often    high
## 726      no     8          7              no occasionally  sometimes  normal
##     Pregancies Pdiabetes UriationFreq Diabetic
## 20           0         0     not much       no
## 258          0         0     not much       no
## 511          3         0     not much       no
## 907          0         0     not much       no
## 582          0         0  quite often      yes
## 726          0         0     not much       no

#Conversion de la variable PhysicallyActive

diabetes2$PhysicallyActive=factor(diabetes2$PhysicallyActive)
levels(diabetes2$PhysicallyActive)=c("less than half an hr","More than half an hr","none","one hr or more")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22      no
258 less than 40 Female              no    yes       one hr or more  22      no
511  60 or older   Male              no    yes                 none  26      no
907 less than 40 Female              no    yes More than half an hr  19     yes
582  60 or older Female             yes    yes More than half an hr  23      no
726        40-49 Female             yes    yes less than half an hr  26     yes
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20       no     8          6              no        often not at all  normal
258      no     8          6              no        often not at all  normal
511      no     7          7              no occasionally  sometimes  normal
907     yes     7          5              no occasionally not at all  normal
582     yes     7          6             yes occasionally very often    high
726      no     8          7              no occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable Smoking

diabetes2$Smoking=factor(diabetes2$Smoking)
levels(diabetes2$Smoking)=c("yes","no")
head(diabetes2)

##              Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
## 20  less than 40 Female              no    yes       one hr or more  22     yes
## 258 less than 40 Female              no    yes       one hr or more  22     yes
## 511  60 or older   Male              no    yes                 none  26     yes
## 907 less than 40 Female              no    yes More than half an hr  19      no
## 582  60 or older Female             yes    yes More than half an hr  23     yes
## 726        40-49 Female             yes    yes less than half an hr  26      no
##     Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
## 20       no     8          6              no        often not at all  normal
## 258      no     8          6              no        often not at all  normal
## 511      no     7          7              no occasionally  sometimes  normal
## 907     yes     7          5              no occasionally not at all  normal
## 582     yes     7          6             yes occasionally very often    high
## 726      no     8          7              no occasionally  sometimes  normal
##     Pregancies Pdiabetes UriationFreq Diabetic
## 20           0         0     not much       no
## 258          0         0     not much       no
## 511          3         0     not much       no
## 907          0         0     not much       no
## 582          0         0  quite often      yes
## 726          0         0     not much       no

#Conversion de la variable Alcohol

diabetes2$Alcohol=factor(diabetes2$Alcohol)
levels(diabetes2$Alcohol)=c("yes","no")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20      yes     8          6              no        often not at all  normal
258     yes     8          6              no        often not at all  normal
511     yes     7          7              no occasionally  sometimes  normal
907      no     7          5              no occasionally not at all  normal
582      no     7          6             yes occasionally very often    high
726     yes     8          7              no occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable RegularMedicine

diabetes2$RegularMedicine=factor(diabetes2$RegularMedicine)
levels(diabetes2$RegularMedicine)=c("yes","no")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine     JunkFood     Stress BPLevel
20      yes     8          6             yes        often not at all  normal
258     yes     8          6             yes        often not at all  normal
511     yes     7          7             yes occasionally  sometimes  normal
907      no     7          5             yes occasionally not at all  normal
582      no     7          6              no occasionally very often    high
726     yes     8          7             yes occasionally  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

Conversion de la variable JunkFood

diabetes2$JunkFood=factor(diabetes2$JunkFood)
levels(diabetes2$JunkFood)=c("very often"," 
always","often","occasionally")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine   JunkFood     Stress BPLevel
20      yes     8          6             yes      often not at all  normal
258     yes     8          6             yes      often not at all  normal
511     yes     7          7             yes \t\nalways  sometimes  normal
907      no     7          5             yes \t\nalways not at all  normal
582      no     7          6              no \t\nalways very often    high
726     yes     8          7             yes \t\nalways  sometimes  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

Conversion de la variable Stress

diabetes2$Stress=factor(diabetes2$Stress)
levels(diabetes2$Stress)=c("not at all","sometimes","very often","always")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine   JunkFood     Stress BPLevel
20      yes     8          6             yes      often  sometimes  normal
258     yes     8          6             yes      often  sometimes  normal
511     yes     7          7             yes \t\nalways very often  normal
907      no     7          5             yes \t\nalways  sometimes  normal
582      no     7          6              no \t\nalways     always    high
726     yes     8          7             yes \t\nalways very often  normal
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable BPLevel

diabetes2$BPLevel=factor(diabetes2$BPLevel)
levels(diabetes2$BPLevel)=c("normal","high","low")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine   JunkFood     Stress BPLevel
20      yes     8          6             yes      often  sometimes     low
258     yes     8          6             yes      often  sometimes     low
511     yes     7          7             yes \t\nalways very often     low
907      no     7          5             yes \t\nalways  sometimes     low
582      no     7          6              no \t\nalways     always  normal
726     yes     8          7             yes \t\nalways very often     low
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

#Conversion de la variable UriationFreq

diabetes2$UriationFreq=factor(diabetes2$UriationFreq)
levels(diabetes2$UriationFreq)=c("not much","quite often")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine   JunkFood     Stress BPLevel
20      yes     8          6             yes      often  sometimes     low
258     yes     8          6             yes      often  sometimes     low
511     yes     7          7             yes \t\nalways very often     low
907      no     7          5             yes \t\nalways  sometimes     low
582      no     7          6              no \t\nalways     always  normal
726     yes     8          7             yes \t\nalways very often     low
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much       no
258          0         0     not much       no
511          3         0     not much       no
907          0         0     not much       no
582          0         0  quite often      yes
726          0         0     not much       no

diabetes2$Diabetic=factor(diabetes2$Diabetic)
levels(diabetes2$Diabetic)=c("yes","no")
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive BMI Smoking
20  less than 40 Female              no    yes       one hr or more  22     yes
258 less than 40 Female              no    yes       one hr or more  22     yes
511  60 or older   Male              no    yes                 none  26     yes
907 less than 40 Female              no    yes More than half an hr  19      no
582  60 or older Female             yes    yes More than half an hr  23     yes
726        40-49 Female             yes    yes less than half an hr  26      no
    Alcohol Sleep SoundSleep RegularMedicine   JunkFood     Stress BPLevel
20      yes     8          6             yes      often  sometimes     low
258     yes     8          6             yes      often  sometimes     low
511     yes     7          7             yes \t\nalways very often     low
907      no     7          5             yes \t\nalways  sometimes     low
582      no     7          6              no \t\nalways     always  normal
726     yes     8          7             yes \t\nalways very often     low
    Pregancies Pdiabetes UriationFreq Diabetic
20           0         0     not much      yes
258          0         0     not much      yes
511          3         0     not much      yes
907          0         0     not much      yes
582          0         0  quite often       no
726          0         0     not much      yes

VERIFICACION DE NA

BASE_na <- is.na(diabetes2)
head(BASE_na)

      Age Gender Family_Diabetes highBP PhysicallyActive   BMI Smoking Alcohol
20  FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
258 FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
511 FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
907 FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
582 FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
726 FALSE  FALSE           FALSE  FALSE            FALSE FALSE   FALSE   FALSE
    Sleep SoundSleep RegularMedicine JunkFood Stress BPLevel Pregancies
20  FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
258 FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
511 FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
907 FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
582 FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
726 FALSE      FALSE           FALSE    FALSE  FALSE   FALSE      FALSE
    Pdiabetes UriationFreq Diabetic
20      FALSE        FALSE    FALSE
258     FALSE        FALSE    FALSE
511     FALSE        FALSE    FALSE
907     FALSE        FALSE    FALSE
582     FALSE        FALSE    FALSE
726     FALSE        FALSE    FALSE

CONVERSION A CLASSES

classes <- sapply(diabetes2, class)
for (variable in names(classes)) {
 cat("Variable:", variable, " - Clase:", classes[variable], "\n")
classes
}

Variable: Age  - Clase: factor 
Variable: Gender  - Clase: factor 
Variable: Family_Diabetes  - Clase: factor 
Variable: highBP  - Clase: factor 
Variable: PhysicallyActive  - Clase: factor 
Variable: BMI  - Clase: integer 
Variable: Smoking  - Clase: factor 
Variable: Alcohol  - Clase: factor 
Variable: Sleep  - Clase: integer 
Variable: SoundSleep  - Clase: integer 
Variable: RegularMedicine  - Clase: factor 
Variable: JunkFood  - Clase: factor 
Variable: Stress  - Clase: factor 
Variable: BPLevel  - Clase: factor 
Variable: Pregancies  - Clase: integer 
Variable: Pdiabetes  - Clase: character 
Variable: UriationFreq  - Clase: factor 
Variable: Diabetic  - Clase: factor

ANALISIS DESCRIPTIVO

MINIMO VALOR En un conjunto de datos es el número más pequeño o la observación más baja dentro de ese conjunto. Es uno de los resúmenes descriptivos básicos que ayuda a entender la gama o dispersión de los datos.

1st QU El primer quintil, también conocido como el quintil inferior o percentil 25, es el valor que separa el 20% inferior de un conjunto de datos ordenado de manera ascendente. En otras palabras, el primer quintil es el valor por debajo del cual cae el 25% de los datos. FORMULA \[Q1 = \frac{n+1}{4}\]

MEDIANA

La mediana es una medida de tendencia central en estadísticas que se utiliza para representar el valor central de un conjunto de datos. Para calcular la mediana, primero debes ordenar los datos de menor a mayor (o viceversa) y luego encontrar el valor que se encuentra exactamente en el centro de la distribución. FORMULA

NUMERO IMPAR DE DATOS

\[M = Valor de la posicion (\frac{n+1}{2})\] NUMERO PAR DE DATOS \[M = Suma de los valores en las posiciones (\frac{n}{2})\] MEDIA Es una medida que representa el valor típico o promedio de un conjunto de datos. Se calcula sumando todos los valores en el conjunto y dividiendo esa suma por el número total de elementos en el conjunto. FORMULA \[\bar{x} = \frac{\sum_{i=1_{}}^{N}xi}{N}\]

3RD QU Sefiere al valor que divide los tres cuartos superiores de un conjunto de datos ordenado. En términos más específicos, el tercer cuartil representa el valor por debajo del cual se encuentra el 75% de los datos. \[Q1 = 3\frac{n+1}{5}\] Para calcular el tercer cuartil, primero debes ordenar el conjunto de datos de menor a mayor. Luego, divides los datos en cuatro partes iguales, y el tercer cuartil es el valor que se encuentra en el límite entre el tercer y el cuarto cuartil. Matemáticamente, se suele denotar como Q3Q3. MAXIMO En un conjunto de datos se refiere al valor más grande o a la observación más alta dentro de ese conjunto.Calcular el valor máximo es útil para entender la variabilidad en un conjunto de datos y proporciona información sobre la parte superior de la distribución. Similar al valor mínimo, el valor máximo puede ser sensible a valores atípicos o extremos, ya que un solo valor extremadamente alto puede afectar significativamente el valor máximo.

CONVERSION A INTEGER

summary(diabetes2[,classes=="integer"])

      BMI            Sleep          SoundSleep       Pregancies    
 Min.   :15.00   Min.   : 4.000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.:21.00   1st Qu.: 6.000   1st Qu.: 4.000   1st Qu.:0.0000  
 Median :24.00   Median : 7.000   Median : 6.000   Median :0.0000  
 Mean   :25.33   Mean   : 6.976   Mean   : 5.609   Mean   :0.3819  
 3rd Qu.:28.00   3rd Qu.: 8.000   3rd Qu.: 7.000   3rd Qu.:0.0000  
 Max.   :42.00   Max.   :11.000   Max.   :11.000   Max.   :4.0000

apply(diabetes2[,classes=="integer"],2,sd)

       BMI      Sleep SoundSleep Pregancies 
 5.1399922  1.3042497  1.8435140  0.9090479

INTERPRETACION DE RESULTADOS

DESCRIPCION DE LA VARIABLE BMI

Observamos personas con un minimo de 15,00 de grosor de la piel y un maximo de 42 mm de grosol de la piel con un promedio de 25,33 mm grosor de la piel y una mediana de 24,00 de grosor

DESCRIPCION DE LA VARIABLE SLEEP

Observamos que las personas tienen sueño minimo 4 horas y un maximo de 11 horas, en promedio las personas tienen sueño casi 7 horas con una mediana de 7 horas, con un primer quiantil de 6 horas y un tercer quantil de 8 horas

DESCRIPCION DE LA VARIABLE SoundSleep

Observamos que las personas tiene sueño profundo minimo o horas y un maximo de 11 horas, en promedio las personas tienen sueño profundo 5 horaa con una mediana de 6horas, con un primer quiantil de 4 horas y un tercer quantil de 7 horas

DESCRIPCION DE LA VARIABLE Pregancies

Se observa un minimo de 0 personas enbarazadas y un maximo de 4 personas embarazadas con un promedio de 0,38 personas embarazadas y con un primer quiantil de 0 y tercer quantil de 0.

DIAGRAMA DE BARRAS(GENERO,AGE,ACTIVIDAD FISICA,COMIDA CHATARRA) VARIABLE GENERO

ggplot(diabetes2, aes(x = Gender)) +
  geom_bar(fill = "skyblue") +
  labs(x = "Gender", title = "VARIABLE GENERO") +
  theme_minimal()

VARIABLE EDAD

ggplot(diabetes2, aes(x = Age)) +
  geom_bar(fill = "skyblue") +
  labs(x = "Age", title = "VARIABLE EDAD") +
  theme_minimal()

DIAGRAMA DE DISPERSION (RELACION ENTRE BMI Y LAS HORAS DE SUEÑO)

qplot(BMI,Sleep, data = diabetes2, colour = Diabetic)

Warning: `qplot()` was deprecated in ggplot2 3.4.0.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

PARA IDENTIFICAR SI PUEDE EXISTER PRESENCIA DE DIABETES O NO, CON LA INFLUENCIA DE ESTOS DOS FACTORES

INTERPRETACION

En la grafica se observa que no presenta tendencia de diabetes en cambio existe presencia de diabetes pero muy poco entre las dos variables sueño y grosol de la piel, pero si puede aver presencia de diabetes en personas exepcionales.

IDENTIFICAR QUE OTROS FACTORES PUEDEN INFLUIR EN LA PRESENCIA DE DIABETES

##DIAGRAMA DE DISPERSION (PhysicallyActive y Smoking )

qplot(PhysicallyActive
,Smoking, data = diabetes2, colour = Diabetic)

INTERPRETACION En la grafica se observa que no presenta tendencia de diabetes en cambio existe presencia de diabetes pero muy escasa, entre las realizar ejercicio y fumar, La influencia del ejercicio y el hábito de fumar en la diabetes tipo 2 es significativa, ya que al realizar ejercicio ayuda a la mejora de la sensibilidad a la insulina y la regulacion de los niveles de glucosa

##DIAGRAMA DE DISPERSION (RELACION ENTRE Edad y ALcocol)

qplot(Age,Alcohol, data = diabetes2, colour = Diabetic)

INTERPRETACION Se observa que npo existe tendencia a a diabetes ya que las personas de las edades encuestadas no son propensar a tener diabetes por consumir alcochol.

##DIAGRAMA DE DISPERSION (RELACION ENTRE ESTRES Y COMIDA CHATARRA)

qplot(JunkFood,Stress, data = diabetes2, colour = Diabetic)

INTERPRETACION Se observa que no existe tendencia a diabetes,solo existe un poco de tendencia entre la variable estres y comida chatarra a tener deabetes.

##DIAGRAMA DE DISPERSION (ESTRES Y INGESTA DE MEDICAMENTOS)

qplot(Stress,RegularMedicine, data = diabetes2, colour = Diabetic)

INTERPRETACION Se observa que no existe tendencia a diabetes,solo existe un poco de tendencia entre la variable estres y ingesta de medicamento,pero puede ser que la disminuicion de ingesta de medicamento producca deabetes.

##DIAGRAMA DE DISPERSION (ESTRES Y INGESTA DE MEDICAMENTOS)

qplot(Alcohol,BPLevel, data = diabetes2, colour = Diabetic)

INTERPRETACION Se observa que no existe tendencia a diabetes,solo existe un poco de tendencia entre la variable embarazo y alcochol, por lo que se puede decir que muchas personas que estan embaradas no consumen alcohol.

Diabetic <- diabetes2$Diabetic
BMI <- diabetes2$BMI
P1 <- ggplot(diabetes2, aes(x=Diabetic,
                            y=BMI, color=Diabetic))+
  geom_boxplot()
P1

INTERPRETACION El eje x representa los niveles de la Diabetic variable. El eje y representa los valores de la BMI variable. Boxplots se utilizan para mostrar la distribución de la BMIvalores para cada nivel La caja de la parcela representa la gama intercuartátil (IQR) de la BMIvalores para cada grupo, con una línea dentro de la caja que representa la mediana.Los batinadores se extienden desde la caja para mostrar el rango de los datos, y los puntos más allá de los bigotes pueden ser considerados como atístes. No se presenta amyor variabilidad en los datos.

Diabetic <- diabetes2$Diabetic
Sleep <- diabetes2$Sleep
P2 <- ggplot(diabetes2, aes(x=Diabetic,
                            y=Sleep, color=Diabetic))+
  geom_boxplot()
P2

Los los boxplot La mediana y los cuartiles (Q1 y Q3) son iguales o muy cercanos entre los dos conjuntos de datos. La longitud de las cajas y los bigotes es comparable, indicando una dispersión similar de los datos.La posición relativa de las medianas en las cajas es similar, lo que significa que la tendencia central de los dos conjuntos de datos es comparable. La variabilidad en ambos conjuntos de datos es parecida, ya que la longitud de las cajas y los bigotes es similar.No hay heno Evidencia de Diferencias Significativas

Diabetic <- diabetes2$Diabetic
SoundSleep <- diabetes2$SoundSleep
P3 <- ggplot(diabetes2, aes(x=Diabetic,
                            y=SoundSleep, color=Diabetic))+
  geom_boxplot()
P3

Diabetic <- diabetes2$Diabetic
Pregancies <- diabetes2$Pregancies
P4 <- ggplot(diabetes2, aes(x=Diabetic,
                            y=Pregancies, color=Diabetic))+
  geom_boxplot()
P4

INTERPRETACION Las medianas y los cuartiles (Q1 y Q3) son diferentes entre los dos conjuntos de datos.La longitud de las cajas y los bigotes es significativamente diferente, indicando una dispersión diferente de los datos. La posición relativa de las medianas en las cajas es diferente, lo que sugiere diferencias en la tendencia central de los dos conjuntos de datos.Puntos fuera de los bigotes (valores atípicos) pueden ser más prominentes en uno de los grupos, indicando diferencias en la cola de la distribución.La forma general de la distribución, como la simetría o asimetría, puede ser diferente.La variabilidad en términos de dispersión y rango intercuartílico puede ser claramente distinta.

RELACION ENTRE LA VARIABLE BINARTIA Y LAS VARIABLES NOMINALES

library(htmltools)

Warning: package 'htmltools' was built under R version 4.3.2

library(ggmosaic)

Warning: package 'ggmosaic' was built under R version 4.3.2

q1 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,Gender),fill = Diabetic))+ labs(x="Gender", title = "factores diabetes")
q1

Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the ggmosaic package.
  Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

q2 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,Smoking),fill = Diabetic))+ labs(x="Smoking", title = "factores diabetes")
q2

q3 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,Alcohol),fill = Diabetic))+ labs(x="Alcohol", title = "factores diabetes")
q3

q4 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,Family_Diabetes),fill = Diabetic))+ labs(x="Family_Diabetes", title = "factores diabetes")
q4

q5 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,highBP),fill = Diabetic))+ labs(x="highBP", title = "factores diabetes")
q5

q6 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,PhysicallyActive),fill = Diabetic))+ labs(x="PhysicalluActive", title = "factores diabetes")
q6

q7 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic,RegularMedicine),fill = Diabetic))+ labs(x="RegularMedicine", title = "factores diabetes")
q7

q8 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic, JunkFood),fill = Diabetic))+ labs(x="junkfood", title = "factores diabetes")
q8

q9 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic, Stress),fill = Diabetic))+ labs(x="stress", title = "factores diabetes")
q9

q10 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic, BPLevel),fill = Diabetic))+ labs(x="BPlevel", title = "factores diabetes")
q10

q11 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic, UriationFreq),fill = Diabetic))+ labs(x="UriationFreq", title = "factores diabetes")
q11

q12 <- ggplot(data = diabetes2)+geom_mosaic(aes(x = product(Diabetic, Diabetic),fill = Diabetic))+ labs(x="diabetic", title = "factores diabetes")
q12

grid

function (nx = NULL, ny = nx, col = "lightgray", lty = "dotted", 
    lwd = par("lwd"), equilogs = TRUE) 
{
    if (is.null(nx) || (!is.na(nx) && nx >= 1)) {
        log <- par("xlog")
        if (is.null(nx)) {
            ax <- par("xaxp")
            if (log && equilogs && ax[3L] > 0) 
                ax[3L] <- 1
            at <- axTicks(1, axp = ax, log = log)
        }
        else {
            U <- par("usr")
            at <- seq.int(U[1L], U[2L], length.out = nx + 1)
            at <- (if (log) 
                10^at
            else at)[-c(1, nx + 1)]
        }
        abline(v = at, col = col, lty = lty, lwd = lwd)
    }
    if (is.null(ny) || (!is.na(ny) && ny >= 1)) {
        log <- par("ylog")
        if (is.null(ny)) {
            ax <- par("yaxp")
            if (log && equilogs && ax[3L] > 0) 
                ax[3L] <- 1
            at <- axTicks(2, axp = ax, log = log)
        }
        else {
            U <- par("usr")
            at <- seq.int(U[3L], U[4L], length.out = ny + 1)
            at <- (if (log) 
                10^at
            else at)[-c(1, ny + 1)]
        }
        abline(h = at, col = col, lty = lty, lwd = lwd)
    }
}
<bytecode: 0x0000024a3be74738>
<environment: namespace:graphics>

grid.arrange(q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,nrow = 4, ncol =4)

INTERPRETACION Se observa en todos los graficos de las variables nominales que no existe presencia de biabetes en ninguno de los factores, todos los factores no influyen a contrater diabetes, algunas variables nominales son demasiadas lejanas a contraer diabetes como por ejemplo tenemos la variable nominal estres, no influye a contraer diabetes.

GRAFICO MULTIVARIANTE

library(GGally)

Warning: package 'GGally' was built under R version 4.3.2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2


Attaching package: 'GGally'

The following object is masked from 'package:ggmosaic':

    happy

ggpairs(diabetes2[,classes=="integer"]) + theme_bw()

INTERPRETACION Se observa que las tres variables influyen en en indice de masa corporal, estas influyen directamente y se ve el nivel de correlacion, igual entre las horas de sueño y el BMI , no existe relacion entre baja entre el sueño y los embarazos ,

p <- ggpairs(diabetes2[,c(which(classes=="integer"),18)], aes(color = diabetes2$Diabetic)) + theme_bw()
for (i in 1:p$nrow) {
  for (j in 1:p$ncol) {
    p[i,j] <- p[i,j] +
      scale_fill_manual(values = c("#000AEB","#E7B800")) +
       scale_color_manual(values = c("#000AEB","#E7B800"))
    
  }
  
}
p

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

INTERPRETACION En el grafico observamos boxplot de las cuatro variables numericas , el diagra,a de cajas la relacion y correlacion entre las variables numericas, ademas observamos que las tres variables influyen en en indice de masa corporal, estas influyen directamente y se ve el nivel de correlacion, igual entre las horas de sueño y el BMI , no existe relacion entre baja entre el sueño y los embarazos ,

Dividiremos el conjunto completo de individuos en dos partes . un apara entrenar el modelo, que contiene el 80% de los individuos y otra para validarlo que contiene el resto. E sto es que usamos para ajustarlo, la bondad del ajuste quedara sobrevalorada. Antes de ajustar cualquier modelo, es conveniente escalar las variables numericas y tenemos que reacondicionar las variables categoricas convirtiendolas en variables ficticias y usando la primera o la ultima categorica como comparacion.

Una variable ficticia es va a tomar valores de cero o de uno para representar cATEGORIAS O REPRESENTA UNA CONDICIION, SE UTILIZA EN ANALISI DE REGRESION para incomporar informacion cualitativa en terminos cuantitativos,

diabetes2[,classes=="integer"]=scale(diabetes2[,classes=="integer"])
head(diabetes2)

             Age Gender Family_Diabetes highBP     PhysicallyActive        BMI
20  less than 40 Female              no    yes       one hr or more -0.6487242
258 less than 40 Female              no    yes       one hr or more -0.6487242
511  60 or older   Male              no    yes                 none  0.1294871
907 less than 40 Female              no    yes More than half an hr -1.2323826
582  60 or older Female             yes    yes More than half an hr -0.4541713
726        40-49 Female             yes    yes less than half an hr  0.1294871
    Smoking Alcohol      Sleep SoundSleep RegularMedicine   JunkFood     Stress
20      yes     yes 0.78534240  0.2119477             yes      often  sometimes
258     yes     yes 0.78534240  0.2119477             yes      often  sometimes
511     yes     yes 0.01861803  0.7543900             yes \t\nalways very often
907      no      no 0.01861803 -0.3304947             yes \t\nalways  sometimes
582     yes      no 0.01861803  0.2119477              no \t\nalways     always
726      no     yes 0.78534240  0.7543900             yes \t\nalways very often
    BPLevel Pregancies Pdiabetes UriationFreq Diabetic
20      low -0.4201082         0     not much      yes
258     low -0.4201082         0     not much      yes
511     low  2.8800479         0     not much      yes
907     low -0.4201082         0     not much      yes
582  normal -0.4201082         0  quite often       no
726     low -0.4201082         0     not much      yes

x=model.matrix(Diabetic~., diabetes2)
head(x)

    (Intercept) Age50-59 Age60 or older Ageless than 40 GenderFemale
20            1        0              0               1            1
258           1        0              0               1            1
511           1        0              1               0            0
907           1        0              0               1            1
582           1        0              1               0            1
726           1        0              0               0            1
    Family_Diabetesno highBPno PhysicallyActiveMore than half an hr
20                  1        0                                    0
258                 1        0                                    0
511                 1        0                                    0
907                 1        0                                    1
582                 0        0                                    1
726                 0        0                                    0
    PhysicallyActivenone PhysicallyActiveone hr or more        BMI Smokingno
20                     0                              1 -0.6487242         0
258                    0                              1 -0.6487242         0
511                    1                              0  0.1294871         0
907                    0                              0 -1.2323826         1
582                    0                              0 -0.4541713         0
726                    0                              0  0.1294871         1
    Alcoholno      Sleep SoundSleep RegularMedicineno JunkFood\t\nalways
20          0 0.78534240  0.2119477                 0                  0
258         0 0.78534240  0.2119477                 0                  0
511         0 0.01861803  0.7543900                 0                  1
907         1 0.01861803 -0.3304947                 0                  1
582         1 0.01861803  0.2119477                 1                  1
726         0 0.78534240  0.7543900                 0                  1
    JunkFoodoften JunkFoodoccasionally Stresssometimes Stressvery often
20              1                    0               1                0
258             1                    0               1                0
511             0                    0               0                1
907             0                    0               1                0
582             0                    0               0                0
726             0                    0               0                1
    Stressalways BPLevelhigh BPLevellow Pregancies Pdiabetesyes
20             0           0          1 -0.4201082            0
258            0           0          1 -0.4201082            0
511            0           0          1  2.8800479            0
907            0           0          1 -0.4201082            0
582            1           0          0 -0.4201082            0
726            0           0          1 -0.4201082            0
    UriationFreqquite often
20                        0
258                       0
511                       0
907                       0
582                       1
726                       0

Observara que las variables numericas reescaladas contienen mismos valores iniciales.Cada una de las variables nominales ha sido convetida en variables binarias.El numero de variables binarias es siempre el numero de categorias de la variable nomenal.Por ejemplo, la variable Grender ( don dos categorias) ha sido conveetida en Gender Male que es una variable numerica que vale 1 caundo el sexo es hombre y o cuando es mujer.El resto de variables tambien son convertidas a binarias con el numero de categorias que cada una almacena.Los parametros correspondientes a las categorias retenidascomparan a estas con la categoria eliminada. Dividiremos el conjunto completo de insividuos en dos partes un para entrenar el modelo , que contiene el 60 % de los individuos y otra para validarlo que contiene 40% de individuos.Esto es asi porque si valoramos el modelo con las mismas observaciones

tr=round(nrow(diabetes2)*0.7)
set.seed(06071981)
muestra=sample.int(nrow(diabetes2),tr)
Train.diabet=diabetes2[muestra,]
Val.diabet=diabetes2[-muestra,]

Prediccion de las diabetes (biniaria y variables que influyenn en la misma)

Modeloa de clasisficacion

Regresion logistica

Trataremos de modelar la probalidad de Biabetes positiva en funciondel restp de las variables

El modelo logistico es

gfit1<- glm(diabetes2$Diabetic~.,diabetes2,family=binomial)
summary(gfit1)


Call:
glm(formula = diabetes2$Diabetic ~ ., family = binomial, data = diabetes2)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -2.77088    1.08160  -2.562 0.010412 *  
Age50-59                               0.41120    0.36291   1.133 0.257177    
Age60 or older                         1.62408    0.41602   3.904 9.47e-05 ***
Ageless than 40                       -1.68936    0.43244  -3.907 9.36e-05 ***
GenderFemale                           0.48706    0.37089   1.313 0.189105    
Family_Diabetesno                      1.00920    0.25459   3.964 7.37e-05 ***
highBPno                              -0.85083    0.38970  -2.183 0.029013 *  
PhysicallyActiveMore than half an hr   0.55514    0.36247   1.532 0.125639    
PhysicallyActivenone                   0.85532    0.38718   2.209 0.027165 *  
PhysicallyActiveone hr or more         1.68317    0.37751   4.459 8.25e-06 ***
BMI                                    0.17735    0.12090   1.467 0.142372    
Smokingno                              1.15528    0.50424   2.291 0.021955 *  
Alcoholno                              0.09689    0.36237   0.267 0.789187    
Sleep                                  0.05861    0.15561   0.377 0.706443    
SoundSleep                             0.39536    0.18442   2.144 0.032045 *  
RegularMedicineno                      2.97844    0.30926   9.631  < 2e-16 ***
JunkFood\t\nalways                     0.08808    0.84774   0.104 0.917245    
JunkFoodoften                          0.25582    0.83663   0.306 0.759775    
JunkFoodoccasionally                   0.04005    0.97557   0.041 0.967253    
Stresssometimes                       -0.03269    0.51891  -0.063 0.949776    
Stressvery often                      -0.53932    0.39565  -1.363 0.172845    
Stressalways                          -0.32951    0.45855  -0.719 0.472393    
BPLevelhigh                          -15.08888  758.75534  -0.020 0.984134    
BPLevellow                            -1.47618    0.39038  -3.781 0.000156 ***
Pregancies                             0.32210    0.14874   2.165 0.030350 *  
Pdiabetesyes                           4.01652    0.88504   4.538 5.67e-06 ***
UriationFreqquite often                0.43749    0.31153   1.404 0.160219    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1091.56  on 905  degrees of freedom
Residual deviance:  491.38  on 879  degrees of freedom
AIC: 545.38

Number of Fisher Scoring iterations: 16

INTERPRETACION Utilizando la funcion glm nos arroja una estaimacion en donde observamos que los coeficientes de regresion mas significativos o que influyen en la variable diabetes son : Age60 or older;Ageless than 40;Family_Diabetesno ;PhysicallyActiveone hr or more;RegularMedicineno ;BPLevellow;Pdiabetesyes ; y las variables que no influyen en la diabetes son Alcoholno;Sleep ;JunkFood…ect

gfit1<- glm(Diabetic~.,Train.diabet,family=binomial)
summary(gfit1)


Call:
glm(formula = Diabetic ~ ., family = binomial, data = Train.diabet)

Coefficients:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -1.74808    1.35203  -1.293 0.196034    
Age50-59                               0.59254    0.44817   1.322 0.186121    
Age60 or older                         1.61215    0.51190   3.149 0.001637 ** 
Ageless than 40                       -1.74882    0.55125  -3.172 0.001511 ** 
GenderFemale                           0.75499    0.45562   1.657 0.097504 .  
Family_Diabetesno                      1.10014    0.31700   3.470 0.000520 ***
highBPno                              -1.34930    0.49038  -2.752 0.005932 ** 
PhysicallyActiveMore than half an hr   0.96937    0.43039   2.252 0.024303 *  
PhysicallyActivenone                   1.24248    0.47769   2.601 0.009295 ** 
PhysicallyActiveone hr or more         1.65601    0.47098   3.516 0.000438 ***
BMI                                    0.14903    0.15708   0.949 0.342734    
Smokingno                              1.46818    0.66524   2.207 0.027315 *  
Alcoholno                             -0.93616    0.47709  -1.962 0.049737 *  
Sleep                                  0.13708    0.19196   0.714 0.475149    
SoundSleep                             0.24377    0.21826   1.117 0.264053    
RegularMedicineno                      3.18194    0.40577   7.842 4.44e-15 ***
JunkFood\t\nalways                    -0.59620    1.07235  -0.556 0.578226    
JunkFoodoften                         -0.68203    1.07406  -0.635 0.525426    
JunkFoodoccasionally                  -0.80432    1.23186  -0.653 0.513799    
Stresssometimes                       -0.09684    0.67129  -0.144 0.885297    
Stressvery often                      -0.66670    0.51982  -1.283 0.199650    
Stressalways                          -0.76141    0.59579  -1.278 0.201255    
BPLevelhigh                          -15.55245  882.82314  -0.018 0.985945    
BPLevellow                            -1.88079    0.48717  -3.861 0.000113 ***
Pregancies                             0.41274    0.18751   2.201 0.027727 *  
Pdiabetesyes                           5.09625    1.16804   4.363 1.28e-05 ***
UriationFreqquite often                0.41012    0.39058   1.050 0.293711    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 784.15  on 633  degrees of freedom
Residual deviance: 333.17  on 607  degrees of freedom
AIC: 387.17

Number of Fisher Scoring iterations: 16

INTERPRETACION Utilizando la funcion glm nos arroja una estimacion en donde observamos que los coeficientes de regresion mas significativos o que influyen en la variable diabetes son : Age60 or older;Family_Diabetesno;SoundSleep ;RegularMedicineno y las variables que menos influyen en la diabetes es JunkFood;JunkFoodoften ;JunkFoodoccasionally;BPLevelhigh;highBPno;PhysicallyActivenone ;GenderFemale adicinal podemos que con Train solo se toma el 60% de los datos observando que el intercepto (Diabetic) es altamente significativo en las variables antes ,mencionadas

ANALISIS ANOVA

gfit0=glm(diabetes2$Diabetic~1, data = diabetes2,family = binomial)
anova(gfit0,gfit1, test = "Chisq")

Warning in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, :
models with response '"Diabetic"' removed because response differs from model 1

Analysis of Deviance Table

Model: binomial, link: logit

Response: diabetes2$Diabetic

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL                   905     1091.6

INTERPRETACION 1 La diferencia es altamente significativa, es decir el midelo con todoas las variables es significativa mejor que el modelo solo con la constante.Mirando el test de wald para la significacion de cada parametro individual vemos que son altamente significativos los coeficientes para las variables Age,FamilY_DiabetesEsto quiere decir INTERPRETACION 2 Se observa un 2.2e-16 con un alto grado de significatividad y existe almenos un grupo que es significativo,uno de estos factores es significativos las variables Age ,Gender Family_Diabetes* ,highBP, PhysicallyActive + BMI + Smoking + *Alcohol** + Sleep + SoundSleep + RegularMedicine + JunkFood + Stress + BPLevel + Pregancies + Pdiabetes + UriationFreq.

anova(gfit1, test = "Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetic

Terms added sequentially (first to last)

                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               633     784.15              
Age               3  218.654       630     565.50 < 2.2e-16 ***
Gender            1   12.859       629     552.64 0.0003358 ***
Family_Diabetes   1   35.069       628     517.57 3.183e-09 ***
highBP            1    9.898       627     507.67 0.0016545 ** 
PhysicallyActive  3   12.344       624     495.33 0.0062944 ** 
BMI               1    0.628       623     494.70 0.4281195    
Smoking           1    9.125       622     485.58 0.0025210 ** 
Alcohol           1    0.015       621     485.56 0.9036516    
Sleep             1    2.187       620     483.37 0.1391359    
SoundSleep        1    0.152       619     483.22 0.6964133    
RegularMedicine   1   80.792       618     402.43 < 2.2e-16 ***
JunkFood          3    1.130       615     401.30 0.7697540    
Stress            3   11.472       612     389.83 0.0094306 ** 
BPLevel           2   16.537       610     373.29 0.0002564 ***
Pregancies        1   12.272       609     361.02 0.0004597 ***
Pdiabetes         1   26.751       608     334.27 2.314e-07 ***
UriationFreq      1    1.102       607     333.17 0.2938943    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

gfit2=glm(Diabetic~., data=Train.diabet, family = binomial)
cbind(gfit1$coefficients, gfit2$coefficients)

##                                              [,1]         [,2]
## (Intercept)                           -1.74808262  -1.74808262
## Age50-59                               0.59253917   0.59253917
## Age60 or older                         1.61214689   1.61214689
## Ageless than 40                       -1.74881721  -1.74881721
## GenderFemale                           0.75499209   0.75499209
## Family_Diabetesno                      1.10013745   1.10013745
## highBPno                              -1.34929812  -1.34929812
## PhysicallyActiveMore than half an hr   0.96937032   0.96937032
## PhysicallyActivenone                   1.24247791   1.24247791
## PhysicallyActiveone hr or more         1.65601454   1.65601454
## BMI                                    0.14903473   0.14903473
## Smokingno                              1.46817871   1.46817871
## Alcoholno                             -0.93616362  -0.93616362
## Sleep                                  0.13708311   0.13708311
## SoundSleep                             0.24376942   0.24376942
## RegularMedicineno                      3.18194151   3.18194151
## JunkFood\t\nalways                    -0.59620345  -0.59620345
## JunkFoodoften                         -0.68203158  -0.68203158
## JunkFoodoccasionally                  -0.80432089  -0.80432089
## Stresssometimes                       -0.09683845  -0.09683845
## Stressvery often                      -0.66670031  -0.66670031
## Stressalways                          -0.76141479  -0.76141479
## BPLevelhigh                          -15.55244810 -15.55244810
## BPLevellow                            -1.88078889  -1.88078889
## Pregancies                             0.41273999   0.41273999
## Pdiabetesyes                           5.09624708   5.09624708
## UriationFreqquite often                0.41011844   0.41011844

p=predict(gfit2, Val.diabet, type="response")
PredDiabet=as.factor(p>0.5)
levels(PredDiabet)=c("no","yes")
library(lattice)

## Warning: package 'lattice' was built under R version 4.3.2

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

matrizLogis  <- confusionMatrix(Val.diabet$Diabetic, PredDiabet)

## Warning in confusionMatrix.default(Val.diabet$Diabetic, PredDiabet): Levels are
## not in the same order for reference and data. Refactoring data to match.

matrizLogis

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no   16  51
##        yes 185  20
##                                           
##                Accuracy : 0.1324          
##                  95% CI : (0.0944, 0.1785)
##     No Information Rate : 0.739           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.3966         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.07960         
##             Specificity : 0.28169         
##          Pos Pred Value : 0.23881         
##          Neg Pred Value : 0.09756         
##              Prevalence : 0.73897         
##          Detection Rate : 0.05882         
##    Detection Prevalence : 0.24632         
##       Balanced Accuracy : 0.18065         
##                                           
##        'Positive' Class : no              
##

INTERPRETACION Observamos un 0,65 de condordancia en los datos Tambien podemos dibujar la curva Roc para distintas probabilidades de corte

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

test_prob = predict(gfit2, newdata = Val.diabet, type = "response")
test_roc = roc(Val.diabet$Diabetic~test_prob, plot = TRUE, print.aun = TRUE)

## Setting levels: control = yes, case = no

## Setting direction: controls < cases

INTERPRETACION

Se utiliza para modelos de prediccion cuando tengo 0 y 1,visulaizar el comportamineto de aprendizaje automatico con el 70% , en el eje de los tenemos la sencibilidad y en el eje x especificalidad, obtenemos un 0,91 un modelo aceptable,observamos los unbrales de decision , este modelo de predicicion si es bueno .

MAQUINAS DE VECTOS SOPORTE Otro modelo de clasificacion binaria es el conocido como Suppory vector Machine Las paquetes que ajustan dlos modelos .Ajusyamos el modelo para nuestros datos. VERIFICAMOS CON EL MODELO KERNEL radial

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.2

fitsm1 <- svm(Diabetic~., data = Train.diabet)
summary(fitsm1)

## 
## Call:
## svm(formula = Diabetic ~ ., data = Train.diabet)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  265
## 
##  ( 132 133 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  yes no

INTERPRETACION Nos da un resumen, estoy utilizando una clasificacion, me presenta 265 vectores de soporte, dos numeros de clase, dos liveles binarios un si y un no, 132 para el lado izquierdo (NO) y 133 para el lado derecho (SI)

Acontinuacion predecimos los valores de la respuesta y calculamos la matriz de confusion

library(caret)
predictedSVM = predict(fitsm1,Val.diabet)
matrizSVM1 = confusionMatrix(Val.diabet$Diabetic,predictedSVM)
matrizSVM1

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes 196   9
       no   14  53
                                          
               Accuracy : 0.9154          
                 95% CI : (0.8758, 0.9456)
    No Information Rate : 0.7721          
    P-Value [Acc > NIR] : 3.803e-10       
                                          
                  Kappa : 0.7664          
                                          
 Mcnemar's Test P-Value : 0.4042          
                                          
            Sensitivity : 0.9333          
            Specificity : 0.8548          
         Pos Pred Value : 0.9561          
         Neg Pred Value : 0.7910          
             Prevalence : 0.7721          
         Detection Rate : 0.7206          
   Detection Prevalence : 0.7537          
      Balanced Accuracy : 0.8941          
                                          
       'Positive' Class : yes

INTERPRETACION

Comparacion de las dos bases de clasificacion Matriz de confucion modelo logistica 185 para el no y 51 para el si En maquinas de soporte obtenemos 196 para el no y 53 para el si, ademas observamos que con el vector de soporte nos da un mejor ajuste.

VERIFICAMOS CON EL MODELO KERNEL PILINOMIAL

library(e1071)
fitsm2 <- svm(Diabetic~., data = Train.diabet)
kernel="polinomial"
summary(fitsm2)


Call:
svm(formula = Diabetic ~ ., data = Train.diabet)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  265

 ( 132 133 )


Number of Classes:  2 

Levels: 
 yes no

predictedSVM = predict(fitsm2,Val.diabet)
matrizSVM2 = confusionMatrix(Val.diabet$Diabetic,predictedSVM)
matrizSVM2

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes 196   9
       no   14  53
                                          
               Accuracy : 0.9154          
                 95% CI : (0.8758, 0.9456)
    No Information Rate : 0.7721          
    P-Value [Acc > NIR] : 3.803e-10       
                                          
                  Kappa : 0.7664          
                                          
 Mcnemar's Test P-Value : 0.4042          
                                          
            Sensitivity : 0.9333          
            Specificity : 0.8548          
         Pos Pred Value : 0.9561          
         Neg Pred Value : 0.7910          
             Prevalence : 0.7721          
         Detection Rate : 0.7206          
   Detection Prevalence : 0.7537          
      Balanced Accuracy : 0.8941          
                                          
       'Positive' Class : yes

INTERPRETACION

En el polinomial aumenta parametros utilizo 202 para el no lado izquierdo y 133 para el lado derecho si, con una matriz de confusion de 203 para el no y 16 para el si, pero el nivel de presicion bajo el p_value bajo totalmente, el nivel de corcordancia es muy reducida por lo que no mejora el modelo

VERIFICAMOS CON EL MODELO KERNEL SIDMOID

library(caret)
library(e1071)
fitsm3 <- svm(Diabetic~., data = Train.diabet)
kernel="sigmoid"
summary(fitsm3)


Call:
svm(formula = Diabetic ~ ., data = Train.diabet)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  265

 ( 132 133 )


Number of Classes:  2 

Levels: 
 yes no

predictedSVM3 = predict(fitsm3,Val.diabet)
matrizSVM3 = confusionMatrix(Val.diabet$Diabetic,predictedSVM3)
matrizSVM3

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes 196   9
       no   14  53
                                          
               Accuracy : 0.9154          
                 95% CI : (0.8758, 0.9456)
    No Information Rate : 0.7721          
    P-Value [Acc > NIR] : 3.803e-10       
                                          
                  Kappa : 0.7664          
                                          
 Mcnemar's Test P-Value : 0.4042          
                                          
            Sensitivity : 0.9333          
            Specificity : 0.8548          
         Pos Pred Value : 0.9561          
         Neg Pred Value : 0.7910          
             Prevalence : 0.7721          
         Detection Rate : 0.7206          
   Detection Prevalence : 0.7537          
      Balanced Accuracy : 0.8941          
                                          
       'Positive' Class : yes

INTERPRETACION

Este modelo es muy parecido al Radial

VERIFICAMOS CON EL MODELO KERNEL LINEAR

library(e1071)
fitsm4 <- svm(Diabetic~., data = Train.diabet)
kernel="linear"
summary(fitsm4)


Call:
svm(formula = Diabetic ~ ., data = Train.diabet)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  265

 ( 132 133 )


Number of Classes:  2 

Levels: 
 yes no

predictedSVM4 = predict(fitsm3,Val.diabet)
matrizSVM4 = confusionMatrix(Val.diabet$Diabetic,predictedSVM4)
matrizSVM4

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes 196   9
       no   14  53
                                          
               Accuracy : 0.9154          
                 95% CI : (0.8758, 0.9456)
    No Information Rate : 0.7721          
    P-Value [Acc > NIR] : 3.803e-10       
                                          
                  Kappa : 0.7664          
                                          
 Mcnemar's Test P-Value : 0.4042          
                                          
            Sensitivity : 0.9333          
            Specificity : 0.8548          
         Pos Pred Value : 0.9561          
         Neg Pred Value : 0.7910          
             Prevalence : 0.7721          
         Detection Rate : 0.7206          
   Detection Prevalence : 0.7537          
      Balanced Accuracy : 0.8941          
                                          
       'Positive' Class : yes

Accuracy <- setNames(c(matrizLogis$overall[1], matrizSVM1$overall[1], matrizSVM2$overall[1], matrizSVM3$overall[1], matrizSVM4$overall[1]),
                    c("Logistica", "SVM_RADIAL", "SVM_PLINOMIAL", "SVM_SIGMOID", "SVM_LINEAR"))
Accuracy

##     Logistica    SVM_RADIAL SVM_PLINOMIAL   SVM_SIGMOID    SVM_LINEAR 
##     0.1323529     0.9154412     0.9154412     0.9154412     0.9154412

TRABAJO EN CLASE

JOHANA BRAVO

2023-12-12

IMPORTACION DE LA BASE

DESCRICION DE LAS VARIABLES

Conversion de la variable JunkFood

Conversion de la variable Stress

Prediccion de las diabetes (biniaria y variables que influyenn en la misma)

Modeloa de clasisficacion

Regresion logistica