library(tidyverse)
library(psych)
library(stats)
library(dplyr)
library(tseries)
library(MVN)
library(ggplot2)
library(ggpubr)
library(GGally)
library(biotools)
library(reshape2)
library(caret) # Matrixconfusion()
library(klaR) # Gráficos de grupos
library(foreign) #para importar .arff
library(RWeka) #para importar .arff
library(kableExtra)
library(knitr)
library(DT) # data tables
library(reticulate)
options(scipen = 999)
options(digits = 3) Análisis Discriminante
Descripción de la base de datos
Título: Rice (Cammeo and Osmancik)
Enlace:https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik
Se tomaron un total de 3810 imágenes de granos de arroz para las dos especies. Se obtuvieron 7 características morfológicas para cada grano de arroz.
El arroz certificado cultivado en TURQUÍA, se han seleccionado para el estudio la especie Osmancik, que tiene una gran superficie de plantación desde 1997 y la especie Cammeo cultivada desde 2014. Al observar las características generales de las especies de Osmancik, tienen una apariencia ancha, larga, vidriosa y opaca. Al observar las características generales de las especies de Cammeo, tienen un aspecto ancho y largo, vidrioso y opaco. Se tomaron un total de 3810 imágenes de granos de arroz para las dos especies, se procesaron y se hicieron inferencias de características. Se obtuvieron 7 características morfológicas para cada grano de arroz.
Variables:
Area (Área): Número de píxeles dentro de los límites del grano de arroz.
Perimeter (Perímetro): Circunferencia calculando la distancia entre píxeles alrededor de los límites del grano de arroz.
Major_Axis_Length (Longitud del eje principal): La línea más larga que se puede dibujar en el grano de arroz, es decir, la distancia del eje principal.
Major_Axis_Length (Longitud del eje menor): La línea más corta que se puede dibujar en el grano de arroz, es decir, la distancia del eje pequeño.
Eccentricity (Excentricidad): Mide lo redonda que es la elipse, que tiene los mismos momentos que el grano de arroz.
Convex_Area (Área convexa): Recuento de píxeles de la cáscara convexa más pequeña de la región formada por el grano de arroz.
Extent (Extensión): Relación entre la región formada por el grano de arroz y los píxeles del cuadro delimitador.
Clase (Clase): Arroz Cammeo y Osmancik
Aplicación R
Exploración inicial
ruta <- file.choose()
base <- read.arff(ruta)
# Comprobar estructura y datos
str(base)'data.frame': 3810 obs. of 8 variables:
$ Area : num 15231 14656 14634 13176 14688 ...
$ Perimeter : num 526 494 501 458 507 ...
$ Major_Axis_Length: num 230 206 214 193 212 ...
$ Minor_Axis_Length: num 85.1 91.7 87.8 87.4 89.3 ...
$ Eccentricity : num 0.929 0.895 0.912 0.892 0.907 ...
$ Convex_Area : num 15617 15072 14954 13368 15262 ...
$ Extent : num 0.573 0.615 0.693 0.641 0.646 ...
$ Class : Factor w/ 2 levels "Cammeo","Osmancik": 1 1 1 1 1 1 1 1 1 1 ...
head(base) Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
1 15231 526 230 85.1 0.929 15617
2 14656 494 206 91.7 0.895 15072
3 14634 501 214 87.8 0.912 14954
4 13176 458 193 87.4 0.892 13368
5 14688 507 212 89.3 0.907 15262
6 13479 477 200 86.7 0.901 13786
Extent Class
1 0.573 Cammeo
2 0.615 Cammeo
3 0.693 Cammeo
4 0.641 Cammeo
5 0.646 Cammeo
6 0.658 Cammeo
tail(base) Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity
3805 12501 452 193 83.2 0.902
3806 11441 416 170 85.8 0.864
3807 11625 421 168 89.5 0.846
3808 12437 442 184 86.8 0.881
3809 9882 392 161 78.2 0.874
3810 11434 405 161 90.9 0.826
Convex_Area Extent Class
3805 12687 0.719 Osmancik
3806 11628 0.681 Osmancik
3807 11904 0.694 Osmancik
3808 12645 0.627 Osmancik
3809 10097 0.659 Osmancik
3810 11591 0.803 Osmancik
kable(rbind(head(base),tail(base))) %>% kable_styling()| Area | Perimeter | Major_Axis_Length | Minor_Axis_Length | Eccentricity | Convex_Area | Extent | Class | |
|---|---|---|---|---|---|---|---|---|
| 1 | 15231 | 526 | 230 | 85.1 | 0.929 | 15617 | 0.573 | Cammeo |
| 2 | 14656 | 494 | 206 | 91.7 | 0.895 | 15072 | 0.615 | Cammeo |
| 3 | 14634 | 501 | 214 | 87.8 | 0.912 | 14954 | 0.693 | Cammeo |
| 4 | 13176 | 458 | 193 | 87.4 | 0.892 | 13368 | 0.641 | Cammeo |
| 5 | 14688 | 507 | 212 | 89.3 | 0.907 | 15262 | 0.646 | Cammeo |
| 6 | 13479 | 477 | 200 | 86.7 | 0.901 | 13786 | 0.658 | Cammeo |
| 3805 | 12501 | 452 | 193 | 83.2 | 0.902 | 12687 | 0.719 | Osmancik |
| 3806 | 11441 | 416 | 170 | 85.8 | 0.864 | 11628 | 0.681 | Osmancik |
| 3807 | 11625 | 421 | 168 | 89.5 | 0.846 | 11904 | 0.694 | Osmancik |
| 3808 | 12437 | 442 | 184 | 86.8 | 0.881 | 12645 | 0.627 | Osmancik |
| 3809 | 9882 | 392 | 161 | 78.2 | 0.874 | 10097 | 0.659 | Osmancik |
| 3810 | 11434 | 405 | 161 | 90.9 | 0.826 | 11591 | 0.803 | Osmancik |
# Valores nulos
paste("Total de valores perdidos:", sum(is.na(base)))[1] "Total de valores perdidos: 0"
# base <- na.omit(base)
# resumen
summary(base) Area Perimeter Major_Axis_Length Minor_Axis_Length
Min. : 7551 Min. :359 Min. :145 Min. : 59.5
1st Qu.:11370 1st Qu.:426 1st Qu.:174 1st Qu.: 82.7
Median :12422 Median :449 Median :186 Median : 86.4
Mean :12668 Mean :454 Mean :189 Mean : 86.3
3rd Qu.:13950 3rd Qu.:484 3rd Qu.:204 3rd Qu.: 90.1
Max. :18913 Max. :548 Max. :239 Max. :107.5
Eccentricity Convex_Area Extent Class
Min. :0.777 Min. : 7723 Min. :0.497 Cammeo :1630
1st Qu.:0.872 1st Qu.:11626 1st Qu.:0.599 Osmancik:2180
Median :0.889 Median :12706 Median :0.645
Mean :0.887 Mean :12952 Mean :0.662
3rd Qu.:0.903 3rd Qu.:14284 3rd Qu.:0.727
Max. :0.948 Max. :19099 Max. :0.861
kable(describe(base[1:7])) %>% kable_styling()| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Area | 1 | 3810 | 12667.728 | 1732.368 | 12421.500 | 12604.852 | 1835.459 | 7551.000 | 18913.000 | 11362.000 | 0.325 | -0.433 | 28.066 |
| Perimeter | 2 | 3810 | 454.239 | 35.597 | 448.852 | 453.307 | 41.362 | 359.100 | 548.446 | 189.346 | 0.221 | -0.842 | 0.577 |
| Major_Axis_Length | 3 | 3810 | 188.776 | 17.449 | 185.810 | 188.215 | 20.862 | 145.264 | 239.010 | 93.746 | 0.260 | -0.953 | 0.283 |
| Minor_Axis_Length | 4 | 3810 | 86.314 | 5.730 | 86.435 | 86.388 | 5.493 | 59.532 | 107.542 | 48.010 | -0.135 | 0.558 | 0.093 |
| Eccentricity | 5 | 3810 | 0.887 | 0.021 | 0.889 | 0.888 | 0.022 | 0.777 | 0.948 | 0.171 | -0.449 | 0.068 | 0.000 |
| Convex_Area | 6 | 3810 | 12952.497 | 1776.972 | 12706.500 | 12888.298 | 1895.504 | 7723.000 | 19099.000 | 11376.000 | 0.320 | -0.468 | 28.788 |
| Extent | 7 | 3810 | 0.662 | 0.077 | 0.645 | 0.659 | 0.087 | 0.497 | 0.861 | 0.364 | 0.344 | -1.031 | 0.001 |
# Visualizar gráficamente
boxplot(base$Area, col = "turquoise4", main="Boxplot - Área")boxplot(base$Perimeter, col = "darkolivegreen3", main="Boxplot - Perímetro")boxplot(base$Major_Axis_Length, col = "darkkhaki", main="Boxplot - Longuitud eje mayor")boxplot(base$Minor_Axis_Length, col = "darkseagreen3", main="Boxplot - Longuitud eje menor")boxplot(base$Eccentricity, col = "darkorange3", main="Boxplot - Excentricidad")boxplot(base$Convex_Area, col = "rosybrown4", main="Boxplot - Área convexa")boxplot(base$Extent, col = "plum3", main="Boxplot - Extensión")hist(base$Area, col="turquoise4", main="Histograma - Área", xlab="Area", ylab="Fracuencia")hist(base$Perimeter, col="darkolivegreen3", main="Histograma - Perímetro", xlab="Perimeter", ylab="Fracuencia")hist(base$Major_Axis_Length, col="darkkhaki", main="Histograma - Longuitud eje mayor", xlab="Major_Axis_Length", ylab="Fracuencia")hist(base$Minor_Axis_Length, col="darkseagreen3", main="Histograma - Longuitud eje menor", xlab="Minor_Axis_Length", ylab="Fracuencia")hist(base$Eccentricity, col="darkorange3", main="Histograma - Excentricidad", xlab="Eccentricity", ylab="Fracuencia")hist(base$Convex_Area, col="rosybrown4", main="Histograma - Área convexa", xlab="Convex_Area", ylab="Fracuencia")hist(base$Extent, col="plum3", main="Histograma - Extensión", xlab="Extent", ylab="Fracuencia")# Exploración gráfica dividiendo base en grupos
pairs(x = base[,1:7], col = c("salmon", "skyblue")[base$Class], pch = 10)pares <- list(
c("Area", "Perimeter"),
c("Area", "Major_Axis_Length"),
c("Area", "Minor_Axis_Length"),
c("Area", "Eccentricity"),
c("Area", "Convex_Area"),
c("Area", "Extent"),
c("Perimeter", "Major_Axis_Length"),
c("Perimeter", "Minor_Axis_Length"),
c("Perimeter", "Eccentricity"),
c("Perimeter", "Convex_Area"),
c("Perimeter", "Extent"),
c("Major_Axis_Length", "Minor_Axis_Length"),
c("Major_Axis_Length", "Eccentricity"),
c("Major_Axis_Length", "Convex_Area"),
c("Major_Axis_Length", "Extent"),
c("Minor_Axis_Length", "Eccentricity"),
c("Minor_Axis_Length", "Convex_Area"),
c("Minor_Axis_Length", "Extent"),
c("Eccentricity", "Convex_Area"),
c("Eccentricity", "Extent"),
c("Convex_Area", "Extent")
)
grafico <- list()
# Usando for para generar los gráficos
for (i in seq_along(pares)) {
grafico[[i]] <- ggplot(base, aes_string(x = pares[[i]][1], y = pares[[i]][2], color = "Class")) +
geom_point() +
scale_color_manual(values = c("Cammeo"="salmon","Osmancik"="skyblue")) +
theme_minimal() +
ggtitle(paste(pares[[i]][1], " & ", pares[[i]][2]))
}
# Mostrar gráficos
grafico[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
Detección de datos atípicos
mahalanobis_dist <- mahalanobis(base[,1:7], colMeans(base[,1:7]), cov(base[,1:7]))
# Valor crítico para identificar outliers (usando chi-cuadrado)
valor_critico <- qchisq(0.95, df = 3)
valor_critico #ver valor crítico[1] 7.81
atipicos <- which(mahalanobis_dist > valor_critico) # Identificar outliers
head(mahalanobis_dist) #ver distancias[1] 27.18 3.43 4.11 4.11 13.01 1.67
length(atipicos)[1] 1022
# # base_noOuliers contendrá no incluirá los datos atípicos
base_noOutliers <- base[-atipicos, ]Probar normalidad univariante y multivariante
# Normalidad univariante
# Prueba de jarque bera para normalidad
# Ho: Los datos siguen una distribución normal
# Ha: Los datos no siguen una distribución normal
# Datos incluyendo las 25 observaciones atípicas
for (i in 1:(ncol(base) - 1)) {
print(paste("Variable:", colnames(base)[i]))
print(jarque.bera.test(base[, i]))
}[1] "Variable: Area"
Jarque Bera Test
data: base[, i]
X-squared = 97, df = 2, p-value <0.0000000000000002
[1] "Variable: Perimeter"
Jarque Bera Test
data: base[, i]
X-squared = 143, df = 2, p-value <0.0000000000000002
[1] "Variable: Major_Axis_Length"
Jarque Bera Test
data: base[, i]
X-squared = 187, df = 2, p-value <0.0000000000000002
[1] "Variable: Minor_Axis_Length"
Jarque Bera Test
data: base[, i]
X-squared = 61, df = 2, p-value = 0.00000000000005
[1] "Variable: Eccentricity"
Jarque Bera Test
data: base[, i]
X-squared = 129, df = 2, p-value <0.0000000000000002
[1] "Variable: Convex_Area"
Jarque Bera Test
data: base[, i]
X-squared = 99, df = 2, p-value <0.0000000000000002
[1] "Variable: Extent"
Jarque Bera Test
data: base[, i]
X-squared = 244, df = 2, p-value <0.0000000000000002
# Datos sin observaciones atípicas
for (i in 1:(ncol(base_noOutliers) - 1)) {
print(paste("Variable:", colnames(base_noOutliers)[i]))
print(jarque.bera.test(base_noOutliers[, i]))
}[1] "Variable: Area"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 136, df = 2, p-value <0.0000000000000002
[1] "Variable: Perimeter"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 157, df = 2, p-value <0.0000000000000002
[1] "Variable: Major_Axis_Length"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 180, df = 2, p-value <0.0000000000000002
[1] "Variable: Minor_Axis_Length"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 19, df = 2, p-value = 0.00008
[1] "Variable: Eccentricity"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 117, df = 2, p-value <0.0000000000000002
[1] "Variable: Convex_Area"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 138, df = 2, p-value <0.0000000000000002
[1] "Variable: Extent"
Jarque Bera Test
data: base_noOutliers[, i]
X-squared = 199, df = 2, p-value <0.0000000000000002
# Normalidad multivariante
### con valores atípicos
mvn(data = base[1:7], mvnTest = "hz", multivariateOutlierMethod = "quan")$multivariateNormality
Test HZ p value MVN
1 Henze-Zirkler 9.61 0 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling Area 21.11 <0.001 NO
2 Anderson-Darling Perimeter 33.15 <0.001 NO
3 Anderson-Darling Major_Axis_Length 46.72 <0.001 NO
4 Anderson-Darling Minor_Axis_Length 2.54 <0.001 NO
5 Anderson-Darling Eccentricity 15.97 <0.001 NO
6 Anderson-Darling Convex_Area 22.13 <0.001 NO
7 Anderson-Darling Extent 62.92 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max
Area 3810 12667.728 1732.3677 12421.500 7551.000 18913.000
Perimeter 3810 454.239 35.5971 448.852 359.100 548.446
Major_Axis_Length 3810 188.776 17.4487 185.810 145.264 239.010
Minor_Axis_Length 3810 86.314 5.7298 86.435 59.532 107.542
Eccentricity 3810 0.887 0.0208 0.889 0.777 0.948
Convex_Area 3810 12952.497 1776.9720 12706.500 7723.000 19099.000
Extent 3810 0.662 0.0772 0.645 0.497 0.861
25th 75th Skew Kurtosis
Area 11370.500 13950.000 0.325 -0.4334
Perimeter 426.145 483.684 0.221 -0.8418
Major_Axis_Length 174.354 203.550 0.260 -0.9532
Minor_Axis_Length 82.732 90.144 -0.135 0.5579
Eccentricity 0.872 0.903 -0.449 0.0678
Convex_Area 11626.250 14284.000 0.320 -0.4681
Extent 0.599 0.727 0.344 -1.0314
# el número de observaciones para royston test debe ser menor a 2000
# se tomarán las primeras 1999 observaciones
mvn(data = base[1:1999,1:7], mvnTest = "royston", multivariatePlot = "qq")$multivariateNormality
Test H p value MVN
1 Royston 226 0.00000000000000000000000000000000000000000000000278 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling Area 4.27 <0.001 NO
2 Anderson-Darling Perimeter 18.14 <0.001 NO
3 Anderson-Darling Major_Axis_Length 23.03 <0.001 NO
4 Anderson-Darling Minor_Axis_Length 1.87 0.0001 NO
5 Anderson-Darling Eccentricity 20.77 <0.001 NO
6 Anderson-Darling Convex_Area 4.90 <0.001 NO
7 Anderson-Darling Extent 36.87 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max
Area 1999 13686.081 1592.8481 13824.000 8499.000 18913.000
Perimeter 1999 476.797 31.1675 481.770 370.446 548.446
Major_Axis_Length 1999 200.164 15.0655 202.722 154.096 239.010
Minor_Axis_Length 1999 87.983 5.5457 88.033 67.695 107.542
Eccentricity 1999 0.897 0.0172 0.899 0.816 0.948
Convex_Area 1999 14002.171 1631.1453 14169.000 8648.000 19099.000
Extent 1999 0.655 0.0808 0.639 0.497 0.861
25th 75th Skew Kurtosis
Area 12577.000 14824.000 -0.2125 -0.249
Perimeter 457.165 499.289 -0.5438 -0.189
Major_Axis_Length 191.991 210.748 -0.5740 -0.160
Minor_Axis_Length 84.570 91.462 -0.0879 0.582
Eccentricity 0.888 0.908 -0.8356 1.142
Convex_Area 12877.500 15157.000 -0.2380 -0.262
Extent 0.585 0.721 0.3803 -1.051
### sin valores atípicos
mvn(data = base_noOutliers[1:7], mvnTest = "hz", multivariateOutlierMethod = "quan")$multivariateNormality
Test HZ p value MVN
1 Henze-Zirkler 4.66 0 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling Area 26.96 <0.001 NO
2 Anderson-Darling Perimeter 35.65 <0.001 NO
3 Anderson-Darling Major_Axis_Length 45.17 <0.001 NO
4 Anderson-Darling Minor_Axis_Length 1.23 0.0034 NO
5 Anderson-Darling Eccentricity 19.76 <0.001 NO
6 Anderson-Darling Convex_Area 27.71 <0.001 NO
7 Anderson-Darling Extent 52.25 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max
Area 2788 12530.309 1413.6629 12308.500 9732.000 16513.000
Perimeter 2788 450.911 29.8622 445.165 390.957 528.176
Major_Axis_Length 2788 187.063 14.7857 183.978 159.499 222.672
Minor_Axis_Length 2788 86.221 4.3721 86.220 74.396 100.133
Eccentricity 2788 0.886 0.0173 0.887 0.842 0.921
Convex_Area 2788 12804.893 1450.8514 12583.000 9887.000 16950.000
Extent 2788 0.660 0.0717 0.642 0.534 0.823
25th 75th Skew Kurtosis
Area 11416.500 13617.500 0.3675 -0.796
Perimeter 427.029 475.957 0.3209 -0.969
Major_Axis_Length 174.787 199.940 0.3269 -1.060
Minor_Axis_Length 83.153 89.352 -0.0086 -0.404
Eccentricity 0.872 0.900 -0.2277 -0.897
Convex_Area 11665.000 13905.750 0.3714 -0.799
Extent 0.601 0.717 0.4337 -0.983
# el número de observaciones para royston test debe ser menor a 2000
# se tomarán las primeras 1999 observaciones
mvn(data = base_noOutliers[1:1999,1:7], mvnTest = "royston", multivariatePlot = "qq")$multivariateNormality
Test H p value MVN
1 Royston 253 0.00000000000000000000000000000000000000000000000000000278 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling Area 13.451 <0.001 NO
2 Anderson-Darling Perimeter 21.185 <0.001 NO
3 Anderson-Darling Major_Axis_Length 27.134 <0.001 NO
4 Anderson-Darling Minor_Axis_Length 0.737 0.0548 YES
5 Anderson-Darling Eccentricity 21.130 <0.001 NO
6 Anderson-Darling Convex_Area 13.805 <0.001 NO
7 Anderson-Darling Extent 36.214 <0.001 NO
$Descriptives
n Mean Std.Dev Median Min Max
Area 1999 12867.031 1442.9019 12770.000 9732.000 16513.000
Perimeter 1999 458.527 30.3618 458.733 390.957 528.176
Major_Axis_Length 1999 190.890 15.0359 191.897 160.227 222.672
Minor_Axis_Length 1999 86.755 4.3251 86.784 74.396 100.133
Eccentricity 1999 0.889 0.0169 0.892 0.842 0.921
Convex_Area 1999 13153.583 1480.4424 13072.000 9887.000 16950.000
Extent 1999 0.657 0.0724 0.641 0.534 0.823
25th 75th Skew Kurtosis
Area 11701.500 14058.500 0.0716 -0.990
Perimeter 432.680 484.235 -0.0253 -1.128
Major_Axis_Length 177.357 203.897 -0.0424 -1.208
Minor_Axis_Length 83.756 89.887 -0.0579 -0.335
Eccentricity 0.877 0.902 -0.4762 -0.623
Convex_Area 11953.500 14388.500 0.0713 -0.994
Extent 0.596 0.715 0.4308 -0.984
Probar homogeneidad de varianzas entre los grupos
# Prueba de M box para homogeneidad entre covarianzas
# Ho: Las matrices de covarianzas entre grupos son iguales
# Ha: Las matrices de covarianzas entre grupo no son iguales
boxM(data = base[, 1:7], grouping = base[, 8])
Box's M-test for Homogeneity of Covariance Matrices
data: base[, 1:7]
Chi-Sq (approx.) = 7056, df = 28, p-value <0.0000000000000002
Estandarizar los datos
# Esclar datos de las variables independientes y agregar la variable cualtativa
baseScale <- cbind(scale(base[,1:7]), base[,8])
# Converir a data frame
baseScale <- as.data.frame(baseScale)
# Nombrar variale cualitativa
colnames(baseScale)[colnames(baseScale) == "V8"] <- "Class"
# Nuevo data frame con los nombres de los grupos y datos estandarizados
for (i in seq_len(nrow(baseScale))) {
if (baseScale[i, 8] == 1) {
baseScale[i,8]="Cammeo"
} else if (baseScale[i, 8] == 2) {
baseScale[i,8]="Osmancik"
}
}
# Variable "Class" de tipo "factor"
baseScale$Class <- as.factor(baseScale$Class)
# Comprobar estructura
str(baseScale)'data.frame': 3810 obs. of 8 variables:
$ Area : num 1.48 1.148 1.135 0.293 1.166 ...
$ Perimeter : num 2.004 1.126 1.317 0.115 1.487 ...
$ Major_Axis_Length: num 2.348 0.988 1.452 0.261 1.316 ...
$ Minor_Axis_Length: num -0.213 0.945 0.254 0.198 0.523 ...
$ Eccentricity : num 2.018 0.41 1.213 0.24 0.952 ...
$ Convex_Area : num 1.499 1.193 1.126 0.234 1.3 ...
$ Extent : num -1.153 -0.602 0.406 -0.275 -0.206 ...
$ Class : Factor w/ 2 levels "Cammeo","Osmancik": 1 1 1 1 1 1 1 1 1 1 ...
kable(rbind(head(baseScale),tail(baseScale))) %>% kable_styling()| Area | Perimeter | Major_Axis_Length | Minor_Axis_Length | Eccentricity | Convex_Area | Extent | Class | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1.480 | 2.004 | 2.348 | -0.213 | 2.018 | 1.499 | -1.153 | Cammeo |
| 2 | 1.148 | 1.126 | 0.988 | 0.945 | 0.410 | 1.193 | -0.602 | Cammeo |
| 3 | 1.135 | 1.317 | 1.452 | 0.254 | 1.213 | 1.126 | 0.406 | Cammeo |
| 4 | 0.293 | 0.115 | 0.261 | 0.198 | 0.240 | 0.234 | -0.275 | Cammeo |
| 5 | 1.166 | 1.487 | 1.316 | 0.523 | 0.952 | 1.300 | -0.206 | Cammeo |
| 6 | 0.468 | 0.640 | 0.646 | 0.059 | 0.694 | 0.469 | -0.052 | Cammeo |
| 3805 | -0.096 | -0.069 | 0.227 | -0.544 | 0.729 | -0.149 | 0.736 | Osmancik |
| 3806 | -0.708 | -1.078 | -1.048 | -0.097 | -1.085 | -0.745 | 0.247 | Osmancik |
| 3807 | -0.602 | -0.923 | -1.207 | 0.550 | -1.970 | -0.590 | 0.419 | Osmancik |
| 3808 | -0.133 | -0.330 | -0.298 | 0.085 | -0.275 | -0.173 | -0.456 | Osmancik |
| 3809 | -1.608 | -1.740 | -1.581 | -1.414 | -0.599 | -1.607 | -0.037 | Osmancik |
| 3810 | -0.712 | -1.391 | -1.587 | 0.795 | -2.939 | -0.766 | 1.826 | Osmancik |
# resumen
summary(baseScale) Area Perimeter Major_Axis_Length Minor_Axis_Length
Min. :-2.95 Min. :-2.673 Min. :-2.494 Min. :-4.67
1st Qu.:-0.75 1st Qu.:-0.789 1st Qu.:-0.827 1st Qu.:-0.63
Median :-0.14 Median :-0.151 Median :-0.170 Median : 0.02
Mean : 0.00 Mean : 0.000 Mean : 0.000 Mean : 0.00
3rd Qu.: 0.74 3rd Qu.: 0.827 3rd Qu.: 0.847 3rd Qu.: 0.67
Max. : 3.61 Max. : 2.646 Max. : 2.879 Max. : 3.70
Eccentricity Convex_Area Extent Class
Min. :-5.27 Min. :-2.94 Min. :-2.130 Cammeo :1630
1st Qu.:-0.70 1st Qu.:-0.75 1st Qu.:-0.817 Osmancik:2180
Median : 0.10 Median :-0.14 Median :-0.215
Mean : 0.00 Mean : 0.00 Mean : 0.000
3rd Qu.: 0.76 3rd Qu.: 0.75 3rd Qu.: 0.837
Max. : 2.94 Max. : 3.46 Max. : 2.578
kable(describe(baseScale[1:7])) %>% kable_styling()| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Area | 1 | 3810 | 0 | 1 | -0.142 | -0.036 | 1.060 | -2.95 | 3.60 | 6.56 | 0.325 | -0.433 | 0.016 |
| Perimeter | 2 | 3810 | 0 | 1 | -0.151 | -0.026 | 1.162 | -2.67 | 2.65 | 5.32 | 0.221 | -0.842 | 0.016 |
| Major_Axis_Length | 3 | 3810 | 0 | 1 | -0.170 | -0.032 | 1.196 | -2.49 | 2.88 | 5.37 | 0.260 | -0.953 | 0.016 |
| Minor_Axis_Length | 4 | 3810 | 0 | 1 | 0.021 | 0.013 | 0.959 | -4.67 | 3.71 | 8.38 | -0.135 | 0.558 | 0.016 |
| Eccentricity | 5 | 3810 | 0 | 1 | 0.105 | 0.043 | 1.060 | -5.27 | 2.94 | 8.20 | -0.449 | 0.068 | 0.016 |
| Convex_Area | 6 | 3810 | 0 | 1 | -0.138 | -0.036 | 1.067 | -2.94 | 3.46 | 6.40 | 0.320 | -0.468 | 0.016 |
| Extent | 7 | 3810 | 0 | 1 | -0.215 | -0.041 | 1.125 | -2.13 | 2.58 | 4.71 | 0.344 | -1.031 | 0.016 |
Calcular Probabilidades a “priori”
n1=0
n2=0
# Usando for para sumar el número de elementos en cada grupo
for (i in seq_len(nrow(baseScale))) {
if (baseScale[i, 8] == "Cammeo") {
n1 <- n1 + 1
} else if (baseScale[i, 8] == "Osmancik") {
n2 <- n2 + 1
}
}
n1[1] 1630
n2[1] 2180
# Probabildades
paste("Probabilidad clase 'Cammeo': ", n1/(n1+n2))[1] "Probabilidad clase 'Cammeo': 0.427821522309711"
paste("Probabilidad clase 'Osmancik': ", n2/(n1+n2))[1] "Probabilidad clase 'Osmancik': 0.572178477690289"
Creación de la función discriminante
modelo_qda <- qda(Class ~ Area + Perimeter + Major_Axis_Length + Minor_Axis_Length + Eccentricity + Convex_Area + Extent, data = baseScale)
# Resumen del modelo
modelo_qda$lev[1] "Cammeo" "Osmancik"
modelo_qda$counts Cammeo Osmancik
1630 2180
modelo_qda$N[1] 3810
modelo_qda$prior Cammeo Osmancik
0.428 0.572
modelo_qda$callqda(formula = Class ~ Area + Perimeter + Major_Axis_Length +
Minor_Axis_Length + Eccentricity + Convex_Area + Extent,
data = baseScale)
modelo_qda$means Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity
Cammeo 0.863 0.933 0.957 0.428 0.681
Osmancik -0.645 -0.697 -0.716 -0.320 -0.509
Convex_Area Extent
Cammeo 0.868 -0.136
Osmancik -0.649 0.102
Clasificar nuevas observaciones
# se agrgaran 3 nuevas observaciones de prueba
obs_prueba <- data.frame(
Area = c(16500, 11550, 15900),
Perimeter = c(700, 440, 400),
Major_Axis_Length = c(150, 180, 170),
Minor_Axis_Length = c(100, 85, 85),
Eccentricity = c(0.7, 0.88, 0.66),
Convex_Area = c(13500, 11960, 15110),
Extent = c(0.5, 0.6, 0.7)
)
str(obs_prueba)'data.frame': 3 obs. of 7 variables:
$ Area : num 16500 11550 15900
$ Perimeter : num 700 440 400
$ Major_Axis_Length: num 150 180 170
$ Minor_Axis_Length: num 100 85 85
$ Eccentricity : num 0.7 0.88 0.66
$ Convex_Area : num 13500 11960 15110
$ Extent : num 0.5 0.6 0.7
# Guradar los estadísticos para escalar los nuevos datos:
baseScale2 <- scale(base[1:7])
media <- attr(baseScale2, "scaled:center")
desvio <- attr(baseScale2, "scaled:scale")
# Escalar lso nuevos datos
obs_prueba_scale <- scale(obs_prueba, center = media, scale = desvio)
obs_prueba_scale <- as.data.frame(obs_prueba_scale)
str(obs_prueba_scale)'data.frame': 3 obs. of 7 variables:
$ Area : num 2.212 -0.645 1.866
$ Perimeter : num 6.9 -0.4 -1.52
$ Major_Axis_Length: num -2.222 -0.503 -1.076
$ Minor_Axis_Length: num 2.389 -0.229 -0.229
$ Eccentricity : num -8.98 -0.33 -10.9
$ Convex_Area : num 0.308 -0.559 1.214
$ Extent : num -2.097 -0.802 0.493
# Discriminar nuevos datos
predict(object = modelo_qda, newdata = obs_prueba)$class
[1] Cammeo Cammeo Cammeo
Levels: Cammeo Osmancik
$posterior
Cammeo Osmancik
1 1 0
2 1 0
3 1 0
Validación del modelo
Evaluación de los errores de clasificación
# clasificar empleando las mismas observaciones con las que se ha generado el modelo discriminante QDA
predicciones <- predict(object = modelo_qda, newdata = baseScale[,1:7],
method = "predictive")
# Mostar en tabla
table(baseScale$Class, predicciones$class,
dnn = c("Clase real", "Clase predicción")) Clase predicción
Clase real Cammeo Osmancik
Cammeo 1547 83
Osmancik 247 1933
# Porcentaje de errores
trainig_error <- mean(baseScale$Class != predicciones$class) * 100
paste("trainig_error=", round(trainig_error,2), "%")[1] "trainig_error= 8.66 %"
Matriz de confusión
# Crear la matriz de confusión
confusionMatrix(predicciones$class, baseScale$Class)Confusion Matrix and Statistics
Reference
Prediction Cammeo Osmancik
Cammeo 1547 247
Osmancik 83 1933
Accuracy : 0.913
95% CI : (0.904, 0.922)
No Information Rate : 0.572
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.825
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.949
Specificity : 0.887
Pos Pred Value : 0.862
Neg Pred Value : 0.959
Prevalence : 0.428
Detection Rate : 0.406
Detection Prevalence : 0.471
Balanced Accuracy : 0.918
'Positive' Class : Cammeo
Graficar las clasificaciones
# partimat(Class ~ Area + Perimeter + Major_Axis_Length + Minor_Axis_Length + Eccentricity + Convex_Area + Extent, data = baseScale, main="Discriminación con QDA",
# method = "qda", image.colors = c("salmon", "skyblue"),
# col.mean = "snow2")
library(combinat) # combinatoria
# Vector con nombres de las variables
vars <- c("Area", "Perimeter", "Major_Axis_Length", "Minor_Axis_Length",
"Eccentricity", "Convex_Area", "Extent")
# Generar todas las combinaciones posibles de 2 variables
comb_2vars <- combn(vars, 2, simplify = FALSE)
# Recorrer cada combinación y graficar
for (i in seq_along(comb_2vars)) {
vars_formula <- comb_2vars[[i]]
formula <- as.formula(paste("Class ~", paste(vars_formula, collapse = " + ")))
# Mostrar en pantalla
partimat(formula,
data = baseScale,
method = "qda",
main = paste("QDA con:", paste(vars_formula, collapse = ", ")),
image.colors = c("salmon", "skyblue"),
col.mean = "snow2")
}Aplicación Python
import numpy as np
import pandas as pd
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, classification_report, cohen_kappa_score
import itertools
#from statsmodels.stats.inter_rater import cohens_kappa
from scipy.stats import jarque_bera
import pingouin as pg
from scipy.spatial.distance import mahalanobis
from sklearn.covariance import EmpiricalCovariance
from bioinfokit.analys import stat
import seaborn as sns
from scipy.io import arffExploración inicial
base=pd.read_csv(r"C:\Users\MINEDUCYT\Documents\Seminario\rice+cammeo+and+osmancik\Rice_Cammeo_Osmancik_.csv", sep=",")
pd.set_option('display.max_rows', None) # Mostrar todas las filas
pd.set_option('display.max_columns', None) # Mostrar todas las columnas
# información de las variables
base.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 3810 non-null int64
1 Perimeter 3810 non-null float64
2 Major_Axis_Length 3810 non-null float64
3 Minor_Axis_Length 3810 non-null float64
4 Eccentricity 3810 non-null float64
5 Convex_Area 3810 non-null int64
6 Extent 3810 non-null float64
7 Class 3810 non-null object
dtypes: float64(5), int64(2), object(1)
memory usage: 238.3+ KB
# Valores nulos:
# Se creó un nuevo archivo csv a partir del data frame "base" en R para importar a python. El cual no contiene valores nulos.
# Datos descriptivos
base.describe() Area Perimeter Major_Axis_Length Minor_Axis_Length \
count 3810.000000 3810.000000 3810.000000 3810.000000
mean 12667.727559 454.239180 188.776222 86.313750
std 1732.367706 35.597081 17.448679 5.729817
min 7551.000000 359.100006 145.264465 59.532406
25% 11370.500000 426.144753 174.353855 82.731695
50% 12421.500000 448.852493 185.810059 86.434647
75% 13950.000000 483.683746 203.550438 90.143677
max 18913.000000 548.445984 239.010498 107.542450
Eccentricity Convex_Area Extent
count 3810.000000 3810.000000 3810.000000
mean 0.886871 12952.496850 0.661934
std 0.020818 1776.972042 0.077239
min 0.777233 7723.000000 0.497413
25% 0.872402 11626.250000 0.598862
50% 0.889050 12706.500000 0.645361
75% 0.902588 14284.000000 0.726562
max 0.948007 19099.000000 0.861050
# Comprobar
base.head() Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity \
0 15231 525.578979 229.749878 85.093788 0.928882
1 14656 494.311005 206.020065 91.730972 0.895405
2 14634 501.122009 214.106781 87.768288 0.912118
3 13176 458.342987 193.337387 87.448395 0.891861
4 14688 507.166992 211.743378 89.312454 0.906691
Convex_Area Extent Class
0 15617 0.572896 Cammeo
1 15072 0.615436 Cammeo
2 14954 0.693259 Cammeo
3 13368 0.640669 Cammeo
4 15262 0.646024 Cammeo
base.tail() Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity \
3805 11441 415.858002 170.486771 85.756592 0.864280
3806 11625 421.390015 167.714798 89.462570 0.845850
3807 12437 442.498993 183.572922 86.801979 0.881144
3808 9882 392.296997 161.193985 78.210480 0.874406
3809 11434 404.709991 161.079269 90.868195 0.825692
Convex_Area Extent Class
3805 11628 0.681012 Osmancik
3806 11904 0.694279 Osmancik
3807 12645 0.626739 Osmancik
3808 10097 0.659064 Osmancik
3809 11591 0.802949 Osmancik
# Boxplots
paleta_colores = np.array([
"#00868B", # turquoise4
"#A2CD5A", # darkolivegreen3
"#BDB76B", # darkkhaki
"#9BCD9B", # darkseagreen3
"#CD6600", # darkorange3
"#8B6969", # rosybrown4
"#CD96CD" # plum3
])
for i, columna in enumerate(base.columns[:7]):
plt.clf()
sns.boxplot(y=base[columna], color=paleta_colores[i])
plt.title(f'Boxplot - {columna}')
plt.show()# Histogramas
for i, columna in enumerate(base.columns[:7]):
plt.clf()
sns.histplot(base[columna], kde=True, bins=15, color=paleta_colores[i])
plt.title(f'Histograma - {columna}')
plt.ylabel("Frecuencia")
plt.show()# Visualización en pares por clase
plt.clf()
sns.pairplot(base, hue='Class', diag_kind='hist', markers=["o", "o"], palette=['salmon', 'skyblue'])plt.show()Probar normalidad univariante y multivariante
print("Normalidad univariante \n",
"Prueba de jarque bera para normalidad \n",
"Ho: Los datos siguen una distribución normal \n",
"Ha: Los datos no siguen una distribución normal \n")Normalidad univariante
Prueba de jarque bera para normalidad
Ho: Los datos siguen una distribución normal
Ha: Los datos no siguen una distribución normal
for columna in base.columns[:-1]:
jb_stat, jb_p = jarque_bera(base[columna])
print(f"Variable: {columna} | JB estadístico = {jb_stat:.4f}, p-valor = {jb_p:.4f} \n")Variable: Area | JB estadístico = 96.7233, p-valor = 0.0000
Variable: Perimeter | JB estadístico = 143.2961, p-valor = 0.0000
Variable: Major_Axis_Length | JB estadístico = 186.8867, p-valor = 0.0000
Variable: Minor_Axis_Length | JB estadístico = 61.2948, p-valor = 0.0000
Variable: Eccentricity | JB estadístico = 128.8224, p-valor = 0.0000
Variable: Convex_Area | JB estadístico = 99.4744, p-valor = 0.0000
Variable: Extent | JB estadístico = 243.5291, p-valor = 0.0000
# Normalidad multivariante
print("\nNormalidad multivariante - Henze-Zirkler (HZ):")
Normalidad multivariante - Henze-Zirkler (HZ):
hz_test = pg.multivariate_normality(base.iloc[:, 0:6], alpha=0.05)
print(hz_test)HZResults(hz=np.float64(16.667633968806907), pval=np.float64(0.0), normal=False)
Probar homogeneidad de varianzas entre los grupos
# Prueba de M box para homogeneidad entre covarianzas
# Ho: Las matrices de covarianzas entre grupos son iguales
# Ha: Las matrices de covarianzas entre grupo no son iguales
# 'Class' como categoría
base2=base.copy()
base2['Class'] = base2['Class'].astype('category')
# Ejecutar Box's M test
resultado_mbox = stat()
resultado_mbox.box_m(df=base2, group_col='Class')
print("Prueba de M box para homogeneidad entre covarianzas \n",
"Ho: Las matrices de covarianzas entre grupos son iguales \n",
"Ha: Las matrices de covarianzas entre grupo no son iguales \n")
print("Estadístico Box's M:", resultado_mbox.boxm['test_stat'])
print("Valor p:", resultado_mbox.boxm['p_value'])Estandarizar los datos
# Escalar las primeras 7 columnas (variables independientes)
scaler = StandardScaler()
Xscale = scaler.fit_transform(base.iloc[:, 0:7])
# Convertir el resultado escalado a DataFrame
baseScale = pd.DataFrame(Xscale, columns=base.columns[0:7])
# Añadir la variable 'Class'
baseScale['Class'] = base['Class'].copy()
# Reemplazar valores 1 y 2 en 'Class' por "Cammeo" y "Osmancik"
baseScale['Class'] = baseScale['Class'].replace({1: 'Cammeo', 2: 'Osmancik'})
# Convertir 'Class' a categoría
baseScale['Class'] = baseScale['Class'].astype('category')
# Comprobar estructura
baseScale.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 3810 non-null float64
1 Perimeter 3810 non-null float64
2 Major_Axis_Length 3810 non-null float64
3 Minor_Axis_Length 3810 non-null float64
4 Eccentricity 3810 non-null float64
5 Convex_Area 3810 non-null float64
6 Extent 3810 non-null float64
7 Class 3810 non-null category
dtypes: category(1), float64(7)
memory usage: 212.3 KB
print(pd.concat([baseScale.head(), baseScale.tail()])) Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity \
0 1.479830 2.004354 2.348547 -0.212943 2.018337
1 1.147870 1.125853 0.988390 0.945568 0.410018
2 1.135169 1.317214 1.451908 0.253887 1.212956
3 0.293436 0.115300 0.261439 0.198051 0.239751
4 1.166345 1.487053 1.316442 0.523419 0.952221
3805 -0.708215 -1.078353 -1.048323 -0.097251 -1.085282
3806 -0.601988 -0.922926 -1.207208 0.549622 -1.970731
3807 -0.133204 -0.329851 -0.298245 0.085220 -0.275099
3808 -1.608257 -1.740320 -1.580971 -1.414414 -0.598821
3809 -0.712256 -1.391566 -1.587546 0.794972 -2.939160
Convex_Area Extent Class
0 1.499659 -1.152921 Cammeo
1 1.192918 -0.602079 Cammeo
2 1.126504 0.405611 Cammeo
3 0.233857 -0.275351 Cammeo
4 1.299855 -0.206013 Cammeo
3805 -0.745465 0.247031 Osmancik
3806 -0.590124 0.418815 Osmancik
3807 -0.173068 -0.455731 Osmancik
3808 -1.607156 -0.037168 Osmancik
3809 -0.766290 1.825947 Osmancik
# Resumen
print(baseScale.describe()) Area Perimeter Major_Axis_Length Minor_Axis_Length \
count 3810.000000 3.810000e+03 3.810000e+03 3.810000e+03
mean 0.000000 -2.983907e-16 1.193563e-16 -4.177469e-16
std 1.000131 1.000131e+00 1.000131e+00 1.000131e+00
min -2.953991 -2.673019e+00 -2.494027e+00 -4.674645e+00
25% -0.748916 -7.893376e-01 -8.266678e-01 -6.252425e-01
50% -0.142152 -1.513437e-01 -1.700159e-01 2.110226e-02
75% 0.740282 8.272709e-01 8.468352e-01 6.685081e-01
max 3.605523 2.646823e+00 2.879351e+00 3.705439e+00
Eccentricity Convex_Area Extent
count 3.810000e+03 3.810000e+03 3.810000e+03
mean 7.758157e-16 -4.774250e-16 -4.475860e-16
std 1.000131e+00 1.000131e+00 1.000131e+00
min -5.267280e+00 -2.943312e+00 -2.130313e+00
25% -6.951263e-01 -7.464501e-01 -8.166890e-01
50% 1.047129e-01 -1.384541e-01 -2.145916e-01
75% 7.551068e-01 7.494085e-01 8.368337e-01
max 2.937147e+00 3.459430e+00 2.578260e+00
Calcular Probabilidades a “priori”
priori = baseScale['Class'].value_counts(normalize=True)
print("Probabilidades a priori:\n", priori)Probabilidades a priori:
Class
Osmancik 0.572178
Cammeo 0.427822
Name: proportion, dtype: float64
Creación de la función discriminante
modelo_lda = LinearDiscriminantAnalysis()
modelo_lda.fit(baseScale.iloc[:, 0:7], baseScale['Class'])LinearDiscriminantAnalysis()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearDiscriminantAnalysis()
############### Resumen
# Clases
print("Clases:", modelo_lda.classes_)Clases: ['Cammeo' 'Osmancik']
# Cantidad de observaciones por clase
(unique, counts) = np.unique(baseScale['Class'], return_counts=True)
print("Counts:", dict(zip(unique, counts)))Counts: {'Cammeo': np.int64(1630), 'Osmancik': np.int64(2180)}
# Total de observaciones
print("N total:", len(baseScale))N total: 3810
# Probabilidades a priori
print("Prior probabilities:", modelo_lda.priors_)Prior probabilities: [0.42782152 0.57217848]
# Medias por clase
print("Means:")Means:
print(pd.DataFrame(modelo_lda.means_, columns=baseScale.columns[0:7], index=modelo_lda.classes_)) Area Perimeter Major_Axis_Length Minor_Axis_Length \
Cammeo 0.863189 0.932776 0.957354 0.428304
Osmancik -0.645412 -0.697443 -0.715820 -0.320246
Eccentricity Convex_Area Extent
Cammeo 0.681064 0.867843 -0.136148
Osmancik -0.509236 -0.648892 0.101798
# Coeficientes de la función discriminante lineal
coef_df = pd.DataFrame(modelo_lda.coef_, columns=baseScale.columns[0:7])
print("Coeficientes de la función discriminante:")Coeficientes de la función discriminante:
print(coef_df) Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity \
0 13.374122 3.367574 -5.263212 3.418791 1.351427
Convex_Area Extent
0 -20.028283 0.040976
# Intercepto de la función discriminante lineal
print("Intercepto:")Intercepto:
print(modelo_lda.intercept_[0])0.966318883957916
Clasificar nuevas observaciones
# Nuevas observaciones sin escalar
obs_prueba = pd.DataFrame({
'Area': [16500, 11550, 15900],
'Perimeter': [700, 440, 400],
'Major_Axis_Length': [150, 180, 170],
'Minor_Axis_Length': [100, 85, 85],
'Eccentricity': [0.7, 0.88, 0.66],
'Convex_Area': [13500, 11960, 15110],
'Extent': [0.5, 0.6, 0.7]
})
# Escalar nuevas observaciones usando los mismos parámetros del entrenamiento
obs_prueba_scale = pd.DataFrame(
scaler.transform(obs_prueba), # Usa media y desviación del fit anterior
columns=obs_prueba.columns
)
# Clasificar las observaciones nuevas
predicciones_nuevas = modelo_lda.predict(obs_prueba_scale)
print("Predicciones:", predicciones_nuevas)Predicciones: ['Osmancik' 'Osmancik' 'Cammeo']
Validación del modelo
Evaluación de los errores de clasificación
# Clasificar empleando las mismas observaciones con las que se ha generado el modelo discriminante LDA
predic_datos_originales_scale = modelo_lda.predict(baseScale.iloc[:, 0:7])
# Matriz de confusión
matriz = confusion_matrix(baseScale['Class'], predic_datos_originales_scale)
print("Matriz de confusión:")Matriz de confusión:
print(matriz)[[1464 166]
[ 99 2081]]
# Porcentaje de error de entrenamiento
training_error = (1 - accuracy_score(baseScale['Class'], predic_datos_originales_scale)) * 100
print(f"Error de entrenamiento: {training_error:.2f}%")Error de entrenamiento: 6.96%
Matriz de confusión
# Graficar la matriz
plt.clf()
disp = ConfusionMatrixDisplay(confusion_matrix=matriz, display_labels=modelo_lda.classes_)
disp.plot(cmap='Blues')<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021DEFB6C2F0>
plt.title("Matriz de Confusión - LDA")
plt.show()# Resumen
reporte = classification_report(baseScale['Class'], predic_datos_originales_scale, target_names=modelo_lda.classes_)
print("Resumen del modelo - Clasificación en datos de entrenamiento:")Resumen del modelo - Clasificación en datos de entrenamiento:
print(reporte) precision recall f1-score support
Cammeo 0.94 0.90 0.92 1630
Osmancik 0.93 0.95 0.94 2180
accuracy 0.93 3810
macro avg 0.93 0.93 0.93 3810
weighted avg 0.93 0.93 0.93 3810
kappa = cohen_kappa_score(baseScale['Class'], predic_datos_originales_scale)
print(f"Índice de Kappa: {kappa:.4f}")Índice de Kappa: 0.8572
# Aplicar prueba de hipótesis de Kappa
# kappa_result = cohens_kappa(matriz)
# print("Acuerdo entre clases reales y predichas \n",
# "Prueba Kappa \n",
# "Ho: El acuerdo observado es igual al esperado por azar (índice=0) \n",
# "Ha: El acuerdo es mejor que el azar. \n")
# print("Índice de Kappa:", round(kappa_result.kappa, 4))
# print("Valor-p:", round(kappa_result.pvalue, 4))
# print("Intervalo de confianza al 95%:", kappa_result.confint)Graficar las clasificaciones
# Variables independientes
variables = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
'Eccentricity', 'Convex_Area', 'Extent']
# Generar todas las combinaciones de 2 variables
combinaciones = list(itertools.combinations(variables, 2))
# variables independientes y cualitativa
X = baseScale[variables]
y = baseScale['Class']
# Crear LabelEncoder fuera del ciclo para consistencia
# Es para generar un estiqueta númerica para variables cualitativas
# en este caso 1=Cammeo, 2=Osmancik
le = LabelEncoder()
le.fit(y)LabelEncoder()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LabelEncoder()
for var1, var2 in combinaciones:
X_pair = X[[var1, var2]]
lda_pair = LinearDiscriminantAnalysis()
lda_pair.fit(X_pair, y)
# Crear la malla de puntos
x_min, x_max = X_pair[var1].min() - 1, X_pair[var1].max() + 1
y_min, y_max = X_pair[var2].min() - 1, X_pair[var2].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
# Predecir en la malla
Z = lda_pair.predict(grid)
# Convertir etiquetas a números usando el encoder ya entrenado
Z_num = le.transform(Z)
Z_num = Z_num.reshape(xx.shape)
plt.figure(figsize=(6, 5))
plt.contourf(xx, yy, Z_num, alpha=0.3, levels=np.arange(len(le.classes_)+1)-0.5,
colors=['salmon', 'skyblue'])
sns.scatterplot(data=baseScale, x=var1, y=var2, hue='Class',
palette={'Cammeo': 'salmon', 'Osmancik': 'skyblue'}, edgecolor='black')
plt.title(f"LDA con: {var1} & {var2}")
plt.xlabel(var1)
plt.ylabel(var2)
plt.tight_layout()
plt.show()