Análisis Discriminante

Autor/a

Daniel Perdomo

Fecha de publicación

19 de junio de 2025

Descripción de la base de datos

Título: Rice (Cammeo and Osmancik)

Enlace:https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik

Se tomaron un total de 3810 imágenes de granos de arroz para las dos especies. Se obtuvieron 7 características morfológicas para cada grano de arroz.

El arroz certificado cultivado en TURQUÍA, se han seleccionado para el estudio la especie Osmancik, que tiene una gran superficie de plantación desde 1997 y la especie Cammeo cultivada desde 2014. Al observar las características generales de las especies de Osmancik, tienen una apariencia ancha, larga, vidriosa y opaca. Al observar las características generales de las especies de Cammeo, tienen un aspecto ancho y largo, vidrioso y opaco. Se tomaron un total de 3810 imágenes de granos de arroz para las dos especies, se procesaron y se hicieron inferencias de características. Se obtuvieron 7 características morfológicas para cada grano de arroz.

Variables:

  1. Area (Área): Número de píxeles dentro de los límites del grano de arroz.

  2. Perimeter (Perímetro): Circunferencia calculando la distancia entre píxeles alrededor de los límites del grano de arroz.

  3. Major_Axis_Length (Longitud del eje principal): La línea más larga que se puede dibujar en el grano de arroz, es decir, la distancia del eje principal.

  4. Major_Axis_Length (Longitud del eje menor): La línea más corta que se puede dibujar en el grano de arroz, es decir, la distancia del eje pequeño.

  5. Eccentricity (Excentricidad): Mide lo redonda que es la elipse, que tiene los mismos momentos que el grano de arroz.

  6. Convex_Area (Área convexa): Recuento de píxeles de la cáscara convexa más pequeña de la región formada por el grano de arroz.

  7. Extent (Extensión): Relación entre la región formada por el grano de arroz y los píxeles del cuadro delimitador.

  8. Clase (Clase): Arroz Cammeo y Osmancik

Aplicación R

library(tidyverse)
library(psych)
library(stats)
library(dplyr)
library(tseries)
library(MVN)
library(ggplot2)
library(ggpubr)
library(GGally)
library(biotools)
library(reshape2)
library(caret) # Matrixconfusion()
library(klaR) # Gráficos de grupos
library(foreign) #para importar .arff
library(RWeka) #para importar .arff
library(kableExtra)
library(knitr)
library(DT) # data tables
library(reticulate)

options(scipen = 999)
options(digits = 3)   

Exploración inicial

ruta <- file.choose()
base <- read.arff(ruta)

# Comprobar estructura y datos
str(base)
'data.frame':   3810 obs. of  8 variables:
 $ Area             : num  15231 14656 14634 13176 14688 ...
 $ Perimeter        : num  526 494 501 458 507 ...
 $ Major_Axis_Length: num  230 206 214 193 212 ...
 $ Minor_Axis_Length: num  85.1 91.7 87.8 87.4 89.3 ...
 $ Eccentricity     : num  0.929 0.895 0.912 0.892 0.907 ...
 $ Convex_Area      : num  15617 15072 14954 13368 15262 ...
 $ Extent           : num  0.573 0.615 0.693 0.641 0.646 ...
 $ Class            : Factor w/ 2 levels "Cammeo","Osmancik": 1 1 1 1 1 1 1 1 1 1 ...
head(base)
   Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
1 15231       526               230              85.1        0.929       15617
2 14656       494               206              91.7        0.895       15072
3 14634       501               214              87.8        0.912       14954
4 13176       458               193              87.4        0.892       13368
5 14688       507               212              89.3        0.907       15262
6 13479       477               200              86.7        0.901       13786
  Extent  Class
1  0.573 Cammeo
2  0.615 Cammeo
3  0.693 Cammeo
4  0.641 Cammeo
5  0.646 Cammeo
6  0.658 Cammeo
tail(base)
      Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity
3805 12501       452               193              83.2        0.902
3806 11441       416               170              85.8        0.864
3807 11625       421               168              89.5        0.846
3808 12437       442               184              86.8        0.881
3809  9882       392               161              78.2        0.874
3810 11434       405               161              90.9        0.826
     Convex_Area Extent    Class
3805       12687  0.719 Osmancik
3806       11628  0.681 Osmancik
3807       11904  0.694 Osmancik
3808       12645  0.627 Osmancik
3809       10097  0.659 Osmancik
3810       11591  0.803 Osmancik
kable(rbind(head(base),tail(base))) %>% kable_styling()
Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area Extent Class
1 15231 526 230 85.1 0.929 15617 0.573 Cammeo
2 14656 494 206 91.7 0.895 15072 0.615 Cammeo
3 14634 501 214 87.8 0.912 14954 0.693 Cammeo
4 13176 458 193 87.4 0.892 13368 0.641 Cammeo
5 14688 507 212 89.3 0.907 15262 0.646 Cammeo
6 13479 477 200 86.7 0.901 13786 0.658 Cammeo
3805 12501 452 193 83.2 0.902 12687 0.719 Osmancik
3806 11441 416 170 85.8 0.864 11628 0.681 Osmancik
3807 11625 421 168 89.5 0.846 11904 0.694 Osmancik
3808 12437 442 184 86.8 0.881 12645 0.627 Osmancik
3809 9882 392 161 78.2 0.874 10097 0.659 Osmancik
3810 11434 405 161 90.9 0.826 11591 0.803 Osmancik
# Valores nulos
paste("Total de valores perdidos:", sum(is.na(base)))
[1] "Total de valores perdidos: 0"
# base <- na.omit(base)

# resumen
summary(base)
      Area         Perimeter   Major_Axis_Length Minor_Axis_Length
 Min.   : 7551   Min.   :359   Min.   :145       Min.   : 59.5    
 1st Qu.:11370   1st Qu.:426   1st Qu.:174       1st Qu.: 82.7    
 Median :12422   Median :449   Median :186       Median : 86.4    
 Mean   :12668   Mean   :454   Mean   :189       Mean   : 86.3    
 3rd Qu.:13950   3rd Qu.:484   3rd Qu.:204       3rd Qu.: 90.1    
 Max.   :18913   Max.   :548   Max.   :239       Max.   :107.5    
  Eccentricity    Convex_Area        Extent           Class     
 Min.   :0.777   Min.   : 7723   Min.   :0.497   Cammeo  :1630  
 1st Qu.:0.872   1st Qu.:11626   1st Qu.:0.599   Osmancik:2180  
 Median :0.889   Median :12706   Median :0.645                  
 Mean   :0.887   Mean   :12952   Mean   :0.662                  
 3rd Qu.:0.903   3rd Qu.:14284   3rd Qu.:0.727                  
 Max.   :0.948   Max.   :19099   Max.   :0.861                  
kable(describe(base[1:7])) %>% kable_styling()
vars n mean sd median trimmed mad min max range skew kurtosis se
Area 1 3810 12667.728 1732.368 12421.500 12604.852 1835.459 7551.000 18913.000 11362.000 0.325 -0.433 28.066
Perimeter 2 3810 454.239 35.597 448.852 453.307 41.362 359.100 548.446 189.346 0.221 -0.842 0.577
Major_Axis_Length 3 3810 188.776 17.449 185.810 188.215 20.862 145.264 239.010 93.746 0.260 -0.953 0.283
Minor_Axis_Length 4 3810 86.314 5.730 86.435 86.388 5.493 59.532 107.542 48.010 -0.135 0.558 0.093
Eccentricity 5 3810 0.887 0.021 0.889 0.888 0.022 0.777 0.948 0.171 -0.449 0.068 0.000
Convex_Area 6 3810 12952.497 1776.972 12706.500 12888.298 1895.504 7723.000 19099.000 11376.000 0.320 -0.468 28.788
Extent 7 3810 0.662 0.077 0.645 0.659 0.087 0.497 0.861 0.364 0.344 -1.031 0.001
# Visualizar gráficamente
boxplot(base$Area, col = "turquoise4", main="Boxplot - Área")

boxplot(base$Perimeter, col = "darkolivegreen3", main="Boxplot - Perímetro")

boxplot(base$Major_Axis_Length, col = "darkkhaki", main="Boxplot - Longuitud eje mayor")

boxplot(base$Minor_Axis_Length, col = "darkseagreen3", main="Boxplot - Longuitud eje menor")

boxplot(base$Eccentricity, col = "darkorange3", main="Boxplot - Excentricidad")

boxplot(base$Convex_Area, col = "rosybrown4", main="Boxplot - Área convexa")

boxplot(base$Extent, col = "plum3", main="Boxplot - Extensión")

hist(base$Area, col="turquoise4", main="Histograma - Área", xlab="Area", ylab="Fracuencia")

hist(base$Perimeter, col="darkolivegreen3", main="Histograma - Perímetro", xlab="Perimeter", ylab="Fracuencia")

hist(base$Major_Axis_Length, col="darkkhaki", main="Histograma - Longuitud eje mayor", xlab="Major_Axis_Length", ylab="Fracuencia")

hist(base$Minor_Axis_Length, col="darkseagreen3", main="Histograma - Longuitud eje menor", xlab="Minor_Axis_Length", ylab="Fracuencia")

hist(base$Eccentricity, col="darkorange3", main="Histograma - Excentricidad", xlab="Eccentricity", ylab="Fracuencia")

hist(base$Convex_Area, col="rosybrown4", main="Histograma - Área convexa", xlab="Convex_Area", ylab="Fracuencia")

hist(base$Extent, col="plum3", main="Histograma - Extensión", xlab="Extent", ylab="Fracuencia")

# Exploración gráfica dividiendo base en grupos
pairs(x = base[,1:7], col = c("salmon", "skyblue")[base$Class], pch = 10)

pares <- list(
  c("Area", "Perimeter"), 
  c("Area", "Major_Axis_Length"), 
  c("Area", "Minor_Axis_Length"), 
  c("Area", "Eccentricity"), 
  c("Area", "Convex_Area"), 
  c("Area", "Extent"), 
  c("Perimeter", "Major_Axis_Length"), 
  c("Perimeter", "Minor_Axis_Length"), 
  c("Perimeter", "Eccentricity"), 
  c("Perimeter", "Convex_Area"), 
  c("Perimeter", "Extent"), 
  c("Major_Axis_Length", "Minor_Axis_Length"), 
  c("Major_Axis_Length", "Eccentricity"), 
  c("Major_Axis_Length", "Convex_Area"), 
  c("Major_Axis_Length", "Extent"), 
  c("Minor_Axis_Length", "Eccentricity"), 
  c("Minor_Axis_Length", "Convex_Area"), 
  c("Minor_Axis_Length", "Extent"), 
  c("Eccentricity", "Convex_Area"), 
  c("Eccentricity", "Extent"), 
  c("Convex_Area", "Extent")
)

grafico <- list()

# Usando for para generar los gráficos
for (i in seq_along(pares)) {
  grafico[[i]] <- ggplot(base, aes_string(x = pares[[i]][1], y = pares[[i]][2], color = "Class")) +
    geom_point() +
    scale_color_manual(values = c("Cammeo"="salmon","Osmancik"="skyblue")) +
    theme_minimal() +
    ggtitle(paste(pares[[i]][1], " & ", pares[[i]][2]))
}

# Mostrar gráficos  
grafico
[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]


[[19]]


[[20]]


[[21]]

Detección de datos atípicos

mahalanobis_dist <- mahalanobis(base[,1:7], colMeans(base[,1:7]), cov(base[,1:7]))

# Valor crítico para identificar outliers (usando chi-cuadrado)
valor_critico <- qchisq(0.95, df = 3)

valor_critico #ver valor crítico
[1] 7.81
atipicos <- which(mahalanobis_dist > valor_critico) # Identificar outliers

head(mahalanobis_dist) #ver distancias
[1] 27.18  3.43  4.11  4.11 13.01  1.67
length(atipicos)
[1] 1022
# # base_noOuliers contendrá no incluirá los datos atípicos
base_noOutliers <- base[-atipicos, ]

Probar normalidad univariante y multivariante

# Normalidad univariante
# Prueba de jarque bera para normalidad
# Ho: Los datos siguen una distribución normal
# Ha: Los datos no siguen una distribución normal

# Datos incluyendo las 25 observaciones atípicas

for (i in 1:(ncol(base) - 1)) {
  print(paste("Variable:", colnames(base)[i]))
  print(jarque.bera.test(base[, i]))
}
[1] "Variable: Area"

    Jarque Bera Test

data:  base[, i]
X-squared = 97, df = 2, p-value <0.0000000000000002

[1] "Variable: Perimeter"

    Jarque Bera Test

data:  base[, i]
X-squared = 143, df = 2, p-value <0.0000000000000002

[1] "Variable: Major_Axis_Length"

    Jarque Bera Test

data:  base[, i]
X-squared = 187, df = 2, p-value <0.0000000000000002

[1] "Variable: Minor_Axis_Length"

    Jarque Bera Test

data:  base[, i]
X-squared = 61, df = 2, p-value = 0.00000000000005

[1] "Variable: Eccentricity"

    Jarque Bera Test

data:  base[, i]
X-squared = 129, df = 2, p-value <0.0000000000000002

[1] "Variable: Convex_Area"

    Jarque Bera Test

data:  base[, i]
X-squared = 99, df = 2, p-value <0.0000000000000002

[1] "Variable: Extent"

    Jarque Bera Test

data:  base[, i]
X-squared = 244, df = 2, p-value <0.0000000000000002
# Datos sin observaciones atípicas
for (i in 1:(ncol(base_noOutliers) - 1)) {
  print(paste("Variable:", colnames(base_noOutliers)[i]))
  print(jarque.bera.test(base_noOutliers[, i]))
}
[1] "Variable: Area"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 136, df = 2, p-value <0.0000000000000002

[1] "Variable: Perimeter"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 157, df = 2, p-value <0.0000000000000002

[1] "Variable: Major_Axis_Length"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 180, df = 2, p-value <0.0000000000000002

[1] "Variable: Minor_Axis_Length"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 19, df = 2, p-value = 0.00008

[1] "Variable: Eccentricity"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 117, df = 2, p-value <0.0000000000000002

[1] "Variable: Convex_Area"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 138, df = 2, p-value <0.0000000000000002

[1] "Variable: Extent"

    Jarque Bera Test

data:  base_noOutliers[, i]
X-squared = 199, df = 2, p-value <0.0000000000000002
# Normalidad multivariante

### con valores atípicos
mvn(data = base[1:7], mvnTest = "hz", multivariateOutlierMethod = "quan")

$multivariateNormality
           Test   HZ p value MVN
1 Henze-Zirkler 9.61       0  NO

$univariateNormality
              Test          Variable Statistic   p value Normality
1 Anderson-Darling       Area            21.11  <0.001      NO    
2 Anderson-Darling     Perimeter         33.15  <0.001      NO    
3 Anderson-Darling Major_Axis_Length     46.72  <0.001      NO    
4 Anderson-Darling Minor_Axis_Length      2.54  <0.001      NO    
5 Anderson-Darling   Eccentricity        15.97  <0.001      NO    
6 Anderson-Darling    Convex_Area        22.13  <0.001      NO    
7 Anderson-Darling      Extent           62.92  <0.001      NO    

$Descriptives
                     n      Mean   Std.Dev    Median      Min       Max
Area              3810 12667.728 1732.3677 12421.500 7551.000 18913.000
Perimeter         3810   454.239   35.5971   448.852  359.100   548.446
Major_Axis_Length 3810   188.776   17.4487   185.810  145.264   239.010
Minor_Axis_Length 3810    86.314    5.7298    86.435   59.532   107.542
Eccentricity      3810     0.887    0.0208     0.889    0.777     0.948
Convex_Area       3810 12952.497 1776.9720 12706.500 7723.000 19099.000
Extent            3810     0.662    0.0772     0.645    0.497     0.861
                       25th      75th   Skew Kurtosis
Area              11370.500 13950.000  0.325  -0.4334
Perimeter           426.145   483.684  0.221  -0.8418
Major_Axis_Length   174.354   203.550  0.260  -0.9532
Minor_Axis_Length    82.732    90.144 -0.135   0.5579
Eccentricity          0.872     0.903 -0.449   0.0678
Convex_Area       11626.250 14284.000  0.320  -0.4681
Extent                0.599     0.727  0.344  -1.0314
# el número de observaciones para royston test debe ser menor a 2000
# se tomarán las primeras 1999 observaciones 
mvn(data = base[1:1999,1:7], mvnTest = "royston", multivariatePlot = "qq")

$multivariateNormality
     Test   H                                              p value MVN
1 Royston 226 0.00000000000000000000000000000000000000000000000278  NO

$univariateNormality
              Test          Variable Statistic   p value Normality
1 Anderson-Darling       Area             4.27  <0.001      NO    
2 Anderson-Darling     Perimeter         18.14  <0.001      NO    
3 Anderson-Darling Major_Axis_Length     23.03  <0.001      NO    
4 Anderson-Darling Minor_Axis_Length      1.87  0.0001      NO    
5 Anderson-Darling   Eccentricity        20.77  <0.001      NO    
6 Anderson-Darling    Convex_Area         4.90  <0.001      NO    
7 Anderson-Darling      Extent           36.87  <0.001      NO    

$Descriptives
                     n      Mean   Std.Dev    Median      Min       Max
Area              1999 13686.081 1592.8481 13824.000 8499.000 18913.000
Perimeter         1999   476.797   31.1675   481.770  370.446   548.446
Major_Axis_Length 1999   200.164   15.0655   202.722  154.096   239.010
Minor_Axis_Length 1999    87.983    5.5457    88.033   67.695   107.542
Eccentricity      1999     0.897    0.0172     0.899    0.816     0.948
Convex_Area       1999 14002.171 1631.1453 14169.000 8648.000 19099.000
Extent            1999     0.655    0.0808     0.639    0.497     0.861
                       25th      75th    Skew Kurtosis
Area              12577.000 14824.000 -0.2125   -0.249
Perimeter           457.165   499.289 -0.5438   -0.189
Major_Axis_Length   191.991   210.748 -0.5740   -0.160
Minor_Axis_Length    84.570    91.462 -0.0879    0.582
Eccentricity          0.888     0.908 -0.8356    1.142
Convex_Area       12877.500 15157.000 -0.2380   -0.262
Extent                0.585     0.721  0.3803   -1.051
### sin valores atípicos
mvn(data = base_noOutliers[1:7], mvnTest = "hz", multivariateOutlierMethod = "quan")

$multivariateNormality
           Test   HZ p value MVN
1 Henze-Zirkler 4.66       0  NO

$univariateNormality
              Test          Variable Statistic   p value Normality
1 Anderson-Darling       Area            26.96  <0.001      NO    
2 Anderson-Darling     Perimeter         35.65  <0.001      NO    
3 Anderson-Darling Major_Axis_Length     45.17  <0.001      NO    
4 Anderson-Darling Minor_Axis_Length      1.23  0.0034      NO    
5 Anderson-Darling   Eccentricity        19.76  <0.001      NO    
6 Anderson-Darling    Convex_Area        27.71  <0.001      NO    
7 Anderson-Darling      Extent           52.25  <0.001      NO    

$Descriptives
                     n      Mean   Std.Dev    Median      Min       Max
Area              2788 12530.309 1413.6629 12308.500 9732.000 16513.000
Perimeter         2788   450.911   29.8622   445.165  390.957   528.176
Major_Axis_Length 2788   187.063   14.7857   183.978  159.499   222.672
Minor_Axis_Length 2788    86.221    4.3721    86.220   74.396   100.133
Eccentricity      2788     0.886    0.0173     0.887    0.842     0.921
Convex_Area       2788 12804.893 1450.8514 12583.000 9887.000 16950.000
Extent            2788     0.660    0.0717     0.642    0.534     0.823
                       25th      75th    Skew Kurtosis
Area              11416.500 13617.500  0.3675   -0.796
Perimeter           427.029   475.957  0.3209   -0.969
Major_Axis_Length   174.787   199.940  0.3269   -1.060
Minor_Axis_Length    83.153    89.352 -0.0086   -0.404
Eccentricity          0.872     0.900 -0.2277   -0.897
Convex_Area       11665.000 13905.750  0.3714   -0.799
Extent                0.601     0.717  0.4337   -0.983
# el número de observaciones para royston test debe ser menor a 2000
# se tomarán las primeras 1999 observaciones 
mvn(data = base_noOutliers[1:1999,1:7], mvnTest = "royston", multivariatePlot = "qq")

$multivariateNormality
     Test   H                                                    p value MVN
1 Royston 253 0.00000000000000000000000000000000000000000000000000000278  NO

$univariateNormality
              Test          Variable Statistic   p value Normality
1 Anderson-Darling       Area           13.451  <0.001      NO    
2 Anderson-Darling     Perimeter        21.185  <0.001      NO    
3 Anderson-Darling Major_Axis_Length    27.134  <0.001      NO    
4 Anderson-Darling Minor_Axis_Length     0.737  0.0548      YES   
5 Anderson-Darling   Eccentricity       21.130  <0.001      NO    
6 Anderson-Darling    Convex_Area       13.805  <0.001      NO    
7 Anderson-Darling      Extent          36.214  <0.001      NO    

$Descriptives
                     n      Mean   Std.Dev    Median      Min       Max
Area              1999 12867.031 1442.9019 12770.000 9732.000 16513.000
Perimeter         1999   458.527   30.3618   458.733  390.957   528.176
Major_Axis_Length 1999   190.890   15.0359   191.897  160.227   222.672
Minor_Axis_Length 1999    86.755    4.3251    86.784   74.396   100.133
Eccentricity      1999     0.889    0.0169     0.892    0.842     0.921
Convex_Area       1999 13153.583 1480.4424 13072.000 9887.000 16950.000
Extent            1999     0.657    0.0724     0.641    0.534     0.823
                       25th      75th    Skew Kurtosis
Area              11701.500 14058.500  0.0716   -0.990
Perimeter           432.680   484.235 -0.0253   -1.128
Major_Axis_Length   177.357   203.897 -0.0424   -1.208
Minor_Axis_Length    83.756    89.887 -0.0579   -0.335
Eccentricity          0.877     0.902 -0.4762   -0.623
Convex_Area       11953.500 14388.500  0.0713   -0.994
Extent                0.596     0.715  0.4308   -0.984

Probar homogeneidad de varianzas entre los grupos

# Prueba de M box para homogeneidad entre covarianzas
# Ho: Las matrices de covarianzas entre grupos son iguales
# Ha: Las matrices de covarianzas entre grupo no son iguales
boxM(data = base[, 1:7], grouping = base[, 8])

    Box's M-test for Homogeneity of Covariance Matrices

data:  base[, 1:7]
Chi-Sq (approx.) = 7056, df = 28, p-value <0.0000000000000002

Estandarizar los datos

# Esclar datos de las variables independientes y agregar la variable cualtativa
baseScale <- cbind(scale(base[,1:7]), base[,8])

# Converir a data frame
baseScale <- as.data.frame(baseScale)

# Nombrar variale cualitativa
colnames(baseScale)[colnames(baseScale) == "V8"] <- "Class"


# Nuevo data frame con los nombres de los grupos y datos estandarizados
for (i in seq_len(nrow(baseScale))) { 
  if (baseScale[i, 8] == 1) {
    baseScale[i,8]="Cammeo"
  } else if (baseScale[i, 8] == 2) {
    baseScale[i,8]="Osmancik"
  }
}

# Variable "Class" de tipo "factor"
baseScale$Class <- as.factor(baseScale$Class)

# Comprobar estructura
str(baseScale)
'data.frame':   3810 obs. of  8 variables:
 $ Area             : num  1.48 1.148 1.135 0.293 1.166 ...
 $ Perimeter        : num  2.004 1.126 1.317 0.115 1.487 ...
 $ Major_Axis_Length: num  2.348 0.988 1.452 0.261 1.316 ...
 $ Minor_Axis_Length: num  -0.213 0.945 0.254 0.198 0.523 ...
 $ Eccentricity     : num  2.018 0.41 1.213 0.24 0.952 ...
 $ Convex_Area      : num  1.499 1.193 1.126 0.234 1.3 ...
 $ Extent           : num  -1.153 -0.602 0.406 -0.275 -0.206 ...
 $ Class            : Factor w/ 2 levels "Cammeo","Osmancik": 1 1 1 1 1 1 1 1 1 1 ...
kable(rbind(head(baseScale),tail(baseScale))) %>% kable_styling()
Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area Extent Class
1 1.480 2.004 2.348 -0.213 2.018 1.499 -1.153 Cammeo
2 1.148 1.126 0.988 0.945 0.410 1.193 -0.602 Cammeo
3 1.135 1.317 1.452 0.254 1.213 1.126 0.406 Cammeo
4 0.293 0.115 0.261 0.198 0.240 0.234 -0.275 Cammeo
5 1.166 1.487 1.316 0.523 0.952 1.300 -0.206 Cammeo
6 0.468 0.640 0.646 0.059 0.694 0.469 -0.052 Cammeo
3805 -0.096 -0.069 0.227 -0.544 0.729 -0.149 0.736 Osmancik
3806 -0.708 -1.078 -1.048 -0.097 -1.085 -0.745 0.247 Osmancik
3807 -0.602 -0.923 -1.207 0.550 -1.970 -0.590 0.419 Osmancik
3808 -0.133 -0.330 -0.298 0.085 -0.275 -0.173 -0.456 Osmancik
3809 -1.608 -1.740 -1.581 -1.414 -0.599 -1.607 -0.037 Osmancik
3810 -0.712 -1.391 -1.587 0.795 -2.939 -0.766 1.826 Osmancik
# resumen
summary(baseScale)
      Area         Perimeter      Major_Axis_Length Minor_Axis_Length
 Min.   :-2.95   Min.   :-2.673   Min.   :-2.494    Min.   :-4.67    
 1st Qu.:-0.75   1st Qu.:-0.789   1st Qu.:-0.827    1st Qu.:-0.63    
 Median :-0.14   Median :-0.151   Median :-0.170    Median : 0.02    
 Mean   : 0.00   Mean   : 0.000   Mean   : 0.000    Mean   : 0.00    
 3rd Qu.: 0.74   3rd Qu.: 0.827   3rd Qu.: 0.847    3rd Qu.: 0.67    
 Max.   : 3.61   Max.   : 2.646   Max.   : 2.879    Max.   : 3.70    
  Eccentricity    Convex_Area        Extent            Class     
 Min.   :-5.27   Min.   :-2.94   Min.   :-2.130   Cammeo  :1630  
 1st Qu.:-0.70   1st Qu.:-0.75   1st Qu.:-0.817   Osmancik:2180  
 Median : 0.10   Median :-0.14   Median :-0.215                  
 Mean   : 0.00   Mean   : 0.00   Mean   : 0.000                  
 3rd Qu.: 0.76   3rd Qu.: 0.75   3rd Qu.: 0.837                  
 Max.   : 2.94   Max.   : 3.46   Max.   : 2.578                  
kable(describe(baseScale[1:7])) %>% kable_styling()
vars n mean sd median trimmed mad min max range skew kurtosis se
Area 1 3810 0 1 -0.142 -0.036 1.060 -2.95 3.60 6.56 0.325 -0.433 0.016
Perimeter 2 3810 0 1 -0.151 -0.026 1.162 -2.67 2.65 5.32 0.221 -0.842 0.016
Major_Axis_Length 3 3810 0 1 -0.170 -0.032 1.196 -2.49 2.88 5.37 0.260 -0.953 0.016
Minor_Axis_Length 4 3810 0 1 0.021 0.013 0.959 -4.67 3.71 8.38 -0.135 0.558 0.016
Eccentricity 5 3810 0 1 0.105 0.043 1.060 -5.27 2.94 8.20 -0.449 0.068 0.016
Convex_Area 6 3810 0 1 -0.138 -0.036 1.067 -2.94 3.46 6.40 0.320 -0.468 0.016
Extent 7 3810 0 1 -0.215 -0.041 1.125 -2.13 2.58 4.71 0.344 -1.031 0.016

Calcular Probabilidades a “priori”

n1=0
n2=0

# Usando for para sumar el número de elementos en cada grupo 
for (i in seq_len(nrow(baseScale))) { 
  if (baseScale[i, 8] == "Cammeo") {
    n1 <- n1 + 1
  } else if (baseScale[i, 8] == "Osmancik") {
    n2 <- n2 + 1
  }
}

n1
[1] 1630
n2
[1] 2180
# Probabildades
paste("Probabilidad clase 'Cammeo': ", n1/(n1+n2))
[1] "Probabilidad clase 'Cammeo':  0.427821522309711"
paste("Probabilidad clase 'Osmancik': ", n2/(n1+n2))
[1] "Probabilidad clase 'Osmancik':  0.572178477690289"

Creación de la función discriminante

modelo_qda <- qda(Class ~ Area + Perimeter + Major_Axis_Length + Minor_Axis_Length + Eccentricity + Convex_Area + Extent, data = baseScale)

# Resumen del modelo
modelo_qda$lev
[1] "Cammeo"   "Osmancik"
modelo_qda$counts
  Cammeo Osmancik 
    1630     2180 
modelo_qda$N
[1] 3810
modelo_qda$prior
  Cammeo Osmancik 
   0.428    0.572 
modelo_qda$call
qda(formula = Class ~ Area + Perimeter + Major_Axis_Length + 
    Minor_Axis_Length + Eccentricity + Convex_Area + Extent, 
    data = baseScale)
modelo_qda$means
           Area Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity
Cammeo    0.863     0.933             0.957             0.428        0.681
Osmancik -0.645    -0.697            -0.716            -0.320       -0.509
         Convex_Area Extent
Cammeo         0.868 -0.136
Osmancik      -0.649  0.102

Clasificar nuevas observaciones

# se agrgaran 3 nuevas observaciones de prueba
obs_prueba <- data.frame(
  Area = c(16500, 11550, 15900),
  Perimeter = c(700, 440, 400),
  Major_Axis_Length = c(150, 180, 170),
  Minor_Axis_Length = c(100, 85, 85),
  Eccentricity = c(0.7, 0.88, 0.66),
  Convex_Area = c(13500, 11960, 15110),
  Extent = c(0.5, 0.6, 0.7)
)

str(obs_prueba)
'data.frame':   3 obs. of  7 variables:
 $ Area             : num  16500 11550 15900
 $ Perimeter        : num  700 440 400
 $ Major_Axis_Length: num  150 180 170
 $ Minor_Axis_Length: num  100 85 85
 $ Eccentricity     : num  0.7 0.88 0.66
 $ Convex_Area      : num  13500 11960 15110
 $ Extent           : num  0.5 0.6 0.7
# Guradar los estadísticos para escalar los nuevos datos:
baseScale2 <- scale(base[1:7])
media <- attr(baseScale2, "scaled:center")
desvio <- attr(baseScale2, "scaled:scale")

# Escalar lso nuevos datos
obs_prueba_scale <- scale(obs_prueba, center = media, scale = desvio)
obs_prueba_scale <- as.data.frame(obs_prueba_scale)

str(obs_prueba_scale)
'data.frame':   3 obs. of  7 variables:
 $ Area             : num  2.212 -0.645 1.866
 $ Perimeter        : num  6.9 -0.4 -1.52
 $ Major_Axis_Length: num  -2.222 -0.503 -1.076
 $ Minor_Axis_Length: num  2.389 -0.229 -0.229
 $ Eccentricity     : num  -8.98 -0.33 -10.9
 $ Convex_Area      : num  0.308 -0.559 1.214
 $ Extent           : num  -2.097 -0.802 0.493
# Discriminar nuevos datos
predict(object = modelo_qda, newdata = obs_prueba)
$class
[1] Cammeo Cammeo Cammeo
Levels: Cammeo Osmancik

$posterior
  Cammeo Osmancik
1      1        0
2      1        0
3      1        0

Validación del modelo

Evaluación de los errores de clasificación

# clasificar empleando las mismas observaciones con las que se ha generado el modelo discriminante QDA
predicciones <- predict(object = modelo_qda, newdata = baseScale[,1:7],
                        method = "predictive")

# Mostar en tabla
table(baseScale$Class, predicciones$class,
      dnn = c("Clase real", "Clase predicción"))
          Clase predicción
Clase real Cammeo Osmancik
  Cammeo     1547       83
  Osmancik    247     1933
# Porcentaje de errores
trainig_error <- mean(baseScale$Class != predicciones$class) * 100
paste("trainig_error=", round(trainig_error,2), "%")
[1] "trainig_error= 8.66 %"

Matriz de confusión

# Crear la matriz de confusión 
confusionMatrix(predicciones$class, baseScale$Class)
Confusion Matrix and Statistics

          Reference
Prediction Cammeo Osmancik
  Cammeo     1547      247
  Osmancik     83     1933
                                             
               Accuracy : 0.913              
                 95% CI : (0.904, 0.922)     
    No Information Rate : 0.572              
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.825              
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.949              
            Specificity : 0.887              
         Pos Pred Value : 0.862              
         Neg Pred Value : 0.959              
             Prevalence : 0.428              
         Detection Rate : 0.406              
   Detection Prevalence : 0.471              
      Balanced Accuracy : 0.918              
                                             
       'Positive' Class : Cammeo             
                                             

Graficar las clasificaciones

# partimat(Class ~ Area + Perimeter + Major_Axis_Length + Minor_Axis_Length + Eccentricity + Convex_Area + Extent, data = baseScale, main="Discriminación con QDA",
#          method = "qda", image.colors = c("salmon", "skyblue"),
#          col.mean = "snow2")


library(combinat) # combinatoria
# Vector con nombres de las variables
vars <- c("Area", "Perimeter", "Major_Axis_Length", "Minor_Axis_Length",
          "Eccentricity", "Convex_Area", "Extent")

# Generar todas las combinaciones posibles de 2 variables
comb_2vars <- combn(vars, 2, simplify = FALSE)

# Recorrer cada combinación y graficar
for (i in seq_along(comb_2vars)) {
  
  vars_formula <- comb_2vars[[i]]
  formula <- as.formula(paste("Class ~", paste(vars_formula, collapse = " + ")))
  
  # Mostrar en pantalla
  partimat(formula,
           data = baseScale,
           method = "qda",
           main = paste("QDA con:", paste(vars_formula, collapse = ", ")),
           image.colors = c("salmon", "skyblue"),
           col.mean = "snow2")
}

Aplicación Python

import numpy as np
import pandas as pd
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, classification_report, cohen_kappa_score
import itertools
#from statsmodels.stats.inter_rater import cohens_kappa
from scipy.stats import jarque_bera
import pingouin as pg
from scipy.spatial.distance import mahalanobis
from sklearn.covariance import EmpiricalCovariance
from bioinfokit.analys import stat
import seaborn as sns
from scipy.io import arff

Exploración inicial

base=pd.read_csv(r"C:\Users\MINEDUCYT\Documents\Seminario\rice+cammeo+and+osmancik\Rice_Cammeo_Osmancik_.csv", sep=",")

pd.set_option('display.max_rows', None)  # Mostrar todas las filas
pd.set_option('display.max_columns', None)  # Mostrar todas las columnas

# información de las variables
base.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Area               3810 non-null   int64  
 1   Perimeter          3810 non-null   float64
 2   Major_Axis_Length  3810 non-null   float64
 3   Minor_Axis_Length  3810 non-null   float64
 4   Eccentricity       3810 non-null   float64
 5   Convex_Area        3810 non-null   int64  
 6   Extent             3810 non-null   float64
 7   Class              3810 non-null   object 
dtypes: float64(5), int64(2), object(1)
memory usage: 238.3+ KB
# Valores nulos:
# Se creó un nuevo archivo csv a partir del data frame "base" en R para importar a python. El cual no contiene valores nulos. 

# Datos descriptivos
base.describe()
               Area    Perimeter  Major_Axis_Length  Minor_Axis_Length  \
count   3810.000000  3810.000000        3810.000000        3810.000000   
mean   12667.727559   454.239180         188.776222          86.313750   
std     1732.367706    35.597081          17.448679           5.729817   
min     7551.000000   359.100006         145.264465          59.532406   
25%    11370.500000   426.144753         174.353855          82.731695   
50%    12421.500000   448.852493         185.810059          86.434647   
75%    13950.000000   483.683746         203.550438          90.143677   
max    18913.000000   548.445984         239.010498         107.542450   

       Eccentricity   Convex_Area       Extent  
count   3810.000000   3810.000000  3810.000000  
mean       0.886871  12952.496850     0.661934  
std        0.020818   1776.972042     0.077239  
min        0.777233   7723.000000     0.497413  
25%        0.872402  11626.250000     0.598862  
50%        0.889050  12706.500000     0.645361  
75%        0.902588  14284.000000     0.726562  
max        0.948007  19099.000000     0.861050  
# Comprobar 
base.head()
    Area   Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity  \
0  15231  525.578979         229.749878          85.093788      0.928882   
1  14656  494.311005         206.020065          91.730972      0.895405   
2  14634  501.122009         214.106781          87.768288      0.912118   
3  13176  458.342987         193.337387          87.448395      0.891861   
4  14688  507.166992         211.743378          89.312454      0.906691   

   Convex_Area    Extent   Class  
0        15617  0.572896  Cammeo  
1        15072  0.615436  Cammeo  
2        14954  0.693259  Cammeo  
3        13368  0.640669  Cammeo  
4        15262  0.646024  Cammeo  
base.tail()
       Area   Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity  \
3805  11441  415.858002         170.486771          85.756592      0.864280   
3806  11625  421.390015         167.714798          89.462570      0.845850   
3807  12437  442.498993         183.572922          86.801979      0.881144   
3808   9882  392.296997         161.193985          78.210480      0.874406   
3809  11434  404.709991         161.079269          90.868195      0.825692   

      Convex_Area    Extent     Class  
3805        11628  0.681012  Osmancik  
3806        11904  0.694279  Osmancik  
3807        12645  0.626739  Osmancik  
3808        10097  0.659064  Osmancik  
3809        11591  0.802949  Osmancik  
# Boxplots
paleta_colores = np.array([
    "#00868B",  # turquoise4
    "#A2CD5A",  # darkolivegreen3
    "#BDB76B",  # darkkhaki
    "#9BCD9B",  # darkseagreen3
    "#CD6600",  # darkorange3
    "#8B6969",  # rosybrown4
    "#CD96CD"   # plum3
])

for i, columna in enumerate(base.columns[:7]):
    plt.clf()
    sns.boxplot(y=base[columna], color=paleta_colores[i])
    plt.title(f'Boxplot - {columna}')
    plt.show()

# Histogramas
for i, columna in enumerate(base.columns[:7]):
    plt.clf()
    sns.histplot(base[columna], kde=True, bins=15, color=paleta_colores[i])
    plt.title(f'Histograma - {columna}')
    plt.ylabel("Frecuencia")
    plt.show()

# Visualización en pares por clase
plt.clf()
sns.pairplot(base, hue='Class', diag_kind='hist', markers=["o", "o"], palette=['salmon', 'skyblue'])

plt.show()

Probar normalidad univariante y multivariante

print("Normalidad univariante \n",
      "Prueba de jarque bera para normalidad \n",
      "Ho: Los datos siguen una distribución normal \n",
      "Ha: Los datos no siguen una distribución normal \n")
Normalidad univariante 
 Prueba de jarque bera para normalidad 
 Ho: Los datos siguen una distribución normal 
 Ha: Los datos no siguen una distribución normal 
for columna in base.columns[:-1]:
    jb_stat, jb_p = jarque_bera(base[columna])
    print(f"Variable: {columna} | JB estadístico = {jb_stat:.4f}, p-valor = {jb_p:.4f} \n")
Variable: Area | JB estadístico = 96.7233, p-valor = 0.0000 

Variable: Perimeter | JB estadístico = 143.2961, p-valor = 0.0000 

Variable: Major_Axis_Length | JB estadístico = 186.8867, p-valor = 0.0000 

Variable: Minor_Axis_Length | JB estadístico = 61.2948, p-valor = 0.0000 

Variable: Eccentricity | JB estadístico = 128.8224, p-valor = 0.0000 

Variable: Convex_Area | JB estadístico = 99.4744, p-valor = 0.0000 

Variable: Extent | JB estadístico = 243.5291, p-valor = 0.0000 
# Normalidad multivariante
print("\nNormalidad multivariante - Henze-Zirkler (HZ):")

Normalidad multivariante - Henze-Zirkler (HZ):
hz_test = pg.multivariate_normality(base.iloc[:, 0:6], alpha=0.05)
print(hz_test)
HZResults(hz=np.float64(16.667633968806907), pval=np.float64(0.0), normal=False)

Probar homogeneidad de varianzas entre los grupos

# Prueba de M box para homogeneidad entre covarianzas
# Ho: Las matrices de covarianzas entre grupos son iguales
# Ha: Las matrices de covarianzas entre grupo no son iguales

# 'Class' como categoría
base2=base.copy()
base2['Class'] = base2['Class'].astype('category')

# Ejecutar Box's M test
resultado_mbox = stat()
resultado_mbox.box_m(df=base2, group_col='Class')

print("Prueba de M box para homogeneidad entre covarianzas \n",
      "Ho: Las matrices de covarianzas entre grupos son iguales \n",
      "Ha: Las matrices de covarianzas entre grupo no son iguales \n")

print("Estadístico Box's M:", resultado_mbox.boxm['test_stat'])
print("Valor p:", resultado_mbox.boxm['p_value'])

Estandarizar los datos

# Escalar las primeras 7 columnas (variables independientes)
scaler = StandardScaler()
Xscale = scaler.fit_transform(base.iloc[:, 0:7])

# Convertir el resultado escalado a DataFrame
baseScale = pd.DataFrame(Xscale, columns=base.columns[0:7])

# Añadir la variable 'Class'
baseScale['Class'] = base['Class'].copy()

# Reemplazar valores 1 y 2 en 'Class' por "Cammeo" y "Osmancik"
baseScale['Class'] = baseScale['Class'].replace({1: 'Cammeo', 2: 'Osmancik'})

# Convertir 'Class' a categoría
baseScale['Class'] = baseScale['Class'].astype('category')

# Comprobar estructura 
baseScale.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Area               3810 non-null   float64 
 1   Perimeter          3810 non-null   float64 
 2   Major_Axis_Length  3810 non-null   float64 
 3   Minor_Axis_Length  3810 non-null   float64 
 4   Eccentricity       3810 non-null   float64 
 5   Convex_Area        3810 non-null   float64 
 6   Extent             3810 non-null   float64 
 7   Class              3810 non-null   category
dtypes: category(1), float64(7)
memory usage: 212.3 KB
print(pd.concat([baseScale.head(), baseScale.tail()]))
          Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity  \
0     1.479830   2.004354           2.348547          -0.212943      2.018337   
1     1.147870   1.125853           0.988390           0.945568      0.410018   
2     1.135169   1.317214           1.451908           0.253887      1.212956   
3     0.293436   0.115300           0.261439           0.198051      0.239751   
4     1.166345   1.487053           1.316442           0.523419      0.952221   
3805 -0.708215  -1.078353          -1.048323          -0.097251     -1.085282   
3806 -0.601988  -0.922926          -1.207208           0.549622     -1.970731   
3807 -0.133204  -0.329851          -0.298245           0.085220     -0.275099   
3808 -1.608257  -1.740320          -1.580971          -1.414414     -0.598821   
3809 -0.712256  -1.391566          -1.587546           0.794972     -2.939160   

      Convex_Area    Extent     Class  
0        1.499659 -1.152921    Cammeo  
1        1.192918 -0.602079    Cammeo  
2        1.126504  0.405611    Cammeo  
3        0.233857 -0.275351    Cammeo  
4        1.299855 -0.206013    Cammeo  
3805    -0.745465  0.247031  Osmancik  
3806    -0.590124  0.418815  Osmancik  
3807    -0.173068 -0.455731  Osmancik  
3808    -1.607156 -0.037168  Osmancik  
3809    -0.766290  1.825947  Osmancik  
# Resumen
print(baseScale.describe())
              Area     Perimeter  Major_Axis_Length  Minor_Axis_Length  \
count  3810.000000  3.810000e+03       3.810000e+03       3.810000e+03   
mean      0.000000 -2.983907e-16       1.193563e-16      -4.177469e-16   
std       1.000131  1.000131e+00       1.000131e+00       1.000131e+00   
min      -2.953991 -2.673019e+00      -2.494027e+00      -4.674645e+00   
25%      -0.748916 -7.893376e-01      -8.266678e-01      -6.252425e-01   
50%      -0.142152 -1.513437e-01      -1.700159e-01       2.110226e-02   
75%       0.740282  8.272709e-01       8.468352e-01       6.685081e-01   
max       3.605523  2.646823e+00       2.879351e+00       3.705439e+00   

       Eccentricity   Convex_Area        Extent  
count  3.810000e+03  3.810000e+03  3.810000e+03  
mean   7.758157e-16 -4.774250e-16 -4.475860e-16  
std    1.000131e+00  1.000131e+00  1.000131e+00  
min   -5.267280e+00 -2.943312e+00 -2.130313e+00  
25%   -6.951263e-01 -7.464501e-01 -8.166890e-01  
50%    1.047129e-01 -1.384541e-01 -2.145916e-01  
75%    7.551068e-01  7.494085e-01  8.368337e-01  
max    2.937147e+00  3.459430e+00  2.578260e+00  

Calcular Probabilidades a “priori”

priori = baseScale['Class'].value_counts(normalize=True)
print("Probabilidades a priori:\n", priori)
Probabilidades a priori:
 Class
Osmancik    0.572178
Cammeo      0.427822
Name: proportion, dtype: float64

Creación de la función discriminante

modelo_lda = LinearDiscriminantAnalysis()
modelo_lda.fit(baseScale.iloc[:, 0:7], baseScale['Class'])
LinearDiscriminantAnalysis()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
############### Resumen
# Clases
print("Clases:", modelo_lda.classes_)
Clases: ['Cammeo' 'Osmancik']
# Cantidad de observaciones por clase
(unique, counts) = np.unique(baseScale['Class'], return_counts=True)
print("Counts:", dict(zip(unique, counts)))
Counts: {'Cammeo': np.int64(1630), 'Osmancik': np.int64(2180)}
# Total de observaciones
print("N total:", len(baseScale))
N total: 3810
# Probabilidades a priori
print("Prior probabilities:", modelo_lda.priors_)
Prior probabilities: [0.42782152 0.57217848]
# Medias por clase
print("Means:")
Means:
print(pd.DataFrame(modelo_lda.means_, columns=baseScale.columns[0:7], index=modelo_lda.classes_))
              Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  \
Cammeo    0.863189   0.932776           0.957354           0.428304   
Osmancik -0.645412  -0.697443          -0.715820          -0.320246   

          Eccentricity  Convex_Area    Extent  
Cammeo        0.681064     0.867843 -0.136148  
Osmancik     -0.509236    -0.648892  0.101798  
# Coeficientes de la función discriminante lineal
coef_df = pd.DataFrame(modelo_lda.coef_, columns=baseScale.columns[0:7])
print("Coeficientes de la función discriminante:")
Coeficientes de la función discriminante:
print(coef_df)
        Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity  \
0  13.374122   3.367574          -5.263212           3.418791      1.351427   

   Convex_Area    Extent  
0   -20.028283  0.040976  
# Intercepto de la función discriminante lineal
print("Intercepto:")
Intercepto:
print(modelo_lda.intercept_[0])
0.966318883957916

Clasificar nuevas observaciones

# Nuevas observaciones sin escalar
obs_prueba = pd.DataFrame({
    'Area': [16500, 11550, 15900],
    'Perimeter': [700, 440, 400],
    'Major_Axis_Length': [150, 180, 170],
    'Minor_Axis_Length': [100, 85, 85],
    'Eccentricity': [0.7, 0.88, 0.66],
    'Convex_Area': [13500, 11960, 15110],
    'Extent': [0.5, 0.6, 0.7]
})

# Escalar nuevas observaciones usando los mismos parámetros del entrenamiento
obs_prueba_scale = pd.DataFrame(
    scaler.transform(obs_prueba), # Usa media y desviación del fit anterior
    columns=obs_prueba.columns
)

# Clasificar las observaciones nuevas
predicciones_nuevas = modelo_lda.predict(obs_prueba_scale)
print("Predicciones:", predicciones_nuevas)
Predicciones: ['Osmancik' 'Osmancik' 'Cammeo']

Validación del modelo

Evaluación de los errores de clasificación

# Clasificar empleando las mismas observaciones con las que se ha generado el modelo discriminante LDA
predic_datos_originales_scale = modelo_lda.predict(baseScale.iloc[:, 0:7])

# Matriz de confusión
matriz = confusion_matrix(baseScale['Class'], predic_datos_originales_scale)
print("Matriz de confusión:")
Matriz de confusión:
print(matriz)
[[1464  166]
 [  99 2081]]
# Porcentaje de error de entrenamiento
training_error = (1 - accuracy_score(baseScale['Class'], predic_datos_originales_scale)) * 100
print(f"Error de entrenamiento: {training_error:.2f}%")
Error de entrenamiento: 6.96%

Matriz de confusión

# Graficar la matriz
plt.clf()
disp = ConfusionMatrixDisplay(confusion_matrix=matriz, display_labels=modelo_lda.classes_)
disp.plot(cmap='Blues')
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000021DEFB6C2F0>
plt.title("Matriz de Confusión - LDA")
plt.show()

# Resumen
reporte = classification_report(baseScale['Class'], predic_datos_originales_scale, target_names=modelo_lda.classes_)
print("Resumen del modelo - Clasificación en datos de entrenamiento:")
Resumen del modelo - Clasificación en datos de entrenamiento:
print(reporte)
              precision    recall  f1-score   support

      Cammeo       0.94      0.90      0.92      1630
    Osmancik       0.93      0.95      0.94      2180

    accuracy                           0.93      3810
   macro avg       0.93      0.93      0.93      3810
weighted avg       0.93      0.93      0.93      3810
kappa = cohen_kappa_score(baseScale['Class'], predic_datos_originales_scale)
print(f"Índice de Kappa: {kappa:.4f}")
Índice de Kappa: 0.8572

# Aplicar prueba de hipótesis de Kappa
# kappa_result = cohens_kappa(matriz)
# print("Acuerdo entre clases reales y predichas \n",
#       "Prueba Kappa \n",
#       "Ho: El acuerdo observado es igual al esperado por azar (índice=0) \n",
#       "Ha: El acuerdo es mejor que el azar. \n")
# print("Índice de Kappa:", round(kappa_result.kappa, 4))
# print("Valor-p:", round(kappa_result.pvalue, 4))
# print("Intervalo de confianza al 95%:", kappa_result.confint)

Graficar las clasificaciones

# Variables independientes
variables = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
             'Eccentricity', 'Convex_Area', 'Extent']

# Generar todas las combinaciones de 2 variables
combinaciones = list(itertools.combinations(variables, 2))

# variables independientes y cualitativa
X = baseScale[variables]
y = baseScale['Class']

# Crear LabelEncoder fuera del ciclo para consistencia
# Es para generar un estiqueta númerica para variables cualitativas
# en este caso 1=Cammeo, 2=Osmancik
le = LabelEncoder()
le.fit(y)
LabelEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
for var1, var2 in combinaciones:
    X_pair = X[[var1, var2]]
    lda_pair = LinearDiscriminantAnalysis()
    lda_pair.fit(X_pair, y)

    # Crear la malla de puntos
    x_min, x_max = X_pair[var1].min() - 1, X_pair[var1].max() + 1
    y_min, y_max = X_pair[var2].min() - 1, X_pair[var2].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]

    # Predecir en la malla
    Z = lda_pair.predict(grid)

    # Convertir etiquetas a números usando el encoder ya entrenado
    Z_num = le.transform(Z)
    Z_num = Z_num.reshape(xx.shape)

    plt.figure(figsize=(6, 5))
    plt.contourf(xx, yy, Z_num, alpha=0.3, levels=np.arange(len(le.classes_)+1)-0.5,
                 colors=['salmon', 'skyblue'])
    sns.scatterplot(data=baseScale, x=var1, y=var2, hue='Class', 
                    palette={'Cammeo': 'salmon', 'Osmancik': 'skyblue'}, edgecolor='black')
    plt.title(f"LDA con: {var1} & {var2}")
    plt.xlabel(var1)
    plt.ylabel(var2)
    plt.tight_layout()
    plt.show()