Análisis Discriminante Lineal LDA

Author

Katerin Zepeda

Published

June 19, 2025

1 Análisis Discriminante en RStudio

1.1 Cargar librerías necesarias

library(readxl)
Warning: package 'readxl' was built under R version 4.3.3
library(MASS)
Warning: package 'MASS' was built under R version 4.3.2

1.2 Cargar y preparar la base de datos

library(readxl)

# Leer el archivo Excel
cancer <- read_excel("C:/Users/MINEDUCYT/Desktop/CICLOI_2025/SeminarioI/breast_cancer_wisconsin.xlsx")

# Ver las primeras 10 filas
head(cancer,10)
# A tibble: 10 × 31
   `mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
           <dbl>          <dbl>            <dbl>       <dbl>             <dbl>
 1          18.0           10.4            123.        1001             0.118 
 2          20.6           17.8            133.        1326             0.0847
 3          19.7           21.2            130         1203             0.110 
 4          11.4           20.4             77.6        386.            0.142 
 5          20.3           14.3            135.        1297             0.100 
 6          12.4           15.7             82.6        477.            0.128 
 7          18.2           20.0            120.        1040             0.0946
 8          13.7           20.8             90.2        578.            0.119 
 9          13             21.8             87.5        520.            0.127 
10          12.5           24.0             84.0        476.            0.119 
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
#   `mean concave points` <dbl>, `mean symmetry` <dbl>,
#   `mean fractal dimension` <dbl>, `radius error` <dbl>,
#   `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
#   `smoothness error` <dbl>, `compactness error` <dbl>,
#   `concavity error` <dbl>, `concave points error` <dbl>,
#   `symmetry error` <dbl>, `fractal dimension error` <dbl>, …

1.2.1 Convertir la variable objetivo a factor

cancer$diagnosis <- as.factor(cancer$diagnosis)
head(cancer$diagnosis)
[1] 0 0 0 0 0 0
Levels: 0 1

Se ha convertido lo que es la variable diagnosis en un factor lo que es muy importante para poder aplicar los modelos de clasificación como LDA.

Se observa que la variable diagnosis es numérica (0 y1) pero al convertirla en lo que es un factor ahora nos representa dos clases que son categóricas que son:

0: Representa una clase “benigno”

1: Representa otra clase “maligno”

1.2.2 Verificar estructura

str(cancer)
tibble [569 × 31] (S3: tbl_df/tbl/data.frame)
 $ mean radius            : num [1:569] 18 20.6 19.7 11.4 20.3 ...
 $ mean texture           : num [1:569] 10.4 17.8 21.2 20.4 14.3 ...
 $ mean perimeter         : num [1:569] 122.8 132.9 130 77.6 135.1 ...
 $ mean area              : num [1:569] 1001 1326 1203 386 1297 ...
 $ mean smoothness        : num [1:569] 0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ mean compactness       : num [1:569] 0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ mean concavity         : num [1:569] 0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ mean concave points    : num [1:569] 0.1471 0.0702 0.1279 0.1052 0.1043 ...
 $ mean symmetry          : num [1:569] 0.242 0.181 0.207 0.26 0.181 ...
 $ mean fractal dimension : num [1:569] 0.0787 0.0567 0.06 0.0974 0.0588 ...
 $ radius error           : num [1:569] 1.095 0.543 0.746 0.496 0.757 ...
 $ texture error          : num [1:569] 0.905 0.734 0.787 1.156 0.781 ...
 $ perimeter error        : num [1:569] 8.59 3.4 4.58 3.44 5.44 ...
 $ area error             : num [1:569] 153.4 74.1 94 27.2 94.4 ...
 $ smoothness error       : num [1:569] 0.0064 0.00522 0.00615 0.00911 0.01149 ...
 $ compactness error      : num [1:569] 0.049 0.0131 0.0401 0.0746 0.0246 ...
 $ concavity error        : num [1:569] 0.0537 0.0186 0.0383 0.0566 0.0569 ...
 $ concave points error   : num [1:569] 0.0159 0.0134 0.0206 0.0187 0.0188 ...
 $ symmetry error         : num [1:569] 0.03 0.0139 0.0225 0.0596 0.0176 ...
 $ fractal dimension error: num [1:569] 0.00619 0.00353 0.00457 0.00921 0.00511 ...
 $ worst radius           : num [1:569] 25.4 25 23.6 14.9 22.5 ...
 $ worst texture          : num [1:569] 17.3 23.4 25.5 26.5 16.7 ...
 $ worst perimeter        : num [1:569] 184.6 158.8 152.5 98.9 152.2 ...
 $ worst area             : num [1:569] 2019 1956 1709 568 1575 ...
 $ worst smoothness       : num [1:569] 0.162 0.124 0.144 0.21 0.137 ...
 $ worst compactness      : num [1:569] 0.666 0.187 0.424 0.866 0.205 ...
 $ worst concavity        : num [1:569] 0.712 0.242 0.45 0.687 0.4 ...
 $ worst concave points   : num [1:569] 0.265 0.186 0.243 0.258 0.163 ...
 $ worst symmetry         : num [1:569] 0.46 0.275 0.361 0.664 0.236 ...
 $ worst fractal dimension: num [1:569] 0.1189 0.089 0.0876 0.173 0.0768 ...
 $ diagnosis              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

La base de datos tiene 569 observaciones y 31 variables. También contiene variables numéricas que nos describen las características físicas de células que fueron tomadas de biopsias mamarias.

Para la variable diagnosis fue importante que esta se halla transformado como factor ya que es la clase de predecir en lo que es un modelo para la clasificación como LDA.

1.3 Dividir datos en entrenamiento y prueba

set.seed(123)
sample_index <- sample(1:nrow(cancer), size = 0.7 * nrow(cancer)) 
train <- cancer[sample_index, ] 
test <- cancer[-sample_index, ] 


head(sample_index)
[1] 415 463 179 526 195 118
head(train)
# A tibble: 6 × 31
  `mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
          <dbl>          <dbl>            <dbl>       <dbl>             <dbl>
1         15.1            29.8             96.7        720.            0.0832
2         14.4            27.0             92.2        646.            0.0700
3         13.0            22.2             82.0        526.            0.0625
4          8.57           13.1             54.5        221.            0.104 
5         14.9            23.2            100.         671.            0.104 
6         14.9            16.7             98.6        682.            0.116 
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
#   `mean concave points` <dbl>, `mean symmetry` <dbl>,
#   `mean fractal dimension` <dbl>, `radius error` <dbl>,
#   `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
#   `smoothness error` <dbl>, `compactness error` <dbl>,
#   `concavity error` <dbl>, `concave points error` <dbl>,
#   `symmetry error` <dbl>, `fractal dimension error` <dbl>, …
head(test)
# A tibble: 6 × 31
  `mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
          <dbl>          <dbl>            <dbl>       <dbl>             <dbl>
1          18.0           10.4            123.        1001             0.118 
2          20.6           17.8            133.        1326             0.0847
3          13             21.8             87.5        520.            0.127 
4          13.7           22.6             93.6        578.            0.113 
5          14.7           20.1             94.7        684.            0.0987
6          16.1           20.7            108.         799.            0.117 
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
#   `mean concave points` <dbl>, `mean symmetry` <dbl>,
#   `mean fractal dimension` <dbl>, `radius error` <dbl>,
#   `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
#   `smoothness error` <dbl>, `compactness error` <dbl>,
#   `concavity error` <dbl>, `concave points error` <dbl>,
#   `symmetry error` <dbl>, `fractal dimension error` <dbl>, …

sample_index: Muestra lo que son los índices aleatorios seleccionados aleatoriamente para el entrenamiento que fueron 415, 463, 179, 526, 195 y 118 que son un 70% de los índices del conjunto de datos cáncer para formar el conjunto de entrenamiento.

train: Son los datos de entrenamiento (70%), se visualizan las primeras 6 filas de lo que es el conjunto de entrenamiento (train) con sus primeras 5 columnas de las 31. Esas columnas corresponden a diferentes medidas de características de tumores que son:

mean radius: Radio promedio de la célula.

mean texture: Textura promedio.

mean perimeter: Perímetro promedio.

mean area: Área promedio.

mean smoothness: Suavidad promedio.

Para nuestro análisis discriminante lineal (LDA) son atributos que se usan como variables predictoras para poder clasificar si el tumor es benigno o maligno.

test: Son los datos de prueba (30%), para evaluar que tan bien funciona el modelo. Los valores que se muestran, por ejemplo:

mean radius: 17.99

mean texture: 10.38

mean perimeter: 122.80

mean area: 1001.0

mean smoothness: 0.1184

Estos valores nos describen de forma cuántica ese tumor. En este análisis se utilizan para predecir una variable como lo es diagnosis que en forma medica suele ser B para benigno o M para maligno.

1.4 Ajustar el modelo LDA

lda_model <- lda(diagnosis ~ ., data = train)

# Mostrar resumen del modelo
print(lda_model)
Call:
lda(diagnosis ~ ., data = train)

Prior probabilities of groups:
        0         1 
0.3492462 0.6507538 

Group means:
  `mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
0      17.28885       21.70583        114.03914    955.0360        0.10234547
1      12.14377       18.02320         78.09371    462.9328        0.09224819
  `mean compactness` `mean concavity` `mean concave points` `mean symmetry`
0          0.1419415       0.15481676            0.08541583       0.1925165
1          0.0810390       0.04729681            0.02568140       0.1752224
  `mean fractal dimension` `radius error` `texture error` `perimeter error`
0               0.06244108      0.5641899        1.240917          3.984662
1               0.06305023      0.2871579        1.213266          2.013821
  `area error` `smoothness error` `compactness error` `concavity error`
0     63.96309        0.006920676          0.03220353        0.04098705
1     21.47387        0.007235819          0.02212952        0.02706923
  `concave points error` `symmetry error` `fractal dimension error`
0            0.014970288       0.02049093               0.004145626
1            0.009969741       0.02064478               0.003778734
  `worst radius` `worst texture` `worst perimeter` `worst area`
0       20.73863        29.47806         138.45978    1360.3525
1       13.41039        23.72772          87.19842     561.9328
  `worst smoothness` `worst compactness` `worst concavity`
0          0.1450568           0.3723490         0.4432517
1          0.1251653           0.1850665         0.1689314
  `worst concave points` `worst symmetry` `worst fractal dimension`
0             0.17910417        0.3245036                0.09196381
1             0.07522487        0.2720641                0.07997440

Coefficients of linear discriminants:
                                    LD1
`mean radius`               1.258547915
`mean texture`             -0.030083088
`mean perimeter`           -0.107784270
`mean area`                -0.005176914
`mean smoothness`          -9.004044617
`mean compactness`         24.111561107
`mean concavity`           -4.749500675
`mean concave points`     -12.674770351
`mean symmetry`            -1.282530280
`mean fractal dimension`   -5.310758318
`radius error`             -0.882583945
`texture error`             0.380924695
`perimeter error`          -0.134564024
`area error`                0.001863364
`smoothness error`        -83.026413011
`compactness error`        -0.851244190
`concavity error`          16.816940282
`concave points error`    -48.437080252
`symmetry error`          -14.053438493
`fractal dimension error`  20.283804784
`worst radius`             -1.067839788
`worst texture`            -0.043361395
`worst perimeter`           0.008610691
`worst area`                0.006935269
`worst smoothness`          1.024035463
`worst compactness`        -0.291222971
`worst concavity`          -2.343462931
`worst concave points`     -2.842339599
`worst symmetry`           -2.536283098
`worst fractal dimension` -22.872838872

Probabilidades a priori de los grupos

Se observó que en Prior probabilities of groups: 0: 0.3492462 1: 0.6507538 esto nos indica que la proporción de observaciones en cada clase en nuestro conjunto de entrenamiento nos da que en:

0= 34.9% que es probablemente maligno

1: 65.1% que es probablemente benigno.

Medias por grupo:

Después se observan las medias de cada variable para los dos grupos 0 y1, como, por ejemplo:

mean radius

Grupo 0: 17.28885

Grupo 1: 12.14377

Esto nos dice que los tumores del grupo 0 tienen un radio promedio mayor que los del grupo 1, lo cual podría ser un rasgo discriminante muy importante.

Coeficientes de la discriminante lineal (LD1)

Coeficientes positivos: contribuyen a clasificar hacia una clase (por ejemplo, 1).

Coeficientes negativos: hacia la otra clase (por ejemplo, 0).

Cuando mayor sea el valor absoluto es más la influencia que tiene es variable en la clasificación.

Por ejemplo:

mean radius: 1.28 que si a mayor radio más probable que sea del grupo 1.

worst fractal dimension: -22.87 esta variable tiene gran peso en la clasificación y es probable que sea del grupo 0

mean smoothness: -9.00 mayor suavidad promedio es más probable de pertenecer en el grupo 0

1.5 Predecir con el modelo

pred <- predict(lda_model, newdata = test)
head(pred)
$class
  [1] 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1
 [38] 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1
 [75] 1 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0
[112] 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0
[149] 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Levels: 0 1

$posterior
               0            1
1   9.999385e-01 6.150584e-05
2   9.985722e-01 1.427755e-03
3   9.794959e-01 2.050408e-02
4   9.070450e-01 9.295503e-02
5   9.947401e-01 5.259897e-03
6   9.998592e-01 1.407958e-04
7   9.999573e-01 4.273316e-05
8   3.619519e-05 9.999638e-01
9   9.995805e-01 4.194986e-04
10  9.998172e-01 1.827739e-04
11  9.993890e-01 6.110469e-04
12  9.272511e-01 7.274888e-02
13  1.386450e-05 9.999861e-01
14  9.999999e-01 7.435258e-08
15  9.399546e-01 6.004543e-02
16  9.999862e-01 1.378183e-05
17  2.661338e-07 9.999997e-01
18  4.889098e-04 9.995111e-01
19  3.281269e-03 9.967187e-01
20  9.940348e-01 5.965182e-03
21  3.864995e-06 9.999961e-01
22  3.391939e-06 9.999966e-01
23  9.993216e-01 6.784161e-04
24  9.990041e-01 9.959470e-04
25  3.991042e-04 9.996009e-01
26  9.992817e-01 7.183058e-04
27  9.952746e-01 4.725441e-03
28  9.999999e-01 1.437264e-07
29  8.304831e-01 1.695169e-01
30  9.999698e-01 3.020836e-05
31  2.524719e-01 7.475281e-01
32  9.910644e-01 8.935625e-03
33  4.164596e-04 9.995835e-01
34  8.102907e-01 1.897093e-01
35  5.668494e-04 9.994332e-01
36  1.000000e+00 1.360880e-09
37  1.010136e-02 9.898986e-01
38  2.698059e-03 9.973019e-01
39  1.673072e-04 9.998327e-01
40  3.746650e-04 9.996253e-01
41  6.708009e-01 3.291991e-01
42  1.764502e-05 9.999824e-01
43  5.535861e-06 9.999945e-01
44  8.924355e-01 1.075645e-01
45  1.511062e-03 9.984889e-01
46  3.666257e-02 9.633374e-01
47  1.056368e-04 9.998944e-01
48  9.870254e-01 1.297463e-02
49  9.998047e-01 1.953185e-04
50  1.267325e-06 9.999987e-01
51  9.999827e-01 1.734547e-05
52  6.227498e-04 9.993773e-01
53  9.945912e-01 5.408768e-03
54  4.033073e-05 9.999597e-01
55  9.999976e-01 2.406976e-06
56  9.716873e-01 2.831268e-02
57  8.720054e-01 1.279946e-01
58  1.104890e-03 9.988951e-01
59  2.858582e-05 9.999714e-01
60  5.356279e-06 9.999946e-01
61  3.744022e-03 9.962560e-01
62  3.993432e-10 1.000000e+00
63  3.758537e-02 9.624146e-01
64  9.999999e-01 6.059155e-08
65  5.815340e-01 4.184660e-01
66  5.214702e-06 9.999948e-01
67  1.000000e+00 1.866553e-12
68  2.148542e-01 7.851458e-01
69  9.987033e-01 1.296653e-03
70  1.817197e-06 9.999982e-01
71  9.979647e-01 2.035330e-03
72  1.036880e-03 9.989631e-01
73  9.999994e-01 6.096648e-07
74  1.043601e-05 9.999896e-01
75  1.174405e-04 9.998826e-01
76  2.575437e-02 9.742456e-01
77  9.999967e-01 3.288422e-06
78  1.173086e-01 8.826914e-01
79  9.998984e-01 1.015554e-04
80  9.999079e-01 9.212784e-05
81  1.766922e-01 8.233078e-01
82  5.076695e-02 9.492330e-01
83  9.816216e-01 1.837844e-02
84  1.259094e-04 9.998741e-01
85  9.795967e-01 2.040327e-02
86  9.999398e-01 6.024239e-05
87  9.142859e-01 8.571408e-02
88  2.015867e-02 9.798413e-01
89  3.936483e-05 9.999606e-01
90  1.958179e-07 9.999998e-01
91  9.999976e-01 2.443786e-06
92  1.901658e-05 9.999810e-01
93  9.757026e-05 9.999024e-01
94  2.729702e-06 9.999973e-01
95  9.877405e-01 1.225950e-02
96  1.436961e-05 9.999856e-01
97  3.815067e-04 9.996185e-01
98  9.946577e-01 5.342336e-03
99  6.051620e-03 9.939484e-01
100 9.999984e-01 1.610602e-06
101 7.941777e-02 9.205822e-01
102 2.583847e-04 9.997416e-01
103 3.661332e-03 9.963387e-01
104 3.519685e-04 9.996480e-01
105 9.982093e-01 1.790696e-03
106 9.998908e-01 1.091583e-04
107 4.547653e-04 9.995452e-01
108 3.476622e-04 9.996523e-01
109 2.454764e-05 9.999755e-01
110 7.700702e-01 2.299298e-01
111 9.999989e-01 1.144892e-06
112 9.999793e-01 2.073769e-05
113 8.940241e-04 9.991060e-01
114 3.473284e-03 9.965267e-01
115 9.173070e-05 9.999083e-01
116 7.122224e-01 2.877776e-01
117 2.009976e-04 9.997990e-01
118 1.012291e-04 9.998988e-01
119 1.409268e-05 9.999859e-01
120 9.999989e-01 1.054255e-06
121 2.708455e-04 9.997292e-01
122 1.181440e-04 9.998819e-01
123 2.019854e-02 9.798015e-01
124 6.645017e-05 9.999335e-01
125 2.682891e-01 7.317109e-01
126 1.100302e-03 9.988997e-01
127 8.787659e-04 9.991212e-01
128 9.999965e-01 3.496283e-06
129 2.468127e-03 9.975319e-01
130 2.807140e-04 9.997193e-01
131 3.584892e-06 9.999964e-01
132 9.942513e-06 9.999901e-01
133 9.990711e-01 9.288586e-04
134 3.689751e-03 9.963102e-01
135 3.114834e-03 9.968852e-01
136 2.576114e-03 9.974239e-01
137 2.266330e-02 9.773367e-01
138 9.999998e-01 1.635484e-07
139 1.829045e-03 9.981710e-01
140 3.240526e-01 6.759474e-01
141 3.475779e-05 9.999652e-01
142 3.112180e-04 9.996888e-01
143 1.114122e-02 9.888588e-01
144 2.365322e-03 9.976347e-01
145 9.999880e-01 1.199040e-05
146 8.339575e-03 9.916604e-01
147 9.999996e-01 3.890186e-07
148 9.911629e-01 8.837149e-03
149 9.959839e-01 4.016079e-03
150 4.691062e-02 9.530894e-01
151 9.997746e-01 2.254499e-04
152 1.169168e-03 9.988308e-01
153 9.999985e-01 1.544038e-06
154 8.950274e-03 9.910497e-01
155 1.530116e-04 9.998470e-01
156 2.423806e-03 9.975762e-01
157 1.165059e-03 9.988349e-01
158 2.115966e-06 9.999979e-01
159 1.551917e-06 9.999984e-01
160 1.853124e-04 9.998147e-01
161 3.389745e-06 9.999966e-01
162 7.087016e-06 9.999929e-01
163 7.332973e-05 9.999267e-01
164 5.633905e-06 9.999944e-01
165 2.052729e-04 9.997947e-01
166 4.637697e-07 9.999995e-01
167 1.686890e-05 9.999831e-01
168 2.698620e-04 9.997301e-01
169 3.556447e-02 9.644355e-01
170 3.501479e-07 9.999996e-01
171 9.999998e-01 1.752178e-07

$x
            LD1
1   -3.26432050
2   -2.44569686
3   -1.74736127
4   -1.33407064
5   -2.10538934
6   -3.04880511
7   -3.35908019
8    1.91967247
9   -2.76465592
10  -2.98089705
11  -2.66674013
12  -1.40357993
13   2.16936786
14  -5.01239943
15  -1.45705666
16  -3.65354000
17   3.19797736
18   1.24218021
19   0.74607546
20  -2.07246391
21   2.50174648
22   2.53571835
23  -2.63950867
24  -2.53952533
25   1.29501322
26  -2.62463169
27  -2.13341001
28  -4.84090121
29  -1.15478503
30  -3.44933656
31  -0.45886468
32  -1.96653537
33   1.28393270
34  -1.11909702
35   1.20367178
36  -6.05339093
37   0.45170599
38   0.79714883
39   1.52129185
40   1.31146191
41  -0.92652311
42   2.10662619
43   2.40825850
44  -1.29186238
45   0.94830331
46   0.10920451
47   1.64095589
48  -1.86842999
49  -2.96362101
50   2.79188723
51  -3.59369740
52   1.17918482
53  -2.09808814
54   1.89152065
55  -4.10759217
56  -1.66131465
57  -1.24058786
58   1.02987021
59   1.98108635
60   2.41683942
61   0.71162602
62   4.88980119
63   0.10248687
64  -5.06565339
65  -0.82693398
66   2.42380967
67  -7.76859653
68  -0.40410809
69  -2.47079305
70   2.69811324
71  -2.35328222
72   1.04641836
73  -4.46490812
74   2.24328441
75   1.61339059
76   0.20402559
77  -4.02639903
78  -0.21617438
79  -3.13382538
80  -3.15917870
81  -0.34087578
82   0.02067332
83  -1.77640335
84   1.59527036
85  -1.74867046
86  -3.26972154
87  -1.35724173
88   0.26925773
89   1.89782843
90   3.27781144
91  -4.10364297
92   2.08714763
93   1.66162690
94   2.59223713
95  -1.88337063
96   2.16005658
97   1.30675145
98  -2.10132120
99   0.58608196
100 -4.21213252
101 -0.10373620
102  1.40817915
103  0.71745879
104  1.32772809
105 -2.38666596
106 -3.11503812
107  1.26102692
108  1.33093237
109  2.02071509
110 -1.05581745
111 -4.30093938
112 -3.54721864
113  1.08502785
114  0.73122753
115  1.67768702
116 -0.97710465
117  1.47354585
118  1.65204693
119  2.16512027
120 -4.32239982
121  1.39592062
122  1.61183655
123  0.26873301
124  1.76158434
125 -0.48024084
126  1.03095405
127  1.08951103
128 -4.01045043
129  0.82038586
130  1.38660601
131  2.52132218
132  2.25588948
133 -2.55768874
134  0.71543952
135  0.75966366
136  0.80921519
137  0.23811879
138 -4.80728364
139  0.89852650
140 -0.55000410
141  1.93021691
142  1.35975620
143  0.42593745
144  0.83148314
145 -3.68977237
146  0.50203874
147 -4.58181285
148 -1.96944473
149 -2.17591874
150  0.04228472
151 -2.92628276
152  1.01513986
153 -4.22311489
154  0.48348940
155  1.54453633
156  0.82511248
157  1.01605700
158  2.65850607
159  2.73917456
160  1.49469144
161  2.53588675
162  2.34398342
163  1.73594897
164  2.40369043
165  1.46806812
166  3.05346339
167  2.11833078
168  1.39686750
169  0.11741354
170  3.12658896
171 -4.78935034

En la sección $class nos da la clase predicha para cada observación en el conjunto de prueba ya sea 0 o 1 que ya mencionamos anteriormente que nos indica.

En la sección $posterior nos muestra lo que es la probabilidad posterior de pertenecer a cada clase, como por ejemplo en la primera fila hay un 99.93% de probabilidad de que la observación sea de la clase 0 y así sucesivamente con las demás.

En la sección $x son las proyecciones de las observaciones en el espacio de lo que es la función discriminante lineal (LD1) como, por ejemplo:

LD1

1 -3.26432050

2 -1.54366986

son las puntuaciones que nos sirven para la visualizar lo que es la separación entre las clases. Ya que cuanto más diferentes (alejadas) sean mejor separa el modelo.

1.6 Matriz de confusión

table(Predicted = pred$class, Actual = test$diagnosis)
         Actual
Predicted  0  1
        0 67  1
        1  6 97

Para la matriz de confusión tenemos las clases reales y vemos que:

Clase 0: 67 + 6 = 73 observaciones

Clase 1: 1 + 97 = 98 observaciones

Vemos también las filas que el modelo nos predijo que 67 veces en la clase 0 correctamente, pero se equivocó 1 vez prediciendo clase 0 cuando era la clase 1, predijo 97 veces para clase 1 correctamente, pero se equivocó 6 veces prediciendo clase 1 cuando era clase 0.

Para calcular se hace lo siguiente:

(67+97) / (67+1+6+97) = 164/171 = 95.91%

1.7 Calcular exactitud

accuracy <- mean(pred$class == test$diagnosis)
cat("Exactitud del modelo:", round(accuracy * 100, 2), "%\n")
Exactitud del modelo: 95.91 %

Este modelo nos predice correctamente el 95.91% de los casos en nuestro conjunto de prueba. Esto es un rendimiento bastante bueno.

1.8 Visualización del espacio discriminante

par(mar = c(4, 4, 2, 1))
plot(lda_model, col = as.numeric(train$diagnosis))

Este grafico muestra en el eje x representa los valores del discriminante lineal LD1 y en el eje y muestra lo que es la frecuencia.

Se observa que los valores de LD1 para group 0 se concentran más en el rango negativo (de −6 a 0) y los de group1 se concentran en el positivo (de 0 a 4). Por lo que nos indica que el modelo LDA logró encontrar una combinación buena de variables que separa bien los grupos.

La mayoría de las observaciones de cada grupo están bien separadas, el modelo LDA probablemente tendrá buena precisión para poder clasificar nuevos casos.

2 Análisis Discriminante en Python

2.1 Cargar librerías necesarias

2.2 Cargar y preparar la base de datos

import pandas as pd

# Leer el archivo Excel
cancer = pd.read_excel("C:/Users/MINEDUCYT/Desktop/CICLOI_2025/SeminarioI/breast_cancer_wisconsin.xlsx")

# Ver las primeras 10 filas
print(cancer.head(10))
   mean radius  mean texture  ...  worst fractal dimension  diagnosis
0        17.99         10.38  ...                  0.11890          0
1        20.57         17.77  ...                  0.08902          0
2        19.69         21.25  ...                  0.08758          0
3        11.42         20.38  ...                  0.17300          0
4        20.29         14.34  ...                  0.07678          0
5        12.45         15.70  ...                  0.12440          0
6        18.25         19.98  ...                  0.08368          0
7        13.71         20.83  ...                  0.11510          0
8        13.00         21.82  ...                  0.10720          0
9        12.46         24.04  ...                  0.20750          0

[10 rows x 31 columns]

2.2.1 Convertir la variable objetivo a factor

cancer['diagnosis'] = cancer['diagnosis'].astype('category')

2.2.2 Verificar estructura

print(cancer.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   mean radius              569 non-null    float64 
 1   mean texture             569 non-null    float64 
 2   mean perimeter           569 non-null    float64 
 3   mean area                569 non-null    float64 
 4   mean smoothness          569 non-null    float64 
 5   mean compactness         569 non-null    float64 
 6   mean concavity           569 non-null    float64 
 7   mean concave points      569 non-null    float64 
 8   mean symmetry            569 non-null    float64 
 9   mean fractal dimension   569 non-null    float64 
 10  radius error             569 non-null    float64 
 11  texture error            569 non-null    float64 
 12  perimeter error          569 non-null    float64 
 13  area error               569 non-null    float64 
 14  smoothness error         569 non-null    float64 
 15  compactness error        569 non-null    float64 
 16  concavity error          569 non-null    float64 
 17  concave points error     569 non-null    float64 
 18  symmetry error           569 non-null    float64 
 19  fractal dimension error  569 non-null    float64 
 20  worst radius             569 non-null    float64 
 21  worst texture            569 non-null    float64 
 22  worst perimeter          569 non-null    float64 
 23  worst area               569 non-null    float64 
 24  worst smoothness         569 non-null    float64 
 25  worst compactness        569 non-null    float64 
 26  worst concavity          569 non-null    float64 
 27  worst concave points     569 non-null    float64 
 28  worst symmetry           569 non-null    float64 
 29  worst fractal dimension  569 non-null    float64 
 30  diagnosis                569 non-null    category
dtypes: category(1), float64(30)
memory usage: 134.2 KB
None
print(cancer.head())
   mean radius  mean texture  ...  worst fractal dimension  diagnosis
0        17.99         10.38  ...                  0.11890          0
1        20.57         17.77  ...                  0.08902          0
2        19.69         21.25  ...                  0.08758          0
3        11.42         20.38  ...                  0.17300          0
4        20.29         14.34  ...                  0.07678          0

[5 rows x 31 columns]

2.3 Dividir datos en entrenamiento y prueba

import numpy as np

np.random.seed(123)
sample_index = np.random.choice(cancer.index, size=int(0.7 * len(cancer)), replace=False)

train = cancer.loc[sample_index]
test = cancer.drop(sample_index)

print(sample_index)
[333 273 201 178  85 500 216 297 209 469 270 335   9 285 430 200 429 107
 502  48 564 171 437 557 387  35 132 188  75 276   5 345 284 399  57 404
 347 480 102 517  54 326 138 204  11 518 499 212  13 199 196 368 466 454
 374 177 155 164 162 396 400 310 488 166 383  82 459 231 151  36 547 381
 205 458 182 170 523 272 156 274 496  34  33 159 527 338  15 327 131 402
  43 362 463 202 384 551  93 150 344  66 450 175 286 356 427  59 522 388
 314 179 192 246 211  72 190 134 386 118 316 331 560 529 334 292 460 482
 172 501  74 456  24 287 548  64 328 546 511  55 230 282 535 436 236 540
 260 410  26 498 537 352 309 105 550 114 229 520 444 181 223  49 472 457
  22 157 249 145 101 329  42 432  41 559 295 264 125 280 288 227  95  20
 470 503 185 534 203 393 464   0 184 142 237 361  30 474 148 120 191 217
 147 258 516 515 462 417 491 189 311 453  70  31 346 379 221  21 553 423
 408 367  79 350 337 275 299 121 443 210 160 376 439  91 234 261 248 226
 545 348  52 266 508  94 252 242 308 104 481 471 351 541 554 415  78 241
 336 289 530 369   6 416 263 277 318 267 447  37 392 128 558 455 165 298
 332  38 320  90  12 524 144 426 313 486  25 421 163  81 461 505  19  29
 543  71  83 291  53 278 173 406 422 490 531  32 492   4 218 239 235  84
 127 143 126 152 398 240  80 117  23 479 317 405  61 108 566  44 306  89
 232 397 446 168 431 220 167 225   7 441 115 494 378 245 136 478 452 363
 514 525 425 370 509 293 532 183 219 279 353 448 228  10 542  28 407 207
 124  69 161 194 137 442 521 112 325 100 343 568 341 403 506 485 110 283
 281 187 122 389  92 493 195  62 556 174 507  45 302 528 130 513  50  97
 197 215]
print(train)
     mean radius  mean texture  ...  worst fractal dimension  diagnosis
333       11.250         14.78  ...                  0.07418          1
273        9.742         15.67  ...                  0.08175          1
201       17.540         19.32  ...                  0.07867          0
178       13.010         22.22  ...                  0.05843          1
85        18.460         18.52  ...                  0.08579          0
..           ...           ...  ...                      ...        ...
513       14.580         13.66  ...                  0.07048          1
50        11.760         21.60  ...                  0.06563          1
97         9.787         19.94  ...                  0.08988          1
197       18.080         21.84  ...                  0.06558          0
215       13.860         16.93  ...                  0.10590          0

[398 rows x 31 columns]
print(test)
     mean radius  mean texture  ...  worst fractal dimension  diagnosis
1          20.57         17.77  ...                  0.08902          0
2          19.69         21.25  ...                  0.08758          0
3          11.42         20.38  ...                  0.17300          0
8          13.00         21.82  ...                  0.10720          0
14         13.73         22.61  ...                  0.14310          0
..           ...           ...  ...                      ...        ...
561        11.20         29.37  ...                  0.05905          1
562        15.22         30.62  ...                  0.14090          0
563        20.92         25.09  ...                  0.09873          0
565        20.13         28.25  ...                  0.06637          0
567        20.60         29.33  ...                  0.12400          0

[171 rows x 31 columns]

2.4 Ajustar el modelo LDA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Separar características (X) y etiqueta (y)
X_train = train.drop(columns=['diagnosis'])
y_train = train['diagnosis']

# Crear y entrenar el modelo LDA
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
LinearDiscriminantAnalysis()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Mostrar resumen (coeficientes y clases)
print("Clases:", lda_model.classes_)
Clases: [0 1]
print("Coeficientes:", lda_model.coef_)
Coeficientes: [[ 1.14219035e+00 -1.80891232e-01 -5.91631038e-02 -3.30453377e-03
   3.88253197e+01  8.56999707e+01 -2.03700166e+00 -1.04195020e+02
   3.62275956e+00 -1.53395422e+02 -1.06975894e+01 -1.55720222e-01
   6.41203912e-01  1.24907030e-02 -1.19549832e+02  1.11070346e+01
   7.20156421e+01 -1.99560582e+02 -4.35753804e+01 -8.04824641e+00
  -3.18633067e+00 -7.27382809e-02 -3.91873975e-02  2.02460913e-02
  -4.19062608e+01 -3.91923884e+00 -1.73066677e+01  1.75734995e+01
  -1.04663411e+01 -3.27061705e+01]]
print("Interceptos:", lda_model.intercept_)
Interceptos: [58.00224196]

2.5 Predecir con el modelo

# Separar características del conjunto de prueba
X_test = test.drop(columns=['diagnosis'])
print(X_test)
     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
1          20.57         17.77  ...          0.2750                  0.08902
2          19.69         21.25  ...          0.3613                  0.08758
3          11.42         20.38  ...          0.6638                  0.17300
8          13.00         21.82  ...          0.4378                  0.10720
14         13.73         22.61  ...          0.3596                  0.14310
..           ...           ...  ...             ...                      ...
561        11.20         29.37  ...          0.1566                  0.05905
562        15.22         30.62  ...          0.4089                  0.14090
563        20.92         25.09  ...          0.2929                  0.09873
565        20.13         28.25  ...          0.2572                  0.06637
567        20.60         29.33  ...          0.4087                  0.12400

[171 rows x 30 columns]
# Hacer predicciones
pred = lda_model.predict(X_test)
print(pred)
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0
 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0
 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0
 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1
 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0]
pred_proba = lda_model.predict_proba(X_test)
print(pred_proba)
[[9.95041889e-01 4.95811120e-03]
 [9.99994440e-01 5.55960044e-06]
 [9.99998747e-01 1.25348893e-06]
 [9.75096914e-01 2.49030860e-02]
 [8.49407433e-01 1.50592567e-01]
 [9.94610348e-01 5.38965208e-03]
 [9.99954241e-01 4.57589089e-05]
 [9.99989985e-01 1.00150793e-05]
 [9.99358928e-01 6.41072313e-04]
 [6.25695385e-01 3.74304615e-01]
 [1.07547232e-02 9.89245277e-01]
 [1.04537431e-07 9.99999895e-01]
 [9.96531941e-01 3.46805910e-03]
 [1.28672710e-04 9.99871327e-01]
 [9.99948816e-01 5.11841269e-05]
 [1.09097677e-04 9.99890902e-01]
 [7.74806233e-06 9.99992252e-01]
 [3.65776861e-06 9.99996342e-01]
 [9.79051000e-01 2.09489997e-02]
 [2.54495325e-04 9.99745505e-01]
 [2.53771272e-03 9.97462287e-01]
 [6.56367868e-02 9.34363213e-01]
 [6.44714157e-04 9.99355286e-01]
 [9.94438686e-01 5.56131403e-03]
 [2.39903622e-01 7.60096378e-01]
 [9.99999929e-01 7.12687303e-08]
 [2.85450802e-02 9.71454920e-01]
 [1.58048444e-04 9.99841952e-01]
 [8.64317915e-04 9.99135682e-01]
 [4.09559091e-01 5.90440909e-01]
 [2.10170399e-05 9.99978983e-01]
 [1.08487446e-02 9.89151255e-01]
 [9.46339884e-02 9.05366012e-01]
 [1.22049402e-02 9.87795060e-01]
 [3.13969214e-06 9.99996860e-01]
 [1.28225452e-09 9.99999999e-01]
 [9.97935560e-01 2.06443977e-03]
 [3.09893469e-03 9.96901065e-01]
 [9.99939153e-01 6.08470529e-05]
 [3.60629927e-02 9.63937007e-01]
 [8.19013468e-02 9.18098653e-01]
 [2.65145829e-06 9.99997349e-01]
 [1.88795844e-07 9.99999811e-01]
 [9.57119383e-01 4.28806169e-02]
 [6.92208248e-01 3.07791752e-01]
 [8.07862121e-05 9.99919214e-01]
 [1.17317402e-05 9.99988268e-01]
 [9.51998962e-02 9.04800104e-01]
 [1.74486089e-05 9.99982551e-01]
 [4.63448209e-04 9.99536552e-01]
 [1.58184215e-04 9.99841816e-01]
 [9.99999747e-01 2.53323577e-07]
 [9.20965040e-01 7.90349596e-02]
 [9.94816113e-01 5.18388680e-03]
 [9.93553295e-01 6.44670473e-03]
 [1.10667435e-06 9.99998893e-01]
 [1.09725577e-01 8.90274423e-01]
 [2.95104373e-01 7.04895627e-01]
 [9.99796875e-01 2.03125078e-04]
 [1.09574831e-05 9.99989043e-01]
 [6.09456900e-03 9.93905431e-01]
 [9.99782688e-01 2.17311643e-04]
 [1.16149561e-01 8.83850439e-01]
 [5.13534719e-04 9.99486465e-01]
 [9.95364288e-01 4.63571180e-03]
 [2.65825794e-02 9.73417421e-01]
 [9.99997410e-01 2.58966818e-06]
 [1.44142525e-04 9.99855857e-01]
 [9.70764038e-01 2.92359623e-02]
 [9.99970342e-01 2.96577722e-05]
 [1.19634293e-01 8.80365707e-01]
 [9.99916821e-01 8.31791722e-05]
 [9.97319823e-01 2.68017681e-03]
 [9.99993126e-01 6.87375854e-06]
 [9.41964827e-01 5.80351733e-02]
 [9.99828068e-01 1.71932387e-04]
 [3.45636385e-04 9.99654364e-01]
 [3.03179937e-06 9.99996968e-01]
 [1.26799614e-05 9.99987320e-01]
 [1.11588951e-02 9.88841105e-01]
 [1.69097304e-06 9.99998309e-01]
 [1.70891329e-07 9.99999829e-01]
 [9.99999533e-01 4.67389033e-07]
 [7.64335290e-05 9.99923566e-01]
 [2.55542386e-06 9.99997445e-01]
 [9.28697331e-06 9.99990713e-01]
 [7.69088492e-05 9.99923091e-01]
 [7.92411960e-07 9.99999208e-01]
 [1.18145721e-04 9.99881854e-01]
 [2.52181109e-06 9.99997478e-01]
 [4.82832325e-05 9.99951717e-01]
 [9.83200612e-01 1.67993881e-02]
 [2.63833645e-04 9.99736166e-01]
 [9.99999965e-01 3.51983982e-08]
 [1.12039707e-04 9.99887960e-01]
 [9.23659397e-01 7.63406026e-02]
 [9.99999556e-01 4.43830617e-07]
 [6.19260924e-02 9.38073908e-01]
 [1.04323972e-04 9.99895676e-01]
 [5.12745158e-07 9.99999487e-01]
 [3.49957354e-06 9.99996500e-01]
 [1.28808783e-04 9.99871191e-01]
 [4.36959456e-04 9.99563041e-01]
 [1.94609279e-06 9.99998054e-01]
 [5.45220818e-04 9.99454779e-01]
 [1.86040635e-05 9.99981396e-01]
 [1.76110141e-04 9.99823890e-01]
 [9.96085777e-01 3.91422290e-03]
 [9.99999598e-01 4.02286660e-07]
 [1.27051343e-04 9.99872949e-01]
 [9.96779976e-01 3.22002428e-03]
 [9.99901540e-01 9.84601602e-05]
 [1.82065787e-03 9.98179342e-01]
 [2.08998839e-03 9.97910012e-01]
 [3.45750842e-03 9.96542492e-01]
 [6.99378232e-05 9.99930062e-01]
 [6.85765472e-01 3.14234528e-01]
 [1.19267604e-06 9.99998807e-01]
 [2.56756257e-06 9.99997432e-01]
 [4.09987879e-04 9.99590012e-01]
 [6.50149230e-04 9.99349851e-01]
 [9.40790308e-05 9.99905921e-01]
 [3.11541383e-03 9.96884586e-01]
 [3.27213132e-05 9.99967279e-01]
 [7.25681710e-07 9.99999274e-01]
 [1.15872485e-01 8.84127515e-01]
 [8.17165530e-01 1.82834470e-01]
 [3.46966857e-04 9.99653033e-01]
 [6.64659852e-05 9.99933534e-01]
 [2.64912376e-04 9.99735088e-01]
 [6.15623913e-04 9.99384376e-01]
 [2.62719719e-06 9.99997373e-01]
 [9.99806748e-01 1.93252061e-04]
 [4.00037057e-03 9.95999629e-01]
 [9.86134737e-01 1.38652631e-02]
 [1.72293570e-03 9.98277064e-01]
 [1.78065211e-03 9.98219348e-01]
 [7.03275720e-03 9.92967243e-01]
 [9.99984453e-01 1.55467425e-05]
 [9.99571794e-01 4.28206024e-04]
 [2.02096250e-01 7.97903750e-01]
 [7.33173317e-05 9.99926683e-01]
 [9.99895681e-01 1.04319106e-04]
 [2.22635349e-03 9.97773647e-01]
 [1.30675148e-03 9.98693249e-01]
 [4.73421954e-03 9.95265780e-01]
 [1.00462578e-04 9.99899537e-01]
 [3.80919920e-04 9.99619080e-01]
 [2.88492962e-02 9.71150704e-01]
 [9.99998271e-01 1.72878758e-06]
 [6.67924152e-01 3.32075848e-01]
 [1.27364141e-02 9.87263586e-01]
 [4.57858161e-04 9.99542142e-01]
 [2.53516258e-06 9.99997465e-01]
 [4.87816178e-05 9.99951218e-01]
 [9.96524194e-01 3.47580583e-03]
 [5.79898502e-04 9.99420101e-01]
 [5.33746974e-03 9.94662530e-01]
 [9.99083817e-01 9.16183251e-04]
 [1.06657052e-01 8.93342948e-01]
 [4.37575398e-05 9.99956242e-01]
 [2.05053307e-06 9.99997949e-01]
 [2.22635801e-04 9.99777364e-01]
 [3.53938806e-03 9.96460612e-01]
 [7.80460721e-03 9.92195393e-01]
 [1.83892772e-04 9.99816107e-01]
 [1.15929179e-05 9.99988407e-01]
 [9.99999949e-01 5.05830180e-08]
 [9.99996581e-01 3.41906522e-06]
 [9.99954824e-01 4.51756629e-05]
 [1.00000000e+00 1.76719844e-10]]

2.6 Matriz de confusión

import pandas as pd

# Crear una tabla de frecuencias cruzadas
conf_matrix = pd.crosstab(index=pred, columns=test['diagnosis'], 
                          rownames=['Predicted'], colnames=['Actual'])

print(conf_matrix)
Actual      0    1
Predicted         
0          58    0
1           8  105

2.7 Calcular exactitud

from sklearn.metrics import accuracy_score

# Calcular exactitud
accuracy = accuracy_score(test['diagnosis'], pred)

# Mostrar el resultado
print(f"Exactitud del modelo: {accuracy * 100:.2f} %")
Exactitud del modelo: 95.32 %

2.8 Visualización del espacio discriminante

import matplotlib.pyplot as plt

# Obtener la proyección LDA (LD1)
X_lda = lda_model.transform(X_train)

# Crear un mapa de colores según la clase
color_map = {'B': 'blue', 'M': 'red'}  # B = benigno, M = maligno
colors = train['diagnosis'].map(color_map)

# Crear el gráfico de dispersión unidimensional
plt.figure(figsize=(10, 2))
plt.scatter(X_lda[:, 0], [0] * len(X_lda), c=colors, alpha=0.6, edgecolor='k')
plt.xlabel("LD1")
plt.yticks([])  # Quitar eje Y
([], [])
plt.title("Proyección LDA - LD1")
plt.grid(True, axis='x')
plt.tight_layout()
plt.show()