library(readxl)
Warning: package 'readxl' was built under R version 4.3.3
library(MASS)
Warning: package 'MASS' was built under R version 4.3.2
library(readxl)
Warning: package 'readxl' was built under R version 4.3.3
library(MASS)
Warning: package 'MASS' was built under R version 4.3.2
library(readxl)
# Leer el archivo Excel
<- read_excel("C:/Users/MINEDUCYT/Desktop/CICLOI_2025/SeminarioI/breast_cancer_wisconsin.xlsx")
cancer
# Ver las primeras 10 filas
head(cancer,10)
# A tibble: 10 × 31
`mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 18.0 10.4 123. 1001 0.118
2 20.6 17.8 133. 1326 0.0847
3 19.7 21.2 130 1203 0.110
4 11.4 20.4 77.6 386. 0.142
5 20.3 14.3 135. 1297 0.100
6 12.4 15.7 82.6 477. 0.128
7 18.2 20.0 120. 1040 0.0946
8 13.7 20.8 90.2 578. 0.119
9 13 21.8 87.5 520. 0.127
10 12.5 24.0 84.0 476. 0.119
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
# `mean concave points` <dbl>, `mean symmetry` <dbl>,
# `mean fractal dimension` <dbl>, `radius error` <dbl>,
# `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
# `smoothness error` <dbl>, `compactness error` <dbl>,
# `concavity error` <dbl>, `concave points error` <dbl>,
# `symmetry error` <dbl>, `fractal dimension error` <dbl>, …
$diagnosis <- as.factor(cancer$diagnosis)
cancerhead(cancer$diagnosis)
[1] 0 0 0 0 0 0
Levels: 0 1
Se ha convertido lo que es la variable diagnosis en un factor lo que es muy importante para poder aplicar los modelos de clasificación como LDA.
Se observa que la variable diagnosis es numérica (0 y1) pero al convertirla en lo que es un factor ahora nos representa dos clases que son categóricas que son:
0: Representa una clase “benigno”
1: Representa otra clase “maligno”
str(cancer)
tibble [569 × 31] (S3: tbl_df/tbl/data.frame)
$ mean radius : num [1:569] 18 20.6 19.7 11.4 20.3 ...
$ mean texture : num [1:569] 10.4 17.8 21.2 20.4 14.3 ...
$ mean perimeter : num [1:569] 122.8 132.9 130 77.6 135.1 ...
$ mean area : num [1:569] 1001 1326 1203 386 1297 ...
$ mean smoothness : num [1:569] 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ mean compactness : num [1:569] 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ mean concavity : num [1:569] 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ mean concave points : num [1:569] 0.1471 0.0702 0.1279 0.1052 0.1043 ...
$ mean symmetry : num [1:569] 0.242 0.181 0.207 0.26 0.181 ...
$ mean fractal dimension : num [1:569] 0.0787 0.0567 0.06 0.0974 0.0588 ...
$ radius error : num [1:569] 1.095 0.543 0.746 0.496 0.757 ...
$ texture error : num [1:569] 0.905 0.734 0.787 1.156 0.781 ...
$ perimeter error : num [1:569] 8.59 3.4 4.58 3.44 5.44 ...
$ area error : num [1:569] 153.4 74.1 94 27.2 94.4 ...
$ smoothness error : num [1:569] 0.0064 0.00522 0.00615 0.00911 0.01149 ...
$ compactness error : num [1:569] 0.049 0.0131 0.0401 0.0746 0.0246 ...
$ concavity error : num [1:569] 0.0537 0.0186 0.0383 0.0566 0.0569 ...
$ concave points error : num [1:569] 0.0159 0.0134 0.0206 0.0187 0.0188 ...
$ symmetry error : num [1:569] 0.03 0.0139 0.0225 0.0596 0.0176 ...
$ fractal dimension error: num [1:569] 0.00619 0.00353 0.00457 0.00921 0.00511 ...
$ worst radius : num [1:569] 25.4 25 23.6 14.9 22.5 ...
$ worst texture : num [1:569] 17.3 23.4 25.5 26.5 16.7 ...
$ worst perimeter : num [1:569] 184.6 158.8 152.5 98.9 152.2 ...
$ worst area : num [1:569] 2019 1956 1709 568 1575 ...
$ worst smoothness : num [1:569] 0.162 0.124 0.144 0.21 0.137 ...
$ worst compactness : num [1:569] 0.666 0.187 0.424 0.866 0.205 ...
$ worst concavity : num [1:569] 0.712 0.242 0.45 0.687 0.4 ...
$ worst concave points : num [1:569] 0.265 0.186 0.243 0.258 0.163 ...
$ worst symmetry : num [1:569] 0.46 0.275 0.361 0.664 0.236 ...
$ worst fractal dimension: num [1:569] 0.1189 0.089 0.0876 0.173 0.0768 ...
$ diagnosis : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
La base de datos tiene 569 observaciones y 31 variables. También contiene variables numéricas que nos describen las características físicas de células que fueron tomadas de biopsias mamarias.
Para la variable diagnosis fue importante que esta se halla transformado como factor ya que es la clase de predecir en lo que es un modelo para la clasificación como LDA.
set.seed(123)
<- sample(1:nrow(cancer), size = 0.7 * nrow(cancer))
sample_index <- cancer[sample_index, ]
train <- cancer[-sample_index, ]
test
head(sample_index)
[1] 415 463 179 526 195 118
head(train)
# A tibble: 6 × 31
`mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 15.1 29.8 96.7 720. 0.0832
2 14.4 27.0 92.2 646. 0.0700
3 13.0 22.2 82.0 526. 0.0625
4 8.57 13.1 54.5 221. 0.104
5 14.9 23.2 100. 671. 0.104
6 14.9 16.7 98.6 682. 0.116
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
# `mean concave points` <dbl>, `mean symmetry` <dbl>,
# `mean fractal dimension` <dbl>, `radius error` <dbl>,
# `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
# `smoothness error` <dbl>, `compactness error` <dbl>,
# `concavity error` <dbl>, `concave points error` <dbl>,
# `symmetry error` <dbl>, `fractal dimension error` <dbl>, …
head(test)
# A tibble: 6 × 31
`mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 18.0 10.4 123. 1001 0.118
2 20.6 17.8 133. 1326 0.0847
3 13 21.8 87.5 520. 0.127
4 13.7 22.6 93.6 578. 0.113
5 14.7 20.1 94.7 684. 0.0987
6 16.1 20.7 108. 799. 0.117
# ℹ 26 more variables: `mean compactness` <dbl>, `mean concavity` <dbl>,
# `mean concave points` <dbl>, `mean symmetry` <dbl>,
# `mean fractal dimension` <dbl>, `radius error` <dbl>,
# `texture error` <dbl>, `perimeter error` <dbl>, `area error` <dbl>,
# `smoothness error` <dbl>, `compactness error` <dbl>,
# `concavity error` <dbl>, `concave points error` <dbl>,
# `symmetry error` <dbl>, `fractal dimension error` <dbl>, …
sample_index: Muestra lo que son los índices aleatorios seleccionados aleatoriamente para el entrenamiento que fueron 415, 463, 179, 526, 195 y 118 que son un 70% de los índices del conjunto de datos cáncer para formar el conjunto de entrenamiento.
train: Son los datos de entrenamiento (70%), se visualizan las primeras 6 filas de lo que es el conjunto de entrenamiento (train) con sus primeras 5 columnas de las 31. Esas columnas corresponden a diferentes medidas de características de tumores que son:
mean radius: Radio promedio de la célula.
mean texture: Textura promedio.
mean perimeter: Perímetro promedio.
mean area: Área promedio.
mean smoothness: Suavidad promedio.
Para nuestro análisis discriminante lineal (LDA) son atributos que se usan como variables predictoras para poder clasificar si el tumor es benigno o maligno.
test: Son los datos de prueba (30%), para evaluar que tan bien funciona el modelo. Los valores que se muestran, por ejemplo:
mean radius: 17.99
mean texture: 10.38
mean perimeter: 122.80
mean area: 1001.0
mean smoothness: 0.1184
Estos valores nos describen de forma cuántica ese tumor. En este análisis se utilizan para predecir una variable como lo es diagnosis que en forma medica suele ser B para benigno o M para maligno.
<- lda(diagnosis ~ ., data = train)
lda_model
# Mostrar resumen del modelo
print(lda_model)
Call:
lda(diagnosis ~ ., data = train)
Prior probabilities of groups:
0 1
0.3492462 0.6507538
Group means:
`mean radius` `mean texture` `mean perimeter` `mean area` `mean smoothness`
0 17.28885 21.70583 114.03914 955.0360 0.10234547
1 12.14377 18.02320 78.09371 462.9328 0.09224819
`mean compactness` `mean concavity` `mean concave points` `mean symmetry`
0 0.1419415 0.15481676 0.08541583 0.1925165
1 0.0810390 0.04729681 0.02568140 0.1752224
`mean fractal dimension` `radius error` `texture error` `perimeter error`
0 0.06244108 0.5641899 1.240917 3.984662
1 0.06305023 0.2871579 1.213266 2.013821
`area error` `smoothness error` `compactness error` `concavity error`
0 63.96309 0.006920676 0.03220353 0.04098705
1 21.47387 0.007235819 0.02212952 0.02706923
`concave points error` `symmetry error` `fractal dimension error`
0 0.014970288 0.02049093 0.004145626
1 0.009969741 0.02064478 0.003778734
`worst radius` `worst texture` `worst perimeter` `worst area`
0 20.73863 29.47806 138.45978 1360.3525
1 13.41039 23.72772 87.19842 561.9328
`worst smoothness` `worst compactness` `worst concavity`
0 0.1450568 0.3723490 0.4432517
1 0.1251653 0.1850665 0.1689314
`worst concave points` `worst symmetry` `worst fractal dimension`
0 0.17910417 0.3245036 0.09196381
1 0.07522487 0.2720641 0.07997440
Coefficients of linear discriminants:
LD1
`mean radius` 1.258547915
`mean texture` -0.030083088
`mean perimeter` -0.107784270
`mean area` -0.005176914
`mean smoothness` -9.004044617
`mean compactness` 24.111561107
`mean concavity` -4.749500675
`mean concave points` -12.674770351
`mean symmetry` -1.282530280
`mean fractal dimension` -5.310758318
`radius error` -0.882583945
`texture error` 0.380924695
`perimeter error` -0.134564024
`area error` 0.001863364
`smoothness error` -83.026413011
`compactness error` -0.851244190
`concavity error` 16.816940282
`concave points error` -48.437080252
`symmetry error` -14.053438493
`fractal dimension error` 20.283804784
`worst radius` -1.067839788
`worst texture` -0.043361395
`worst perimeter` 0.008610691
`worst area` 0.006935269
`worst smoothness` 1.024035463
`worst compactness` -0.291222971
`worst concavity` -2.343462931
`worst concave points` -2.842339599
`worst symmetry` -2.536283098
`worst fractal dimension` -22.872838872
Probabilidades a priori de los grupos
Se observó que en Prior probabilities of groups: 0: 0.3492462 1: 0.6507538 esto nos indica que la proporción de observaciones en cada clase en nuestro conjunto de entrenamiento nos da que en:
0= 34.9% que es probablemente maligno
1: 65.1% que es probablemente benigno.
Medias por grupo:
Después se observan las medias de cada variable para los dos grupos 0 y1, como, por ejemplo:
mean radius
Grupo 0: 17.28885
Grupo 1: 12.14377
Esto nos dice que los tumores del grupo 0 tienen un radio promedio mayor que los del grupo 1, lo cual podría ser un rasgo discriminante muy importante.
Coeficientes de la discriminante lineal (LD1)
Coeficientes positivos: contribuyen a clasificar hacia una clase (por ejemplo, 1).
Coeficientes negativos: hacia la otra clase (por ejemplo, 0).
Cuando mayor sea el valor absoluto es más la influencia que tiene es variable en la clasificación.
Por ejemplo:
mean radius: 1.28 que si a mayor radio más probable que sea del grupo 1.
worst fractal dimension: -22.87 esta variable tiene gran peso en la clasificación y es probable que sea del grupo 0
mean smoothness: -9.00 mayor suavidad promedio es más probable de pertenecer en el grupo 0
<- predict(lda_model, newdata = test)
pred head(pred)
$class
[1] 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1
[38] 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1
[75] 1 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0
[112] 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0
[149] 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Levels: 0 1
$posterior
0 1
1 9.999385e-01 6.150584e-05
2 9.985722e-01 1.427755e-03
3 9.794959e-01 2.050408e-02
4 9.070450e-01 9.295503e-02
5 9.947401e-01 5.259897e-03
6 9.998592e-01 1.407958e-04
7 9.999573e-01 4.273316e-05
8 3.619519e-05 9.999638e-01
9 9.995805e-01 4.194986e-04
10 9.998172e-01 1.827739e-04
11 9.993890e-01 6.110469e-04
12 9.272511e-01 7.274888e-02
13 1.386450e-05 9.999861e-01
14 9.999999e-01 7.435258e-08
15 9.399546e-01 6.004543e-02
16 9.999862e-01 1.378183e-05
17 2.661338e-07 9.999997e-01
18 4.889098e-04 9.995111e-01
19 3.281269e-03 9.967187e-01
20 9.940348e-01 5.965182e-03
21 3.864995e-06 9.999961e-01
22 3.391939e-06 9.999966e-01
23 9.993216e-01 6.784161e-04
24 9.990041e-01 9.959470e-04
25 3.991042e-04 9.996009e-01
26 9.992817e-01 7.183058e-04
27 9.952746e-01 4.725441e-03
28 9.999999e-01 1.437264e-07
29 8.304831e-01 1.695169e-01
30 9.999698e-01 3.020836e-05
31 2.524719e-01 7.475281e-01
32 9.910644e-01 8.935625e-03
33 4.164596e-04 9.995835e-01
34 8.102907e-01 1.897093e-01
35 5.668494e-04 9.994332e-01
36 1.000000e+00 1.360880e-09
37 1.010136e-02 9.898986e-01
38 2.698059e-03 9.973019e-01
39 1.673072e-04 9.998327e-01
40 3.746650e-04 9.996253e-01
41 6.708009e-01 3.291991e-01
42 1.764502e-05 9.999824e-01
43 5.535861e-06 9.999945e-01
44 8.924355e-01 1.075645e-01
45 1.511062e-03 9.984889e-01
46 3.666257e-02 9.633374e-01
47 1.056368e-04 9.998944e-01
48 9.870254e-01 1.297463e-02
49 9.998047e-01 1.953185e-04
50 1.267325e-06 9.999987e-01
51 9.999827e-01 1.734547e-05
52 6.227498e-04 9.993773e-01
53 9.945912e-01 5.408768e-03
54 4.033073e-05 9.999597e-01
55 9.999976e-01 2.406976e-06
56 9.716873e-01 2.831268e-02
57 8.720054e-01 1.279946e-01
58 1.104890e-03 9.988951e-01
59 2.858582e-05 9.999714e-01
60 5.356279e-06 9.999946e-01
61 3.744022e-03 9.962560e-01
62 3.993432e-10 1.000000e+00
63 3.758537e-02 9.624146e-01
64 9.999999e-01 6.059155e-08
65 5.815340e-01 4.184660e-01
66 5.214702e-06 9.999948e-01
67 1.000000e+00 1.866553e-12
68 2.148542e-01 7.851458e-01
69 9.987033e-01 1.296653e-03
70 1.817197e-06 9.999982e-01
71 9.979647e-01 2.035330e-03
72 1.036880e-03 9.989631e-01
73 9.999994e-01 6.096648e-07
74 1.043601e-05 9.999896e-01
75 1.174405e-04 9.998826e-01
76 2.575437e-02 9.742456e-01
77 9.999967e-01 3.288422e-06
78 1.173086e-01 8.826914e-01
79 9.998984e-01 1.015554e-04
80 9.999079e-01 9.212784e-05
81 1.766922e-01 8.233078e-01
82 5.076695e-02 9.492330e-01
83 9.816216e-01 1.837844e-02
84 1.259094e-04 9.998741e-01
85 9.795967e-01 2.040327e-02
86 9.999398e-01 6.024239e-05
87 9.142859e-01 8.571408e-02
88 2.015867e-02 9.798413e-01
89 3.936483e-05 9.999606e-01
90 1.958179e-07 9.999998e-01
91 9.999976e-01 2.443786e-06
92 1.901658e-05 9.999810e-01
93 9.757026e-05 9.999024e-01
94 2.729702e-06 9.999973e-01
95 9.877405e-01 1.225950e-02
96 1.436961e-05 9.999856e-01
97 3.815067e-04 9.996185e-01
98 9.946577e-01 5.342336e-03
99 6.051620e-03 9.939484e-01
100 9.999984e-01 1.610602e-06
101 7.941777e-02 9.205822e-01
102 2.583847e-04 9.997416e-01
103 3.661332e-03 9.963387e-01
104 3.519685e-04 9.996480e-01
105 9.982093e-01 1.790696e-03
106 9.998908e-01 1.091583e-04
107 4.547653e-04 9.995452e-01
108 3.476622e-04 9.996523e-01
109 2.454764e-05 9.999755e-01
110 7.700702e-01 2.299298e-01
111 9.999989e-01 1.144892e-06
112 9.999793e-01 2.073769e-05
113 8.940241e-04 9.991060e-01
114 3.473284e-03 9.965267e-01
115 9.173070e-05 9.999083e-01
116 7.122224e-01 2.877776e-01
117 2.009976e-04 9.997990e-01
118 1.012291e-04 9.998988e-01
119 1.409268e-05 9.999859e-01
120 9.999989e-01 1.054255e-06
121 2.708455e-04 9.997292e-01
122 1.181440e-04 9.998819e-01
123 2.019854e-02 9.798015e-01
124 6.645017e-05 9.999335e-01
125 2.682891e-01 7.317109e-01
126 1.100302e-03 9.988997e-01
127 8.787659e-04 9.991212e-01
128 9.999965e-01 3.496283e-06
129 2.468127e-03 9.975319e-01
130 2.807140e-04 9.997193e-01
131 3.584892e-06 9.999964e-01
132 9.942513e-06 9.999901e-01
133 9.990711e-01 9.288586e-04
134 3.689751e-03 9.963102e-01
135 3.114834e-03 9.968852e-01
136 2.576114e-03 9.974239e-01
137 2.266330e-02 9.773367e-01
138 9.999998e-01 1.635484e-07
139 1.829045e-03 9.981710e-01
140 3.240526e-01 6.759474e-01
141 3.475779e-05 9.999652e-01
142 3.112180e-04 9.996888e-01
143 1.114122e-02 9.888588e-01
144 2.365322e-03 9.976347e-01
145 9.999880e-01 1.199040e-05
146 8.339575e-03 9.916604e-01
147 9.999996e-01 3.890186e-07
148 9.911629e-01 8.837149e-03
149 9.959839e-01 4.016079e-03
150 4.691062e-02 9.530894e-01
151 9.997746e-01 2.254499e-04
152 1.169168e-03 9.988308e-01
153 9.999985e-01 1.544038e-06
154 8.950274e-03 9.910497e-01
155 1.530116e-04 9.998470e-01
156 2.423806e-03 9.975762e-01
157 1.165059e-03 9.988349e-01
158 2.115966e-06 9.999979e-01
159 1.551917e-06 9.999984e-01
160 1.853124e-04 9.998147e-01
161 3.389745e-06 9.999966e-01
162 7.087016e-06 9.999929e-01
163 7.332973e-05 9.999267e-01
164 5.633905e-06 9.999944e-01
165 2.052729e-04 9.997947e-01
166 4.637697e-07 9.999995e-01
167 1.686890e-05 9.999831e-01
168 2.698620e-04 9.997301e-01
169 3.556447e-02 9.644355e-01
170 3.501479e-07 9.999996e-01
171 9.999998e-01 1.752178e-07
$x
LD1
1 -3.26432050
2 -2.44569686
3 -1.74736127
4 -1.33407064
5 -2.10538934
6 -3.04880511
7 -3.35908019
8 1.91967247
9 -2.76465592
10 -2.98089705
11 -2.66674013
12 -1.40357993
13 2.16936786
14 -5.01239943
15 -1.45705666
16 -3.65354000
17 3.19797736
18 1.24218021
19 0.74607546
20 -2.07246391
21 2.50174648
22 2.53571835
23 -2.63950867
24 -2.53952533
25 1.29501322
26 -2.62463169
27 -2.13341001
28 -4.84090121
29 -1.15478503
30 -3.44933656
31 -0.45886468
32 -1.96653537
33 1.28393270
34 -1.11909702
35 1.20367178
36 -6.05339093
37 0.45170599
38 0.79714883
39 1.52129185
40 1.31146191
41 -0.92652311
42 2.10662619
43 2.40825850
44 -1.29186238
45 0.94830331
46 0.10920451
47 1.64095589
48 -1.86842999
49 -2.96362101
50 2.79188723
51 -3.59369740
52 1.17918482
53 -2.09808814
54 1.89152065
55 -4.10759217
56 -1.66131465
57 -1.24058786
58 1.02987021
59 1.98108635
60 2.41683942
61 0.71162602
62 4.88980119
63 0.10248687
64 -5.06565339
65 -0.82693398
66 2.42380967
67 -7.76859653
68 -0.40410809
69 -2.47079305
70 2.69811324
71 -2.35328222
72 1.04641836
73 -4.46490812
74 2.24328441
75 1.61339059
76 0.20402559
77 -4.02639903
78 -0.21617438
79 -3.13382538
80 -3.15917870
81 -0.34087578
82 0.02067332
83 -1.77640335
84 1.59527036
85 -1.74867046
86 -3.26972154
87 -1.35724173
88 0.26925773
89 1.89782843
90 3.27781144
91 -4.10364297
92 2.08714763
93 1.66162690
94 2.59223713
95 -1.88337063
96 2.16005658
97 1.30675145
98 -2.10132120
99 0.58608196
100 -4.21213252
101 -0.10373620
102 1.40817915
103 0.71745879
104 1.32772809
105 -2.38666596
106 -3.11503812
107 1.26102692
108 1.33093237
109 2.02071509
110 -1.05581745
111 -4.30093938
112 -3.54721864
113 1.08502785
114 0.73122753
115 1.67768702
116 -0.97710465
117 1.47354585
118 1.65204693
119 2.16512027
120 -4.32239982
121 1.39592062
122 1.61183655
123 0.26873301
124 1.76158434
125 -0.48024084
126 1.03095405
127 1.08951103
128 -4.01045043
129 0.82038586
130 1.38660601
131 2.52132218
132 2.25588948
133 -2.55768874
134 0.71543952
135 0.75966366
136 0.80921519
137 0.23811879
138 -4.80728364
139 0.89852650
140 -0.55000410
141 1.93021691
142 1.35975620
143 0.42593745
144 0.83148314
145 -3.68977237
146 0.50203874
147 -4.58181285
148 -1.96944473
149 -2.17591874
150 0.04228472
151 -2.92628276
152 1.01513986
153 -4.22311489
154 0.48348940
155 1.54453633
156 0.82511248
157 1.01605700
158 2.65850607
159 2.73917456
160 1.49469144
161 2.53588675
162 2.34398342
163 1.73594897
164 2.40369043
165 1.46806812
166 3.05346339
167 2.11833078
168 1.39686750
169 0.11741354
170 3.12658896
171 -4.78935034
En la sección $class nos da la clase predicha para cada observación en el conjunto de prueba ya sea 0 o 1 que ya mencionamos anteriormente que nos indica.
En la sección $posterior nos muestra lo que es la probabilidad posterior de pertenecer a cada clase, como por ejemplo en la primera fila hay un 99.93% de probabilidad de que la observación sea de la clase 0 y así sucesivamente con las demás.
En la sección $x son las proyecciones de las observaciones en el espacio de lo que es la función discriminante lineal (LD1) como, por ejemplo:
LD1
1 -3.26432050
2 -1.54366986
son las puntuaciones que nos sirven para la visualizar lo que es la separación entre las clases. Ya que cuanto más diferentes (alejadas) sean mejor separa el modelo.
table(Predicted = pred$class, Actual = test$diagnosis)
Actual
Predicted 0 1
0 67 1
1 6 97
Para la matriz de confusión tenemos las clases reales y vemos que:
Clase 0: 67 + 6 = 73 observaciones
Clase 1: 1 + 97 = 98 observaciones
Vemos también las filas que el modelo nos predijo que 67 veces en la clase 0 correctamente, pero se equivocó 1 vez prediciendo clase 0 cuando era la clase 1, predijo 97 veces para clase 1 correctamente, pero se equivocó 6 veces prediciendo clase 1 cuando era clase 0.
Para calcular se hace lo siguiente:
(67+97) / (67+1+6+97) = 164/171 = 95.91%
<- mean(pred$class == test$diagnosis)
accuracy cat("Exactitud del modelo:", round(accuracy * 100, 2), "%\n")
Exactitud del modelo: 95.91 %
Este modelo nos predice correctamente el 95.91% de los casos en nuestro conjunto de prueba. Esto es un rendimiento bastante bueno.
par(mar = c(4, 4, 2, 1))
plot(lda_model, col = as.numeric(train$diagnosis))
Este grafico muestra en el eje x representa los valores del discriminante lineal LD1 y en el eje y muestra lo que es la frecuencia.
Se observa que los valores de LD1 para group 0 se concentran más en el rango negativo (de −6 a 0) y los de group1 se concentran en el positivo (de 0 a 4). Por lo que nos indica que el modelo LDA logró encontrar una combinación buena de variables que separa bien los grupos.
La mayoría de las observaciones de cada grupo están bien separadas, el modelo LDA probablemente tendrá buena precisión para poder clasificar nuevos casos.
import pandas as pd
# Leer el archivo Excel
= pd.read_excel("C:/Users/MINEDUCYT/Desktop/CICLOI_2025/SeminarioI/breast_cancer_wisconsin.xlsx")
cancer
# Ver las primeras 10 filas
print(cancer.head(10))
mean radius mean texture ... worst fractal dimension diagnosis
0 17.99 10.38 ... 0.11890 0
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
4 20.29 14.34 ... 0.07678 0
5 12.45 15.70 ... 0.12440 0
6 18.25 19.98 ... 0.08368 0
7 13.71 20.83 ... 0.11510 0
8 13.00 21.82 ... 0.10720 0
9 12.46 24.04 ... 0.20750 0
[10 rows x 31 columns]
'diagnosis'] = cancer['diagnosis'].astype('category') cancer[
print(cancer.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 diagnosis 569 non-null category
dtypes: category(1), float64(30)
memory usage: 134.2 KB
None
print(cancer.head())
mean radius mean texture ... worst fractal dimension diagnosis
0 17.99 10.38 ... 0.11890 0
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
4 20.29 14.34 ... 0.07678 0
[5 rows x 31 columns]
import numpy as np
123)
np.random.seed(= np.random.choice(cancer.index, size=int(0.7 * len(cancer)), replace=False)
sample_index
= cancer.loc[sample_index]
train = cancer.drop(sample_index)
test
print(sample_index)
[333 273 201 178 85 500 216 297 209 469 270 335 9 285 430 200 429 107
502 48 564 171 437 557 387 35 132 188 75 276 5 345 284 399 57 404
347 480 102 517 54 326 138 204 11 518 499 212 13 199 196 368 466 454
374 177 155 164 162 396 400 310 488 166 383 82 459 231 151 36 547 381
205 458 182 170 523 272 156 274 496 34 33 159 527 338 15 327 131 402
43 362 463 202 384 551 93 150 344 66 450 175 286 356 427 59 522 388
314 179 192 246 211 72 190 134 386 118 316 331 560 529 334 292 460 482
172 501 74 456 24 287 548 64 328 546 511 55 230 282 535 436 236 540
260 410 26 498 537 352 309 105 550 114 229 520 444 181 223 49 472 457
22 157 249 145 101 329 42 432 41 559 295 264 125 280 288 227 95 20
470 503 185 534 203 393 464 0 184 142 237 361 30 474 148 120 191 217
147 258 516 515 462 417 491 189 311 453 70 31 346 379 221 21 553 423
408 367 79 350 337 275 299 121 443 210 160 376 439 91 234 261 248 226
545 348 52 266 508 94 252 242 308 104 481 471 351 541 554 415 78 241
336 289 530 369 6 416 263 277 318 267 447 37 392 128 558 455 165 298
332 38 320 90 12 524 144 426 313 486 25 421 163 81 461 505 19 29
543 71 83 291 53 278 173 406 422 490 531 32 492 4 218 239 235 84
127 143 126 152 398 240 80 117 23 479 317 405 61 108 566 44 306 89
232 397 446 168 431 220 167 225 7 441 115 494 378 245 136 478 452 363
514 525 425 370 509 293 532 183 219 279 353 448 228 10 542 28 407 207
124 69 161 194 137 442 521 112 325 100 343 568 341 403 506 485 110 283
281 187 122 389 92 493 195 62 556 174 507 45 302 528 130 513 50 97
197 215]
print(train)
mean radius mean texture ... worst fractal dimension diagnosis
333 11.250 14.78 ... 0.07418 1
273 9.742 15.67 ... 0.08175 1
201 17.540 19.32 ... 0.07867 0
178 13.010 22.22 ... 0.05843 1
85 18.460 18.52 ... 0.08579 0
.. ... ... ... ... ...
513 14.580 13.66 ... 0.07048 1
50 11.760 21.60 ... 0.06563 1
97 9.787 19.94 ... 0.08988 1
197 18.080 21.84 ... 0.06558 0
215 13.860 16.93 ... 0.10590 0
[398 rows x 31 columns]
print(test)
mean radius mean texture ... worst fractal dimension diagnosis
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
8 13.00 21.82 ... 0.10720 0
14 13.73 22.61 ... 0.14310 0
.. ... ... ... ... ...
561 11.20 29.37 ... 0.05905 1
562 15.22 30.62 ... 0.14090 0
563 20.92 25.09 ... 0.09873 0
565 20.13 28.25 ... 0.06637 0
567 20.60 29.33 ... 0.12400 0
[171 rows x 31 columns]
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Separar características (X) y etiqueta (y)
= train.drop(columns=['diagnosis'])
X_train = train['diagnosis']
y_train
# Crear y entrenar el modelo LDA
= LinearDiscriminantAnalysis()
lda_model lda_model.fit(X_train, y_train)
LinearDiscriminantAnalysis()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearDiscriminantAnalysis()
# Mostrar resumen (coeficientes y clases)
print("Clases:", lda_model.classes_)
Clases: [0 1]
print("Coeficientes:", lda_model.coef_)
Coeficientes: [[ 1.14219035e+00 -1.80891232e-01 -5.91631038e-02 -3.30453377e-03
3.88253197e+01 8.56999707e+01 -2.03700166e+00 -1.04195020e+02
3.62275956e+00 -1.53395422e+02 -1.06975894e+01 -1.55720222e-01
6.41203912e-01 1.24907030e-02 -1.19549832e+02 1.11070346e+01
7.20156421e+01 -1.99560582e+02 -4.35753804e+01 -8.04824641e+00
-3.18633067e+00 -7.27382809e-02 -3.91873975e-02 2.02460913e-02
-4.19062608e+01 -3.91923884e+00 -1.73066677e+01 1.75734995e+01
-1.04663411e+01 -3.27061705e+01]]
print("Interceptos:", lda_model.intercept_)
Interceptos: [58.00224196]
# Separar características del conjunto de prueba
= test.drop(columns=['diagnosis'])
X_test print(X_test)
mean radius mean texture ... worst symmetry worst fractal dimension
1 20.57 17.77 ... 0.2750 0.08902
2 19.69 21.25 ... 0.3613 0.08758
3 11.42 20.38 ... 0.6638 0.17300
8 13.00 21.82 ... 0.4378 0.10720
14 13.73 22.61 ... 0.3596 0.14310
.. ... ... ... ... ...
561 11.20 29.37 ... 0.1566 0.05905
562 15.22 30.62 ... 0.4089 0.14090
563 20.92 25.09 ... 0.2929 0.09873
565 20.13 28.25 ... 0.2572 0.06637
567 20.60 29.33 ... 0.4087 0.12400
[171 rows x 30 columns]
# Hacer predicciones
= lda_model.predict(X_test)
pred print(pred)
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0
1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0
0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0
0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1
1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0]
= lda_model.predict_proba(X_test)
pred_proba print(pred_proba)
[[9.95041889e-01 4.95811120e-03]
[9.99994440e-01 5.55960044e-06]
[9.99998747e-01 1.25348893e-06]
[9.75096914e-01 2.49030860e-02]
[8.49407433e-01 1.50592567e-01]
[9.94610348e-01 5.38965208e-03]
[9.99954241e-01 4.57589089e-05]
[9.99989985e-01 1.00150793e-05]
[9.99358928e-01 6.41072313e-04]
[6.25695385e-01 3.74304615e-01]
[1.07547232e-02 9.89245277e-01]
[1.04537431e-07 9.99999895e-01]
[9.96531941e-01 3.46805910e-03]
[1.28672710e-04 9.99871327e-01]
[9.99948816e-01 5.11841269e-05]
[1.09097677e-04 9.99890902e-01]
[7.74806233e-06 9.99992252e-01]
[3.65776861e-06 9.99996342e-01]
[9.79051000e-01 2.09489997e-02]
[2.54495325e-04 9.99745505e-01]
[2.53771272e-03 9.97462287e-01]
[6.56367868e-02 9.34363213e-01]
[6.44714157e-04 9.99355286e-01]
[9.94438686e-01 5.56131403e-03]
[2.39903622e-01 7.60096378e-01]
[9.99999929e-01 7.12687303e-08]
[2.85450802e-02 9.71454920e-01]
[1.58048444e-04 9.99841952e-01]
[8.64317915e-04 9.99135682e-01]
[4.09559091e-01 5.90440909e-01]
[2.10170399e-05 9.99978983e-01]
[1.08487446e-02 9.89151255e-01]
[9.46339884e-02 9.05366012e-01]
[1.22049402e-02 9.87795060e-01]
[3.13969214e-06 9.99996860e-01]
[1.28225452e-09 9.99999999e-01]
[9.97935560e-01 2.06443977e-03]
[3.09893469e-03 9.96901065e-01]
[9.99939153e-01 6.08470529e-05]
[3.60629927e-02 9.63937007e-01]
[8.19013468e-02 9.18098653e-01]
[2.65145829e-06 9.99997349e-01]
[1.88795844e-07 9.99999811e-01]
[9.57119383e-01 4.28806169e-02]
[6.92208248e-01 3.07791752e-01]
[8.07862121e-05 9.99919214e-01]
[1.17317402e-05 9.99988268e-01]
[9.51998962e-02 9.04800104e-01]
[1.74486089e-05 9.99982551e-01]
[4.63448209e-04 9.99536552e-01]
[1.58184215e-04 9.99841816e-01]
[9.99999747e-01 2.53323577e-07]
[9.20965040e-01 7.90349596e-02]
[9.94816113e-01 5.18388680e-03]
[9.93553295e-01 6.44670473e-03]
[1.10667435e-06 9.99998893e-01]
[1.09725577e-01 8.90274423e-01]
[2.95104373e-01 7.04895627e-01]
[9.99796875e-01 2.03125078e-04]
[1.09574831e-05 9.99989043e-01]
[6.09456900e-03 9.93905431e-01]
[9.99782688e-01 2.17311643e-04]
[1.16149561e-01 8.83850439e-01]
[5.13534719e-04 9.99486465e-01]
[9.95364288e-01 4.63571180e-03]
[2.65825794e-02 9.73417421e-01]
[9.99997410e-01 2.58966818e-06]
[1.44142525e-04 9.99855857e-01]
[9.70764038e-01 2.92359623e-02]
[9.99970342e-01 2.96577722e-05]
[1.19634293e-01 8.80365707e-01]
[9.99916821e-01 8.31791722e-05]
[9.97319823e-01 2.68017681e-03]
[9.99993126e-01 6.87375854e-06]
[9.41964827e-01 5.80351733e-02]
[9.99828068e-01 1.71932387e-04]
[3.45636385e-04 9.99654364e-01]
[3.03179937e-06 9.99996968e-01]
[1.26799614e-05 9.99987320e-01]
[1.11588951e-02 9.88841105e-01]
[1.69097304e-06 9.99998309e-01]
[1.70891329e-07 9.99999829e-01]
[9.99999533e-01 4.67389033e-07]
[7.64335290e-05 9.99923566e-01]
[2.55542386e-06 9.99997445e-01]
[9.28697331e-06 9.99990713e-01]
[7.69088492e-05 9.99923091e-01]
[7.92411960e-07 9.99999208e-01]
[1.18145721e-04 9.99881854e-01]
[2.52181109e-06 9.99997478e-01]
[4.82832325e-05 9.99951717e-01]
[9.83200612e-01 1.67993881e-02]
[2.63833645e-04 9.99736166e-01]
[9.99999965e-01 3.51983982e-08]
[1.12039707e-04 9.99887960e-01]
[9.23659397e-01 7.63406026e-02]
[9.99999556e-01 4.43830617e-07]
[6.19260924e-02 9.38073908e-01]
[1.04323972e-04 9.99895676e-01]
[5.12745158e-07 9.99999487e-01]
[3.49957354e-06 9.99996500e-01]
[1.28808783e-04 9.99871191e-01]
[4.36959456e-04 9.99563041e-01]
[1.94609279e-06 9.99998054e-01]
[5.45220818e-04 9.99454779e-01]
[1.86040635e-05 9.99981396e-01]
[1.76110141e-04 9.99823890e-01]
[9.96085777e-01 3.91422290e-03]
[9.99999598e-01 4.02286660e-07]
[1.27051343e-04 9.99872949e-01]
[9.96779976e-01 3.22002428e-03]
[9.99901540e-01 9.84601602e-05]
[1.82065787e-03 9.98179342e-01]
[2.08998839e-03 9.97910012e-01]
[3.45750842e-03 9.96542492e-01]
[6.99378232e-05 9.99930062e-01]
[6.85765472e-01 3.14234528e-01]
[1.19267604e-06 9.99998807e-01]
[2.56756257e-06 9.99997432e-01]
[4.09987879e-04 9.99590012e-01]
[6.50149230e-04 9.99349851e-01]
[9.40790308e-05 9.99905921e-01]
[3.11541383e-03 9.96884586e-01]
[3.27213132e-05 9.99967279e-01]
[7.25681710e-07 9.99999274e-01]
[1.15872485e-01 8.84127515e-01]
[8.17165530e-01 1.82834470e-01]
[3.46966857e-04 9.99653033e-01]
[6.64659852e-05 9.99933534e-01]
[2.64912376e-04 9.99735088e-01]
[6.15623913e-04 9.99384376e-01]
[2.62719719e-06 9.99997373e-01]
[9.99806748e-01 1.93252061e-04]
[4.00037057e-03 9.95999629e-01]
[9.86134737e-01 1.38652631e-02]
[1.72293570e-03 9.98277064e-01]
[1.78065211e-03 9.98219348e-01]
[7.03275720e-03 9.92967243e-01]
[9.99984453e-01 1.55467425e-05]
[9.99571794e-01 4.28206024e-04]
[2.02096250e-01 7.97903750e-01]
[7.33173317e-05 9.99926683e-01]
[9.99895681e-01 1.04319106e-04]
[2.22635349e-03 9.97773647e-01]
[1.30675148e-03 9.98693249e-01]
[4.73421954e-03 9.95265780e-01]
[1.00462578e-04 9.99899537e-01]
[3.80919920e-04 9.99619080e-01]
[2.88492962e-02 9.71150704e-01]
[9.99998271e-01 1.72878758e-06]
[6.67924152e-01 3.32075848e-01]
[1.27364141e-02 9.87263586e-01]
[4.57858161e-04 9.99542142e-01]
[2.53516258e-06 9.99997465e-01]
[4.87816178e-05 9.99951218e-01]
[9.96524194e-01 3.47580583e-03]
[5.79898502e-04 9.99420101e-01]
[5.33746974e-03 9.94662530e-01]
[9.99083817e-01 9.16183251e-04]
[1.06657052e-01 8.93342948e-01]
[4.37575398e-05 9.99956242e-01]
[2.05053307e-06 9.99997949e-01]
[2.22635801e-04 9.99777364e-01]
[3.53938806e-03 9.96460612e-01]
[7.80460721e-03 9.92195393e-01]
[1.83892772e-04 9.99816107e-01]
[1.15929179e-05 9.99988407e-01]
[9.99999949e-01 5.05830180e-08]
[9.99996581e-01 3.41906522e-06]
[9.99954824e-01 4.51756629e-05]
[1.00000000e+00 1.76719844e-10]]
import pandas as pd
# Crear una tabla de frecuencias cruzadas
= pd.crosstab(index=pred, columns=test['diagnosis'],
conf_matrix =['Predicted'], colnames=['Actual'])
rownames
print(conf_matrix)
Actual 0 1
Predicted
0 58 0
1 8 105
from sklearn.metrics import accuracy_score
# Calcular exactitud
= accuracy_score(test['diagnosis'], pred)
accuracy
# Mostrar el resultado
print(f"Exactitud del modelo: {accuracy * 100:.2f} %")
Exactitud del modelo: 95.32 %
import matplotlib.pyplot as plt
# Obtener la proyección LDA (LD1)
= lda_model.transform(X_train)
X_lda
# Crear un mapa de colores según la clase
= {'B': 'blue', 'M': 'red'} # B = benigno, M = maligno
color_map = train['diagnosis'].map(color_map)
colors
# Crear el gráfico de dispersión unidimensional
=(10, 2))
plt.figure(figsize0], [0] * len(X_lda), c=colors, alpha=0.6, edgecolor='k')
plt.scatter(X_lda[:, "LD1")
plt.xlabel(# Quitar eje Y plt.yticks([])
([], [])
"Proyección LDA - LD1")
plt.title(True, axis='x')
plt.grid(
plt.tight_layout() plt.show()