elenco e caratteristiche delle variabili presenti nel dataset
## description Variabile Tipo
## 1 Hypothyroidism Hypothyroidism factor
## 2 Code Code character
## 3 Owner Owner character
## 4 Patient Patient character
## 5 Breed Breed factor
## 6 Gender Gender factor
## 7 Age in years Age_years numeric
## 8 Age in months Age_months numeric
## 9 Age Age numeric
## 10 Weight (Kg) Weight numeric
## 11 Body Condition Score BCS numeric
## 12 Thyroxine (T4 nmol/l) T4_nmol.l numeric
## 13 Thyroxine (T4 µg/dl) T4_µg.dl numeric
## 14 T4 post TSH (nmol/ml) T4.post.TSH..nmol.ml. numeric
## 15 T4 post TSH (µg/dl) T4.post.TSH..µg.dl. numeric
## 16 Increase from baseline Increase.from.baseline numeric
## 17 Thyroid-stimulating hormone (TSH ng/ml) TSH_ng.ml numeric
## 18 Increased TSH Increased.TSH factor
## 19 Cholesterol (mg/dl) Cholesterol_mg.dl numeric
## 20 Increased Cholesterol Increased.Cholesterol factor
## 21 Triglycerides Triglycerides numeric
## 22 Creatinine Creatinine numeric
## 23 Alanine Aminotransferase (ALT) ALT numeric
## 24 Aspartate Aminotransferase (AST) AST numeric
## 25 Hemoglobin (Hgb) Hgb numeric
## 26 Hematocrit (Hct) Hct numeric
## 27 Red blood cells (RBC) RBC numeric
## 28 Mean corpuscular volume (MCV) MCV numeric
## 29 Mean corpuscular hemoglobin concentration (MCHC) MCHC numeric
## 30 urine-specific gravity USG numeric
## 31 urine protein-to-creatinine UPC numeric
## 32 Asthenia Asthenia factor
## 33 Lethargy/Depression Lethargy_Depression factor
## 34 Polyuria and polydipsia (PU/PD) PU_PD factor
## 35 Appetite Appetite factor
## 36 Obesity Obesity factor
## 37 Alopecia Alopecia factor
## 38 Dermatopathy Dermatopathy factor
## 39 Neurological alterations Neuro_alterations factor
## 40 Other symptoms other.symptoms character
## 41 Reason for the test Reason.for.test character
## 42 Diagnosis Diagnosis character
alcune di queste informazioni sono derivate da una semplice anamnesi del paziente (come ad esempio Asthenia, Lethargy_Depression, etc.), altre sono invece il risultato di analisi di routine (Cholesterol_mg.dl, Hct, …), altre ancora sono fornite da analisi più specifiche legate alla patologia (T4_nmol.l, TSH_ng.ml).
Per le informazioni ottenute tramite analisi (T4_nmol.l, TSH_ng.ml, Cholesterol_mg.dl, Hct, …), in base a quanto indicato dai veterinari, sembra che ci possano essere problemi legati alla loro modalità di misurazione, e che quindi possa essere opportuno esprimerle per classi:
In sintesi possiamo avere variabili/informazioni:
presenza di dati mancanti e dati anomali
## Variable n.missing perc.missing
## 1 Hypothyroidism 0 0.0000000
## 2 Breed 0 0.0000000
## 3 Gender 0 0.0000000
## 4 Age 0 0.0000000
## 5 T4_nmol.l 0 0.0000000
## 6 T4_µg.dl 0 0.0000000
## 7 TSH_ng.ml 0 0.0000000
## 8 Increased.TSH 0 0.0000000
## 9 Increased.Cholesterol 0 0.0000000
## 10 Asthenia 0 0.0000000
## 11 Lethargy_Depression 0 0.0000000
## 12 PU_PD 0 0.0000000
## 13 Obesity 0 0.0000000
## 14 Alopecia 0 0.0000000
## 15 Dermatopathy 0 0.0000000
## 16 Neuro_alterations 0 0.0000000
## 17 T4_cl 0 0.0000000
## 18 TSH_cl 0 0.0000000
## 19 Appetite 1 0.3174603
## 20 Weight 3 0.9523810
## 21 Creatinine 8 2.5396825
## 22 Hct 8 2.5396825
## 23 Hct_cl 8 2.5396825
## 24 ALT 9 2.8571429
## 25 Hgb 9 2.8571429
## 26 RBC 9 2.8571429
## 27 MCV 9 2.8571429
## 28 Cholesterol_mg.dl 12 3.8095238
## 29 MCHC 12 3.8095238
## 30 Cholesterol_cl 12 3.8095238
## 31 AST 17 5.3968254
## 32 Triglycerides 171 54.2857143
## 33 USG 205 65.0793651
## 34 T4.post.TSH..nmol.ml. 209 66.3492063
## 35 T4.post.TSH..µg.dl. 209 66.3492063
## 36 Increase.from.baseline 209 66.3492063
## 37 BCS 230 73.0158730
## 38 UPC 250 79.3650794
elimino le variabili con il 5% ed oltre di missing
## Code Hypothyroidism Breed Gender Age Weight T4_nmol.l T4_µg.dl TSH_ng.ml Increased.TSH
## Length:315 No :233 meticcio : 85 C: 37 Min. : 1.250 Min. : 1.00 Min. : 1.29 Min. :0.1006 Min. :0.0100 No :246
## Class :character Yes: 82 dobermann : 25 F: 58 1st Qu.: 6.250 1st Qu.: 14.93 1st Qu.: 6.68 1st Qu.:0.5210 1st Qu.:0.1000 Yes: 69
## Mode :character labrador : 18 M:126 Median : 8.917 Median : 26.30 Median :16.00 Median :1.2480 Median :0.1700
## golden retriever: 14 S: 94 Mean : 8.704 Mean : 28.19 Mean :16.94 Mean :1.3216 Mean :0.4017
## setter inglese : 8 3rd Qu.:11.292 3rd Qu.: 37.50 3rd Qu.:23.15 3rd Qu.:1.8057 3rd Qu.:0.3200
## pinscher : 7 Max. :17.250 Max. :323.00 Max. :51.86 Max. :4.0451 Max. :8.9000
## (Other) :158 NA's :3
## Cholesterol_mg.dl Increased.Cholesterol Creatinine ALT Hgb Hct RBC MCV MCHC Asthenia
## Min. : 76.0 No :166 Min. :0.410 Min. : 8.00 Min. : 1.30 Min. : 4.90 Min. : 483000 Min. :54.80 Min. :21.2 No :187
## 1st Qu.: 227.5 Yes:149 1st Qu.:0.790 1st Qu.: 40.00 1st Qu.:12.93 1st Qu.:38.60 1st Qu.:5792500 1st Qu.:65.72 1st Qu.:33.0 Yes:128
## Median : 316.0 Median :0.950 Median : 61.50 Median :14.95 Median :43.80 Median :6510000 Median :68.00 Median :33.8
## Mean : 374.7 Mean :1.005 Mean : 97.74 Mean :14.77 Mean :43.62 Mean :6438343 Mean :67.84 Mean :33.9
## 3rd Qu.: 446.0 3rd Qu.:1.145 3rd Qu.: 103.00 3rd Qu.:16.60 3rd Qu.:49.00 3rd Qu.:7065000 3rd Qu.:70.30 3rd Qu.:34.9
## Max. :2025.0 Max. :6.300 Max. :1285.00 Max. :21.10 Max. :60.70 Max. :9420000 Max. :82.00 Max. :41.3
## NA's :12 NA's :8 NA's :9 NA's :9 NA's :8 NA's :9 NA's :9 NA's :12
## Lethargy_Depression PU_PD Appetite Obesity Alopecia Dermatopathy Neuro_alterations Cholesterol_cl Hct_cl T4_cl TSH_cl
## No :209 No :262 Low : 31 No :240 No :196 No :243 No :227 Normal : 79 Anemia: 56 Unmeasurable: 76 Normal/Low:244
## Yes:106 Yes: 53 Normal:238 Yes: 75 Yes:119 Yes: 72 Yes: 88 High :153 Normal:251 Low : 33 High : 71
## High : 45 More than double: 71 NA's : 8 Normal :206
## NA's : 1 NA's : 12
##
##
##
Dato anomalo per Weight (323 Kg):
i dati di colesterolo = 2025, ALT > 1200 e AST >1000, sono plausibili?
(vedi Saeys_2007_A review of feature selection techniques in bioinformatics)
Due questioni da affrontare:
identificare le variabili che maggirmento influiscono sulla variabile obiettivo Hypothyroidism
verificare se vi siano variabili esplicative fortemente correlate tra di loro e che quindi possano essere escluse
## attributes importance
## 1 T4_nmol.l 0.439602976
## 2 T4_µg.dl 0.439602976
## 3 T4_cl 0.410476462
## 4 TSH_ng.ml 0.288541997
## 5 Increased.TSH 0.281307285
## 6 TSH_cl 0.268059807
## 7 Breed 0.208001945
## 8 Cholesterol_mg.dl 0.124270429
## 9 Cholesterol_cl 0.119704846
## 10 Hgb 0.092413165
## 11 RBC 0.090852342
## 12 Hct 0.074769725
## 13 Increased.Cholesterol 0.058339750
## 14 Lethargy_Depression 0.046991070
## 15 Hct_cl 0.043764557
## 16 Creatinine 0.042732723
## 17 Alopecia 0.035400534
## 18 Obesity 0.021026575
## 19 Asthenia 0.014613008
## 20 Dermatopathy 0.012015703
## 21 Appetite 0.006361986
## 22 Neuro_alterations 0.004726434
## 23 PU_PD 0.004631117
## 24 Gender 0.004576536
## 25 Age 0.000000000
## 26 Weight 0.000000000
## 27 ALT 0.000000000
## 28 MCV 0.000000000
## 29 MCHC 0.000000000
Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute;
The entropy (very common in Information Theory) characterizes the (im)purity of an arbitrary collection of examples
infogain = H(Class) + H(Attribute) − H(Class, Attribute)
where H(X) is Shannon’s Entropy for a variable X and H(X, Y) is a joint Shannon’s Entropy for a variable X with a condition to Y.
## Age Weight T4_nmol.l T4_µg.dl TSH_ng.ml Cholesterol_mg.dl Creatinine ALT Hgb Hct RBC MCV MCHC
## Age 1.000 -0.086 0.004 0.004 -0.086 -0.063 -0.013 0.127 -0.089 -0.105 -0.093 0.029 0.007
## Weight -0.086 1.000 -0.154 -0.154 -0.012 0.113 0.179 -0.137 -0.166 -0.142 -0.119 -0.104 0.015
## T4_nmol.l 0.004 -0.154 1.000 1.000 -0.359 -0.398 -0.150 -0.011 0.339 0.303 0.348 0.036 0.089
## T4_µg.dl 0.004 -0.154 1.000 1.000 -0.359 -0.398 -0.150 -0.011 0.339 0.303 0.348 0.036 0.089
## TSH_ng.ml -0.086 -0.012 -0.359 -0.359 1.000 0.315 0.166 -0.006 -0.208 -0.201 -0.214 -0.007 -0.033
## Cholesterol_mg.dl -0.063 0.113 -0.398 -0.398 0.315 1.000 0.093 0.076 -0.210 -0.195 -0.235 0.065 -0.046
## Creatinine -0.013 0.179 -0.150 -0.150 0.166 0.093 1.000 -0.091 -0.205 -0.177 -0.192 0.051 0.006
## ALT 0.127 -0.137 -0.011 -0.011 -0.006 0.076 -0.091 1.000 -0.002 0.076 0.033 0.120 -0.071
## Hgb -0.089 -0.166 0.339 0.339 -0.208 -0.210 -0.205 -0.002 1.000 0.824 0.819 0.131 0.225
## Hct -0.105 -0.142 0.303 0.303 -0.201 -0.195 -0.177 0.076 0.824 1.000 0.859 0.200 -0.058
## RBC -0.093 -0.119 0.348 0.348 -0.214 -0.235 -0.192 0.033 0.819 0.859 1.000 -0.130 0.021
## MCV 0.029 -0.104 0.036 0.036 -0.007 0.065 0.051 0.120 0.131 0.200 -0.130 1.000 -0.172
## MCHC 0.007 0.015 0.089 0.089 -0.033 -0.046 0.006 -0.071 0.225 -0.058 0.021 -0.172 1.000
test Mann-Whitney per verificare se ci sia una differenza significativa, per ciascuna delle variabili continue, tra le due modalità della variabile dipendente Hypothyroidism:
## variable W p.value signif
## 1 Age 11205.0 1.988916e-02 *
## 2 Weight 7947.5 3.459423e-02 *
## 3 T4_nmol.l 18658.0 6.446435e-38 ***
## 4 T4_µg.dl 18658.0 6.446435e-38 ***
## 5 TSH_ng.ml 2280.0 1.101113e-24 ***
## 6 Cholesterol_mg.dl 3461.0 2.555047e-16 ***
## 7 Creatinine 6562.5 1.574912e-04 **
## 8 ALT 8209.0 2.219874e-01
## 9 Hgb 14052.0 4.692396e-13 ***
## 10 Hct 13987.0 1.768019e-12 ***
## 11 RBC 14194.5 9.916545e-14 ***
## 12 MCV 9356.5 7.213665e-01
## 13 MCHC 9440.5 5.057451e-01
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
vengono escluse dall’analisi le variabili con un ridotto information-gain e quelle che presentano un’alta correlazione, in specifico si utilizzano:
## [1] "T4_nmol.l" "TSH_ng.ml" "Cholesterol_mg.dl"
## [4] "Hct" "Creatinine"
si è scelto Hct (correlato con Hgb e RBC), lo hanno detto i veterinari
ho un dubbio se mantenere Creatinine
variabili qualitative
problema razza: sembra esserci una relazione con la razza, al momento ci sono 83 modalità di razza diverse, potrebbero essere aggregate?
per alcune ci sono differenze semplici da correggere:pit bull / pitbull; pastore maremmano-abruzzese / pastore maremmano abruzzese rimangono comunque troppe modalità: ha senso definire 7/8 gruppi?
variabili quantitative in classi
in alternativa alle variabili Increased.TSH e Increased.Cholesterol
è possibile utilizzare una classificazione standard per le corrispondenti variabili TSH_ng.ml e Cholesterol_mg.dl
allo stesso modo è possibile definire variabili per classi delle variabili T4_nmol.l e Hct
Test chi-quadrato per verificare l’indipendenza tra Hypothyroidism e le variabili qualitative
## variable X.squared df p.value signif
## 1 Gender 2.707214 3 4.390027e-01
## 2 Increased.TSH 182.690742 1 1.252984e-41 ***
## 3 Increased.Cholesterol 34.119684 1 5.182437e-09 ***
## 4 Asthenia 8.541705 1 3.471003e-03 **
## 5 Lethargy_Depression 29.261257 1 6.324773e-08 ***
## 6 PU_PD 2.175105 1 1.402600e-01
## 7 Appetite 3.113449 2 2.108255e-01
## 8 Obesity 13.035472 1 3.056461e-04 **
## 9 Alopecia 21.534621 1 3.474983e-06 ***
## 10 Dermatopathy 7.170476 1 7.411311e-03 **
## 11 Neuro_alterations 2.395061 1 1.217190e-01
## 12 Cholesterol_cl 76.783662 2 2.121484e-17 ***
## 13 Hct_cl 27.805310 1 1.341574e-07 ***
## 14 T4_cl 237.167775 2 3.159891e-52 ***
## 15 TSH_cl 174.744392 1 6.808088e-40 ***
variabili qualitative da utilizzare
## [1] "T4_cl" "TSH_cl" "Cholesterol_cl"
## [4] "Hct_cl" "Lethargy_Depression" "Alopecia"
## [7] "Obesity" "Asthenia" "Dermatopathy"
Definire un modello che consenta di prevedere la presenza di ipotiroidismo, questo significa determinare:
modello 1: variabili di base tutte qualitative - (qual_rid)
Hypothyroidism ~ Cholesterol_cl + Hct_cl + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy
modello 2: variabili di base quantitative e qualitative - [quan_rid]
Hypothyroidism ~ Cholesterol_mg.dl + Hct + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy
modello 3: variabili di da analisi specifiche tutte qualitative - [qual_all]
Hypothyroidism ~ T4_cl + TSH_cl + Cholesterol_cl + Hct_cl + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy
modello 4: variabili di da analisi specifiche quantitative e qualitative - [quan_all]
Hypothyroidism ~ T4_nmol.l + TSH_ng.ml + Cholesterol_mg.dl + Hct + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy
Per ognuno dei modelli identificati si tratta di determinare quale algoritmo di classificazione fornisce la migliore capacità previsiva della variabile obiettivo Hypo_eu. Gli algoritmi utilizzati sono:
La valutazione della capacità previsiva degli algoritmi è stata effettuata utilizzando il metodo del training/test dataset, ed in specifico dividendo i dati in un campione casuale di training pari al 75% delle osservazioni ed un dataset di test con il restante 25% delle osservazioni. Il primo dataset è quindi stato utilizzato per definire le regole di classificazione, il secondo è invece utilizzato per valutare l’effettiva capacità previsiva dell’algoritmo su dati non utilizzati per le costruzione delle regole di classificazione. La capacità previsiva è stata valutata per mezzo degli indici:
Per ovviare ad eventuali problemi legati alla selezione casuale dei campioni di training e test, si è optato per una replicazione per 500 volte di tali campioni, ottenendo quindi 500 misurazioni dei diversi indici di capacità previsiva.
Le analisi sono state effuttuate escludendo le osservazioni con dati mancanti.
metodo alternativo di valutazione dei modelli potrebbe essere il repeated k-folds cv: si divide il dataset in k sottocampioni e di volta in volta vengono utilizzati i k-1 come training ed il restante come test, il tutto ripetuto n volte
In termini di accuratezza si osserva come, in complesso, vi sia un miglioramento nel passaggio dal modello 1 al modello 2 e ai modelli con le variabili T4 e TSH. I modelli 3 e 4, in cui queste due ultime variabili sono espresse in termini di classi o in termini quantitativi, non presentano invece sostanziali differenze.
| model | method | F1 | Accuracy | Kappa | AUC |
|---|---|---|---|---|---|
| qual_rid | NB | 0.6664 | 0.8192 | 0.5427 | 0.8460 |
| Logistic | 0.6389 | 0.8195 | 0.5205 | 0.8362 | |
| GBM | 0.6349 | 0.8205 | 0.5187 | 0.8357 | |
| SVM | 0.6201 | 0.8074 | 0.4921 | 0.8296 | |
| rForest | 0.5969 | 0.8167 | 0.4852 | 0.8380 | |
| rpart | 0.5929 | 0.8074 | 0.4712 | 0.7770 | |
| quan_rid | NB | 0.7242 | 0.8633 | 0.6341 | 0.8835 |
| SVM | 0.7201 | 0.8631 | 0.6309 | 0.8752 | |
| Logistic | 0.7112 | 0.8579 | 0.6182 | 0.8691 | |
| GBM | 0.6776 | 0.8471 | 0.5804 | 0.8608 | |
| rForest | 0.6625 | 0.8426 | 0.5637 | 0.8732 | |
| rpart | 0.6121 | 0.7992 | 0.4786 | 0.7851 | |
| qual_all | NB | 0.9127 | 0.9534 | 0.8807 | 0.9834 |
| GBM | 0.9032 | 0.9477 | 0.8672 | 0.9803 | |
| rForest | 0.8921 | 0.9431 | 0.8532 | 0.9771 | |
| Logistic | 0.8869 | 0.9397 | 0.8456 | 0.9763 | |
| rpart | 0.8738 | 0.9323 | 0.8274 | 0.9474 | |
| SVM | 0.8718 | 0.9334 | 0.8266 | 0.9781 | |
| quan_all | GBM | 0.9134 | 0.9527 | 0.8806 | 0.9868 |
| rForest | 0.9104 | 0.9521 | 0.8774 | 0.9863 | |
| NB | 0.8949 | 0.9463 | 0.8587 | 0.9836 | |
| rpart | 0.8893 | 0.9415 | 0.8494 | 0.9436 | |
| SVM | 0.8868 | 0.9398 | 0.8456 | 0.9863 | |
| Logistic | 0.8843 | 0.9379 | 0.8415 | 0.9838 |
Per i diversi modelli analizzati si osserva come gli algoritmi che forniscono le migliori prestazioni previsive siano diversi in base all’indicatore utilizzato: in generale, rispetto al F1 score, considerato ottimale per situazioni di campione unbalanced (essendo media armonica tra recall e precisione), abbiamo comunque che il Naive Bayes fornisce le prestazioni migliori per i modelli 1,2 e 3, mentre nel modello 4 random forest e GBM risultano migliori.
Variable importance
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
Si potrebbero semplificare ulteriormente i modelli?
Se l’obiettivo è la predisposizione di una applicazione che consenta ad utenti finali di quantificare la probabilità che il paziente sia ipotiroideo o meno, potrebbe essere auspicabile una semplificazione dell’interfaccia.
Questo significa ridurre il numero di informazioni da inserire, senza una perdita in termini di capacità previsiva.
Come si può osservare dai grafici precedenti, all’interno di ciascuno dei modelli, i diversi metodi evidenziano “classifiche” diverse rispetto all’importanza che assumo le variabili.
Per selezionare le variabili rilevanti per metodo e modello si è utilizzato un Wrapper method. Per ciascuno dei quattro modelli: si è generata la lista di tutte le possibili combinazioni di variabili esplicative, da 2 a V (V = numero variabili nel modello) per ciascuna di queste combinazioni e per ciascun metodo sono stati calcolati Accuracy, Kappa e F1 usando un metodo di Repeated K-fold Cross-Validation. (K-fold Cross-Validation randomly divides the data into K blocks of roughly equal size. Each of the blocks is left out in turn and the other K-1 blocks are used to train the model. The held out block is predicted and these predictions are summarized into some type of performance measure (e.g. accuracy, Kappa, F1 score). The K estimates of performance are averaged to get the overall resampled estimate. Repeated K-fold CV does the same as above but more than once.)
Come si può vedere dalla tabella seguente, la riduzione delle variabili conduce in generale ad un miglioramento delle performance dei diversi metodi
| model | method | variables | F1 | Accuracy | Kappa |
|---|---|---|---|---|---|
| qual_rid | NB | Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia | 0.6722879 | 0.8331388 | 0.5614605 |
| GB | Cholesterol_cl,Hct_cl,Alopecia,Obesity | 0.6699684 | 0.8417368 | 0.5678996 | |
| LR | Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy | 0.6622315 | 0.8308051 | 0.5507083 | |
| RF | Cholesterol_cl,Hct_cl,Obesity,Dermatopathy | 0.6397103 | 0.8396161 | 0.5416011 | |
| SV | Cholesterol_cl,Hct_cl,Alopecia,Obesity,Asthenia | 0.6244554 | 0.8072303 | 0.4959919 | |
| CT | Cholesterol_cl,Obesity | 0.6182639 | 0.8071066 | 0.4899868 | |
| quan_rid | NB | Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Obesity,Dermatopathy | 0.7391153 | 0.8724583 | 0.6567123 |
| SV | Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Dermatopathy | 0.7353160 | 0.8733229 | 0.6536291 | |
| LR | Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia | 0.7309085 | 0.8669349 | 0.6428263 | |
| RF | Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Obesity,Asthenia | 0.6844513 | 0.8523544 | 0.5902075 | |
| GB | ALL | 0.6658634 | 0.8337316 | 0.5571375 | |
| CT | Cholesterol_mg.dl,Hct,Lethargy_Depression,Asthenia | 0.6576579 | 0.8348533 | 0.5531177 | |
| qual_all | NB | T4_cl,TSH_cl,Cholesterol_cl,Hct_cl,Lethargy_Depression | 0.9359539 | 0.9653424 | 0.9118712 |
| LR | T4_cl,TSH_cl | 0.9220547 | 0.9566991 | 0.8915479 | |
| GB | T4_cl,TSH_cl | 0.9198252 | 0.9562176 | 0.8893624 | |
| RF | T4_cl,TSH_cl | 0.9184423 | 0.9550045 | 0.8868662 | |
| CT | T4_cl,TSH_cl,Obesity | 0.8991319 | 0.9433005 | 0.8590729 | |
| SV | T4_cl,Lethargy_Depression,Alopecia,Obesity | 0.8805999 | 0.9361673 | 0.8367642 | |
| quan_all | RF | T4_nmol.l,TSH_ng.ml,Hct,Alopecia | 0.9320108 | 0.9648773 | 0.9079539 |
| LR | T4_nmol.l,Cholesterol_mg.dl,Lethargy_Depression,Dermatopathy | 0.9219543 | 0.9570454 | 0.8919865 | |
| GB | T4_nmol.l,TSH_ng.ml,Hct,Obesity,Asthenia,Dermatopathy | 0.9216226 | 0.9584909 | 0.8932769 | |
| SV | T4_nmol.l,TSH_ng.ml,Lethargy_Depression,Asthenia | 0.9203423 | 0.9564712 | 0.8901038 | |
| NB | T4_nmol.l,Dermatopathy | 0.9145703 | 0.9518603 | 0.8808681 | |
| CT | T4_nmol.l,TSH_ng.ml,Cholesterol_mg.dl,Hct,Lethargy_Depression,Obesity | 0.9100100 | 0.9499311 | 0.8748876 |
Osserviamo anche che le stesse combinazioni di variabili sono quelle che forniscono i risultati migliori per tutti i metodi
Bisogna anche sottolineare che, all’interno dello stesso modello e dello stesso metodo, se selezioniamo le combinazioni di variabili che forniscono i risultati migliori, non emergono differenze significative tra di loro.
Se consideriamo il solo metodo Naive Bayes nel modello qual_rid, e selezioniamo le 10 combinazioni di variabili che forniscono i migliori risultati in termini di F1 score:
| variables | mean.F1s | sd.F1s | mean.Accuracy | sd.Accuracy |
|---|---|---|---|---|
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia | 0.6722879 | 0.0916673 | 0.8331388 | 0.0424534 |
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy | 0.6717556 | 0.0830683 | 0.8267432 | 0.0429851 |
| ALL | 0.6699881 | 0.0896321 | 0.8225279 | 0.0508599 |
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Asthenia | 0.6698422 | 0.0961234 | 0.8227509 | 0.0566542 |
| Cholesterol_cl,Hct_cl,Alopecia,Obesity,Dermatopathy | 0.6682124 | 0.0762412 | 0.8352865 | 0.0426273 |
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity | 0.6611154 | 0.0932299 | 0.8249472 | 0.0403212 |
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Dermatopathy | 0.6573141 | 0.0809044 | 0.8206604 | 0.0473017 |
| Cholesterol_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy | 0.6563939 | 0.0886870 | 0.8203639 | 0.0387466 |
| Cholesterol_cl,Lethargy_Depression,Alopecia,Obesity | 0.6542365 | 0.0839609 | 0.8212951 | 0.0427165 |
| Cholesterol_cl,Hct_cl,Lethargy_Depression,Obesity,Dermatopathy | 0.6527167 | 0.0999858 | 0.8246731 | 0.0429195 |
si può vedere che le medie non differiscono significativamente tra di loro
analisi della varianza:
## Df Sum Sq Mean Sq F value Pr(>F)
## tmp1NBr$variables 9 0.027 0.003035 0.386 0.942
## Residuals 490 3.848 0.007854
Sulla base dei risultati precedenti si potrebbero definire, per ciascun modello, un metodo ed un set di variabili diverse; questo potrebbe essere “disorientante” in fase di immissione dei dati nell’app; essendoci però gruppi di variabili con indici di qualità non significativamente diversi, si potrebbe individuare il sotto-insieme di variabili comuni ai quattro modelli con il migliore risultato
Confusion matrix:
| True Condition | ||||
| Yes | No | Total | ||
| PredictedCondition | Yes | True Positive | False Positive | P’ |
| No | False Negative | True Negative | N’ | |
| Total | P | N | T | |
True positive (TP): Prediction is Yes and X is hypothyroid
True negative (TN): Prediction is No and X is healthy
False positive (FP): Prediction is Yes and X is healthy
False negative (FN): Prediction is No and X is hypothyroid
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision = TP/(TP+FP)
Recall (aka Sensitivity) = TP/(TP+FN)
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Specificity = TN/(TN+FP)
Balanced Accuracy = (Recall + Specificity) / 2
Cohen’s Kappa = (Observed accuracy - Expected accuracy) / (1 - Expected accuracy) where Expected accuracy = (P * P’)/T² + (N * N’)/T²
ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate = TP / (TP + FN) False Positive Rate = FP / (FP + TN)
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The decision tree technique is used in classification to detect criteria for dividing the individuals of a population into n predetermined classes (in many cases, n=2). We start by choosing the variable which, by its categories, provides the best separation of the individuals in each class, thus providing sub-populations, called nodes, each containing the largest possible proportion of individuals in a single class; the same operation is then repeated on each new node obtained, until no further separation of the individuals is possible or desirable (according to criteria which depend on the type of tree). The construction is such that each of the terminal nodes (the leaves) mainly consists of the individuals of a single class. An individual is assigned to a leaf, and therefore to a certain class, with a reasonably high probability, when it conforms to all the rules for reaching this leaf. The set of rules for all the leaves forms the classification model. [Tufféry, 2011]
A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the way a forest is a collection of many trees. DTs that are grown very deep often cause overfitting of the training data, resulting a high variation in classification outcome for a small change in the input data. They are very sensitive to their training data, which makes them error-prone to the test dataset. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT then considers a different part of that input vector and gives a classification outcome. The forest then chooses the classification of having the most ‘votes’ (for discrete classification outcome) or the average of all trees in the forest (for numeric classification outcome). Since the RF algorithm considers the outcomes from many different DTs, it can reduce the variance resulted from the consideration of a single DT for the same dataset. [Uddin, 2019]
In gradient boosting decision trees, we combine many weak learners to come up with one strong learner. The weak learners here are the individual decision trees. All the trees are conncted in series and each tree tries to minimise the error of the previous tree. Due to this sequential connection, boosting algorithms are usually slow to learn, but also highly accurate. In statistical learning, models that learn slowly perform better. The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model aggregates the result of each step and thus a strong learner is achieved. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.
Naïve Bayes (NB) is a classification technique based on the Bayes’ theorem P(A|B) = (P(B|A)*P(A))/P(B). This theorem can describe the probability of an event based on the prior knowledge of conditions related to that event. This classifier assumes that a particular feature in a class is not directly related to any other feature although features for that class could have interdependence among themselves. [descrizione completa vedi Tufféry, 2011]
Support vector machine (SVM) algorithm can classify both linear and non-linear data. It first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors. The marginal distance for a class is the distance between the decision hyperplane and its nearest instance which is a member of that class. More formally, each data point is plotted first as a point in an n-dimension space (where n is the number of features) with the value of each feature being the value of a specific coordinate. To perform the classification, we then need to find the hyperplane that differentiates the two classes by the maximum margin. [Uddin, 2019; Tufféry, 2011]
Logistic Regression can be considered as an extension of ordinary regression and can model only a dichotomous variable which usually represents the occurrence or nonoccurrence of an event. LR helps in finding the probability that a new instance belongs to a certain class. Since it is a probability, the outcome lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be assigned to differentiate two classes. For example, a probability value higher than 0.50 for an input instance will classify it as ‘class A’; otherwise, ‘class B’. The LR model can be generalized to model a categorical variable with more than two values. This generalised version of LR is known as the multinomial logistic regression. [Uddin, 2019; Tufféry, 2011]