Hypotiroid_202101

Le informazioni disponibili

elenco e caratteristiche delle variabili presenti nel dataset

##                                         description              Variabile      Tipo
## 1                                    Hypothyroidism         Hypothyroidism    factor
## 2                                              Code                   Code character
## 3                                             Owner                  Owner character
## 4                                           Patient                Patient character
## 5                                             Breed                  Breed    factor
## 6                                            Gender                 Gender    factor
## 7                                      Age in years              Age_years   numeric
## 8                                     Age in months             Age_months   numeric
## 9                                               Age                    Age   numeric
## 10                                      Weight (Kg)                 Weight   numeric
## 11                             Body Condition Score                    BCS   numeric
## 12                            Thyroxine (T4 nmol/l)              T4_nmol.l   numeric
## 13                             Thyroxine (T4 µg/dl)               T4_µg.dl   numeric
## 14                            T4 post TSH (nmol/ml)  T4.post.TSH..nmol.ml.   numeric
## 15                              T4 post TSH (µg/dl)    T4.post.TSH..µg.dl.   numeric
## 16                           Increase from baseline Increase.from.baseline   numeric
## 17          Thyroid-stimulating hormone (TSH ng/ml)              TSH_ng.ml   numeric
## 18                                    Increased TSH          Increased.TSH    factor
## 19                              Cholesterol (mg/dl)      Cholesterol_mg.dl   numeric
## 20                            Increased Cholesterol  Increased.Cholesterol    factor
## 21                                    Triglycerides          Triglycerides   numeric
## 22                                       Creatinine             Creatinine   numeric
## 23                   Alanine Aminotransferase (ALT)                    ALT   numeric
## 24                 Aspartate Aminotransferase (AST)                    AST   numeric
## 25                                 Hemoglobin (Hgb)                    Hgb   numeric
## 26                                 Hematocrit (Hct)                    Hct   numeric
## 27                            Red blood cells (RBC)                    RBC   numeric
## 28                    Mean corpuscular volume (MCV)                    MCV   numeric
## 29 Mean corpuscular hemoglobin concentration (MCHC)                   MCHC   numeric
## 30                           urine-specific gravity                    USG   numeric
## 31                      urine protein-to-creatinine                    UPC   numeric
## 32                                         Asthenia               Asthenia    factor
## 33                              Lethargy/Depression    Lethargy_Depression    factor
## 34                  Polyuria and polydipsia (PU/PD)                  PU_PD    factor
## 35                                         Appetite               Appetite    factor
## 36                                          Obesity                Obesity    factor
## 37                                         Alopecia               Alopecia    factor
## 38                                     Dermatopathy           Dermatopathy    factor
## 39                         Neurological alterations      Neuro_alterations    factor
## 40                                   Other symptoms         other.symptoms character
## 41                              Reason for the test        Reason.for.test character
## 42                                        Diagnosis              Diagnosis character

alcune di queste informazioni sono derivate da una semplice anamnesi del paziente (come ad esempio Asthenia, Lethargy_Depression, etc.), altre sono invece il risultato di analisi di routine (Cholesterol_mg.dl, Hct, …), altre ancora sono fornite da analisi più specifiche legate alla patologia (T4_nmol.l, TSH_ng.ml).

Per le informazioni ottenute tramite analisi (T4_nmol.l, TSH_ng.ml, Cholesterol_mg.dl, Hct, …), in base a quanto indicato dai veterinari, sembra che ci possano essere problemi legati alla loro modalità di misurazione, e che quindi possa essere opportuno esprimerle per classi:

Colesterolo_cl: “Normal” (0-230mg/dl), “High” (230-460mg/dl) e “More than double” (>460mg/dl);
Hct_cl: “Anemia” (Hct<36%) e “Normal” (>=36%);
T4_cl: “Unmeasurable” (<=6.4nmol/l), “Low” (6.5-13nmol/l), “Normal” (>13nmol/l)
TSH_cl: “Normal/Low” (<0.36ng/ml), “High” (>=0.36ng/ml)

In sintesi possiamo avere variabili/informazioni:

più facilmente “reperibili” e rispetto alle quali ottenere una prima indicazione sull’ipotiroidismo,
altre, specificatamente fatte in relazione alla patologia, che invece richiedono analisi più approfondite (ad esempio T4 e TSH)

Feature selection

Analisi esplorativa delle variabili

presenza di dati mancanti e dati anomali

##                  Variable n.missing perc.missing
## 1          Hypothyroidism         0    0.0000000
## 2                   Breed         0    0.0000000
## 3                  Gender         0    0.0000000
## 4                     Age         0    0.0000000
## 5               T4_nmol.l         0    0.0000000
## 6                T4_µg.dl         0    0.0000000
## 7               TSH_ng.ml         0    0.0000000
## 8           Increased.TSH         0    0.0000000
## 9   Increased.Cholesterol         0    0.0000000
## 10               Asthenia         0    0.0000000
## 11    Lethargy_Depression         0    0.0000000
## 12                  PU_PD         0    0.0000000
## 13                Obesity         0    0.0000000
## 14               Alopecia         0    0.0000000
## 15           Dermatopathy         0    0.0000000
## 16      Neuro_alterations         0    0.0000000
## 17                  T4_cl         0    0.0000000
## 18                 TSH_cl         0    0.0000000
## 19               Appetite         1    0.3174603
## 20                 Weight         3    0.9523810
## 21             Creatinine         8    2.5396825
## 22                    Hct         8    2.5396825
## 23                 Hct_cl         8    2.5396825
## 24                    ALT         9    2.8571429
## 25                    Hgb         9    2.8571429
## 26                    RBC         9    2.8571429
## 27                    MCV         9    2.8571429
## 28      Cholesterol_mg.dl        12    3.8095238
## 29                   MCHC        12    3.8095238
## 30         Cholesterol_cl        12    3.8095238
## 31                    AST        17    5.3968254
## 32          Triglycerides       171   54.2857143
## 33                    USG       205   65.0793651
## 34  T4.post.TSH..nmol.ml.       209   66.3492063
## 35    T4.post.TSH..µg.dl.       209   66.3492063
## 36 Increase.from.baseline       209   66.3492063
## 37                    BCS       230   73.0158730
## 38                    UPC       250   79.3650794

elimino le variabili con il 5% ed oltre di missing

##      Code           Hypothyroidism              Breed     Gender       Age             Weight         T4_nmol.l        T4_µg.dl        TSH_ng.ml      Increased.TSH
##  Length:315         No :233        meticcio        : 85   C: 37   Min.   : 1.250   Min.   :  1.00   Min.   : 1.29   Min.   :0.1006   Min.   :0.0100   No :246      
##  Class :character   Yes: 82        dobermann       : 25   F: 58   1st Qu.: 6.250   1st Qu.: 14.93   1st Qu.: 6.68   1st Qu.:0.5210   1st Qu.:0.1000   Yes: 69      
##  Mode  :character                  labrador        : 18   M:126   Median : 8.917   Median : 26.30   Median :16.00   Median :1.2480   Median :0.1700                
##                                    golden retriever: 14   S: 94   Mean   : 8.704   Mean   : 28.19   Mean   :16.94   Mean   :1.3216   Mean   :0.4017                
##                                    setter inglese  :  8           3rd Qu.:11.292   3rd Qu.: 37.50   3rd Qu.:23.15   3rd Qu.:1.8057   3rd Qu.:0.3200                
##                                    pinscher        :  7           Max.   :17.250   Max.   :323.00   Max.   :51.86   Max.   :4.0451   Max.   :8.9000                
##                                    (Other)         :158                            NA's   :3                                                                       
##  Cholesterol_mg.dl Increased.Cholesterol   Creatinine         ALT               Hgb             Hct             RBC               MCV             MCHC      Asthenia 
##  Min.   :  76.0    No :166               Min.   :0.410   Min.   :   8.00   Min.   : 1.30   Min.   : 4.90   Min.   : 483000   Min.   :54.80   Min.   :21.2   No :187  
##  1st Qu.: 227.5    Yes:149               1st Qu.:0.790   1st Qu.:  40.00   1st Qu.:12.93   1st Qu.:38.60   1st Qu.:5792500   1st Qu.:65.72   1st Qu.:33.0   Yes:128  
##  Median : 316.0                          Median :0.950   Median :  61.50   Median :14.95   Median :43.80   Median :6510000   Median :68.00   Median :33.8            
##  Mean   : 374.7                          Mean   :1.005   Mean   :  97.74   Mean   :14.77   Mean   :43.62   Mean   :6438343   Mean   :67.84   Mean   :33.9            
##  3rd Qu.: 446.0                          3rd Qu.:1.145   3rd Qu.: 103.00   3rd Qu.:16.60   3rd Qu.:49.00   3rd Qu.:7065000   3rd Qu.:70.30   3rd Qu.:34.9            
##  Max.   :2025.0                          Max.   :6.300   Max.   :1285.00   Max.   :21.10   Max.   :60.70   Max.   :9420000   Max.   :82.00   Max.   :41.3            
##  NA's   :12                              NA's   :8       NA's   :9         NA's   :9       NA's   :8       NA's   :9         NA's   :9       NA's   :12              
##  Lethargy_Depression PU_PD       Appetite   Obesity   Alopecia  Dermatopathy Neuro_alterations          Cholesterol_cl    Hct_cl             T4_cl            TSH_cl   
##  No :209             No :262   Low   : 31   No :240   No :196   No :243      No :227           Normal          : 79    Anemia: 56   Unmeasurable: 76   Normal/Low:244  
##  Yes:106             Yes: 53   Normal:238   Yes: 75   Yes:119   Yes: 72      Yes: 88           High            :153    Normal:251   Low         : 33   High      : 71  
##                                High  : 45                                                      More than double: 71    NA's  :  8   Normal      :206                   
##                                NA's  :  1                                                      NA's            : 12                                                    
##                                                                                                                                                                        
##                                                                                                                                                                        
##

Dato anomalo per Weight (323 Kg):

i dati di colesterolo = 2025, ALT > 1200 e AST >1000, sono plausibili?

criteri per la feature selection

(vedi Saeys_2007_A review of feature selection techniques in bioinformatics)

Due questioni da affrontare:

identificare le variabili che maggirmento influiscono sulla variabile obiettivo Hypothyroidism
verificare se vi siano variabili esplicative fortemente correlate tra di loro e che quindi possano essere escluse

information-gain

##               attributes  importance
## 1              T4_nmol.l 0.439602976
## 2               T4_µg.dl 0.439602976
## 3                  T4_cl 0.410476462
## 4              TSH_ng.ml 0.288541997
## 5          Increased.TSH 0.281307285
## 6                 TSH_cl 0.268059807
## 7                  Breed 0.208001945
## 8      Cholesterol_mg.dl 0.124270429
## 9         Cholesterol_cl 0.119704846
## 10                   Hgb 0.092413165
## 11                   RBC 0.090852342
## 12                   Hct 0.074769725
## 13 Increased.Cholesterol 0.058339750
## 14   Lethargy_Depression 0.046991070
## 15                Hct_cl 0.043764557
## 16            Creatinine 0.042732723
## 17              Alopecia 0.035400534
## 18               Obesity 0.021026575
## 19              Asthenia 0.014613008
## 20          Dermatopathy 0.012015703
## 21              Appetite 0.006361986
## 22     Neuro_alterations 0.004726434
## 23                 PU_PD 0.004631117
## 24                Gender 0.004576536
## 25                   Age 0.000000000
## 26                Weight 0.000000000
## 27                   ALT 0.000000000
## 28                   MCV 0.000000000
## 29                  MCHC 0.000000000

Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute;

The entropy (very common in Information Theory) characterizes the (im)purity of an arbitrary collection of examples

infogain = H(Class) + H(Attribute) − H(Class, Attribute)

where H(X) is Shannon’s Entropy for a variable X and H(X, Y) is a joint Shannon’s Entropy for a variable X with a condition to Y.

esiste una correlazione tra le diverse variabili esplicative continue?

##                      Age Weight T4_nmol.l T4_µg.dl TSH_ng.ml Cholesterol_mg.dl Creatinine    ALT    Hgb    Hct    RBC    MCV   MCHC
## Age                1.000 -0.086     0.004    0.004    -0.086            -0.063     -0.013  0.127 -0.089 -0.105 -0.093  0.029  0.007
## Weight            -0.086  1.000    -0.154   -0.154    -0.012             0.113      0.179 -0.137 -0.166 -0.142 -0.119 -0.104  0.015
## T4_nmol.l          0.004 -0.154     1.000    1.000    -0.359            -0.398     -0.150 -0.011  0.339  0.303  0.348  0.036  0.089
## T4_µg.dl           0.004 -0.154     1.000    1.000    -0.359            -0.398     -0.150 -0.011  0.339  0.303  0.348  0.036  0.089
## TSH_ng.ml         -0.086 -0.012    -0.359   -0.359     1.000             0.315      0.166 -0.006 -0.208 -0.201 -0.214 -0.007 -0.033
## Cholesterol_mg.dl -0.063  0.113    -0.398   -0.398     0.315             1.000      0.093  0.076 -0.210 -0.195 -0.235  0.065 -0.046
## Creatinine        -0.013  0.179    -0.150   -0.150     0.166             0.093      1.000 -0.091 -0.205 -0.177 -0.192  0.051  0.006
## ALT                0.127 -0.137    -0.011   -0.011    -0.006             0.076     -0.091  1.000 -0.002  0.076  0.033  0.120 -0.071
## Hgb               -0.089 -0.166     0.339    0.339    -0.208            -0.210     -0.205 -0.002  1.000  0.824  0.819  0.131  0.225
## Hct               -0.105 -0.142     0.303    0.303    -0.201            -0.195     -0.177  0.076  0.824  1.000  0.859  0.200 -0.058
## RBC               -0.093 -0.119     0.348    0.348    -0.214            -0.235     -0.192  0.033  0.819  0.859  1.000 -0.130  0.021
## MCV                0.029 -0.104     0.036    0.036    -0.007             0.065      0.051  0.120  0.131  0.200 -0.130  1.000 -0.172
## MCHC               0.007  0.015     0.089    0.089    -0.033            -0.046      0.006 -0.071  0.225 -0.058  0.021 -0.172  1.000

test Mann-Whitney per verificare se ci sia una differenza significativa, per ciascuna delle variabili continue, tra le due modalità della variabile dipendente Hypothyroidism:

##             variable       W      p.value signif
## 1                Age 11205.0 1.988916e-02      *
## 2             Weight  7947.5 3.459423e-02      *
## 3          T4_nmol.l 18658.0 6.446435e-38    ***
## 4           T4_µg.dl 18658.0 6.446435e-38    ***
## 5          TSH_ng.ml  2280.0 1.101113e-24    ***
## 6  Cholesterol_mg.dl  3461.0 2.555047e-16    ***
## 7         Creatinine  6562.5 1.574912e-04     **
## 8                ALT  8209.0 2.219874e-01       
## 9                Hgb 14052.0 4.692396e-13    ***
## 10               Hct 13987.0 1.768019e-12    ***
## 11               RBC 14194.5 9.916545e-14    ***
## 12               MCV  9356.5 7.213665e-01       
## 13              MCHC  9440.5 5.057451e-01

## Warning: Removed 79 rows containing non-finite values (stat_boxplot).

vengono escluse dall’analisi le variabili con un ridotto information-gain e quelle che presentano un’alta correlazione, in specifico si utilizzano:

## [1] "T4_nmol.l"         "TSH_ng.ml"         "Cholesterol_mg.dl"
## [4] "Hct"               "Creatinine"

si è scelto Hct (correlato con Hgb e RBC), lo hanno detto i veterinari

ho un dubbio se mantenere Creatinine

variabili qualitative

problema razza: sembra esserci una relazione con la razza, al momento ci sono 83 modalità di razza diverse, potrebbero essere aggregate?

per alcune ci sono differenze semplici da correggere:pit bull / pitbull; pastore maremmano-abruzzese / pastore maremmano abruzzese rimangono comunque troppe modalità: ha senso definire 7/8 gruppi?

variabili quantitative in classi

in alternativa alle variabili Increased.TSH e Increased.Cholesterol

è possibile utilizzare una classificazione standard per le corrispondenti variabili TSH_ng.ml e Cholesterol_mg.dl

allo stesso modo è possibile definire variabili per classi delle variabili T4_nmol.l e Hct

Test chi-quadrato per verificare l’indipendenza tra Hypothyroidism e le variabili qualitative

##                 variable  X.squared df      p.value signif
## 1                 Gender   2.707214  3 4.390027e-01       
## 2          Increased.TSH 182.690742  1 1.252984e-41    ***
## 3  Increased.Cholesterol  34.119684  1 5.182437e-09    ***
## 4               Asthenia   8.541705  1 3.471003e-03     **
## 5    Lethargy_Depression  29.261257  1 6.324773e-08    ***
## 6                  PU_PD   2.175105  1 1.402600e-01       
## 7               Appetite   3.113449  2 2.108255e-01       
## 8                Obesity  13.035472  1 3.056461e-04     **
## 9               Alopecia  21.534621  1 3.474983e-06    ***
## 10          Dermatopathy   7.170476  1 7.411311e-03     **
## 11     Neuro_alterations   2.395061  1 1.217190e-01       
## 12        Cholesterol_cl  76.783662  2 2.121484e-17    ***
## 13                Hct_cl  27.805310  1 1.341574e-07    ***
## 14                 T4_cl 237.167775  2 3.159891e-52    ***
## 15                TSH_cl 174.744392  1 6.808088e-40    ***

variabili qualitative da utilizzare

## [1] "T4_cl"               "TSH_cl"              "Cholesterol_cl"     
## [4] "Hct_cl"              "Lethargy_Depression" "Alopecia"           
## [7] "Obesity"             "Asthenia"            "Dermatopathy"

Obiettivi

Definire un modello che consenta di prevedere la presenza di ipotiroidismo, questo significa determinare:

le variabili che meglio sono in grado di discriminare la presenza/assenza della patologia
quale algoritmo, sulla base delle variabili disponibili, fornisce la migliore capacità previsiva.

Specificazione dei modelli

modello 1: variabili di base tutte qualitative - (qual_rid)

Hypothyroidism ~ Cholesterol_cl + Hct_cl + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy

modello 2: variabili di base quantitative e qualitative - [quan_rid]

Hypothyroidism ~ Cholesterol_mg.dl + Hct + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy

modello 3: variabili di da analisi specifiche tutte qualitative - [qual_all]

Hypothyroidism ~ T4_cl + TSH_cl + Cholesterol_cl + Hct_cl + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy

modello 4: variabili di da analisi specifiche quantitative e qualitative - [quan_all]

Hypothyroidism ~ T4_nmol.l + TSH_ng.ml + Cholesterol_mg.dl + Hct + Lethargy_Depression + Alopecia + Obesity + Asthenia + Dermatopathy

Metodi

algoritmi di classificazione utilizzati

Per ognuno dei modelli identificati si tratta di determinare quale algoritmo di classificazione fornisce la migliore capacità previsiva della variabile obiettivo Hypo_eu. Gli algoritmi utilizzati sono:

decision tree (CART)
randomForest (rForest)
Gradient Boosting Machine (GBM)
Support Vector Machine (SVM)
Logistic Regression (Logistic)
Naive Bayes (NaiveBayes)

Indici per la valutazione delle performance degli algoritmi

La valutazione della capacità previsiva degli algoritmi è stata effettuata utilizzando il metodo del training/test dataset, ed in specifico dividendo i dati in un campione casuale di training pari al 75% delle osservazioni ed un dataset di test con il restante 25% delle osservazioni. Il primo dataset è quindi stato utilizzato per definire le regole di classificazione, il secondo è invece utilizzato per valutare l’effettiva capacità previsiva dell’algoritmo su dati non utilizzati per le costruzione delle regole di classificazione. La capacità previsiva è stata valutata per mezzo degli indici:

Accuracy
Kappa
F1
Sensitivity
Specificity
Precision
Balanced.Accuracy
AUC

Per ovviare ad eventuali problemi legati alla selezione casuale dei campioni di training e test, si è optato per una replicazione per 500 volte di tali campioni, ottenendo quindi 500 misurazioni dei diversi indici di capacità previsiva.

Le analisi sono state effuttuate escludendo le osservazioni con dati mancanti.

metodo alternativo di valutazione dei modelli potrebbe essere il repeated k-folds cv: si divide il dataset in k sottocampioni e di volta in volta vengono utilizzati i k-1 come training ed il restante come test, il tutto ripetuto n volte

Risultati

In termini di accuratezza si osserva come, in complesso, vi sia un miglioramento nel passaggio dal modello 1 al modello 2 e ai modelli con le variabili T4 e TSH. I modelli 3 e 4, in cui queste due ultime variabili sono espresse in termini di classi o in termini quantitativi, non presentano invece sostanziali differenze.

Indici di capacità previsiva per modello e metodo
model	method	F1	Accuracy	Kappa	AUC
qual_rid	NB	0.6664	0.8192	0.5427	0.8460
	Logistic	0.6389	0.8195	0.5205	0.8362
	GBM	0.6349	0.8205	0.5187	0.8357
	SVM	0.6201	0.8074	0.4921	0.8296
	rForest	0.5969	0.8167	0.4852	0.8380
	rpart	0.5929	0.8074	0.4712	0.7770
quan_rid	NB	0.7242	0.8633	0.6341	0.8835
	SVM	0.7201	0.8631	0.6309	0.8752
	Logistic	0.7112	0.8579	0.6182	0.8691
	GBM	0.6776	0.8471	0.5804	0.8608
	rForest	0.6625	0.8426	0.5637	0.8732
	rpart	0.6121	0.7992	0.4786	0.7851
qual_all	NB	0.9127	0.9534	0.8807	0.9834
	GBM	0.9032	0.9477	0.8672	0.9803
	rForest	0.8921	0.9431	0.8532	0.9771
	Logistic	0.8869	0.9397	0.8456	0.9763
	rpart	0.8738	0.9323	0.8274	0.9474
	SVM	0.8718	0.9334	0.8266	0.9781
quan_all	GBM	0.9134	0.9527	0.8806	0.9868
	rForest	0.9104	0.9521	0.8774	0.9863
	NB	0.8949	0.9463	0.8587	0.9836
	rpart	0.8893	0.9415	0.8494	0.9436
	SVM	0.8868	0.9398	0.8456	0.9863
	Logistic	0.8843	0.9379	0.8415	0.9838

Risultati per modello

Per i diversi modelli analizzati si osserva come gli algoritmi che forniscono le migliori prestazioni previsive siano diversi in base all’indicatore utilizzato: in generale, rispetto al F1 score, considerato ottimale per situazioni di campione unbalanced (essendo media armonica tra recall e precisione), abbiamo comunque che il Naive Bayes fornisce le prestazioni migliori per i modelli 1,2 e 3, mentre nel modello 4 random forest e GBM risultano migliori.

Variable importance

## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.

Semplificazione dei modelli con wrapper method

Si potrebbero semplificare ulteriormente i modelli?

Se l’obiettivo è la predisposizione di una applicazione che consenta ad utenti finali di quantificare la probabilità che il paziente sia ipotiroideo o meno, potrebbe essere auspicabile una semplificazione dell’interfaccia.

Questo significa ridurre il numero di informazioni da inserire, senza una perdita in termini di capacità previsiva.

Come si può osservare dai grafici precedenti, all’interno di ciascuno dei modelli, i diversi metodi evidenziano “classifiche” diverse rispetto all’importanza che assumo le variabili.

Per selezionare le variabili rilevanti per metodo e modello si è utilizzato un Wrapper method.
Per ciascuno dei quattro modelli:
si è generata la lista di tutte le possibili combinazioni di variabili esplicative, da 2 a V (V = numero variabili nel modello)
per ciascuna di queste combinazioni e per ciascun metodo sono stati calcolati Accuracy, Kappa e F1 usando un metodo di Repeated K-fold Cross-Validation.
(K-fold Cross-Validation randomly divides the data into K blocks of roughly equal size. Each of the blocks is left out in turn and the other K-1 blocks are used to train the model. The held out block is predicted and these predictions are summarized into some type of performance measure (e.g. accuracy, Kappa, F1 score). The K estimates of performance are averaged to get the overall resampled estimate.
Repeated K-fold CV does the same as above but more than once.)

Come si può vedere dalla tabella seguente, la riduzione delle variabili conduce in generale ad un miglioramento delle performance dei diversi metodi

Best variable selection by Model and method sorted by F1 score
model	method	variables	F1	Accuracy	Kappa
qual_rid	NB	Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia	0.6722879	0.8331388	0.5614605
	GB	Cholesterol_cl,Hct_cl,Alopecia,Obesity	0.6699684	0.8417368	0.5678996
	LR	Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy	0.6622315	0.8308051	0.5507083
	RF	Cholesterol_cl,Hct_cl,Obesity,Dermatopathy	0.6397103	0.8396161	0.5416011
	SV	Cholesterol_cl,Hct_cl,Alopecia,Obesity,Asthenia	0.6244554	0.8072303	0.4959919
	CT	Cholesterol_cl,Obesity	0.6182639	0.8071066	0.4899868
quan_rid	NB	Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Obesity,Dermatopathy	0.7391153	0.8724583	0.6567123
	SV	Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Dermatopathy	0.7353160	0.8733229	0.6536291
	LR	Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia	0.7309085	0.8669349	0.6428263
	RF	Cholesterol_mg.dl,Hct,Lethargy_Depression,Alopecia,Obesity,Asthenia	0.6844513	0.8523544	0.5902075
	GB	ALL	0.6658634	0.8337316	0.5571375
	CT	Cholesterol_mg.dl,Hct,Lethargy_Depression,Asthenia	0.6576579	0.8348533	0.5531177
qual_all	NB	T4_cl,TSH_cl,Cholesterol_cl,Hct_cl,Lethargy_Depression	0.9359539	0.9653424	0.9118712
	LR	T4_cl,TSH_cl	0.9220547	0.9566991	0.8915479
	GB	T4_cl,TSH_cl	0.9198252	0.9562176	0.8893624
	RF	T4_cl,TSH_cl	0.9184423	0.9550045	0.8868662
	CT	T4_cl,TSH_cl,Obesity	0.8991319	0.9433005	0.8590729
	SV	T4_cl,Lethargy_Depression,Alopecia,Obesity	0.8805999	0.9361673	0.8367642
quan_all	RF	T4_nmol.l,TSH_ng.ml,Hct,Alopecia	0.9320108	0.9648773	0.9079539
	LR	T4_nmol.l,Cholesterol_mg.dl,Lethargy_Depression,Dermatopathy	0.9219543	0.9570454	0.8919865
	GB	T4_nmol.l,TSH_ng.ml,Hct,Obesity,Asthenia,Dermatopathy	0.9216226	0.9584909	0.8932769
	SV	T4_nmol.l,TSH_ng.ml,Lethargy_Depression,Asthenia	0.9203423	0.9564712	0.8901038
	NB	T4_nmol.l,Dermatopathy	0.9145703	0.9518603	0.8808681
	CT	T4_nmol.l,TSH_ng.ml,Cholesterol_mg.dl,Hct,Lethargy_Depression,Obesity	0.9100100	0.9499311	0.8748876

Osserviamo anche che le stesse combinazioni di variabili sono quelle che forniscono i risultati migliori per tutti i metodi

Bisogna anche sottolineare che, all’interno dello stesso modello e dello stesso metodo, se selezioniamo le combinazioni di variabili che forniscono i risultati migliori, non emergono differenze significative tra di loro.

Se consideriamo il solo metodo Naive Bayes nel modello qual_rid, e selezioniamo le 10 combinazioni di variabili che forniscono i migliori risultati in termini di F1 score:

model: qual_rid - Best variable selection by Model and method sorted by F1 score
variables	mean.F1s	sd.F1s	mean.Accuracy	sd.Accuracy
Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia	0.6722879	0.0916673	0.8331388	0.0424534
Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy	0.6717556	0.0830683	0.8267432	0.0429851
ALL	0.6699881	0.0896321	0.8225279	0.0508599
Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity,Asthenia	0.6698422	0.0961234	0.8227509	0.0566542
Cholesterol_cl,Hct_cl,Alopecia,Obesity,Dermatopathy	0.6682124	0.0762412	0.8352865	0.0426273
Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Obesity	0.6611154	0.0932299	0.8249472	0.0403212
Cholesterol_cl,Hct_cl,Lethargy_Depression,Alopecia,Dermatopathy	0.6573141	0.0809044	0.8206604	0.0473017
Cholesterol_cl,Lethargy_Depression,Alopecia,Obesity,Dermatopathy	0.6563939	0.0886870	0.8203639	0.0387466
Cholesterol_cl,Lethargy_Depression,Alopecia,Obesity	0.6542365	0.0839609	0.8212951	0.0427165
Cholesterol_cl,Hct_cl,Lethargy_Depression,Obesity,Dermatopathy	0.6527167	0.0999858	0.8246731	0.0429195

si può vedere che le medie non differiscono significativamente tra di loro

analisi della varianza:

##                    Df Sum Sq  Mean Sq F value Pr(>F)
## tmp1NBr$variables   9  0.027 0.003035   0.386  0.942
## Residuals         490  3.848 0.007854

scelta dei modelli per la app

Sulla base dei risultati precedenti si potrebbero definire, per ciascun modello, un metodo ed un set di variabili diverse;
questo potrebbe essere “disorientante” in fase di immissione dei dati nell’app;
essendoci però gruppi di variabili con indici di qualità non significativamente diversi, si potrebbe individuare il sotto-insieme di variabili comuni ai quattro modelli con il migliore risultato

————————————————————————————————

Appendice

Misure di valutazione della capacità previsiva dei metodi

Confusion matrix:

		True Condition
		Yes	No	Total
Predicted Condition	Yes	True Positive	False Positive	P’
	No	False Negative	True Negative	N’
	Total	P	N	T

True positive (TP): Prediction is Yes and X is hypothyroid
True negative (TN): Prediction is No and X is healthy
False positive (FP): Prediction is Yes and X is healthy
False negative (FN): Prediction is No and X is hypothyroid

Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision = TP/(TP+FP)
Recall (aka Sensitivity) = TP/(TP+FN)
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Specificity = TN/(TN+FP)
Balanced Accuracy = (Recall + Specificity) / 2
Cohen’s Kappa = (Observed accuracy - Expected accuracy) / (1 - Expected accuracy)
where Expected accuracy = (P * P’)/T² + (N * N’)/T²

ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate = TP / (TP + FN)
False Positive Rate = FP / (FP + TN)

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

AUC (Area under the ROC Curve) measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).
AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

Algoritmi

Decision tree (CART)

The decision tree technique is used in classification to detect criteria for dividing the individuals of a population into n predetermined classes (in many cases, n=2). We start by choosing the variable which, by its categories, provides the best separation of the individuals in each class, thus providing sub-populations, called nodes, each containing the largest possible proportion of individuals in a single class; the same operation is then repeated on each new node obtained, until no further separation of the individuals is possible or desirable (according to criteria which depend on the type of tree). The construction is such that each of the terminal nodes (the leaves) mainly consists of the individuals of a single class. An individual is assigned to a leaf, and therefore to a certain class, with a reasonably high probability, when it conforms to all the rules for reaching this leaf. The set of rules for all the leaves forms the classification model. [Tufféry, 2011]

Random forest

A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the way a forest is a collection of many trees. DTs that are grown very deep often cause overfitting of the training data, resulting a high variation in classification outcome for a small change in the input data. They are very sensitive to their training data, which makes them error-prone to the test dataset. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT then considers a different part of that input vector and gives a classification outcome. The forest then chooses the classification of having the most ‘votes’ (for discrete classification outcome) or the average of all trees in the forest (for numeric classification outcome). Since the RF algorithm considers the outcomes from many different DTs, it can reduce the variance resulted from the consideration of a single DT for the same dataset. [Uddin, 2019]

Gradient Boosting Algorithm

In gradient boosting decision trees, we combine many weak learners to come up with one strong learner. The weak learners here are the individual decision trees. All the trees are conncted in series and each tree tries to minimise the error of the previous tree. Due to this sequential connection, boosting algorithms are usually slow to learn, but also highly accurate. In statistical learning, models that learn slowly perform better. The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model aggregates the result of each step and thus a strong learner is achieved. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.

Naive Bayes

Naïve Bayes (NB) is a classification technique based on the Bayes’ theorem P(A|B) = (P(B|A)*P(A))/P(B). This theorem can describe the probability of an event based on the prior knowledge of conditions related to that event. This classifier assumes that a particular feature in a class is not directly related to any other feature although features for that class could have interdependence among themselves. [descrizione completa vedi Tufféry, 2011]

Support Vector Machine

Support vector machine (SVM) algorithm can classify both linear and non-linear data. It first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors. The marginal distance for a class is the distance between the decision hyperplane and its nearest instance which is a member of that class. More formally, each data point is plotted first as a point in an n-dimension space (where n is the number of features) with the value of each feature being the value of a specific coordinate. To perform the classification, we then need to find the hyperplane that differentiates the two classes by the maximum margin. [Uddin, 2019; Tufféry, 2011]

Logistic Regression

Logistic Regression can be considered as an extension of ordinary regression and can model only a dichotomous variable which usually represents the occurrence or nonoccurrence of an event. LR helps in finding the probability that a new instance belongs to a certain class. Since it is a probability, the outcome lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be assigned to differentiate two classes. For example, a probability value higher than 0.50 for an input instance will classify it as ‘class A’; otherwise, ‘class B’. The LR model can be generalized to model a categorical variable with more than two values. This generalised version of LR is known as the multinomial logistic regression. [Uddin, 2019; Tufféry, 2011]