Fouille de données avec R pour la data science et l’intelligence
artificielle
III.TD 3 : Partie I - MANOVA
Badr TAJINI – ESIEE Paris
Source : Bertrand Roudier – ESIEE Paris
Code complété par Thibault LE GUIDEVAIS, Elysee MUGABIRE et Ahmed DIAKITE
Ce TD a pour objectif de réaliser une analyse de variance multivariée (MANOVA) en développant une fonction dédiée. Nous utilisons dans un premier temps un jeu de données simple (exemple vu en cours) ou tous les résultats intermédiaires vous sont fournis.
Une fois votre code intégré dans une fonction, vous vérifiez vos résultats en comparaison avec les résultats fournis par la fonction manova de R sur un jeu de données de volumétrie plus importante.
Une partie du code que vous allez développer nous servira pour la suite lorsque nous aborderons l’analyse factorielle discriminante ; principalement le calcul des inerties inter classes, intra classes et totales
La somme des carrés des écarts total (SCT = Inertie Totale) est la résultante (comme en ANOVA) de la sommes des carrés intra classes (SCresiduelle = SC Intra ) et de la somme des carrés inter classes (SC Ecart = SC Inter) \[\sum\limits_{i = 1}^n {{d^2}({x_i},g)} = \sum\limits_{j = 1}^k {{n_j}{d^2}} ({g_j},g) + \sum\limits_{j = 1}^k {\sum\limits_{i = 1}^{{n_k}} {{d^2}({x_{i,j}},{g_j})} } \]
Si nous considérons le jeu de données comme étant la population, nous pouvons directement estimer les variances inter classes et intra classes : \[\frac{1}{n}\sum\limits_{i = 1}^n {{d^2}({x_i},g)} = \frac{1}{n}\sum\limits_{j = 1}^k {{n_j}{d^2}} ({g_j},g) + \frac{1}{n}\sum\limits_{j = 1}^k {\sum\limits_{i = 1}^{{n_k}} {{d^2}({x_{i,j}},{g_j})} } \] Pour réaliser le test de comparaison des groupes, nous calculons
Avant de calculer les inerties (SS), et pour rendre le code le plus générique possible, nous devons créer :
Nous utilisons le fichier : MANOVA_DATASET.csv. Ce jeu de données comprend
rmq: Il s’agit ici d’étudier l’existence d’une différence entre la composition chimique de différentes des poteries antiques retrouvées dans des fouilles archéologiques.
rm(list=ls())
library(kableExtra)
library(help = "datasets")
# Récupération de nos données
data <- read.csv("MANOVA_DATASET.csv", header = TRUE)
head(data)
## Site Al Fe Mg Ca Na
## 1 Llanedyrn 14.4 7.00 4.30 0.15 0.51
## 2 Llanedyrn 13.8 7.08 3.43 0.12 0.17
## 3 Llanedyrn 14.6 7.09 3.88 0.13 0.20
## 4 Llanedyrn 11.5 6.37 5.64 0.16 0.14
## 5 Llanedyrn 13.8 7.06 5.34 0.20 0.20
## 6 Llanedyrn 10.9 6.26 3.47 0.17 0.22
data$Class <- data$Site
data$Site <- NULL
head(data)
## Al Fe Mg Ca Na Class
## 1 14.4 7.00 4.30 0.15 0.51 Llanedyrn
## 2 13.8 7.08 3.43 0.12 0.17 Llanedyrn
## 3 14.6 7.09 3.88 0.13 0.20 Llanedyrn
## 4 11.5 6.37 5.64 0.16 0.14 Llanedyrn
## 5 13.8 7.06 5.34 0.20 0.20 Llanedyrn
## 6 10.9 6.26 3.47 0.17 0.22 Llanedyrn
Les données sont les suivantes :
data$Class <- as.factor(substr(data$Class,1,2))
head(data)
## Al Fe Mg Ca Na Class
## 1 14.4 7.00 4.30 0.15 0.51 Ll
## 2 13.8 7.08 3.43 0.12 0.17 Ll
## 3 14.6 7.09 3.88 0.13 0.20 Ll
## 4 11.5 6.37 5.64 0.16 0.14 Ll
## 5 13.8 7.06 5.34 0.20 0.20 Ll
## 6 10.9 6.26 3.47 0.17 0.22 Ll
save(data, file = 'MANOVA_DATASET.Rda')
summary(data)
## Al Fe Mg Ca
## Min. :10.10 Min. :0.920 Min. :0.530 Min. :0.0100
## 1st Qu.:11.95 1st Qu.:1.700 1st Qu.:0.670 1st Qu.:0.0600
## Median :13.80 Median :5.465 Median :3.825 Median :0.1550
## Mean :14.49 Mean :4.468 Mean :3.142 Mean :0.1465
## 3rd Qu.:17.45 3rd Qu.:6.590 3rd Qu.:4.503 3rd Qu.:0.2150
## Max. :20.80 Max. :7.090 Max. :7.230 Max. :0.3100
## Na Class
## Min. :0.0300 As: 5
## 1st Qu.:0.0500 Ca: 2
## Median :0.1500 Is: 5
## Mean :0.1585 Ll:14
## 3rd Qu.:0.2150
## Max. :0.5400
data %>% kbl(digits=3) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center", latex_options = 'stripped') %>%
scroll_box( height = "250px")
| Al | Fe | Mg | Ca | Na | Class |
|---|---|---|---|---|---|
| 14.4 | 7.00 | 4.30 | 0.15 | 0.51 | Ll |
| 13.8 | 7.08 | 3.43 | 0.12 | 0.17 | Ll |
| 14.6 | 7.09 | 3.88 | 0.13 | 0.20 | Ll |
| 11.5 | 6.37 | 5.64 | 0.16 | 0.14 | Ll |
| 13.8 | 7.06 | 5.34 | 0.20 | 0.20 | Ll |
| 10.9 | 6.26 | 3.47 | 0.17 | 0.22 | Ll |
| 10.1 | 4.26 | 4.26 | 0.20 | 0.18 | Ll |
| 11.6 | 5.78 | 5.91 | 0.18 | 0.16 | Ll |
| 11.1 | 5.49 | 4.52 | 0.29 | 0.30 | Ll |
| 13.4 | 6.92 | 7.23 | 0.28 | 0.20 | Ll |
| 12.4 | 6.13 | 5.69 | 0.22 | 0.54 | Ll |
| 13.1 | 6.64 | 5.51 | 0.31 | 0.24 | Ll |
| 12.7 | 6.69 | 4.45 | 0.20 | 0.22 | Ll |
| 12.5 | 6.44 | 3.94 | 0.22 | 0.23 | Ll |
| 11.8 | 5.44 | 3.94 | 0.30 | 0.04 | Ca |
| 11.6 | 5.39 | 3.77 | 0.29 | 0.06 | Ca |
| 18.3 | 1.28 | 0.67 | 0.03 | 0.03 | Is |
| 15.8 | 2.39 | 0.63 | 0.01 | 0.04 | Is |
| 18.0 | 1.50 | 0.67 | 0.01 | 0.06 | Is |
| 18.0 | 1.88 | 0.68 | 0.01 | 0.04 | Is |
| 20.8 | 1.51 | 0.72 | 0.07 | 0.10 | Is |
| 17.7 | 1.12 | 0.56 | 0.06 | 0.06 | As |
| 18.3 | 1.14 | 0.67 | 0.06 | 0.05 | As |
| 16.7 | 0.92 | 0.53 | 0.01 | 0.05 | As |
| 14.8 | 2.74 | 0.67 | 0.03 | 0.05 | As |
| 19.1 | 1.64 | 0.60 | 0.10 | 0.03 | As |
Résultat attendu du nouveau dataset après la création de notre pipeline ELT (Extract-Load-Transform) :
N <- nrow(data)
paste("Nombre d'individus N = ",N)
## [1] "Nombre d'individus N = 26"
Le calcul du nombre de variables prédictives doit être réalisé de manière automatique. Pour y parvenir, on peut, par exemple, identifier les colonnes des variables numériques et calculer la longueur du vecteur des identifiants (utilisation des fonctions which et sapply)
P <- ncol(data)-1
paste("Nombre de variables prédictives P = ", P)
## [1] "Nombre de variables prédictives P = 5"
X <- subset(data, select = -c(Class))
head(X)
## Al Fe Mg Ca Na
## 1 14.4 7.00 4.30 0.15 0.51
## 2 13.8 7.08 3.43 0.12 0.17
## 3 14.6 7.09 3.88 0.13 0.20
## 4 11.5 6.37 5.64 0.16 0.14
## 5 13.8 7.06 5.34 0.20 0.20
## 6 10.9 6.26 3.47 0.17 0.22
Y <- data$Class
Y
## [1] Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ca Ca Is Is Is Is Is As As As As
## [26] As
## Levels: As Ca Is Ll
K <- length(levels(Y))
K
## [1] 4
Pour y parvenir, nous pouvons utiliser la fonction split. Les éléments sont les suivants :
XK <- split(X, Y)
XK
## $As
## Al Fe Mg Ca Na
## 22 17.7 1.12 0.56 0.06 0.06
## 23 18.3 1.14 0.67 0.06 0.05
## 24 16.7 0.92 0.53 0.01 0.05
## 25 14.8 2.74 0.67 0.03 0.05
## 26 19.1 1.64 0.60 0.10 0.03
##
## $Ca
## Al Fe Mg Ca Na
## 15 11.8 5.44 3.94 0.30 0.04
## 16 11.6 5.39 3.77 0.29 0.06
##
## $Is
## Al Fe Mg Ca Na
## 17 18.3 1.28 0.67 0.03 0.03
## 18 15.8 2.39 0.63 0.01 0.04
## 19 18.0 1.50 0.67 0.01 0.06
## 20 18.0 1.88 0.68 0.01 0.04
## 21 20.8 1.51 0.72 0.07 0.10
##
## $Ll
## Al Fe Mg Ca Na
## 1 14.4 7.00 4.30 0.15 0.51
## 2 13.8 7.08 3.43 0.12 0.17
## 3 14.6 7.09 3.88 0.13 0.20
## 4 11.5 6.37 5.64 0.16 0.14
## 5 13.8 7.06 5.34 0.20 0.20
## 6 10.9 6.26 3.47 0.17 0.22
## 7 10.1 4.26 4.26 0.20 0.18
## 8 11.6 5.78 5.91 0.18 0.16
## 9 11.1 5.49 4.52 0.29 0.30
## 10 13.4 6.92 7.23 0.28 0.20
## 11 12.4 6.13 5.69 0.22 0.54
## 12 13.1 6.64 5.51 0.31 0.24
## 13 12.7 6.69 4.45 0.20 0.22
## 14 12.5 6.44 3.94 0.22 0.23
NK <- table(Y)
NK
## Y
## As Ca Is Ll
## 5 2 5 14
GK <- lapply(XK, colMeans)
GK
## $As
## Al Fe Mg Ca Na
## 17.320 1.512 0.606 0.052 0.048
##
## $Ca
## Al Fe Mg Ca Na
## 11.700 5.415 3.855 0.295 0.050
##
## $Is
## Al Fe Mg Ca Na
## 18.180 1.712 0.674 0.026 0.054
##
## $Ll
## Al Fe Mg Ca Na
## 12.5642857 6.3721429 4.8264286 0.2021429 0.2507143
G <- lapply(X, mean)
G
## $Al
## [1] 14.49231
##
## $Fe
## [1] 4.467692
##
## $Mg
## [1] 3.141538
##
## $Ca
## [1] 0.1465385
##
## $Na
## [1] 0.1584615
La Somme des carrés totaux (Inertie Totale) correspond à la sommes des carrés des distances entre l’ensemble des observations et la moyenne générale :
\[{I_{total}} = \sum\limits_{i = 1}^n {{d^2}({x_i},g)} \]
Nous pouvons la calculer directement à l’aide du calcul matriciel suivant : \[{I_{Tot}} = SST = {(X - G)^t} \times (X - G)\]
Pour y parvenir nous devons :
Au finale, la matrice (SS_tot) est la suivante :
X_matrix <- X
colnames(X_matrix) <- NULL
rownames(X_matrix) <- NULL
X_matrix <- as.matrix(X_matrix)
X_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 14.4 7.00 4.30 0.15 0.51
## [2,] 13.8 7.08 3.43 0.12 0.17
## [3,] 14.6 7.09 3.88 0.13 0.20
## [4,] 11.5 6.37 5.64 0.16 0.14
## [5,] 13.8 7.06 5.34 0.20 0.20
## [6,] 10.9 6.26 3.47 0.17 0.22
## [7,] 10.1 4.26 4.26 0.20 0.18
## [8,] 11.6 5.78 5.91 0.18 0.16
## [9,] 11.1 5.49 4.52 0.29 0.30
## [10,] 13.4 6.92 7.23 0.28 0.20
## [11,] 12.4 6.13 5.69 0.22 0.54
## [12,] 13.1 6.64 5.51 0.31 0.24
## [13,] 12.7 6.69 4.45 0.20 0.22
## [14,] 12.5 6.44 3.94 0.22 0.23
## [15,] 11.8 5.44 3.94 0.30 0.04
## [16,] 11.6 5.39 3.77 0.29 0.06
## [17,] 18.3 1.28 0.67 0.03 0.03
## [18,] 15.8 2.39 0.63 0.01 0.04
## [19,] 18.0 1.50 0.67 0.01 0.06
## [20,] 18.0 1.88 0.68 0.01 0.04
## [21,] 20.8 1.51 0.72 0.07 0.10
## [22,] 17.7 1.12 0.56 0.06 0.06
## [23,] 18.3 1.14 0.67 0.06 0.05
## [24,] 16.7 0.92 0.53 0.01 0.05
## [25,] 14.8 2.74 0.67 0.03 0.05
## [26,] 19.1 1.64 0.60 0.10 0.03
G_tot <- matrix(rep(as.numeric(G), N), nrow = N, ncol = P, byrow = TRUE)
G_tot
## [,1] [,2] [,3] [,4] [,5]
## [1,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [2,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [3,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [4,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [5,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [6,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [7,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [8,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [9,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [10,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [11,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [12,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [13,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [14,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [15,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [16,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [17,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [18,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [19,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [20,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [21,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [22,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [23,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [24,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [25,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
## [26,] 14.49231 4.467692 3.141538 0.1465385 0.1584615
Diff_matrix <- X_matrix - G_tot
Diff_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.09230769 2.5323077 1.1584615 0.003461538 0.351538462
## [2,] -0.69230769 2.6123077 0.2884615 -0.026538462 0.011538462
## [3,] 0.10769231 2.6223077 0.7384615 -0.016538462 0.041538462
## [4,] -2.99230769 1.9023077 2.4984615 0.013461538 -0.018461538
## [5,] -0.69230769 2.5923077 2.1984615 0.053461538 0.041538462
## [6,] -3.59230769 1.7923077 0.3284615 0.023461538 0.061538462
## [7,] -4.39230769 -0.2076923 1.1184615 0.053461538 0.021538462
## [8,] -2.89230769 1.3123077 2.7684615 0.033461538 0.001538462
## [9,] -3.39230769 1.0223077 1.3784615 0.143461538 0.141538462
## [10,] -1.09230769 2.4523077 4.0884615 0.133461538 0.041538462
## [11,] -2.09230769 1.6623077 2.5484615 0.073461538 0.381538462
## [12,] -1.39230769 2.1723077 2.3684615 0.163461538 0.081538462
## [13,] -1.79230769 2.2223077 1.3084615 0.053461538 0.061538462
## [14,] -1.99230769 1.9723077 0.7984615 0.073461538 0.071538462
## [15,] -2.69230769 0.9723077 0.7984615 0.153461538 -0.118461538
## [16,] -2.89230769 0.9223077 0.6284615 0.143461538 -0.098461538
## [17,] 3.80769231 -3.1876923 -2.4715385 -0.116538462 -0.128461538
## [18,] 1.30769231 -2.0776923 -2.5115385 -0.136538462 -0.118461538
## [19,] 3.50769231 -2.9676923 -2.4715385 -0.136538462 -0.098461538
## [20,] 3.50769231 -2.5876923 -2.4615385 -0.136538462 -0.118461538
## [21,] 6.30769231 -2.9576923 -2.4215385 -0.076538462 -0.058461538
## [22,] 3.20769231 -3.3476923 -2.5815385 -0.086538462 -0.098461538
## [23,] 3.80769231 -3.3276923 -2.4715385 -0.086538462 -0.108461538
## [24,] 2.20769231 -3.5476923 -2.6115385 -0.136538462 -0.108461538
## [25,] 0.30769231 -1.7276923 -2.4715385 -0.116538462 -0.108461538
## [26,] 4.60769231 -2.8276923 -2.5415385 -0.046538462 -0.128461538
SS_tot <- t(Diff_matrix) %*% Diff_matrix
SS_tot
## [,1] [,2] [,3] [,4] [,5]
## [1,] 223.898462 -142.215462 -130.201692 -5.7826923 -4.7833077
## [2,] -142.215462 145.172462 118.272092 4.6665923 5.3927077
## [3,] -130.201692 118.272092 118.780138 4.6445385 4.7381615
## [4,] -5.782692 4.666592 4.644538 0.2561885 0.1648615
## [5,] -4.783308 5.392708 4.738162 0.1648615 0.4575385
Nous allons, dans un premier temps calculer, pour chaque groupe, le somme des carrés des écarts entre les individus de ce groupe et la moyenne de chaque groupe. Les Inerties intra partielles sont stockées dans une liste (SS_partiel_Intra).
Pour chaque classe, nous calculons la SS intra (partielle) : \[S{S_{{\text{k}}{\text{, intra}}}} = \sum\limits_{i = 1}^{{n_k}} {{d^2}({x_{i,j}},{g_j})} = {({X_{ik}} - {G_k})^t} \times ({X_{ik}} - {G_k})\]
Les résultats sont les suivants:
SS_tot_intra <- list()
for (key in names(XK)) {
XK_matrix <- XK[[key]]
colnames(XK_matrix) <- NULL
rownames(XK_matrix) <- NULL
XK_matrix <- as.matrix(XK_matrix)
GK_matrix <- matrix(rep(as.numeric(GK[[key]]), nrow(XK_matrix)), nrow = nrow(XK_matrix), byrow = TRUE)
Diff_matrix_inter <- XK_matrix - GK_matrix
colnames(Diff_matrix_inter) <- NULL
rownames(Diff_matrix_inter) <- NULL
print(Diff_matrix_inter)
SS_tot_intra[[key]] <- t(Diff_matrix_inter) %*% Diff_matrix_inter
}
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.38 -0.392 -0.046 0.008 0.012
## [2,] 0.98 -0.372 0.064 0.008 0.002
## [3,] -0.62 -0.592 -0.076 -0.042 0.002
## [4,] -2.52 1.228 0.064 -0.022 0.002
## [5,] 1.78 0.128 -0.006 0.048 -0.018
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1 0.025 0.085 0.005 -0.01
## [2,] -0.1 -0.025 -0.085 -0.005 0.01
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.12 -0.432 -0.004 0.004 -0.024
## [2,] -2.38 0.678 -0.044 -0.016 -0.014
## [3,] -0.18 -0.212 -0.004 -0.016 0.006
## [4,] -0.18 0.168 0.006 -0.016 -0.014
## [5,] 2.62 -0.202 0.046 0.044 0.046
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.83571429 0.627857143 -0.5264286 -0.052142857 0.25928571
## [2,] 1.23571429 0.707857143 -1.3964286 -0.082142857 -0.08071429
## [3,] 2.03571429 0.717857143 -0.9464286 -0.072142857 -0.05071429
## [4,] -1.06428571 -0.002142857 0.8135714 -0.042142857 -0.11071429
## [5,] 1.23571429 0.687857143 0.5135714 -0.002142857 -0.05071429
## [6,] -1.66428571 -0.112142857 -1.3564286 -0.032142857 -0.03071429
## [7,] -2.46428571 -2.112142857 -0.5664286 -0.002142857 -0.07071429
## [8,] -0.96428571 -0.592142857 1.0835714 -0.022142857 -0.09071429
## [9,] -1.46428571 -0.882142857 -0.3064286 0.087857143 0.04928571
## [10,] 0.83571429 0.547857143 2.4035714 0.077857143 -0.05071429
## [11,] -0.16428571 -0.242142857 0.8635714 0.017857143 0.28928571
## [12,] 0.53571429 0.267857143 0.6835714 0.107857143 -0.01071429
## [13,] 0.13571429 0.317857143 -0.3764286 -0.002142857 -0.03071429
## [14,] -0.06428571 0.067857143 -0.8864286 0.017857143 -0.02071429
SS_tot_intra
## $As
## [,1] [,2] [,3] [,4] [,5]
## [1,] 11.0080 -3.01320 -0.07960 0.17780 -0.03180
## [2,] -3.0132 2.16688 0.11704 -0.00212 -0.00648
## [3,] -0.0796 0.11704 0.01612 0.00164 -0.00034
## [4,] 0.1778 -0.00212 0.00164 0.00468 -0.00088
## [5,] -0.0318 -0.00648 -0.00034 -0.00088 0.00048
##
## $Ca
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.020 0.00500 0.01700 0.00100 -0.0020
## [2,] 0.005 0.00125 0.00425 0.00025 -0.0005
## [3,] 0.017 0.00425 0.01445 0.00085 -0.0017
## [4,] 0.001 0.00025 0.00085 0.00005 -0.0001
## [5,] -0.002 -0.00050 -0.00170 -0.00010 0.0002
##
## $Is
## [,1] [,2] [,3] [,4] [,5]
## [1,] 12.6080 -2.18680 0.22440 0.15960 0.15240
## [2,] -2.1868 0.76028 -0.03554 -0.02076 -0.01204
## [3,] 0.2244 -0.03554 0.00412 0.00268 0.00272
## [4,] 0.1596 -0.02076 0.00268 0.00272 0.00228
## [5,] 0.1524 -0.01204 0.00272 0.00228 0.00312
##
## $Ll
## [,1] [,2] [,3] [,4] [,5]
## [1,] 24.6521429 12.27507143 0.44621429 -0.231928571 0.470357143
## [2,] 12.2750714 8.02243571 0.44130714 -0.132564286 0.085778571
## [3,] 0.4462143 0.44130714 15.39492143 0.430207143 0.026935714
## [4,] -0.2319286 -0.13256429 0.43020714 0.044035714 0.008778571
## [5,] 0.4703571 0.08577857 0.02693571 0.008778571 0.195492857
Les K matrices sont ensuite additionnées pour obtenir l’inertie Intra (SS_Intra),
\[S{S_{{\text{intra }}}} = \sum\limits_{j = 1}^k {\sum\limits_{i = 1}^{{n_k}} {{d^2}({x_{i,j}},{g_j})} } = \sum\limits_{j = 1}^k {{{({X_{ik}} - {G_k})}^t} \times ({X_{ik}} - {G_k})} \]
SS_intra <- Reduce(`+`, SS_tot_intra)
SS_intra
## [,1] [,2] [,3] [,4] [,5]
## [1,] 48.2881429 7.08007143 0.60801429 0.10647143 0.58895714
## [2,] 7.0800714 10.95084571 0.52705714 -0.15519429 0.06675857
## [3,] 0.6080143 0.52705714 15.42961143 0.43537714 0.02761571
## [4,] 0.1064714 -0.15519429 0.43537714 0.05148571 0.01007857
## [5,] 0.5889571 0.06675857 0.02761571 0.01007857 0.19929286
L’inertie inter classe s’obtient directement par différence.
\[S{S_{{\text{inter}}}} = S{S_{tot}} - S{S_{{\text{intra}}}}\]
SS_inter <- SS_tot - SS_intra
SS_inter
## [,1] [,2] [,3] [,4] [,5]
## [1,] 175.610319 -149.295533 -130.809707 -5.8891637 -5.3722648
## [2,] -149.295533 134.221616 117.745035 4.8217866 5.3259491
## [3,] -130.809707 117.745035 103.350527 4.2091613 4.7105458
## [4,] -5.889164 4.821787 4.209161 0.2047027 0.1547830
## [5,] -5.372265 5.325949 4.710546 0.1547830 0.2582456
\[\Lambda = \frac{{\left| {{I_W}} \right|}}{{\left| {{I_B} + {I_W}} \right|}} = \frac{{\left| {S{S_{{\text{intra}}}}} \right|}}{{\left| {S{S_{{\text{inter}}}} + S{S_{{\text{intra}}}}} \right|}}\]
lambda <- det(SS_intra) / (det(SS_inter + SS_intra))
lambda
## [1] 0.01230091
\[ - \left( {n - 1 - \frac{{P + K}}{2}} \right)\ln (\Lambda )\]
correction <- -(N-1-(P+K)/2)*log(lambda)
correction
## [1] 90.16069
degre <- K-1
alpha <- 0.05
valeur_critique <- qchisq(1 - alpha, degre)
valeur_critique
## [1] 7.814728
La valeur critique est très inférieure à la valeur corrigée. En conclusion, on rejette l’hypothèse nulle d’égalité des moyennes. On peut donc affirmer que les catégories différent très significativement. Au TD III partie 2, nous allons réaliser le même type de test sur des plans factoriels (Analyse Factorielle Discriminante) ce qui permettra d’obtenir des représentations graphiques de positionnement des différents groupes et des variables associées qui sont essentielles pour de la fouille de données qu’elles soient réalisées en R ou en Python !
Nous comparons maintenant les résultats avec la fonction manova de R
X
## Al Fe Mg Ca Na
## 1 14.4 7.00 4.30 0.15 0.51
## 2 13.8 7.08 3.43 0.12 0.17
## 3 14.6 7.09 3.88 0.13 0.20
## 4 11.5 6.37 5.64 0.16 0.14
## 5 13.8 7.06 5.34 0.20 0.20
## 6 10.9 6.26 3.47 0.17 0.22
## 7 10.1 4.26 4.26 0.20 0.18
## 8 11.6 5.78 5.91 0.18 0.16
## 9 11.1 5.49 4.52 0.29 0.30
## 10 13.4 6.92 7.23 0.28 0.20
## 11 12.4 6.13 5.69 0.22 0.54
## 12 13.1 6.64 5.51 0.31 0.24
## 13 12.7 6.69 4.45 0.20 0.22
## 14 12.5 6.44 3.94 0.22 0.23
## 15 11.8 5.44 3.94 0.30 0.04
## 16 11.6 5.39 3.77 0.29 0.06
## 17 18.3 1.28 0.67 0.03 0.03
## 18 15.8 2.39 0.63 0.01 0.04
## 19 18.0 1.50 0.67 0.01 0.06
## 20 18.0 1.88 0.68 0.01 0.04
## 21 20.8 1.51 0.72 0.07 0.10
## 22 17.7 1.12 0.56 0.06 0.06
## 23 18.3 1.14 0.67 0.06 0.05
## 24 16.7 0.92 0.53 0.01 0.05
## 25 14.8 2.74 0.67 0.03 0.05
## 26 19.1 1.64 0.60 0.10 0.03
Y
## [1] Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ca Ca Is Is Is Is Is As As As As
## [26] As
## Levels: As Ca Is Ll
manova_res <- manova(as.matrix(X) ~ Y)
summary(manova_res)
## Df Pillai approx F num Df den Df Pr(>F)
## Y 3 1.5539 4.2984 15 60 2.413e-05 ***
## Residuals 22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explication : Nous avons une valeur de test de 1.5539 ce qui nettement supérieur à 0,0123. Nous pensons que c’est parce que le test réalisé est le test de Pillai et non de Wilks. Ainsi regardons comment effectuer le bon test :
X
## Al Fe Mg Ca Na
## 1 14.4 7.00 4.30 0.15 0.51
## 2 13.8 7.08 3.43 0.12 0.17
## 3 14.6 7.09 3.88 0.13 0.20
## 4 11.5 6.37 5.64 0.16 0.14
## 5 13.8 7.06 5.34 0.20 0.20
## 6 10.9 6.26 3.47 0.17 0.22
## 7 10.1 4.26 4.26 0.20 0.18
## 8 11.6 5.78 5.91 0.18 0.16
## 9 11.1 5.49 4.52 0.29 0.30
## 10 13.4 6.92 7.23 0.28 0.20
## 11 12.4 6.13 5.69 0.22 0.54
## 12 13.1 6.64 5.51 0.31 0.24
## 13 12.7 6.69 4.45 0.20 0.22
## 14 12.5 6.44 3.94 0.22 0.23
## 15 11.8 5.44 3.94 0.30 0.04
## 16 11.6 5.39 3.77 0.29 0.06
## 17 18.3 1.28 0.67 0.03 0.03
## 18 15.8 2.39 0.63 0.01 0.04
## 19 18.0 1.50 0.67 0.01 0.06
## 20 18.0 1.88 0.68 0.01 0.04
## 21 20.8 1.51 0.72 0.07 0.10
## 22 17.7 1.12 0.56 0.06 0.06
## 23 18.3 1.14 0.67 0.06 0.05
## 24 16.7 0.92 0.53 0.01 0.05
## 25 14.8 2.74 0.67 0.03 0.05
## 26 19.1 1.64 0.60 0.10 0.03
Y
## [1] Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ll Ca Ca Is Is Is Is Is As As As As
## [26] As
## Levels: As Ca Is Ll
manova_res <- manova(as.matrix(X) ~ Y)
summary(manova_res, test='Wilks')
## Df Wilks approx F num Df den Df Pr(>F)
## Y 3 0.012301 13.088 15 50.091 1.84e-12 ***
## Residuals 22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A partir du code que vous avez développé, construire une fonction générique (MANOVA) qui retourne sous forme de listes :
Nous testons cette fonction avec le fichier iris fourni par defaut dans R Cette fonction nous servira au prochain TD lorsque nous réaliserons une analyse factorielle discriminante
my_manova <- function(X, Y) {
final <- list()
final$N <- nrow(X)
final$P <- ncol(X) - 1
K <- length(levels(Y))
XK <- split(X, Y)
final$NK <- table(Y)
final$GK <- lapply(XK, colMeans)
final$G <- lapply(X, mean)
# Transformer le dataframe en matrice (NxP)
X_matrix <- X
X_matrix <- as.matrix(X_matrix)
G_tot <- matrix(rep(as.numeric(final$G), final$N),
nrow = nrow(X_matrix),
byrow = TRUE)
Diff_matrix <- X_matrix - G_tot
final$SS_tot <- t(Diff_matrix) %*% Diff_matrix
SS_tot_intra <- list()
for (key in names(XK)) {
XK_matrix <- as.matrix(XK[[key]])
GK_matrix <- matrix(rep(as.numeric(final$GK[[key]]),
nrow(XK_matrix)),
nrow = nrow(XK_matrix),
byrow = TRUE)
# Calcul de la différence et stockage dans la liste
Diff_matrix_inter <- XK_matrix - GK_matrix
SS_tot_intra[[key]] <- t(Diff_matrix_inter) %*% Diff_matrix_inter
}
final$SS_intra <- Reduce(`+`, SS_tot_intra)
final$SS_inter <- final$SS_tot - final$SS_intra
final$lambda <- det(final$SS_intra) / (det(final$SS_inter + final$SS_intra))
correction <- -(final$N - 1 - (final$P + K)/2) * log(final$lambda)
degre <- (K-1)
final$proba <- qchisq(1 - alpha, degre)
return(final)
}
Le jeux de données est le suivant
test <- my_manova(X,Y)
test
## $N
## [1] 26
##
## $P
## [1] 4
##
## $NK
## Y
## As Ca Is Ll
## 5 2 5 14
##
## $GK
## $GK$As
## Al Fe Mg Ca Na
## 17.320 1.512 0.606 0.052 0.048
##
## $GK$Ca
## Al Fe Mg Ca Na
## 11.700 5.415 3.855 0.295 0.050
##
## $GK$Is
## Al Fe Mg Ca Na
## 18.180 1.712 0.674 0.026 0.054
##
## $GK$Ll
## Al Fe Mg Ca Na
## 12.5642857 6.3721429 4.8264286 0.2021429 0.2507143
##
##
## $G
## $G$Al
## [1] 14.49231
##
## $G$Fe
## [1] 4.467692
##
## $G$Mg
## [1] 3.141538
##
## $G$Ca
## [1] 0.1465385
##
## $G$Na
## [1] 0.1584615
##
##
## $SS_tot
## Al Fe Mg Ca Na
## Al 223.898462 -142.215462 -130.201692 -5.7826923 -4.7833077
## Fe -142.215462 145.172462 118.272092 4.6665923 5.3927077
## Mg -130.201692 118.272092 118.780138 4.6445385 4.7381615
## Ca -5.782692 4.666592 4.644538 0.2561885 0.1648615
## Na -4.783308 5.392708 4.738162 0.1648615 0.4575385
##
## $SS_intra
## Al Fe Mg Ca Na
## Al 48.2881429 7.08007143 0.60801429 0.10647143 0.58895714
## Fe 7.0800714 10.95084571 0.52705714 -0.15519429 0.06675857
## Mg 0.6080143 0.52705714 15.42961143 0.43537714 0.02761571
## Ca 0.1064714 -0.15519429 0.43537714 0.05148571 0.01007857
## Na 0.5889571 0.06675857 0.02761571 0.01007857 0.19929286
##
## $SS_inter
## Al Fe Mg Ca Na
## Al 175.610319 -149.295533 -130.809707 -5.8891637 -5.3722648
## Fe -149.295533 134.221616 117.745035 4.8217866 5.3259491
## Mg -130.809707 117.745035 103.350527 4.2091613 4.7105458
## Ca -5.889164 4.821787 4.209161 0.2047027 0.1547830
## Na -5.372265 5.325949 4.710546 0.1547830 0.2582456
##
## $lambda
## [1] 0.01230091
##
## $proba
## [1] 7.814728
Explication : Nous retrouvons bien les mêmes valeurs !
iris_data <- iris
head(iris_data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Y_iris <- iris_data$Species
X_iris <- subset(iris_data, select = -c(Species))
test_iris <- my_manova(X_iris,Y_iris)
test_iris
## $N
## [1] 150
##
## $P
## [1] 3
##
## $NK
## Y
## setosa versicolor virginica
## 50 50 50
##
## $GK
## $GK$setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.006 3.428 1.462 0.246
##
## $GK$versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.936 2.770 4.260 1.326
##
## $GK$virginica
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.588 2.974 5.552 2.026
##
##
## $G
## $G$Sepal.Length
## [1] 5.843333
##
## $G$Sepal.Width
## [1] 3.057333
##
## $G$Petal.Length
## [1] 3.758
##
## $G$Petal.Width
## [1] 1.199333
##
##
## $SS_tot
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 102.168333 -6.322667 189.8730 76.92433
## Sepal.Width -6.322667 28.306933 -49.1188 -18.12427
## Petal.Length 189.873000 -49.118800 464.3254 193.04580
## Petal.Width 76.924333 -18.124267 193.0458 86.56993
##
## $SS_intra
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 38.9562 13.6300 24.6246 5.6450
## Sepal.Width 13.6300 16.9620 8.1208 4.8084
## Petal.Length 24.6246 8.1208 27.2226 6.2718
## Petal.Width 5.6450 4.8084 6.2718 6.1566
##
## $SS_inter
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 63.21213 -19.95267 165.2484 71.27933
## Sepal.Width -19.95267 11.34493 -57.2396 -22.93267
## Petal.Length 165.24840 -57.23960 437.1028 186.77400
## Petal.Width 71.27933 -22.93267 186.7740 80.41333
##
## $lambda
## [1] 0.02343863
##
## $proba
## [1] 5.991465
l’utilisation de la fonction manova de R conduit à la même valeur du lambda… Une fois de plus, vous avez bien travaillé !
manova_iris <- manova(as.matrix(X_iris) ~ Y_iris)
summary(manova_iris, test='Wilks')
## Df Wilks approx F num Df den Df Pr(>F)
## Y_iris 2 0.023439 199.15 8 288 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explication : On trouve 0,023439, validant notre fonction