Votre société de consulting informatique vous propose une nouvelle mission au ministère de l’Intérieur, dans le cadre de la lutte contre la criminalité organisée, à l’Office central pour la répression du faux monnayage. Votre mission si vous l’acceptez : créer un algorithme de détection de faux billets. ### —————————————————————————————-
Afin d’introduire votre analyse, effectuez une brève description des données (analyses univariées et bivariées).
Vous réaliserez une analyse en composantes principales de l’échantillon, en suivant toutes ces étapes :
Pour chacune de ces étapes, commentez les résultats obtenus. La variable donnant la nature Vrai/Faux du billet sera utilisée comme variable illustrative.
Appliquez un algorithme de classification, puis analysez le résultat obtenu. Visualisez la partition obtenue dans le premier plan factoriel de l’ACP, puis analysez-la.
Modélisez les données à l’aide d’une régression logistique. Grâce à celle-ci, vous créerez un programme capable d’effectuer une prédiction sur un billet, c’est-à-dire de déterminer s’il s’agit d’un vrai ou d’un faux billet. Pour chaque billet, votre algorithme de classification devra donner la probabilité que le billet soit vrai. Si cette probabilité est supérieure ou égale à 0.5, le billet sera considéré comme vrai. Dans le cas contraire, il sera considéré comme faux.
Jeu de données contenant les caractéristiques géométriques de billets de banque. Pour chacun d’eux, nous connaissons :
la longueur du billet (en mm) ;
la hauteur du billet (mesurée sur le côté gauche, en mm) ;
La hauteur du billet (mesurée sur le côté droit, en mm) ;
la marge entre le bord supérieur du billet et l'image de celui-ci (en mm) ;
la marge entre le bord inférieur du billet et l'image de celui-ci (en mm) ;
la diagonale du billet (en mm).
-> L’installation des packages ne se fait qu’une seule fois, je les ai commentées pour éviter de réisntaller à chaque chargement du script
#install.packages("FactoMineR", dependendcies=T)
#install.packages("factoextra",dependencies=T)
#install.packages(c("Factoshiny"))
#install.packages("ggpubr")
-> L’activation des librairies se fait à chaque démarrage du script
-> corrplot doit être appelé après ggplot2 pour éviter un warning
library(ggplot2)
library(FactoMineR)
library(factoextra)
library(ggpubr)
library(dplyr)
library(corrplot)
options(ggrepel.max.overlaps = Inf)
theme_update(
plot.title = element_text(color="#3876C2", size=18, face="bold",hjust = 0.5),
axis.title.x = element_text(color="black", size=14, face="bold",hjust = 0.5),
axis.title.y = element_text(color="black", size=14, face="bold",vjust = 0.5),
)
-> définition d’une variable répertoire pour faciliter le changement d’OS
repertoire_sources <- "/home/user/Documents/formations/oc/Formation Analyst/P6/sources/"
repertoire_images <- "/home/user/Documents/formations/oc/Formation Analyst/P6/images/"
fic_notes <- paste (repertoire_sources,"notes.csv",sep = "")
-> les critères d’importation, séparateurs, virgules décimales, entête etc, ont étés définis après observation des fichiers sources à l’aide d’une éditeur de texte ou d’un tableur
df_notes <- read.csv2(fic_notes,header = TRUE, sep = ",", dec = ".")
head(df_notes)
summary(df_notes)
is_genuine diagonal height_left height_right margin_low margin_up length
Length:170 Min. :171.0 Min. :103.2 Min. :103.1 Min. :3.540 Min. :2.270 Min. :110.0
Class :character 1st Qu.:171.7 1st Qu.:103.8 1st Qu.:103.7 1st Qu.:4.050 1st Qu.:3.013 1st Qu.:111.9
Mode :character Median :171.9 Median :104.1 Median :104.0 Median :4.450 Median :3.170 Median :112.8
Mean :171.9 Mean :104.1 Mean :103.9 Mean :4.612 Mean :3.170 Mean :112.6
3rd Qu.:172.1 3rd Qu.:104.3 3rd Qu.:104.2 3rd Qu.:5.128 3rd Qu.:3.330 3rd Qu.:113.3
Max. :173.0 Max. :104.9 Max. :105.0 Max. :6.280 Max. :3.680 Max. :114.0
vrais_billets <- filter(df_notes, is_genuine == 'True')
summary(vrais_billets)
is_genuine diagonal height_left height_right margin_low margin_up length
Length:100 Min. :171.0 Min. :103.2 Min. :103.1 Min. :3.540 Min. :2.270 Min. :111.8
Class :character 1st Qu.:171.8 1st Qu.:103.7 1st Qu.:103.6 1st Qu.:3.900 1st Qu.:2.938 1st Qu.:113.0
Mode :character Median :172.0 Median :103.9 Median :103.8 Median :4.080 Median :3.070 Median :113.2
Mean :172.0 Mean :104.0 Mean :103.8 Mean :4.144 Mean :3.055 Mean :113.2
3rd Qu.:172.2 3rd Qu.:104.1 3rd Qu.:104.0 3rd Qu.:4.383 3rd Qu.:3.192 3rd Qu.:113.5
Max. :172.8 Max. :104.9 Max. :105.0 Max. :5.040 Max. :3.530 Max. :114.0
faux_billets <- filter(df_notes, is_genuine == 'False')
summary(faux_billets)
is_genuine diagonal height_left height_right margin_low margin_up length
Length:70 Min. :171.4 Min. :103.8 Min. :103.4 Min. :3.820 Min. :2.980 Min. :110.0
Class :character 1st Qu.:171.7 1st Qu.:104.1 1st Qu.:104.0 1st Qu.:4.952 1st Qu.:3.185 1st Qu.:111.3
Mode :character Median :171.9 Median :104.2 Median :104.2 Median :5.265 Median :3.335 Median :111.8
Mean :171.9 Mean :104.2 Mean :104.1 Mean :5.282 Mean :3.335 Mean :111.7
3rd Qu.:172.0 3rd Qu.:104.4 3rd Qu.:104.3 3rd Qu.:5.702 3rd Qu.:3.450 3rd Qu.:112.0
Max. :173.0 Max. :104.7 Max. :104.9 Max. :6.280 Max. :3.680 Max. :113.6
sapply(df_notes, function(x) sum(is.na(x)))
is_genuine diagonal height_left height_right margin_low margin_up length
0 0 0 0 0 0 0
nous avons déterminé précédement que la variables disrciminatoires étaient margin_low et length.
Ce qui est confirmé par la p-value du test de student pour ces deux variables. Si l’on fait un test sur une autre variable, on retrouve une p-value supérieure à 5%.
# multiplot : Pour afficher plusieurs graphiques
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
data <- df_notes
p1 <- ggplot(data, aes(x=diagonal, fill=is_genuine, color=is_genuine)) + geom_density(alpha=.3)
p2 <- ggplot(data, aes(x=height_left, fill=is_genuine, color=is_genuine)) + geom_density(alpha=.3)
p3 <- ggplot(data, aes(x=height_right, fill=is_genuine, color=is_genuine))+ geom_density(alpha=.3)
p4 <- ggplot(data, aes(x=margin_low, fill=is_genuine, color=is_genuine))+ geom_density(alpha=.3)
p5 <- ggplot(data, aes(x=margin_up, fill=is_genuine, color=is_genuine)) + geom_density(alpha=.3)
p6 <- ggplot(data, aes(x=length, fill=is_genuine, color=is_genuine)) + geom_density(alpha=.3)
multiplot(p1, p2, p3, p4,p5,p6, cols=2)
df_notes_vf<-df_notes
df_notes_vf$is_genuine[df_notes$is_genuine == "True"] <- "1"
df_notes_vf$is_genuine[df_notes$is_genuine == "False"] <- "0"
df_notes_vf_numeric <- transform(df_notes_vf, is_genuine = as.numeric(is_genuine))
df_notes_vf_numeric
# corrplot(Matrice, method = "circle")
matrice_cor <- cor(df_notes_vf_numeric[,2:7])
corrplot(matrice_cor, type="upper", sig.level = 0.01)
df_notes_vf_scaled <- data.frame(df_notes['is_genuine'], scale(df_notes[,2:7]))
res.pca <- PCA(df_notes_vf_scaled, scale.unit = FALSE, quali.sup=1, graph = FALSE)
eig.val <- get_eigenvalue(res.pca)
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 75))
-> On observe un coude au niveau de la seconde dimension. Après quoi la diminution est plus régulière.
Mais il sera peut-être nécessaire d’observer les 3 premières composantes principales pour totaliser 70% des informations.
var <- get_pca_var(res.pca)
corrplot(var$cos2, is.corr=FALSE,addCoef.col = "black", title="Qualité de représentation des variables", mar=c(0,0,1,0))
-> dans la première composante, margin_low et length sont le mieux représentées
-> dans la seconde, margin_low est encore un peu représentée.
-> dans la troisième, c’est la distance qui est le mieux représentée
## CONCLUSION : nous pouvons travailler sur le plan factoriel 1-3 et 1-2 nous verrons avec d’autres analyses
img_graphe_correlation<-fviz_pca_var(res.pca, col.var = "cos2", axes = c(1,2),
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Évite le chevauchement de texte
) + ggsave("img_graphe_correlation.png", width = 11, height = 8)
plot(img_graphe_correlation)
img_graphe_correlation_1_3<-fviz_pca_var(res.pca, col.var = "cos2", axes = c(1,3),
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Évite le chevauchement de texte
) + ggsave("img_graphe_correlation_1_3.png", width = 11, height = 8)
plot(img_graphe_correlation_1_3)
Le cercle des corrélations aide à interpréter les axes d’inertie.
fviz_pca_biplot(res.pca, habillage = 1, addEllipses =TRUE, pointsize = 1, labelsize = 3, axes = c(1,2) )
Classification Ascendante Hiérarchique 2 Clusters La CAH va déterminer si un billet est vrai ou faux en fonction des variables
res.hcpc <- HCPC(res.pca, nb.clust=2, graph = FALSE) #
head(res.hcpc)
$data.clust
$desc.var
Link between the cluster variable and the categorical variables (chi-square test)
=================================================================================
p.value df
is_genuine 1.699353e-31 1
Description of each cluster by the categories
=============================================
$`1`
Cla/Mod Mod/Cla Global p.value v.test
is_genuine=True 92.000000 98.924731 58.82353 3.02179e-37 12.75243
is_genuine=False 1.428571 1.075269 41.17647 3.02179e-37 -12.75243
$`2`
Cla/Mod Mod/Cla Global p.value v.test
is_genuine=False 98.57143 89.61039 41.17647 3.02179e-37 12.75243
is_genuine=True 8.00000 10.38961 58.82353 3.02179e-37 -12.75243
Link between the cluster variable and the quantitative variables
================================================================
Eta2 P-value
length 0.6380042 6.544704e-39
margin_low 0.5511088 4.963822e-31
height_right 0.4301258 2.844171e-22
height_left 0.3460892 3.295328e-17
margin_up 0.3417477 5.780388e-17
Description of each cluster by quantitative variables
=====================================================
$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
length 10.383771 0.7246604 1.625579e-15 0.3725574 0.9970545 2.939300e-25
margin_up -7.599695 -0.5303659 4.501709e-16 0.8122175 0.9970545 2.968301e-14
height_left -7.647815 -0.5337241 -1.626105e-14 0.8634866 0.9970545 2.044229e-14
height_right -8.525917 -0.5950049 -6.826402e-15 0.7329393 0.9970545 1.516041e-17
margin_low -9.650771 -0.6735060 -5.177639e-16 0.4933067 0.9970545 4.878742e-22
$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
margin_low 9.650771 0.8134553 -5.177639e-16 0.8314510 0.9970545 4.878742e-22
height_right 8.525917 0.7186422 -6.826402e-15 0.7758459 0.9970545 1.516041e-17
height_left 7.647815 0.6446278 -1.626105e-14 0.7312085 0.9970545 2.044229e-14
margin_up 7.599695 0.6405718 4.501709e-16 0.8049586 0.9970545 2.968301e-14
length -10.383771 -0.8752392 1.625579e-15 0.7917512 0.9970545 2.939300e-25
$desc.axes
Link between the cluster variable and the quantitative variables
================================================================
Eta2 P-value
Dim.1 0.78592540 4.052533e-58
Dim.2 0.05255236 2.637868e-03
Description of each cluster by quantitative variables
=====================================================
$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Dim.2 2.980159 0.2387163 -3.618429e-17 1.035101 1.144411 2.880985e-03
Dim.1 -11.524816 -1.3570557 -4.065375e-17 0.844393 1.682299 9.892284e-31
$`2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Dim.1 11.524816 1.6390413 -4.065375e-17 0.6902571 1.682299 9.892284e-31
Dim.2 -2.980159 -0.2883197 -3.618429e-17 1.2022770 1.144411 2.880985e-03
$desc.ind
$desc.ind$para
Cluster: 1
47 90 86 60 82
0.3706149 0.5573561 0.6140290 0.6383198 0.6876754
------------------------------------------------------------------------------------------
Cluster: 2
133 106 112 114 115
0.5059074 0.5332767 0.7610073 0.8182491 0.8357569
$desc.ind$dist
Cluster: 1
5 50 40 30 41
5.352441 4.952443 4.782645 4.743077 4.511329
------------------------------------------------------------------------------------------
Cluster: 2
167 123 1 113 159
5.283726 5.183932 4.831765 4.821886 4.703969
$call
$call$t
$call$t$res
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 170 individuals, described by 7 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$quali.sup" "results for the supplementary categorical variables"
12 "$quali.sup$coord" "coord. for the supplementary categories"
13 "$quali.sup$v.test" "v-test of the supplementary categories"
14 "$call" "summary statistics"
15 "$call$centre" "mean of the variables"
16 "$call$ecart.type" "standard error of the variables"
17 "$call$row.w" "weights for the individuals"
18 "$call$col.w" "weights for the variables"
$call$t$tree
Call:
flashClust::hclust(d = dissi, method = method, members = weight)
Cluster method : ward
Distance : euclidean
Number of objects: 170
$call$t$nb.clust
[1] 3
$call$t$within
[1] 5.7725629478 3.5216096062 2.8822349752 2.5372368521 2.3086311796 2.1053342348 1.9201623452 1.7959504723
[9] 1.7063752694 1.6195953318 1.5359365894 1.4547471615 1.3777404740 1.3164514466 1.2566719232 1.1979150006
[17] 1.1397298450 1.1030633286 1.0696086745 1.0368277160 1.0042270707 0.9728749971 0.9421426698 0.9119650895
[25] 0.8826535120 0.8534067171 0.8263212996 0.7998404412 0.7745263155 0.7513042491 0.7283599048 0.7060653274
[33] 0.6839042923 0.6618414631 0.6410198650 0.6204314992 0.6009762585 0.5818849219 0.5631246791 0.5467696155
[41] 0.5316927053 0.5166275196 0.5019226130 0.4873178525 0.4729759978 0.4605117283 0.4484657905 0.4364247177
[49] 0.4244189952 0.4131806274 0.4020348461 0.3918511297 0.3821537846 0.3726518839 0.3636523046 0.3551266323
[57] 0.3466205645 0.3381958500 0.3300306225 0.3220887811 0.3141524188 0.3063450403 0.2986927636 0.2917950204
[65] 0.2851479070 0.2785355837 0.2719462901 0.2654219703 0.2589358184 0.2525715591 0.2462446200 0.2399330629
[73] 0.2336562982 0.2274164150 0.2216801391 0.2159710077 0.2103574465 0.2049645162 0.1996316330 0.1943987634
[81] 0.1892868147 0.1842691704 0.1795417302 0.1749930629 0.1706271957 0.1663459239 0.1621049400 0.1579526700
[89] 0.1539625511 0.1500147348 0.1462088092 0.1425218640 0.1388739686 0.1352669373 0.1316781386 0.1281720358
[97] 0.1247158258 0.1212929774 0.1179024910 0.1145955337 0.1112923850 0.1080190806 0.1047561909 0.1016464603
[105] 0.0985446912 0.0955912941 0.0926878351 0.0898082492 0.0869456325 0.0841238094 0.0814814059 0.0788436790
[113] 0.0762264160 0.0736958415 0.0712136259 0.0687760048 0.0664423531 0.0641393942 0.0619265224 0.0597179228
[121] 0.0575252891 0.0553645906 0.0532388010 0.0511207781 0.0491402619 0.0471978350 0.0454091505 0.0436497883
[129] 0.0419026869 0.0402297986 0.0385958479 0.0369779644 0.0353606230 0.0337528408 0.0322260007 0.0307458373
[137] 0.0292808281 0.0278229367 0.0263931419 0.0249647376 0.0235615049 0.0221752711 0.0208736976 0.0195995187
[145] 0.0183913104 0.0172273613 0.0161350854 0.0150861818 0.0140839056 0.0131047823 0.0121381514 0.0112078259
[153] 0.0102990755 0.0094095326 0.0085317046 0.0076708425 0.0068802012 0.0060924760 0.0053867782 0.0047136514
[161] 0.0041009399 0.0035224122 0.0029515071 0.0024678124 0.0019874070 0.0015346056 0.0010914685 0.0006937041
[169] 0.0003096729
$call$t$inert.gain
[1] 2.2509533416 0.6393746311 0.3449981230 0.2286056725 0.2032969448 0.1851718896 0.1242118729 0.0895752029
[9] 0.0867799376 0.0836587424 0.0811894279 0.0770066875 0.0612890274 0.0597795234 0.0587569226 0.0581851556
[17] 0.0366665165 0.0334546541 0.0327809585 0.0326006454 0.0313520735 0.0307323274 0.0301775802 0.0293115775
[25] 0.0292467949 0.0270854175 0.0264808585 0.0253141257 0.0232220664 0.0229443443 0.0222945774 0.0221610352
[33] 0.0220628292 0.0208215981 0.0205883658 0.0194552407 0.0190913366 0.0187602429 0.0163550636 0.0150769102
[41] 0.0150651857 0.0147049066 0.0146047605 0.0143418547 0.0124642694 0.0120459378 0.0120410728 0.0120057225
[49] 0.0112383678 0.0111457813 0.0101837164 0.0096973451 0.0095019008 0.0089995792 0.0085256723 0.0085060678
[57] 0.0084247145 0.0081652275 0.0079418414 0.0079363622 0.0078073785 0.0076522768 0.0068977432 0.0066471133
[65] 0.0066123233 0.0065892937 0.0065243197 0.0064861519 0.0063642593 0.0063269391 0.0063115571 0.0062767646
[73] 0.0062398832 0.0057362759 0.0057091313 0.0056135612 0.0053929303 0.0053328832 0.0052328697 0.0051119487
[81] 0.0050176443 0.0047274402 0.0045486673 0.0043658673 0.0042812717 0.0042409839 0.0041522700 0.0039901189
[89] 0.0039478163 0.0038059256 0.0036869453 0.0036478954 0.0036070313 0.0035887987 0.0035061028 0.0034562100
[97] 0.0034228483 0.0033904865 0.0033069573 0.0033031488 0.0032733043 0.0032628898 0.0031097306 0.0031017692
[105] 0.0029533970 0.0029034590 0.0028795859 0.0028626167 0.0028218231 0.0026424035 0.0026377269 0.0026172629
[113] 0.0025305745 0.0024822156 0.0024376211 0.0023336517 0.0023029588 0.0022128718 0.0022085996 0.0021926338
[121] 0.0021606984 0.0021257896 0.0021180229 0.0019805162 0.0019424270 0.0017886844 0.0017593622 0.0017471014
[129] 0.0016728883 0.0016339507 0.0016178835 0.0016173414 0.0016077822 0.0015268400 0.0014801634 0.0014650092
[137] 0.0014578913 0.0014297949 0.0014284043 0.0014032327 0.0013862338 0.0013015735 0.0012741789 0.0012082083
[145] 0.0011639491 0.0010922759 0.0010489036 0.0010022762 0.0009791233 0.0009666309 0.0009303255 0.0009087504
[153] 0.0008895429 0.0008778279 0.0008608621 0.0007906413 0.0007877252 0.0007056978 0.0006731268 0.0006127115
[161] 0.0005785276 0.0005709052 0.0004836947 0.0004804054 0.0004528015 0.0004431370 0.0003977644 0.0003840312
[169] 0.0003096729
$call$t$quot
[1] 0.8184425 0.8803019 0.9098998 0.9119405 0.9120463 0.9353118 0.9501238 0.9491437
$call$min
[1] 3
$call$max
[1] 10
$call$X
$call$bw.before.consol
[1] 2.250953
$call$bw.after.consol
[1] 2.296803
$call$vec
[1] FALSE
$call$call
HCPC(res = res.pca, nb.clust = 2, graph = FALSE)
img_dendrogramme <- fviz_dend(res.hcpc,
labelsize = 2,
cex = 0.5, # Taille du text
palette = "jco", # Palette de couleur ?ggpubr::ggpar
rect = TRUE, rect_fill = TRUE, # Rectangle autour des groupes
rect_border = "jco", # Couleur du rectangle
labels_track_height = 0.8 # Augment l'espace pour le texte
)
plot(img_dendrogramme)
img_factor_map <- fviz_cluster(res.hcpc,
labelsize = 5,
repel = FALSE, # Evite le chevauchement des textes
show.clust.cent = TRUE, # Montre le centre des clusters
palette = "jco", # Palette de couleurs, voir ?ggpubr::ggpar
ggtheme = theme_minimal(),
main = "Factor map"
) + ggsave("P6_img_factormap.png", width = 11, height = 8)
plot(img_factor_map)
df_billets_clust<- res.hcpc$data.clust
df_billets_clust <- transform(df_billets_clust, clust = as.numeric(clust))
# Le cluster 1 correspond aux vrais billets.
df_billets_clust$clust[df_billets_clust$clust=='1']<-'True'
df_billets_clust$clust[df_billets_clust$clust=='2']<-'False'
# Le cluster 2 correspond aux faux billets.
#billets.hc$groupes.hc[billets.hc$groupes.hc=='2']<-'False'
#billets.hc <- transform(billets.hc, groupes.hc = as.factor(groupes.hc))
# Matrice de confusion à partir données réelles et des données prédites par l'ACH
set.seed(123)
mc_clust = table(df_billets_clust$is_genuine,df_billets_clust$clust)
mc_clust
False True
False 69 1
True 8 92
#reskmeansCR$cluster
#df_notes_vf_numeric$is_genuine
set.seed(123)
MC = table( df_notes_vf_numeric$is_genuine, res.hcpc$data.clust[,8])
MC
1 2
0 1 69
1 92 8
Nous considérons la variable is_genuine comme illustrative puisqu’elle ne participe pas à l’analyse, mais elle permet de l’expliquer. ### La variable margin low est aussi fortement corrélée à length et is_genuine. ### Le Biplot avec les variables et les individus montre deux ensembles bien séparés en fonction de length et margin_low. Plus l’individu va vers le length, plus on a un vrai billet. Et plus l’individu s’oriente vers margin_low, plus on est sûr d’avoir un faux billet.
df_notes_vf_numeric[,1:7]
res.pca = PCA(df_notes_vf_numeric[,1:7], scale.unit=TRUE,axes = c(1,3), graph=T, quali.sup = 1)
reskmeans<- kmeans(df_notes_vf_numeric[,2:7],centers = 2, nstart=5)
La représentation par le K-means sur la dataframe centré réduite est de moins bonne qualité si l’on observe le ratio entre l’intertie exliquée par le partitionnement et l’innertie totale
#reskmeans$cluster
#df_notes_vf_numeric$is_genuine
set.seed(123)
MC = table( df_notes_vf_numeric$is_genuine, reskmeans$cluster)
MC
1 2
0 68 2
1 1 99
On constate que la CAH donne un meilleur résultat que les Kmeans
#data<-df_notes_vf_numeric
data<-df_notes
#install.packages("caret")
#install.packages("e1071")
library(caret)
Le chargement a nécessité le package : lattice
library(lattice)
library(e1071)
print(dim(data))
[1] 170 7
head(data)
nrow(data)
[1] 170
data <- transform(data, is_genuine = as.character(is_genuine))
# Le cluster 1 correspond aux vrais billets.
data$is_genuine[data$is_genuine=='True']<-1
# Le cluster 2 correspond aux faux billets.
data$is_genuine[data$is_genuine=='False']<-0
data <- transform(data, is_genuine = as.numeric(is_genuine))
head(data)
#data$is_genuine= as.factor(data$is_genuine)
set.seed(123)
trainIndex <-createDataPartition(data$is_genuine,p=0.8,list=F)
print(length(trainIndex))
[1] 136
#10 premiers individus de l’échantillon d’apprentissage
print(head(trainIndex,10))
Resample1
[1,] 1
[2,] 4
[3,] 5
[4,] 6
[5,] 7
[6,] 8
[7,] 9
[8,] 11
[9,] 12
[10,] 13
#data frame pour les individus en apprentissage
dataTrain <-data[trainIndex,]
print(dim(dataTrain))
[1] 136 7
dataTest <-data[-trainIndex,]
print(dim(dataTest))
[1] 34 7
print(table(dataTrain$is_genuine))
0 1
58 78
print(prop.table(table(data$is_genuine)))
0 1
0.4117647 0.5882353
print(prop.table(table(dataTrain$is_genuine)))
0 1
0.4264706 0.5735294
print(prop.table(table(dataTest$is_genuine)))
0 1
0.3529412 0.6470588
test<- lm(is_genuine~.,data=dataTrain)
test
Call:
lm(formula = is_genuine ~ ., data = dataTrain)
Coefficients:
(Intercept) diagonal height_left height_right margin_low margin_up length
-14.38596 -0.00225 0.10299 -0.11018 -0.36888 -0.64928 0.17629
default_glm_mod = train(
form = is_genuine ~ .,
data = data,
trControl = trainControl(method = "cv", number = 5),
method = "glmStepAIC",
family = "binomial"
)
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.000 12.000
- height_right 1 0.000 12.000
- height_left 1 0.000 12.000
<none> 0.000 14.000
- margin_up 1 7.622 19.622
- length 1 9.606 21.606
- margin_low 1 37.587 49.587
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ height_left + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.000 10.000
- height_left 1 0.000 10.000
<none> 0.000 12.000
- margin_up 1 8.111 18.111
- length 1 10.780 20.780
- margin_low 1 40.446 50.446
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_left + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.000 8.000
<none> 0.000 10.000
- margin_up 1 8.257 16.257
- length 1 11.048 19.048
- margin_low 1 43.740 51.740
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
<none> 0.000 8.000
- margin_up 1 8.259 14.259
- length 1 11.148 17.148
- margin_low 1 46.722 52.722
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.0000 12.000
- height_left 1 0.0000 12.000
- height_right 1 0.0000 12.000
- margin_up 1 0.0000 12.000
<none> 0.0000 14.000
- length 1 7.5789 19.579
- margin_low 1 23.2244 35.224
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ height_left + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.0000 10.000
- height_right 1 0.0000 10.000
<none> 0.0000 12.000
- margin_up 1 6.1447 16.145
- length 1 7.9797 17.980
- margin_low 1 29.1631 39.163
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_right + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.0000 8.000
<none> 0.0000 10.000
- margin_up 1 6.7383 14.738
- length 1 8.5560 16.556
- margin_low 1 29.5859 37.586
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
<none> 0.000 8.000
- margin_up 1 6.738 12.738
- length 1 9.204 15.204
- margin_low 1 38.651 44.651
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.000 12.000
- height_right 1 0.000 12.000
- diagonal 1 0.000 12.000
- length 1 0.000 12.000
<none> 0.000 14.000
- margin_up 1 5.332 17.332
- margin_low 1 37.863 49.863
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ diagonal + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.000 10.000
- height_right 1 0.000 10.000
- length 1 0.000 10.000
<none> 0.000 12.000
- margin_up 1 5.591 15.591
- margin_low 1 37.983 47.983
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_right + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.000 8.000
- length 1 0.000 8.000
<none> 0.000 10.000
- margin_up 1 5.742 13.742
- margin_low 1 42.481 50.481
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- length 1 0.000 6.000
<none> 0.000 8.000
- margin_up 1 5.749 11.749
- margin_low 1 52.699 58.699
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=6
.outcome ~ margin_low + margin_up
Df Deviance AIC
<none> 0.000 6.000
- margin_up 1 59.851 63.851
- margin_low 1 124.178 128.178
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.000 12.000
- height_right 1 0.000 12.000
- height_left 1 0.000 12.000
<none> 0.000 14.000
- margin_up 1 7.949 19.949
- length 1 10.436 22.436
- margin_low 1 35.314 47.314
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ height_left + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.000 10.000
- height_left 1 0.000 10.000
<none> 0.000 12.000
- margin_up 1 8.250 18.250
- length 1 10.436 20.436
- margin_low 1 44.743 54.743
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_left + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.000 8.000
<none> 0.000 10.000
- margin_up 1 8.358 16.358
- length 1 10.459 18.459
- margin_low 1 48.824 56.824
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
<none> 0.000 8.000
- margin_up 1 8.378 14.378
- length 1 11.037 17.037
- margin_low 1 52.645 58.645
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.0000 12.000
- height_right 1 0.0000 12.000
- height_left 1 0.0000 12.000
- margin_up 1 0.0000 12.000
<none> 0.0000 14.000
- length 1 8.1912 20.191
- margin_low 1 27.3806 39.381
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ height_left + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.0000 10.000
- height_left 1 0.0000 10.000
- margin_up 1 0.0000 10.000
<none> 0.0000 12.000
- length 1 9.8466 19.847
- margin_low 1 30.3930 40.393
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_left + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.000 8.000
- margin_up 1 0.000 8.000
<none> 0.000 10.000
- length 1 11.183 19.183
- margin_low 1 33.019 41.019
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- margin_up 1 0.000 6.000
<none> 0.000 8.000
- length 1 11.269 17.269
- margin_low 1 37.695 43.695
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=6
.outcome ~ margin_low + length
Df Deviance AIC
<none> 0.000 6.000
- margin_low 1 41.685 45.685
- length 1 62.303 66.303
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Start: AIC=14
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- diagonal 1 0.000 12.000
- height_right 1 0.000 12.000
- height_left 1 0.000 12.000
<none> 0.000 14.000
- margin_up 1 8.265 20.265
- length 1 11.198 23.198
- margin_low 1 42.342 54.342
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
.outcome ~ height_left + height_right + margin_low + margin_up +
length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_right 1 0.000 10.000
- height_left 1 0.000 10.000
<none> 0.000 12.000
- margin_up 1 8.568 18.568
- length 1 12.462 22.462
- margin_low 1 47.782 57.782
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
.outcome ~ height_left + margin_low + margin_up + length
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- height_left 1 0.000 8.000
<none> 0.000 10.000
- margin_up 1 8.585 16.585
- length 1 12.716 20.716
- margin_low 1 53.624 61.624
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
.outcome ~ margin_low + margin_up + length
glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
<none> 0.000 8.000
- margin_up 1 8.586 14.586
- length 1 12.721 18.721
- margin_low 1 57.812 63.812
m_lr <-train(is_genuine ~., data =dataTrain , method="glmStepAIC",trControl=trainControl("none")) #trControl=fitControl)
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
Start: AIC=-102.63
.outcome ~ diagonal + height_left + height_right + margin_low +
margin_up + length
Df Deviance AIC
- diagonal 1 3.3284 -104.630
<none> 3.3284 -102.632
- height_left 1 3.3845 -102.358
- height_right 1 3.3961 -101.893
- length 1 4.7762 -55.513
- margin_up 1 5.2373 -42.981
- margin_low 1 6.7733 -8.003
Step: AIC=-104.63
.outcome ~ height_left + height_right + margin_low + margin_up +
length
Df Deviance AIC
<none> 3.3284 -104.630
- height_left 1 3.3882 -104.208
- height_right 1 3.3990 -103.775
- length 1 4.7789 -57.438
- margin_up 1 5.2816 -43.834
- margin_low 1 7.2308 -1.115
print(summary(m_lr$finalModel))
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-0.45002 -0.10279 0.00227 0.10643 0.36675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.63075 6.48615 -2.256 0.0258 *
height_left 0.10210 0.06679 1.529 0.1288
height_right -0.11073 0.06668 -1.661 0.0992 .
margin_low -0.36837 0.02984 -12.346 < 2e-16 ***
margin_up -0.64874 0.07428 -8.734 1.05e-14 ***
length 0.17633 0.02343 7.527 7.67e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.02560321)
Null deviance: 33.2647 on 135 degrees of freedom
Residual deviance: 3.3284 on 130 degrees of freedom
AIC: -104.63
Number of Fisher Scoring iterations: 2
modele <- glm(is_genuine ~ margin_low + margin_up + length, data = data, family=binomial(link="logit"))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
#prediction
pred.res <-predict(default_glm_mod,newdata=dataTest)
#distribution des classes prédites
table(pred.res)
pred.res
2.22044604925031e-16 3.36519837106077e-13 3.08897338398342e-10 0.999999991661286 1
10 1 1 1 21
pred.res <- ifelse(pred.res > 0.5,1,0)
pred.res[pred.res==1] <- 'True'
pred.res[pred.res==0] <- 'False'
dataTest$is_genuine
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
dataTest$pred <- pred.res
dataTest <- transform(dataTest, pred = as.factor(pred))
dataTest$is_genuine[dataTest$is_genuine == "1"] <- "True"
dataTest$is_genuine[dataTest$is_genuine == "0"] <- "False"
dataTest
set.seed(123)
MC = table( dataTest$is_genuine, dataTest$pred)
MC
False True
False 12 0
True 0 22
summary(m_lr)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-0.45002 -0.10279 0.00227 0.10643 0.36675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.63075 6.48615 -2.256 0.0258 *
height_left 0.10210 0.06679 1.529 0.1288
height_right -0.11073 0.06668 -1.661 0.0992 .
margin_low -0.36837 0.02984 -12.346 < 2e-16 ***
margin_up -0.64874 0.07428 -8.734 1.05e-14 ***
length 0.17633 0.02343 7.527 7.67e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.02560321)
Null deviance: 33.2647 on 135 degrees of freedom
Residual deviance: 3.3284 on 130 degrees of freedom
AIC: -104.63
Number of Fisher Scoring iterations: 2
#prediction
pred <-predict(m_lr,newdata=dataTest)
#distribution des classes prédites
print(table(pred))
pred
-0.288922116221507 -0.164985973152636 -0.146087142732345 -0.0361436259450549 -0.0214389890448459 -0.0191442486616396
1 1 1 1 1 1
0.0254033891163701 0.0718979695464412 0.235371181737662 0.340730215671069 0.387760271588281 0.439124878075962
1 1 1 1 1 1
0.495541570210069 0.648673367312153 0.651271730742877 0.777116223412456 0.790764631702817 0.813729784539092
1 1 1 1 1 1
0.816157373247556 0.844662642192681 0.855458881416993 0.868105489105375 0.882192200962251 0.888773413040425
1 1 1 1 1 1
0.983634553456525 0.984244859276597 1.00107311893466 1.01042262669148 1.04001273362791 1.05789933497058
1 1 1 1 1 1
1.06421572343166 1.07228573257934 1.11567058462171 1.14053429494833
1 1 1 1
dataTest
summary(dataTest)
is_genuine diagonal height_left height_right margin_low margin_up length
Length:34 Min. :171.6 Min. :103.5 Min. :103.2 Min. :3.640 Min. :2.700 Min. :111.0
Class :character 1st Qu.:171.8 1st Qu.:103.9 1st Qu.:103.8 1st Qu.:4.020 1st Qu.:3.022 1st Qu.:111.9
Mode :character Median :172.0 Median :104.0 Median :104.0 Median :4.395 Median :3.165 Median :112.8
Mean :171.9 Mean :104.1 Mean :104.0 Mean :4.584 Mean :3.154 Mean :112.6
3rd Qu.:172.1 3rd Qu.:104.2 3rd Qu.:104.2 3rd Qu.:4.935 3rd Qu.:3.300 3rd Qu.:113.4
Max. :172.3 Max. :104.7 Max. :104.9 Max. :6.190 Max. :3.660 Max. :113.8
pred
False:12
True :22
summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.2889 0.2617 0.8022 0.6066 0.9841 1.1405
head(pred)
2 3 10 15 18 20
1.1156706 1.0578993 0.6512717 0.9842449 1.0642157 1.0010731
print(dataTest$is_genuine)
[1] "True" "True" "True" "True" "True" "True" "True" "True" "True" "True" "True" "True" "True" "True"
[15] "True" "True" "True" "True" "True" "True" "True" "True" "False" "False" "False" "False" "False" "False"
[29] "False" "False" "False" "False" "False" "False"
#mat <-confusionMatrix(data=pred,reference=dataTest$is_genuine,positive="True")
#print(mat)
model <- glm(is_genuine ~ margin_low + margin_up + length, data = df_notes_vf_numeric, family=binomial(link="logit"))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
fic_exemple <- paste (repertoire_sources,"example.csv",sep = "")
df_exemple <- read.csv2(fic_exemple,header = TRUE, sep = ",", dec = ".")
head(df_exemple)
#prediction
pred <-predict(m_lr,newdata=df_exemple)
#distribution des classes prédites
print(table(pred))
pred
-0.118141595741676 0.110210622375941 0.134463092522076 0.849434579974368 1.02726845628282
1 1 1 1 1
pred
1 2 3 4 5
0.1102106 -0.1181416 0.1344631 0.8494346 1.0272685
df_exemple$is_genuine<-pred
df_exemple <- transform(df_exemple, is_genuine = as.numeric(is_genuine))
df_exemple$is_genuine[df_exemple$is_genuine == "1" ] <- "False"
df_exemple$is_genuine[df_exemple$is_genuine == "2" ] <- "True"
df_billets <- df_notes
df_billets$id <- rownames(df_billets)
df_data_complet <- merge(df_billets, df_exemple, all = T, sort = FALSE)
rownames(df_data_complet) <- df_data_complet$id
dim(df_data_complet)
[1] 175 8
#df_billets <- transform(df_billets, id = as.factor(id))
# ACP billets de la première donnée et billets de l'exemple prédit
res.pca <- PCA(df_data_complet[,c(1:7)], quali.sup = 1, ind.sup = 171:175, scale.unit = TRUE, graph=FALSE)
# billets prédits par rapport aux ensembles
graphe <- fviz_pca_ind(res.pca, geom.ind = "point",pointsize = 1,habillage=1,addEllipses = TRUE)
graphe <- fviz_add(graphe, res.pca$ind.sup$coord)
graphe
fic_a_predire <- paste (repertoire_sources,"example.csv",sep = "")
df_a_predire <- read.csv2(fic_a_predire,header = TRUE, sep = ",", dec = ".")
head(df_a_predire)
#prediction
pred <-predict(m_lr,newdata=df_a_predire)
#distribution des classes prédites
print(table(pred))
pred
-0.118141595741676 0.110210622375941 0.134463092522076 0.849434579974368 1.02726845628282
1 1 1 1 1
pred
1 2 3 4 5
0.1102106 -0.1181416 0.1344631 0.8494346 1.0272685
df_a_predire$is_genuine<-pred
df_a_predire <- transform(df_a_predire, is_genuine = as.numeric(is_genuine))
df_a_predire$is_genuine[df_a_predire$is_genuine == "1" ] <- "False"
df_a_predire$is_genuine[df_a_predire$is_genuine == "2" ] <- "True"
df_billets <- df_notes
df_billets$id <- rownames(df_billets)
df_data_complet <- merge(df_billets, df_a_predire, all = T, sort = FALSE)
rownames(df_data_complet) <- df_data_complet$id
dim(df_data_complet)
[1] 175 8
#df_billets <- transform(df_billets, id = as.factor(id))
# ACP billets de la première donnée et billets de l'exemple prédit
res.pca <- PCA(df_data_complet[,c(1:7)], quali.sup = 1, ind.sup = 171:175, scale.unit = TRUE, graph=FALSE)
# billets prédits par rapport aux ensembles
graphe <- fviz_pca_ind(res.pca, geom.ind = "point",pointsize = 1,habillage=1,addEllipses = TRUE)
graphe <- fviz_add(graphe, res.pca$ind.sup$coord)
graphe
# Grâce au modèle établit, on prédit la réalité ou non des billets
fitted.results <- predict(model,newdata=df_exemple,type='response')
print(fitted.results)
1 2 3 4 5
2.220446e-16 2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00
fitted.results <- ifelse(fitted.results > 0.5,1,0)
fitted.results[fitted.results==1] <- 'True'
fitted.results[fitted.results==0] <- 'False'
# La colonne is_genuine est ajouté au dataframe billets_a_predire
df_exemple$is_genuine <- fitted.results
df_exemple <- transform(df_exemple, is_genuine = as.factor(is_genuine))
df_exemple