Ce sont des méthodes utilisés dans le cas où on veut expliquer une variable cible en fonction d’un nombre très important de variables explicatives.
On doit disposer d’une variable réponse \(y\) (output) et d’une série importante de variables explicatives \(x_1,\ldots,x_p\) (input).
Les avantages de l’utilisation des arbres de décisions
Exemple:
> library(tree)
> Pollute <- read.delim("Pollute.txt")> colnames(Pollute)
[1] "Pollution" "Temp" "Industry" "Population" "Wind"
[6] "Rain" "Wet.days"
> model1<-tree(Pollute)
> plot(model1)
> text(model1)La valeur indiquée en bas de chaque feuille indique la moyenne du sous-échantillon obtenu en suivant la lecture de l’arbre.
Exemple :
spam : Données récoltées dans les laboratoires Hewlett–Packard : 4601 emails où 1813 sont détectés comme des Spam, 57 variables explicatives.Les 6 variables explicatives :
crl.tot, longueur totale des mots écris en majusculedollar, la fréquence du symbole $ ;bang, la fréquence du symbole !money, la fréquence du mot “money” ;n000, la fréquence de “000” ;make, la fréquence du mot “make”,> library(DAAG)
Loading required package: lattice
> data(spam7)
> colnames(spam7)
[1] "crl.tot" "dollar" "bang" "money" "n000" "make" "yesno"
> model2 = tree(yesno~.,spam7)
> plot(model2)
> text(model2)La variable réponse \(y\) est qualitative à \(K\) classes.
On s’arrête quand les sous-groupes atteignent une taille minimale, ou quand il n’y a plus d’amelioration
Mesure de l’impureté \(\mbox{Imp}(t)\) du nœud \(t\) pour la classification en \(K\) classes \[ \mbox{Imp}(t) = \phi(p(t))\;;\; p(t) =\left[p(k\mid t),\; k\in \mathcal{K}\right]\] où \(\phi\) est une fonction non-négative de \(p(t)\) qui satisfait les conditions suivantes
\(\mbox{Imp}(t)\) est maximale quand toutes les classes sont mélangées avec des “parts égales”, et est minimale quand le nœud ne contient qu’une seule classe.
Exemples de mesures d’impureté
\(K=2\), \(\phi(p)=-p\log p- (1-p)\log (1-p)\)
> phi1=function(p) -p*log(p)-(1-p)*log(1-p)
> phi2=function(p) 1-p^2-(1-p)^2
> curve(phi1,from = 0,to=1,col="red",ylab="",xlab=expression(p),)
> curve(phi2,from = 0,to=1,col="blue",add=T)
> legend("topright",text.col = c("red","blue"),legend = c("Entropie","Gini"))Le taux d’erreur pour un nœud \(t\) s’exprime par \[ R(t)=\displaystyle\sum_{k\in \mathcal{K}\, k\not=Y(t)} p(k\mid t)\] ou \[R(t)=\left[1- \displaystyle\frac{N_{Y(t)}(t)}{N(t)}\right]\]
Si \(\mathcal{T}\) l’ensemble des nœuds terminaux. Le taux d’erreur sur un arbre \(T\) est \[R(T)=\displaystyle\sum_{t\in \mathcal{T}} \displaystyle\frac{N(t}{n} R(t)\]
Proposition : \(R(T)=\displaystyle\frac{1}{n}\left(n-\displaystyle\sum_{t\in \mathcal{T}} N_{Y(t)}(t)\right)\)
Conclusion : \(R(T)\in [0,1]\) et \(R(T)\rightarrow 0\) si la taille de \(T\) augmente.
\[R_\alpha(T)=R(T)+\alpha |T|\] \(\alpha\) est appelé le paramètre de complexité
Proposition: Soit \(t\) un nœud non terminal et \(T(t)\) le sous-arbre issu de \(t\) Alors \[R(T(t))\leq R(t)\]
Donc l’élagage de l’arbre ne vaut plus la peine si \[R_\alpha(t)<R_\alpha(T(t)) \]
Comme \(R_\alpha(t)=R(t)+\alpha\) et \(R(T(t))=R(T(t))+\alpha|T(t)|\), donc
\[\begin{array}{rcl} R_\alpha(t)<R_\alpha(T(t)) &\iff & R(t)+\alpha < R(T(t))+\alpha|T(t)| \\ &\iff & \displaystyle\frac{R(t)-R(T(t))}{|T(t)|-1} <\alpha \end{array}\]
Soit \(T\) l’ensemble de nœuds non-terminaux dans un arbre. Soit \(t\in T\), on pose \[ g(t,T)=\displaystyle\frac{R(t)-R(T(t))}{|T(t)|-1}\] Comme \(T\) est un ensemble fini, donc soit \[t_{min}=\mbox{argmin}_{t\in T} g(t,T)\] Posons \(\alpha=g(t_{min},T)\). Alors pour tout \(t\in T\), \[\alpha \leq \displaystyle\frac{R(t)-R(T(t))}{|T(t)|-1} \;\iff\; R_\alpha(t) \geq R_\alpha(T(t))\] \(t_{min}\) est appelé le maillon faible de l’arbre.
Résultat : Une suite décroissante d’arbres et l’arbre final sera déterminer en fonction de son pouvoir de prédiction.
R avec la fonction rpart on obtient :
CP les valeurs de \(\alpha\) de la plus grande à la plus petite (\(\alpha\) décroit quand l’arbre croit)rel error l’erreur relative : \(R(T)\)xerror l’erreur calculée par validation croisée.xstd l’écart-type de l’erreur calculée par validation croisée.1-SE règle du pouce: tous les arbres sont équivalents xerror \(\leq\) minimal xerror + xstd: Il faut choisir le plus simple.spam7> library(rpart)
> library(DAAG)
> data(spam7)
> set.seed(1235)
> spam.rpart <- rpart(formula = yesno~ crl.tot + dollar + bang +
+ money + n000 + make, method="class", data=spam7)> printcp(spam.rpart)
Classification tree:
rpart(formula = yesno ~ crl.tot + dollar + bang + money + n000 +
make, data = spam7, method = "class")
Variables actually used in tree construction:
[1] bang crl.tot dollar
Root node error: 1813/4601 = 0.39404
n= 4601
CP nsplit rel error xerror xstd
1 0.476558 0 1.00000 1.00000 0.018282
2 0.075565 1 0.52344 0.54275 0.015341
3 0.011583 3 0.37231 0.38720 0.013453
4 0.010480 4 0.36073 0.38776 0.013461
5 0.010000 5 0.35025 0.38555 0.013429> pred=predict(object = spam.rpart,newdata = spam7,type="class")> mc<-table(spam7$yesno,pred)
> mc
pred
n y
n 2517 271
y 364 1449yesno> rowSums(mc)
n y
2788 1813 et l’erreur à la racine
> rowSums(mc)[2]/(sum(mc))
y
0.3940448 > err.resub=1-((mc[1,1]+mc[2,2])/sum(mc))
> err.resub
[1] 0.1380135ou
> (mc[2,1]+mc[1,2])/sum(mc)
[1] 0.1380135Elle est aussi égale à
> 0.35025*0.39404
[1] 0.1380125Taille de l’échantillon.
> n<-nrow(spam7) Nombre de bloques
> K<-10Tirer d’une permutation \(\{1,\ldots,n\}\)
> alea=runif(n)
> rang=rank(alea)
> rang[1:10]
[1] 2352 2013 203 4520 4427 773 1190 4586 3716 1061Taille de chaque échantillon
> taille<-n%/%K
> taille
[1] 460Composition des bloques.
> block=(rang-1)%/%taille+1
> table(block)
block
1 2 3 4 5 6 7 8 9 10 11
460 460 460 460 460 460 460 460 460 460 1
> block[block==11]=10
> block=as.factor(block)Calcul de l’erreur
> err.cv=numeric(0)
> for(k in 1:K){
+ arbre=rpart(yesno~.,data=spam7[block!=k,],method="class")
+ pred=predict(object = arbre,newdata = spam7[block==k,],type="class")
+ mc<-table(spam7$yesno[block==k],pred)
+ err=1-((mc[1,1]+mc[2,2])/sum(mc))
+ err.cv=rbind(err.cv,err)
+ }
> err.cv
[,1]
err 0.1326087
err 0.1434783
err 0.1152174
err 0.1913043
err 0.1521739
err 0.1413043
err 0.1586957
err 0.1608696
err 0.1500000
err 0.1453362
> mean(err.cv)
[1] 0.1490988Par le calcul
> spam.rpart$cptable
CP nsplit rel error xerror xstd
1 0.47655819 0 1.0000000 1.0000000 0.01828190
2 0.07556536 1 0.5234418 0.5427468 0.01534080
3 0.01158301 3 0.3723111 0.3872035 0.01345307
4 0.01047987 4 0.3607281 0.3877551 0.01346092
5 0.01000000 5 0.3502482 0.3855488 0.01342945
> i=which.min(spam.rpart$cptable[,"xerror"])
> threshold=spam.rpart$cptable[i,"xerror"]+spam.rpart$cptable[i,"xstd"]
> j=which(spam.rpart$cptable[,"xerror"]<=threshold)
> spam.rpart$cptable[j,"CP"]
3 4 5
0.01158301 0.01047987 0.01000000 Et graphiquement
> plotcp(spam.rpart)bodyfat qui contient des enregistrements de 71 femmes sur des différentes matières (age, taille,..) et une variable réponse \(y\) qui est DEXfat une mesure de la matière grasse.> library(rpart)
> library(TH.data)
Loading required package: survival
Attaching package: 'survival'
The following object is masked from 'package:DAAG':
lung
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:DAAG':
hills
Attaching package: 'TH.data'
The following object is masked from 'package:MASS':
geyser
> data("bodyfat")
> head(bodyfat)
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b
47 57 41.68 100.0 112.0 7.1 9.4 4.42 4.95
48 65 43.29 99.5 116.5 6.5 8.9 4.63 5.01
49 59 35.41 96.0 108.5 6.2 8.9 4.12 4.74
50 58 22.79 72.0 96.5 6.1 9.2 4.03 4.48
51 60 36.42 89.5 100.5 7.1 10.0 4.24 4.68
52 61 24.13 83.5 97.0 6.5 8.8 3.55 4.06
anthro3c anthro4
47 4.50 6.13
48 4.48 6.37
49 4.60 5.82
50 3.91 5.66
51 4.15 5.91
52 3.64 5.14> bodyfat_rpart <- rpart(DEXfat ~ .,data = bodyfat,control = rpart.control(minsplit = 10,cp=0.01))
> printcp(bodyfat_rpart)
Regression tree:
rpart(formula = DEXfat ~ ., data = bodyfat, control = rpart.control(minsplit = 10,
cp = 0.01))
Variables actually used in tree construction:
[1] anthro3c hipcirc kneebreadth waistcirc
Root node error: 8536/71 = 120.23
n= 71
CP nsplit rel error xerror xstd
1 0.662895 0 1.000000 1.02131 0.167487
2 0.093763 1 0.337105 0.41410 0.095259
3 0.083593 2 0.243342 0.44026 0.091325
4 0.045075 3 0.159749 0.34150 0.070391
5 0.029568 4 0.114674 0.28971 0.061322
6 0.013751 5 0.085107 0.27427 0.069508
7 0.010000 6 0.071355 0.26664 0.068680> library("partykit")
Loading required package: grid
> plot(as.party(bodyfat_rpart), tp_args = list(id = FALSE))> residuals=bodyfat$DEXfat-predict(bodyfat_rpart,bodyfat)
> SSE=sum(residuals^2)
> SST=sum((bodyfat$DEXfat-mean(bodyfat$DEXfat))^2)
> SSE/SST
[1] 0.07135526> opt <- which.min(bodyfat_rpart$cptable[,"xerror"])
> SE1=bodyfat_rpart$cptable[opt,"xerror"]+bodyfat_rpart$cptable[opt,"xstd"]
> SE1
[1] 0.3353165
> which(bodyfat_rpart$cptable[,"xerror"]<=SE1)
5 6 7
5 6 7 On choisit le cp pour l’arbre à 5 coupes.
> cp=bodyfat_rpart$cptable[5,'CP']
> cp
[1] 0.02956768Élagage de l’arbre
> bodyfat_prune <- prune(bodyfat_rpart, cp = cp)
> plot(as.party(bodyfat_prune), tp_args = list(id = FALSE))> DEXfat_pred <- predict(bodyfat_prune, newdata = bodyfat)
> xlim <- range(bodyfat$DEXfat)
> ylim <- range(DEXfat_pred)
>
> plot(DEXfat_pred ~ DEXfat, data = bodyfat, xlab = "Observed",
+ ylab = "Predicted", ylim = xlim, xlim = xlim)
> abline(a = 0, b = 1,lwd=2)On va maintenant estimer 100 fois l’arbre de décision et on estime à chaque fois le paramètre de complexité.
On crée une fonction R qui donne le split optimal + cp + xerror selon la règle du 1-SE à partir de la donnée d’un arbre estimé.
> compt_cps=function(cp){
+ opt <- which.min(cp[,"xerror"])
+ SE1=cp[opt,"xerror"]+cp[opt,"xstd"]
+ i=which(cp[,"xerror"]<=SE1)[1]
+ return(list(CP=cp[i,1],nsplit=cp[i,2],xerror=cp[i,4]))
+ }nsplit, cp et xerror.> execute=function(){
+ cp <- bodyfat_rpart <- rpart(DEXfat ~ .,data = bodyfat,control = rpart.control(minsplit = 10,cp=0.01))$cptable
+ compt_cps(cp)
+ }> res=replicate(100,execute())> summary(unlist(res[1,]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01000 0.02957 0.02957 0.02853 0.02957 0.04508 + Le nombre de coupes
> table(unlist(res[2,]))
3 4 5 6
18 58 23 1 + le `xerror`
> summary(unlist(res[3,]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1760 0.2651 0.2908 0.2881 0.3124 0.3721 Compléter le code R suivant et tracer l’arbre de décision estimé
> library(DAAG)
> library(rpart)
> data(mifem)
> summary(mifem)
outcome age yronset premi smstat diabetes
live:974 Min. :35.00 Min. :85.00 y :311 c :390 y :248
dead:321 1st Qu.:57.00 1st Qu.:87.00 n :928 x :280 n :978
Median :63.00 Median :89.00 nk: 56 n :522 nk: 69
Mean :60.92 Mean :88.79 nk:103
3rd Qu.:66.00 3rd Qu.:91.00
Max. :69.00 Max. :93.00
highbp hichol angina stroke
y :813 y :452 y :472 y : 153
n :406 n :655 n :724 n :1063
nk: 76 nk:188 nk: 99 nk: 79
> mifem.rpart <- rpart(outcome ~yronset+premi+age+highbp, method="class",data = mifem, cp = 0.004)
> printcp(mifem.rpart)
Classification tree:
rpart(formula = outcome ~ yronset + premi + smstat + highbp,data = mifem, method = "class", cp = 0.004)
Variables actually used in tree construction:
[1] ......... ........ smstat yronset
Root node error: ........ = ........
n= 1295
CP nsplit rel error xerror xstd
1 0.1869159 0 1.00000 1.00000 0.048405
2 0.0046729 1 0.81308 0.72308 0.044972
3 ......... 5 0.79439 0.65801 0.045479
> print(mifem.rpart)
n= 1295
node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 1295 321 ........ (0.7521236 .......)
2) highbp=y,n 1219 ..253.. live (0.7924528 0.2075472)
4) smstat=c,x,n 1169 229 live (0.8041061 0.1958939) *
5) smstat=.nk.. 50 24 ....... (0.5200000 0.4800000)
10) yronset>=92.5 10 3 live (0.7000000 0.3000000) *
11) yronset< 92.5 40 19 dead (0.4750000 0.5250000)
22) yronset< 91.5 33 16 live (0.5151515 0.4848485)
44) premi=n 22 9 live (0.5909091 0.4090909) *
45) premi=y,nk 11 4 dead (0.3636364 0.6363636) *
23) yronset>=91.5 7 2 dead (0.2857143 0.7142857) *
3) highbp=nk ...... 8 dead (0.1052632 0.8947368) *> library(MASS)
> library(rpart) rock : ce sont des données concernant douze échantillons de carottes provenant de réservoirs de pétrole ont été échantillonnés par 4 sections. Chaque échantillon de carotte a mesuré la perméabilité, et chaque section transversale a une aire totale de pores, le périmètre total de pores, et la forme.> data(rock)
> summary(rock)
area peri shape perm
Min. : 1016 Min. : 308.6 Min. :0.09033 Min. : 6.30
1st Qu.: 5305 1st Qu.:1414.9 1st Qu.:0.16226 1st Qu.: 76.45
Median : 7487 Median :2536.2 Median :0.19886 Median : 130.50
Mean : 7188 Mean :2682.2 Mean :0.21811 Mean : 415.45
3rd Qu.: 8870 3rd Qu.:3989.5 3rd Qu.:0.26267 3rd Qu.: 777.50
Max. :12212 Max. :4864.2 Max. :0.46413 Max. :1300.00 L’objectif est d’expliquer la perméabilité (la variable perm) en fonction des autres variables.
log(perm)> rpart.fit1 <- rpart(log(perm) ~ ., data=rock, method="anova")
> printcp(rpart.fit1)
Regression tree:
rpart(formula = log(perm) ~ ., data = rock, method = "anova")
Variables actually used in tree construction:
[1] area peri
Root node error: 126.93/48 = 2.6444
n= 48
CP nsplit rel error xerror xstd
1 0.630569 0 1.00000 1.04026 0.165910
2 0.106765 1 0.36943 0.39957 0.071683
3 0.031791 2 0.26267 0.31573 0.071265
4 0.010000 3 0.23087 0.31851 0.065237L’erreur à la racine
> sum((log(rock$perm) - mean(log(rock$perm)))^2)
[1] 126.9322
> n <- dim(rock)[1]
> sum((log(rock$perm) - mean(log(rock$perm)))^2)/n
[1] 2.64442L’erreur à 1 split
> rpart.fit1a=prune(rpart.fit1,cp=0.10677)
> printcp(rpart.fit1a)
Regression tree:
rpart(formula = log(perm) ~ ., data = rock, method = "anova")
Variables actually used in tree construction:
[1] peri
Root node error: 126.93/48 = 2.6444
n= 48
CP nsplit rel error xerror xstd
1 0.63057 0 1.00000 1.04026 0.165910
2 0.10677 1 0.36943 0.39957 0.071683
> rpart.fit1a
n= 48
node), split, n, deviance, yval
* denotes terminal node
1) root 48 126.93220 5.107463
2) peri>=2536.195 24 30.01282 3.816150 *
3) peri< 2536.195 24 16.87989 6.398775 *
> unique(predict(rpart.fit1a))
[1] 3.816150 6.398775La prédiction rapport à une branche.
> x1=log(rock$perm)[rock$peri>=2536.195]
> mean(x1)
[1] 3.81615Et par rapport à l’autre branche
> x2=log(rock$perm)[rock$peri<2536.195]
> mean(x2)
[1] 6.398775Calcul de l’erreur.
> sum((x1-mean(x1))^2)
[1] 30.01282
> sum((x1-mean(x1))^2)/length(x1)
[1] 1.250534
> sum((x2-mean(x2))^2)
[1] 16.87989
> sum((x2-mean(x2))^2)/length(x2)
[1] 0.7033287
> SSE=sum((x1-mean(x1))^2)+sum((x2-mean(x2))^2)
> SST=sum((log(rock$perm) - mean(log(rock$perm)))^2)
> relative.error=SSE/SST
> relative.error
[1] 0.3694313> rpart.fit1 <- rpart(log(perm) ~ ., data=rock, method="anova")
> printcp(rpart.fit1)
Regression tree:
rpart(formula = log(perm) ~ ., data = rock, method = "anova")
Variables actually used in tree construction:
[1] area peri
Root node error: 126.93/48 = 2.6444
n= 48
CP nsplit rel error xerror xstd
1 0.630569 0 1.00000 1.02309 0.16312
2 0.106765 1 0.36943 0.52345 0.16500
3 0.031791 2 0.26267 0.49240 0.16938
4 0.010000 3 0.23087 0.46660 0.14145Ré-estimons le modèle en centrant et réduire les variables explicatives.
> X <- rock[,-4]
> X <- sweep(X,2,apply(X,2,mean))
> X <- sweep(X,2,apply(X,2,sd),"/")
> summary(X)
area peri shape
Min. :-2.2996 Min. :-1.6579 Min. :-1.5304
1st Qu.:-0.7014 1st Qu.:-0.8852 1st Qu.:-0.6689
Median : 0.1115 Median :-0.1020 Median :-0.2305
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6266 3rd Qu.: 0.9131 3rd Qu.: 0.5337
Max. : 1.8720 Max. : 1.5241 Max. : 2.9464
> apply(X,2,sd)
area peri shape
1 1 1
> X <- cbind(X,logperm=log(rock$perm))L’estimation du deuxième arbre.
> rpart.fit2 <- rpart(logperm ~ ., data=X, method="anova")Représentation des deux arbres.
> library("partykit")
> par(mfrow=c(1,2))
> plot(as.party(rpart.fit1), tp_args = list(id = FALSE),main="Sans transformation")> plot(as.party(rpart.fit2), tp_args = list(id = FALSE),main="Avec transformation",add=T)Les mêmes erreurs :
> printcp(rpart.fit1)
Regression tree:
rpart(formula = log(perm) ~ ., data = rock, method = "anova")
Variables actually used in tree construction:
[1] area peri
Root node error: 126.93/48 = 2.6444
n= 48
CP nsplit rel error xerror xstd
1 0.630569 0 1.00000 1.02309 0.16312
2 0.106765 1 0.36943 0.52345 0.16500
3 0.031791 2 0.26267 0.49240 0.16938
4 0.010000 3 0.23087 0.46660 0.14145
> printcp(rpart.fit2)
Regression tree:
rpart(formula = logperm ~ ., data = X, method = "anova")
Variables actually used in tree construction:
[1] area peri
Root node error: 126.93/48 = 2.6444
n= 48
CP nsplit rel error xerror xstd
1 0.630569 0 1.00000 1.05908 0.164460
2 0.106765 1 0.36943 0.39854 0.071869
3 0.031791 2 0.26267 0.31130 0.067921
4 0.010000 3 0.23087 0.30447 0.060382rpart partitionne les données> rpart.fit1
n= 48
node), split, n, deviance, yval
* denotes terminal node
1) root 48 126.932200 5.107463
2) peri>=2536.195 24 30.012820 3.816150
4) area< 8221 9 12.232200 2.846043 *
5) area>=8221 15 4.228663 4.398214 *
3) peri< 2536.195 24 16.879890 6.398775
6) area< 4544 7 7.433435 5.759762 *
7) area>=4544 17 5.411119 6.661898 *> par(mfrow=c(1,1))
> attach(rock)
> plot(peri, area)
> abline(v=2536.195)
> lines(c(2536.195,5000),c(8221,8221))
> lines(c(0,2536.195),c(4544,4544))
> text(4000,4000,"n=9\nybar=2.85")
> text(3000,11000,"n=15\nybar=4.40")
> text(1500,2000,"n=7\nybar=5.76")
> text(1500,11000,"n=17\nybar=6.66")> detach(rock)gam modèles généralisées additives.ppr projection pursuit regressionlm modèle linéaire> library(gam)
Loading required package: splines
Loading required package: foreach
Loaded gam 1.12
> n <- dim(X)[1]
> err <- matrix(NA,nrow=5,ncol=n)
> dimnames(err) <- list(c("lm","gam(shape)","gam(all)","rpart","ppreg"),NULL)
>
> for (i in 1:n) {
+ XX <- X[-i,]
+ X.pred <- X[i,]
+ resp <- X$logperm[i]
+ fit <- lm(logperm ~ .,data=XX)
+ err[1,i] <- resp - predict(fit,X.pred)
+ fit <- gam(logperm ~ area + peri + s(shape), data=XX)
+ err[2,i] <- resp - predict(fit,X.pred)
+ fit <- gam(logperm ~ s(area) + s(peri) + s(shape), data=XX)
+ err[3,i] <- resp - predict(fit,X.pred)
+ fit <- rpart(logperm ~ ., data=XX)
+ err[4,i] <- resp - predict(fit,X.pred)
+ fit <- ppr(logperm ~ area + peri + shape,
+ data = XX,nterms = 3, max.terms = 5)
+ err[5,i] <- resp - predict(fit,X.pred)
+ }
>
> round(apply(err^2,1,sum)/n,2)
lm gam(shape) gam(all) rpart ppreg
0.86 0.93 1.21 1.01 0.87 Compléter le code R suivant
> colnames(dt2)
[1] "salary" "age" "score" "gender" "education" "region"
[7] "activite" "matrimonial" ............. "age_c" "ln.salary"
> dim(dt2)
[1] .......... ........
> library(plyr)
> length(levels(dt2$age_c))
[1] 4
> dt2$age_c=mapvalues(dt2$age_c,from=levels(dt2$age_c),to=c("[18,30)","[30,50)",
+ "50 and more",...........))
> xtabs(~dt2$age_c)
dt2$age_c
[18,30) [30,50) 50 and more
9724 16326 .....
> contdata <- ddply(dt2, c("age_c","gender","privepublic"), summarise,
+ N = sum(!is.na(salary)),
+ N.na = sum(is.na(salary)==T),
+ mean = mean(salary, na.rm=TRUE), + sd = sd(salary, na.rm=TRUE),
+ se = sd / sqrt(N)
+ )
> contdata
age_c gender privepublic N N.na mean sd se
1 [18,30) MASCULIN Public 1530 0 385.0157 202.4247 5.175089
2 [18,30) MASCULIN Private 3365 0 264.2663 180.1593 3.105733
3 [18,30) FEMININ Public 1009 0 453.3845 216.7245 6.822796
4 [18,30) FEMININ Private 3820 0 210.4243 150.8134 2.440103
5 [30,50) MASCULIN Public 6909 0 454.3584 229.5869 2.762101
6 [30,50) MASCULIN Private 4350 0 343.8248 218.1429 3.307474
7 [30,50) FEMININ Public 2783 0 481.4344 222.0035 4.208266
8 [30,50) FEMININ Private 2284 0 247.0394 143.3684 2.999890
9 50 and more MASCULIN Public 2376 0 445.0825 273.9328 5.619800
10 50 and more MASCULIN Private 872 0 346.3372 293.3663 9.934636
11 50 and more FEMININ Public 456 0 466.7741 229.5051 10.747562
12 50 and more FEMININ Private 143 0 289.9021 302.1567 25.267614
> xtabs(~privepublic+age_c,data=dt2)
age_c
privepublic [18,30) [30,50) 50 and more
Public 2539 .... ....
Private 7185 6634 1015
> xtabs(~privepublic+gender,data=dt2)
gender
privepublic MASCULIN FEMININ
Public 10815 ....
Private .... 6247
> sum(!is.na(dt2$salary))
[1] 29897
> dt2$med.salary=1*(dt2$salary>median(dt2$salary))
> dt2$med.salary=as.factor(dt2$med.salary)
> xtabs(~dt2$med.salary)
dt2$med.salary
0 1
..... 14182
> library(.......)
> set.seed(1235)
> dt2.rpart <- rpart(formula = ..................................
+ ,method="........", data=dt2)
> printcp(dt2.rpart)
Classification tree:
rpart(formula = med.salary ~ age_c + gender + education + privepublic,data = dt2,
method = ".......")
Variables actually used in tree construction:
[1] education privepublic
Root node error: 14182/....... = 0.47436
n= 29897
CP nsplit rel error xerror xstd
1 0.476449 0 1.00000 1.00000 0.0060880
2 0.109011 1 0.52355 0.52355 0.0052677
3 0.037935 2 0.41454 0.41454 0.0048458
4 0.010000 3 0.37660 0.37660 0.0046702
> pred=predict(object = dt2.rpart,newdata = dt2,type="class")
> xtabs(~pred)
pred
0 1
16946 .....
> mc<-table(dt2$med.salary,pred)
> mc
pred
0 1
0 13660 ....
1 3286 10896
> rowSums(mc)[2]/(sum(mc))
1
........
> err.resub=1-((mc[1,1]+mc[2,2])/sum(mc))
> err.resub
[1] ........
> dt2.rpart
n= .......
node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 29897 14182 0 (0.5256380 ........)
2) privepublic=Private 14834 3272 0 (......... 0.2205744)
4) education=NEANT,PRIMAIRE ASSIMIL,SECONDAIRE 13684 2428 0 (0.8225665 0.1774335) *
5) education=SUPER. 1150 306 1 (0.2660870 0.7339130) *
3) privepublic=......... 15063 4153 1 (0.2757087 0.7242913)
6) education=NEANT,PRIMAIRE ASSIMIL 3262 858 ..... (0.7369712 0.2630288) *
7) education=SECONDAIRE,SUPER. 11801 1749 ..... (0.1482078 0.8517922) *Compléter le code R suivant et tracer l’arbre de décision estimé
> library(DAAG)
> library(rpart)
> data(mifem)
> summary(mifem)
outcome age yronset premi smstat diabetes
live:974 Min. :35.00 Min. :85.00 y :311 c :390 y :248
dead:321 1st Qu.:57.00 1st Qu.:87.00 n :928 x :280 n :978
Median :63.00 Median :89.00 nk: 56 n :522 nk: 69
Mean :60.92 Mean :88.79 nk:103
3rd Qu.:66.00 3rd Qu.:91.00
Max. :69.00 Max. :93.00
highbp hichol angina stroke
y :813 y :452 y :472 y : 153
n :406 n :655 n :724 n :1063
nk: 76 nk:188 nk: 99 nk: 79
> mifem.rpart <- rpart(outcome ~yronset+premi+age+highbp, method="class",data = mifem, cp = 0.004)
> printcp(mifem.rpart)
Classification tree:
rpart(formula = outcome ~ yronset + premi + smstat + highbp,data = mifem, method = "class", cp = 0.004)
Variables actually used in tree construction:
[1] ......... ........ smstat yronset
Root node error: ........ = ........
n= 1295
CP nsplit rel error xerror xstd
1 0.1869159 0 1.00000 1.00000 0.048405
2 0.0046729 1 0.81308 0.72308 0.044972
3 ......... 5 0.79439 0.65801 0.045479
> print(mifem.rpart)
n= 1295
node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 1295 321 ........ (0.7521236 .......)
2) highbp=y,n 1219 ...... live (0.7924528 0.2075472)
4) smstat=c,x,n 1169 229 live (0.8041061 0.1958939) *
5) smstat=.nk.. 50 24 ....... (0.5200000 0.4800000)
10) yronset>=92.5 10 3 live (0.7000000 0.3000000) *
11) yronset< 92.5 40 19 dead (0.4750000 0.5250000)
22) yronset< 91.5 33 16 live (0.5151515 0.4848485)
44) premi=n 22 9 live (0.5909091 0.4090909) *
45) premi=y,nk 11 4 dead (0.3636364 0.6363636) *
23) yronset>=91.5 7 2 dead (0.2857143 0.7142857) *
3) highbp=nk ...... 8 dead (0.1052632 0.8947368) *
Comment “diviser” un noeud \(t\)?
Soit \(x_j\) une variable qu’on veut utiliser pour diviser le sous-échantillon dans \(t\) en deux sous-nœuds \(t_L\) et \(t_R\)
Par exemple si \(x_j\) est quantitative, on voudrait chercher \(a\) un réel dans le domaine des valeurs de \(x_j\) telle que \[t_L(a)=\{w\in t,\mbox{ tel que }x_j(w)\leq a\}\;\mbox{ et }\;t_R(a)=\{w\in t,\mbox{ tel que }x_j(w)> a\}\] \(a\) sera le réel qui maximise la fonction suivante : \[x\longmapsto \mbox{Imp}(t)-\pi_T \mbox{Imp}(t_L(x))-\pi_R \mbox{Imp}(t_R(x))\] où \[\pi_T=\displaystyle\frac{N(t_L(x))}{N(t)}\;\mbox{ et } \pi_R=\displaystyle\frac{N(t_R(x))}{N(t)}\]