Micro Projeto em R de Machine Learning, nele usaremos decisions trees e depois floresta aleatória para classificar escolas como Públicas ou Privadas baseadas em suas características. Será utilizado o dataframe College que vem com a livraria ISLR.
Carregando as livrarias:
install.packages("ggplot2","ISLR","dplyr","caToolds","rpart","rpart.plot","randomForest")
library(ggplot2)
library(ISLR)
library(dplyr)
library(caTools)
library(rpart)
Montando e exibindo o dataframe:
df <- College
head(College)
Análise Exploratória de Dados
Exploração de dados básica antes de começarmos a construção dos modelos.
Scatterplot da taxa de graduação vs custos de sala e inscrição:
ggplot(df,aes(Room.Board,Grad.Rate)) + geom_point(aes(color=Private))

Faz sentido ver que os colégios privados tem custos mais elevados.
Agora um histograma mostrando os estudantes de tempo integral colorido em privado ou não:
ggplot(df,aes(F.Undergrad)) + geom_histogram(aes(fill=Private),color="black",bins=50)

Histograma da taxa de graduação, colorida se privada ou não:
ggplot(df,aes(Grad.Rate)) + geom_histogram(aes(fill=Private),color="black",bins=50)

Tem um colégio com a taxa de graduação acima de 100%. Vamos encontrá-lo e fixa-lo.
subset(df,Grad.Rate>100)
Arrumando:
df['Cazenovia College','Grad.Rate'] <- 100
Divisão de treino e teste:
set.seed(101)
sample = sample.split(df$Private,SplitRatio=0.7)
train = subset(df,sample==T)
test = subset(df,sample==F)
Árvore de Decisão
Construindo o modelo e fazendo predições:
tree <- rpart(Private ~ . ,method="class",data=train)
tree.preds <- predict(tree,test)
head(tree.preds)
No Yes
Adrian College 0.003311258 0.9966887
Alfred University 0.003311258 0.9966887
Allegheny College 0.003311258 0.9966887
Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
Alma College 0.003311258 0.9966887
Amherst College 0.003311258 0.9966887
Criando uma coluna nomeada “Private”, com as variáveis Sim/Não para indicar se o colégio é público ou não, para combinar com o dataframe original, assim podemos comparar os resultados facilmente:
tree.preds <- as.data.frame(tree.preds)
tree.preds$Private <- ifelse(tree.preds$Yes > 0.5,"Sim","Não")
Checando a Matriz de confusão:
table(tree.preds$Private,test$Private)
No Yes
Não 57 9
Sim 7 160
Visualizando a árvore:
library(rpart.plot)
prp(tree)

Floresta Aleatória
Carregando a livraria:
library(randomForest)
Construindo o modelo:
rf.model <- randomForest(Private ~ ., data=train,importance=TRUE)
Fazendo as predições:
p <- predict(rf.model,test)
table(p,test$Private)
p No Yes
No 57 6
Yes 7 163
Nós vemos que o modelo performou melhor que a árvore de decisão.
Projeto baseado no repositório de Sajal Sharma, elaborado em 11/12/2016. Link: https://rpubs.com/sajal_sharma/micro_dt_rf
Dados contidos na livraria ISLR, utilizados apenas para fins educacionais.
Traduzido e formatado por Alexsander Santos Psendziuk em 2020. Contato: alexpsendziuk@gmail.com
LS0tCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgaHRtbF9kb2N1bWVudDogZGVmYXVsdAotLS0KTWljcm8gUHJvamV0byBlbSBSIGRlIE1hY2hpbmUgTGVhcm5pbmcsIG5lbGUgdXNhcmVtb3MgZGVjaXNpb25zIHRyZWVzIGUgZGVwb2lzIGZsb3Jlc3RhIGFsZWF0w7NyaWEgcGFyYSBjbGFzc2lmaWNhciBlc2NvbGFzIGNvbW8gUMO6YmxpY2FzIG91IFByaXZhZGFzIGJhc2VhZGFzIGVtIHN1YXMgY2FyYWN0ZXLDrXN0aWNhcy4gU2Vyw6EgdXRpbGl6YWRvIG8gZGF0YWZyYW1lIENvbGxlZ2UgcXVlIHZlbSBjb20gYSBsaXZyYXJpYSBJU0xSLgoKQ2FycmVnYW5kbyBhcyBsaXZyYXJpYXM6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9Cmluc3RhbGwucGFja2FnZXMoImdncGxvdDIiLCJJU0xSIiwiZHBseXIiLCJjYVRvb2xkcyIsInJwYXJ0IiwicnBhcnQucGxvdCIsInJhbmRvbUZvcmVzdCIpCmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShJU0xSKQpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KGNhVG9vbHMpCmxpYnJhcnkocnBhcnQpCmBgYAoKTW9udGFuZG8gZSBleGliaW5kbyBvIGRhdGFmcmFtZToKYGBge3IgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9OH0KZGYgPC0gQ29sbGVnZQpoZWFkKENvbGxlZ2UpCmBgYAoKQW7DoWxpc2UgRXhwbG9yYXTDs3JpYSBkZSBEYWRvcwoKRXhwbG9yYcOnw6NvIGRlIGRhZG9zIGLDoXNpY2EgYW50ZXMgZGUgY29tZcOnYXJtb3MgYSBjb25zdHJ1w6fDo28gZG9zIG1vZGVsb3MuCgpTY2F0dGVycGxvdCBkYSB0YXhhIGRlIGdyYWR1YcOnw6NvIHZzIGN1c3RvcyBkZSBzYWxhIGUgaW5zY3Jpw6fDo286CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CmdncGxvdChkZixhZXMoUm9vbS5Cb2FyZCxHcmFkLlJhdGUpKSArIGdlb21fcG9pbnQoYWVzKGNvbG9yPVByaXZhdGUpKQpgYGAKCkZheiBzZW50aWRvIHZlciBxdWUgb3MgY29sw6lnaW9zIHByaXZhZG9zIHRlbSBjdXN0b3MgbWFpcyBlbGV2YWRvcy4KCkFnb3JhIHVtIGhpc3RvZ3JhbWEgbW9zdHJhbmRvIG9zIGVzdHVkYW50ZXMgZGUgdGVtcG8gaW50ZWdyYWwgY29sb3JpZG8gZW0gcHJpdmFkbyBvdSBuw6NvOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpnZ3Bsb3QoZGYsYWVzKEYuVW5kZXJncmFkKSkgKyBnZW9tX2hpc3RvZ3JhbShhZXMoZmlsbD1Qcml2YXRlKSxjb2xvcj0iYmxhY2siLGJpbnM9NTApCmBgYAoKSGlzdG9ncmFtYSBkYSB0YXhhIGRlIGdyYWR1YcOnw6NvLCBjb2xvcmlkYSBzZSBwcml2YWRhIG91ICBuw6NvOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpnZ3Bsb3QoZGYsYWVzKEdyYWQuUmF0ZSkpICsgZ2VvbV9oaXN0b2dyYW0oYWVzKGZpbGw9UHJpdmF0ZSksY29sb3I9ImJsYWNrIixiaW5zPTUwKQpgYGAKClRlbSB1bSBjb2zDqWdpbyBjb20gYSB0YXhhIGRlIGdyYWR1YcOnw6NvIGFjaW1hIGRlIDEwMCUuIFZhbW9zIGVuY29udHLDoS1sbyBlIGZpeGEtbG8uCmBgYHtyfQpzdWJzZXQoZGYsR3JhZC5SYXRlPjEwMCkKYGBgCgpBcnJ1bWFuZG86CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CmRmWydDYXplbm92aWEgQ29sbGVnZScsJ0dyYWQuUmF0ZSddIDwtIDEwMApgYGAKCkRpdmlzw6NvIGRlIHRyZWlubyBlIHRlc3RlOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpzZXQuc2VlZCgxMDEpCgpzYW1wbGUgPSBzYW1wbGUuc3BsaXQoZGYkUHJpdmF0ZSxTcGxpdFJhdGlvPTAuNykKdHJhaW4gPSBzdWJzZXQoZGYsc2FtcGxlPT1UKQp0ZXN0ID0gc3Vic2V0KGRmLHNhbXBsZT09RikKYGBgCgrDgXJ2b3JlIGRlIERlY2lzw6NvCgpDb25zdHJ1aW5kbyBvIG1vZGVsbyBlIGZhemVuZG8gcHJlZGnDp8O1ZXM6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRyZWUgPC0gcnBhcnQoUHJpdmF0ZSB+IC4gLG1ldGhvZD0iY2xhc3MiLGRhdGE9dHJhaW4pCnRyZWUucHJlZHMgPC0gcHJlZGljdCh0cmVlLHRlc3QpCmhlYWQodHJlZS5wcmVkcykKYGBgCgpDcmlhbmRvIHVtYSBjb2x1bmEgbm9tZWFkYSDigJxQcml2YXRl4oCdLCBjb20gYXMgdmFyacOhdmVpcyBTaW0vTsOjbyBwYXJhIGluZGljYXIgc2UgbyBjb2zDqWdpbyDDqSBww7pibGljbyBvdSBuw6NvLCBwYXJhIGNvbWJpbmFyIGNvbSBvIGRhdGFmcmFtZSBvcmlnaW5hbCwgYXNzaW0gcG9kZW1vcyBjb21wYXJhciBvcyByZXN1bHRhZG9zIGZhY2lsbWVudGU6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRyZWUucHJlZHMgPC0gYXMuZGF0YS5mcmFtZSh0cmVlLnByZWRzKQp0cmVlLnByZWRzJFByaXZhdGUgPC0gaWZlbHNlKHRyZWUucHJlZHMkWWVzID4gMC41LCJTaW0iLCJOw6NvIikKYGBgCgpDaGVjYW5kbyBhIE1hdHJpeiBkZSBjb25mdXPDo286CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRhYmxlKHRyZWUucHJlZHMkUHJpdmF0ZSx0ZXN0JFByaXZhdGUpCmBgYAoKVmlzdWFsaXphbmRvIGEgw6Fydm9yZToKYGBge3J9CmxpYnJhcnkocnBhcnQucGxvdCkKcHJwKHRyZWUpCmBgYAoKRmxvcmVzdGEgQWxlYXTDs3JpYQoKQ2FycmVnYW5kbyBhIGxpdnJhcmlhOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpsaWJyYXJ5KHJhbmRvbUZvcmVzdCkKYGBgCgpDb25zdHJ1aW5kbyBvIG1vZGVsbzoKYGBge3IgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9OH0KcmYubW9kZWwgPC0gcmFuZG9tRm9yZXN0KFByaXZhdGUgfiAuLCBkYXRhPXRyYWluLGltcG9ydGFuY2U9VFJVRSkKYGBgCgpGYXplbmRvIGFzIHByZWRpw6fDtWVzOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpwIDwtIHByZWRpY3QocmYubW9kZWwsdGVzdCkKdGFibGUocCx0ZXN0JFByaXZhdGUpCmBgYAoKTsOzcyB2ZW1vcyBxdWUgbyBtb2RlbG8gcGVyZm9ybW91IG1lbGhvciBxdWUgYSDDoXJ2b3JlIGRlIGRlY2lzw6NvLgoKUHJvamV0byBiYXNlYWRvIG5vIHJlcG9zaXTDs3JpbyBkZSBTYWphbCBTaGFybWEsIGVsYWJvcmFkbyBlbSAxMS8xMi8yMDE2LiBMaW5rOiBodHRwczovL3JwdWJzLmNvbS9zYWphbF9zaGFybWEvbWljcm9fZHRfcmYKCkRhZG9zIGNvbnRpZG9zIG5hIGxpdnJhcmlhIElTTFIsIHV0aWxpemFkb3MgYXBlbmFzIHBhcmEgZmlucyBlZHVjYWNpb25haXMuCgpUcmFkdXppZG8gZSBmb3JtYXRhZG8gcG9yIEFsZXhzYW5kZXIgU2FudG9zIFBzZW5keml1ayBlbSAyMDIwLiBDb250YXRvOiBhbGV4cHNlbmR6aXVrQGdtYWlsLmNvbQo=