Micro Projeto em R de Machine Learning, nele usaremos decisions trees e depois floresta aleatória para classificar escolas como Públicas ou Privadas baseadas em suas características. Será utilizado o dataframe College que vem com a livraria ISLR.

Carregando as livrarias:

install.packages("ggplot2","ISLR","dplyr","caToolds","rpart","rpart.plot","randomForest")
library(ggplot2)
library(ISLR)
library(dplyr)
library(caTools)
library(rpart)

Montando e exibindo o dataframe:

df <- College
head(College)

Análise Exploratória de Dados

Exploração de dados básica antes de começarmos a construção dos modelos.

Scatterplot da taxa de graduação vs custos de sala e inscrição:

ggplot(df,aes(Room.Board,Grad.Rate)) + geom_point(aes(color=Private))

Faz sentido ver que os colégios privados tem custos mais elevados.

Agora um histograma mostrando os estudantes de tempo integral colorido em privado ou não:

ggplot(df,aes(F.Undergrad)) + geom_histogram(aes(fill=Private),color="black",bins=50)

Histograma da taxa de graduação, colorida se privada ou não:

ggplot(df,aes(Grad.Rate)) + geom_histogram(aes(fill=Private),color="black",bins=50)

Tem um colégio com a taxa de graduação acima de 100%. Vamos encontrá-lo e fixa-lo.

subset(df,Grad.Rate>100)

Arrumando:

df['Cazenovia College','Grad.Rate'] <- 100

Divisão de treino e teste:

set.seed(101)

sample = sample.split(df$Private,SplitRatio=0.7)
train = subset(df,sample==T)
test = subset(df,sample==F)

Árvore de Decisão

Construindo o modelo e fazendo predições:

tree <- rpart(Private ~ . ,method="class",data=train)
tree.preds <- predict(tree,test)
head(tree.preds)
                                                 No       Yes
Adrian College                          0.003311258 0.9966887
Alfred University                       0.003311258 0.9966887
Allegheny College                       0.003311258 0.9966887
Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
Alma College                            0.003311258 0.9966887
Amherst College                         0.003311258 0.9966887

Criando uma coluna nomeada “Private”, com as variáveis Sim/Não para indicar se o colégio é público ou não, para combinar com o dataframe original, assim podemos comparar os resultados facilmente:

tree.preds <- as.data.frame(tree.preds)
tree.preds$Private <- ifelse(tree.preds$Yes > 0.5,"Sim","Não")

Checando a Matriz de confusão:

table(tree.preds$Private,test$Private)
     
       No Yes
  Não  57   9
  Sim   7 160

Visualizando a árvore:

library(rpart.plot)
prp(tree)

Floresta Aleatória

Carregando a livraria:

library(randomForest)

Construindo o modelo:

rf.model <- randomForest(Private ~ ., data=train,importance=TRUE)

Fazendo as predições:

p <- predict(rf.model,test)
table(p,test$Private)
     
p      No Yes
  No   57   6
  Yes   7 163

Nós vemos que o modelo performou melhor que a árvore de decisão.

Projeto baseado no repositório de Sajal Sharma, elaborado em 11/12/2016. Link: https://rpubs.com/sajal_sharma/micro_dt_rf

Dados contidos na livraria ISLR, utilizados apenas para fins educacionais.

Traduzido e formatado por Alexsander Santos Psendziuk em 2020. Contato:

LS0tCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgaHRtbF9kb2N1bWVudDogZGVmYXVsdAotLS0KTWljcm8gUHJvamV0byBlbSBSIGRlIE1hY2hpbmUgTGVhcm5pbmcsIG5lbGUgdXNhcmVtb3MgZGVjaXNpb25zIHRyZWVzIGUgZGVwb2lzIGZsb3Jlc3RhIGFsZWF0w7NyaWEgcGFyYSBjbGFzc2lmaWNhciBlc2NvbGFzIGNvbW8gUMO6YmxpY2FzIG91IFByaXZhZGFzIGJhc2VhZGFzIGVtIHN1YXMgY2FyYWN0ZXLDrXN0aWNhcy4gU2Vyw6EgdXRpbGl6YWRvIG8gZGF0YWZyYW1lIENvbGxlZ2UgcXVlIHZlbSBjb20gYSBsaXZyYXJpYSBJU0xSLgoKQ2FycmVnYW5kbyBhcyBsaXZyYXJpYXM6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9Cmluc3RhbGwucGFja2FnZXMoImdncGxvdDIiLCJJU0xSIiwiZHBseXIiLCJjYVRvb2xkcyIsInJwYXJ0IiwicnBhcnQucGxvdCIsInJhbmRvbUZvcmVzdCIpCmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShJU0xSKQpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KGNhVG9vbHMpCmxpYnJhcnkocnBhcnQpCmBgYAoKTW9udGFuZG8gZSBleGliaW5kbyBvIGRhdGFmcmFtZToKYGBge3IgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9OH0KZGYgPC0gQ29sbGVnZQpoZWFkKENvbGxlZ2UpCmBgYAoKQW7DoWxpc2UgRXhwbG9yYXTDs3JpYSBkZSBEYWRvcwoKRXhwbG9yYcOnw6NvIGRlIGRhZG9zIGLDoXNpY2EgYW50ZXMgZGUgY29tZcOnYXJtb3MgYSBjb25zdHJ1w6fDo28gZG9zIG1vZGVsb3MuCgpTY2F0dGVycGxvdCBkYSB0YXhhIGRlIGdyYWR1YcOnw6NvIHZzIGN1c3RvcyBkZSBzYWxhIGUgaW5zY3Jpw6fDo286CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CmdncGxvdChkZixhZXMoUm9vbS5Cb2FyZCxHcmFkLlJhdGUpKSArIGdlb21fcG9pbnQoYWVzKGNvbG9yPVByaXZhdGUpKQpgYGAKCkZheiBzZW50aWRvIHZlciBxdWUgb3MgY29sw6lnaW9zIHByaXZhZG9zIHRlbSBjdXN0b3MgbWFpcyBlbGV2YWRvcy4KCkFnb3JhIHVtIGhpc3RvZ3JhbWEgbW9zdHJhbmRvIG9zIGVzdHVkYW50ZXMgZGUgdGVtcG8gaW50ZWdyYWwgY29sb3JpZG8gZW0gcHJpdmFkbyBvdSBuw6NvOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpnZ3Bsb3QoZGYsYWVzKEYuVW5kZXJncmFkKSkgKyBnZW9tX2hpc3RvZ3JhbShhZXMoZmlsbD1Qcml2YXRlKSxjb2xvcj0iYmxhY2siLGJpbnM9NTApCmBgYAoKSGlzdG9ncmFtYSBkYSB0YXhhIGRlIGdyYWR1YcOnw6NvLCBjb2xvcmlkYSBzZSBwcml2YWRhIG91ICBuw6NvOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpnZ3Bsb3QoZGYsYWVzKEdyYWQuUmF0ZSkpICsgZ2VvbV9oaXN0b2dyYW0oYWVzKGZpbGw9UHJpdmF0ZSksY29sb3I9ImJsYWNrIixiaW5zPTUwKQpgYGAKClRlbSB1bSBjb2zDqWdpbyBjb20gYSB0YXhhIGRlIGdyYWR1YcOnw6NvIGFjaW1hIGRlIDEwMCUuIFZhbW9zIGVuY29udHLDoS1sbyBlIGZpeGEtbG8uCmBgYHtyfQpzdWJzZXQoZGYsR3JhZC5SYXRlPjEwMCkKYGBgCgpBcnJ1bWFuZG86CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CmRmWydDYXplbm92aWEgQ29sbGVnZScsJ0dyYWQuUmF0ZSddIDwtIDEwMApgYGAKCkRpdmlzw6NvIGRlIHRyZWlubyBlIHRlc3RlOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpzZXQuc2VlZCgxMDEpCgpzYW1wbGUgPSBzYW1wbGUuc3BsaXQoZGYkUHJpdmF0ZSxTcGxpdFJhdGlvPTAuNykKdHJhaW4gPSBzdWJzZXQoZGYsc2FtcGxlPT1UKQp0ZXN0ID0gc3Vic2V0KGRmLHNhbXBsZT09RikKYGBgCgrDgXJ2b3JlIGRlIERlY2lzw6NvCgpDb25zdHJ1aW5kbyBvIG1vZGVsbyBlIGZhemVuZG8gcHJlZGnDp8O1ZXM6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRyZWUgPC0gcnBhcnQoUHJpdmF0ZSB+IC4gLG1ldGhvZD0iY2xhc3MiLGRhdGE9dHJhaW4pCnRyZWUucHJlZHMgPC0gcHJlZGljdCh0cmVlLHRlc3QpCmhlYWQodHJlZS5wcmVkcykKYGBgCgpDcmlhbmRvIHVtYSBjb2x1bmEgbm9tZWFkYSDigJxQcml2YXRl4oCdLCBjb20gYXMgdmFyacOhdmVpcyBTaW0vTsOjbyBwYXJhIGluZGljYXIgc2UgbyBjb2zDqWdpbyDDqSBww7pibGljbyBvdSBuw6NvLCBwYXJhIGNvbWJpbmFyIGNvbSBvIGRhdGFmcmFtZSBvcmlnaW5hbCwgYXNzaW0gcG9kZW1vcyBjb21wYXJhciBvcyByZXN1bHRhZG9zIGZhY2lsbWVudGU6CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRyZWUucHJlZHMgPC0gYXMuZGF0YS5mcmFtZSh0cmVlLnByZWRzKQp0cmVlLnByZWRzJFByaXZhdGUgPC0gaWZlbHNlKHRyZWUucHJlZHMkWWVzID4gMC41LCJTaW0iLCJOw6NvIikKYGBgCgpDaGVjYW5kbyBhIE1hdHJpeiBkZSBjb25mdXPDo286CmBgYHtyIGZpZy5oZWlnaHQ9OCwgZmlnLndpZHRoPTh9CnRhYmxlKHRyZWUucHJlZHMkUHJpdmF0ZSx0ZXN0JFByaXZhdGUpCmBgYAoKVmlzdWFsaXphbmRvIGEgw6Fydm9yZToKYGBge3J9CmxpYnJhcnkocnBhcnQucGxvdCkKcHJwKHRyZWUpCmBgYAoKRmxvcmVzdGEgQWxlYXTDs3JpYQoKQ2FycmVnYW5kbyBhIGxpdnJhcmlhOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpsaWJyYXJ5KHJhbmRvbUZvcmVzdCkKYGBgCgpDb25zdHJ1aW5kbyBvIG1vZGVsbzoKYGBge3IgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9OH0KcmYubW9kZWwgPC0gcmFuZG9tRm9yZXN0KFByaXZhdGUgfiAuLCBkYXRhPXRyYWluLGltcG9ydGFuY2U9VFJVRSkKYGBgCgpGYXplbmRvIGFzIHByZWRpw6fDtWVzOgpgYGB7ciBmaWcuaGVpZ2h0PTgsIGZpZy53aWR0aD04fQpwIDwtIHByZWRpY3QocmYubW9kZWwsdGVzdCkKdGFibGUocCx0ZXN0JFByaXZhdGUpCmBgYAoKTsOzcyB2ZW1vcyBxdWUgbyBtb2RlbG8gcGVyZm9ybW91IG1lbGhvciBxdWUgYSDDoXJ2b3JlIGRlIGRlY2lzw6NvLgoKUHJvamV0byBiYXNlYWRvIG5vIHJlcG9zaXTDs3JpbyBkZSBTYWphbCBTaGFybWEsIGVsYWJvcmFkbyBlbSAxMS8xMi8yMDE2LiBMaW5rOiBodHRwczovL3JwdWJzLmNvbS9zYWphbF9zaGFybWEvbWljcm9fZHRfcmYKCkRhZG9zIGNvbnRpZG9zIG5hIGxpdnJhcmlhIElTTFIsIHV0aWxpemFkb3MgYXBlbmFzIHBhcmEgZmlucyBlZHVjYWNpb25haXMuCgpUcmFkdXppZG8gZSBmb3JtYXRhZG8gcG9yIEFsZXhzYW5kZXIgU2FudG9zIFBzZW5keml1ayBlbSAyMDIwLiBDb250YXRvOiBhbGV4cHNlbmR6aXVrQGdtYWlsLmNvbQo=