1 Prelúdio e Objetivos

Carregando os pacotes a serem usados.

setwd('/home/heitor/Área de Trabalho/R Projects/Análise Macro/Lab 7')
library(tidyverse)   # general
library(rpart)       # my standard tree package
library(rpart.plot)  # extension to rpart
#library(party)       # 1 alternative package
library(tree)        # 2 alternative package
library(gt)          # to report's tables
library(plotly)      # interacting with plot

2 Preparando os Dados

Os dados são informações de antecedentes e características de tomadores de empréstimos e se eles se tornaram ou não inadimplentes. Devemos converter toda a variável em caracteres para fatores, pois elas realmente representam níveis.

dd <- read_csv("credit.csv") %>% as_tibble()
dd[sapply(dd, is.character)] <- lapply(dd[sapply(dd, is.character)], as.factor)
attach(dd)

Uma rápida olhada na natureza dos dados:

dd %>% summary()

##    checking_balance months_loan_duration   credit_history
##  < 0 DM    :274     Min.   : 4.0         critical :293   
##  > 200 DM  : 63     1st Qu.:12.0         good     :530   
##  1 - 200 DM:269     Median :18.0         perfect  : 40   
##  unknown   :394     Mean   :20.9         poor     : 88   
##                     3rd Qu.:24.0         very good: 49   
##                     Max.   :72.0                         
##                  purpose        amount           savings_balance
##  business            : 97   Min.   :  250   < 100 DM     :603   
##  car                 :337   1st Qu.: 1366   > 1000 DM    : 48   
##  car0                : 12   Median : 2320   100 - 500 DM :103   
##  education           : 59   Mean   : 3271   500 - 1000 DM: 63   
##  furniture/appliances:473   3rd Qu.: 3972   unknown      :183   
##  renovations         : 22   Max.   :18424                       
##   employment_duration percent_of_income years_at_residence      age       
##  < 1 year   :172      Min.   :1.000     Min.   :1.000      Min.   :19.00  
##  > 7 years  :253      1st Qu.:2.000     1st Qu.:2.000      1st Qu.:27.00  
##  1 - 4 years:339      Median :3.000     Median :3.000      Median :33.00  
##  4 - 7 years:174      Mean   :2.973     Mean   :2.845      Mean   :35.55  
##  unemployed : 62      3rd Qu.:4.000     3rd Qu.:4.000      3rd Qu.:42.00  
##                       Max.   :4.000     Max.   :4.000      Max.   :75.00  
##  other_credit  housing    existing_loans_count         job        dependents   
##  bank :139    other:108   Min.   :1.000        management:148   Min.   :1.000  
##  none :814    own  :713   1st Qu.:1.000        skilled   :630   1st Qu.:1.000  
##  store: 47    rent :179   Median :1.000        unemployed: 22   Median :1.000  
##                           Mean   :1.407        unskilled :200   Mean   :1.155  
##                           3rd Qu.:2.000                         3rd Qu.:1.000  
##                           Max.   :4.000                         Max.   :2.000  
##  phone     default  
##  no :596   no :700  
##  yes:404   yes:300  
##                     
##                     
##                     
##

dd %>% glimpse()

## Rows: 1,000
## Columns: 17
## $ checking_balance     <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, unkn…
## $ months_loan_duration <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12…
## $ credit_history       <fct> critical, good, critical, good, poor, good, good,…
## $ purpose              <fct> furniture/appliances, furniture/appliances, educa…
## $ amount               <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3…
## $ savings_balance      <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 DM, …
## $ employment_duration  <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 years,…
## $ percent_of_income    <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4…
## $ years_at_residence   <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4…
## $ age                  <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 2…
## $ other_credit         <fct> none, none, none, none, none, none, none, none, n…
## $ housing              <fct> own, own, own, other, other, other, own, rent, ow…
## $ existing_loans_count <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2…
## $ job                  <fct> skilled, skilled, unskilled, skilled, skilled, un…
## $ dependents           <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ phone                <fct> yes, no, no, no, no, yes, no, yes, no, no, no, no…
## $ default              <fct> no, yes, no, no, yes, no, no, no, no, yes, yes, y…

3 Algumas Visualizações

gg2 <- ggplot(dd) +
  geom_bar(aes(credit_history, fill=default)) +
  labs(title    = 'Proporção de Defaults por Classificação de Histórico',
       x        = 'Classificações do Histórico de Empréstimos',
       y        = 'Contagem')

ggplotly(gg2)

gg3 <- ggplot(dd) +
  geom_bar(aes(job, fill=default)) +
  labs(title    = 'Proporção de Defaults por Classificação de Emprego',
       x        = 'Tipos de Empregos',
       y        = 'Contagem')
ggplotly(gg3)

gg4 <- ggplot(dd) +
  geom_bar(aes(purpose, fill=default)) +
  labs(title    = 'Proporção de Defaults por Aplicação do Empréstimo',
       x        = 'Destinações',
       y        = 'Contagem')
ggplotly(gg4)

gg5 <- ggplot(dd) +
  geom_bar(aes(savings_balance, fill=default)) +
  labs(title    = 'Proporção de Defaults por Grupos de Poupança',
       x        = 'Agrupamentos de Poupança',
       y        = 'Contagem')
ggplotly(gg5)

4 Dividindo as Amostras de Treino & Teste

Foi aleatoriamente escolhida uma amostra de treino, sobre o qual o modelo será feito, de 80%. Portanto, o teste será feito na amostra restante, 20%. A aleatoriedade está “fixa” em 777 para fins de replicação.

set.seed(777)
train <- dd %>% slice_sample(.,prop= 0.8)
sid   <- as.numeric(rownames(train))
test  <- dd[-sid,]
remove(sid)

5 Árvores

Crontruiremos uma árvore de classificação com os dados de treino, usando todas as demais variáveis como explicativas de default.

tree1 <- rpart(formula =  default~.,
               data    = train,
               method  = 'class') #"class", classification; "anova", regression

tree1 %>% printcp()

## 
## Classification tree:
## rpart(formula = default ~ ., data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] amount               checking_balance     credit_history      
## [4] employment_duration  months_loan_duration purpose             
## [7] savings_balance     
## 
## Root node error: 234/800 = 0.2925
## 
## n= 800 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.038462      0   1.00000 1.00000 0.054986
## 2 0.019231      4   0.80342 0.94444 0.054047
## 3 0.015670      6   0.76496 0.96581 0.054419
## 4 0.010000      9   0.71795 0.97863 0.054636

tree1 %>% rpart.plot()

tree1 %>% plotcp()

Usaremos agora o modelo treinado, tree1 para prever os dados do teste.

pred_tree1 <- predict(tree1,
                      newdata = test,
                      type    = 'matrix')
pred_tree1 %>% head()

##   [,1] [,2] [,3]      [,4]      [,5]    [,6]
## 1    1  319   49 0.8668478 0.1331522 0.46000
## 2    1   76   31 0.7102804 0.2897196 0.13375
## 3    1   76   31 0.7102804 0.2897196 0.13375
## 4    1  319   49 0.8668478 0.1331522 0.46000
## 5    1   76   31 0.7102804 0.2897196 0.13375
## 6    2   47   82 0.3643411 0.6356589 0.16125

Transformaremos os resultados previstos em uma Tabela de Confusão, comp_1, de onde conseguimos ver a taxa de acurácia, de falsos negativo e positivo.

comp_1 <- data.frame(Original = test$default,
                     Previsão = pred_tree1[,1]) %>%
  mutate( Previsão = recode( Previsão, '1'='no', '2'='yes'))
 
comp_1 <- ftable(comp_1)

acc1     <- sum(diag(comp_1))/sum(comp_1)
falsYes1 <- comp_1[1,2]/sum(comp_1[,2])
falsNo1  <- comp_1[2,1]/sum(comp_1[,1])

6 Poda

Agora podaremos a primeira árvore, escolhendo uma penalização, CP de acordo com o menor erro relativo \(1-R^2.\), para formar tree2, e com a menor soma dos quadrados dos erros dos resíduos para formar tree3.

tree2 <- prune(tree1,
               cp= tree1$cptable[which.min(tree1$cptable[,"rel error"]),"CP"])

tree3 <-  prune(tree1,
                cp= tree1$cptable[which.min(tree1$cptable[,"xerror"]),"CP"])
# select the complexity parameter associated with the smallest cross-validated error

tree2 %>% rpart.plot()

tree3 %>% rpart.plot()

Usaremos as duas modalidades de poda para fazer a previsão:

pred_tree2 <- predict(tree2,
                      newdata = test,
                      type    = 'matrix')
pred_tree3 <- predict(tree3,
                      newdata = test,
                      type    = 'matrix')

Faremos agora a Tabela de Confusão para ambas as podas.

comp_2 <- data.frame(Original = test$default,
                     Previsão = pred_tree2[,1]) %>%
  mutate( Previsão = recode( Previsão, '1'='no', '2'='yes'))

comp_3 <- data.frame(Original = test$default,
                     Previsão = pred_tree3[,1]) %>%
  mutate( Previsão = recode( Previsão, '1'='no', '2'='yes'))

comp_2 <- ftable(comp_2)
comp_3 <- ftable(comp_3)
comp_2

##          Previsão  no yes
## Original                 
## no                123  16
## yes                29  32

comp_3

##          Previsão  no yes
## Original                 
## no                126  13
## yes                31  30

Arquivaremos os resultados obtidos até agora num data frame novo, compare.

acc2     <- sum(diag(comp_2))/sum(comp_2)
falsYes2 <- comp_2[1,2]/sum(comp_2[,2])
falsNo2  <- comp_2[2,1]/sum(comp_2[,1])

acc3     <- sum(diag(comp_3))/sum(comp_3)
falsYes3 <- comp_3[1,2]/sum(comp_3[,2])
falsNo3  <- comp_3[2,1]/sum(comp_3[,1])

compare <- data.frame(Modelos = c('Tree 1', 'Tree 2', 'Tree 3'),
                      Accuracy = c(acc1, acc2, acc3),
                      Falso_No = c(falsNo1, falsNo2, falsNo3),
                      Falso_Yes = c(falsYes1, falsYes2, falsYes3))
compare %>% gt()

Modelos	Accuracy	Falso_No	Falso_Yes
Tree 1	0.775	0.1907895	0.3333333
Tree 2	0.775	0.1907895	0.3333333
Tree 3	0.780	0.1974522	0.3023256

7 Tree Package

Fiz essa árvore usando outro pacote, para comparar os resultados.

tree4 <- tree(default~.,train)
tree4 %>% summary()

## 
## Classification tree:
## tree(formula = default ~ ., data = train)
## Variables actually used in tree construction:
## [1] "checking_balance"     "amount"               "employment_duration" 
## [4] "months_loan_duration" "credit_history"       "savings_balance"     
## [7] "age"                 
## Number of terminal nodes:  10 
## Residual mean deviance:  0.9566 = 755.7 / 790 
## Misclassification error rate: 0.235 = 188 / 800

tree4 %>% plot()
tree4 %>% text()

cv_t4 <- cv.tree(tree4, FUN = prune.misclass)
cv_t4

## $size
## [1] 10  6  5  1
## 
## $dev
## [1] 227 227 229 240
## 
## $k
## [1]  -Inf  0.00  3.00 10.75
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

par(mfrow=c(1,2))
plot(cv_t4$size, cv_t4$dev, type="b")
plot(cv_t4$k,    cv_t4$dev, type="b")

tree5 <- prune.misclass(tree4, best = cv_t4$size[which.min(cv_t4$dev)])
tree6 <- prune.misclass(tree4, k = cv_t4$k[which.min(cv_t4$dev)])
tree5 %>% summary()

## 
## Classification tree:
## tree(formula = default ~ ., data = train)
## Variables actually used in tree construction:
## [1] "checking_balance"     "amount"               "employment_duration" 
## [4] "months_loan_duration" "credit_history"       "savings_balance"     
## [7] "age"                 
## Number of terminal nodes:  10 
## Residual mean deviance:  0.9566 = 755.7 / 790 
## Misclassification error rate: 0.235 = 188 / 800

tree6 %>% summary()

##        Length Class  Mode     
## size   6      -none- numeric  
## dev    1      -none- function 
## k      1      -none- numeric  
## method 1      -none- character

pred_tree4 <- predict(tree4,test,type="class")
#pred_tree5 <- predict(tree5,test,type="class")
#pred_tree6 <- predict(tree6,test,type="class")

comp_4 <- data.frame(Original = test$default,
                     Previsão = pred_tree4)
comp_4 <- ftable(comp_4)
comp_4

##          Previsão  no yes
## Original                 
## no                119  20
## yes                27  34

acc4       <- sum(diag(comp_4))/sum(comp_4)
falsYes4   <- comp_4[1,2]/sum(comp_4[,2])
falsNo4    <- comp_4[2,1]/sum(comp_4[,1])

nova_linha <- c('Tree 4', acc4, falsYes4, falsNo4)

compare <- rbind(compare, nova_linha)

remove(acc1, acc2, acc3, acc4,
       falsNo1, falsNo2, falsNo3, falsNo4,
       falsYes1, falsYes2, falsYes3, falsYes4, nova_linha)

compare %>% gt()

Modelos	Accuracy	Falso_No	Falso_Yes
Tree 1	0.775	0.190789473684211	0.333333333333333
Tree 2	0.775	0.190789473684211	0.333333333333333
Tree 3	0.78	0.197452229299363	0.302325581395349
Tree 4	0.765	0.37037037037037	0.184931506849315

Percebemos, portanto, que a melhor árvore é a terceira, pela acurácia geral. Vamos seguir com ela.

8 Visualização da Estimação

Fiz algumas visualizações para tomar mais noção de como estão os dados. Coloquei lado-a-lado Default original e o previsto:

test$Fit_default  <- as.factor(pred_tree2[,1])
test <- test %>% mutate( Fit_default = recode( Fit_default, '1'='no', '2'='yes'))
attach(test)

## The following objects are masked from dd:
## 
##     age, amount, checking_balance, credit_history, default, dependents,
##     employment_duration, existing_loans_count, housing, job,
##     months_loan_duration, other_credit, percent_of_income, phone,
##     purpose, savings_balance, years_at_residence

gg1 <- ggplot(test, aes(x=amount, y=months_loan_duration,
                 colour=default, shape=Fit_default)) +
  geom_point() +  # poderia colocar o aes em _point, mas, para o lm, precisa ser em gg.
  geom_smooth(method=lm, se=FALSE, aes(colour=default,
                                       linetype=Fit_default)) +
  labs(title    = 'Montante do Empréstimo por Duração',
       subtitle = 'Separado por Default e Default Previsto',
       x        = 'Montante do Empréstimo',
       y        = 'Duração em Meses')

ggplotly(gg1)

## `geom_smooth()` using formula 'y ~ x'

Árvores de Decisão

Heitor Gabriel S. Monteiro