Bibliotecas Utlizadas

Primeiramente vamos importar as bibliotecas necessárias para esse script ser executado.

library(dplyr)
library(reshape2)
library(GGally)
library(ggplot2)
library(corrplot)
library(caret)
library(rpart)
library(C50)
library(gmodels)

Descrição da atividade

Nessa etapa você vai aplicar os algoritmos de classificação vistos até agora para prever evasão de alunos no curso de computação.

O cenário é o seguinte: o(a) aluno(a) cursou o primeiro período inteiro e queremos prever se ele(a) se matriculará ou não no segundo período. Se ele(a) não se matriculou é porque abandonou o curso ou solicitou desligamento. De forma mais específica:

Separe os dados em treino e teste;
Use como atributos as médias das disciplinas mais o atributo que você criou na parte 1 (fique a vontade para criar mais atributos);
Treine modelos de regressão logística;
Treine modelos de árvore de decisão;
Interprete os coeficientes da regressão. Quais atributos parecem ser mais importantes?;
Reporte acurácia, precision e recall no treino e teste. Como você avalia os resultados? Justifique sua resposta.

Note que para os passos acima não é necessário usar validação cruzada.

Controle overfitting usando validação-cruzada (ridge e lasso na regressão logística e condições de “early stopping” nas árvores de decisão, por exemplo, profundidade da árvore);
Reporte acurácia, precision e recall da validação-cruzada e teste (para os melhores modelos);
Aplique o melhor modelo a você mesmo(a) usando seu histórico e reporte a predição e resultado.

0. Funções auxiliares e variáveis globais

Para preparar e transformar os dados foram usadas as funções e variáveis auxiliares apresentadas nesta seção.

COL_DADOS = c("matricula", "cod_disciplina", "disciplina", "ano", "periodo", "media", "evadiu")

COL_T = c("matricula", "AV", "C1", "IC", "LP1", "LPT", "P1", "evadiu", "ano", "periodo")

NOTA_REP = 5
NOTA_FINAL = 7

INICIO_DISC = 2
FIM_DISC = 7
NUM_DISC = 6

select_treino <- function(ano_periodo) {
  return(as.integer(ano_periodo) >= 20091  & as.integer(ano_periodo) <= 20142)
}

calcula_status <- function(C1, AV, IC, LP1, LPT, P1) {
  return (
    (C1 < NOTA_REP) & (AV < NOTA_REP) & (IC < NOTA_REP) & (LP1 < NOTA_REP) & 
    (LPT < NOTA_REP) & (P1 < NOTA_REP)
  )
}

calcula_cra <- function(C1, AV, IC, LP1, LPT, P1) {
  return((C1 + AV + IC + LP1 + LPT + P1)/NUM_DISC)
}

calcula_cra_cc <- function(IC, LP1, P1) {
  return((IC + LP1 + P1)/3)
}

calcula_num_finais <- function(C1, AV, IC, LP1, LPT, P1) {
  return(
    (C1 < NOTA_FINAL) + (AV < NOTA_FINAL) + (IC < NOTA_FINAL) + (LP1 < NOTA_FINAL) + 
    (LPT < NOTA_FINAL) + (P1 < NOTA_FINAL)
  )
}

calcula_num_NA <- function(C1, AV, IC, LP1, LPT, P1) {
  return( 
    NUM_DISC - (is.finite(C1) + is.finite(AV) + is.finite(IC) + is.finite(LP1) + 
    is.finite(LPT) + is.finite(P1))
  )
}

decide_na <- function(col, num, media) {
  return(ifelse(is.na(col), ifelse(num > 3, 0, media), col))
}

1. Separe os dados em treino e teste

1.1 Separando os dados

Para fazer a separação iremos considerar os dados de [2009.1, 2014.2] como treino e [2015.1, 2015.2] como teste. Os demais anos foram retirados, pois vários não apresentaram evasões (isso foi verificado na parte 1) e portanto retirá-los pode diminuir um pouco o desbalanceamento dos dados (Iremos tratar o desbalanceamento na parte 3).

Além disso, não foi feita uma separação aleatória para evitar que tentemos “prever o passado usando o futuro”, além de que uma amostra aleatória poderia ser muito desbalanceada se não particionada corretamente.

# recebe dados
dados <- read.csv('~/DataAnalysis2/lab3/treino.csv')
dados <- na.omit(dados)

# renomeia colunas
colnames(dados) <- COL_DADOS

# cria coluna para auxiliar separação dos dados
dados$ano_periodo <- paste(as.character(dados$ano), as.character(dados$periodo), sep="")

# separação dos dados
dados <- dados %>% mutate(treino=select_treino(ano_periodo))
aux_split <- split(dados, dados$treino)

teste <- aux_split[[1]]
teste <- teste[teste$ano == 2015,] # apenas o ano de 2015 eh considerado

treino <- aux_split[[2]]

1.2 Transformando as linhas em colunas para facilitar o treinamento do modelo

# Treino

alunos.evadiu <- treino %>%
  group_by(matricula) %>% select(matricula, evadiu, ano, periodo) %>% unique()

treino <- treino %>%
  group_by(matricula, disciplina) %>%
  ungroup() %>%
  select(matricula, disciplina, media) %>%
  mutate(disciplina = as.factor(gsub(" ", ".", disciplina))) %>%
  dcast(matricula ~ disciplina, mean) %>% merge(alunos.evadiu)

## Using media as value column: use value.var to override.

colnames(treino) <- COL_T

# Teste

alunos.evadiu <- teste %>%
  group_by(matricula) %>% select(matricula, evadiu, ano, periodo) %>% unique()

teste <- teste %>%
  group_by(matricula, disciplina) %>%
  ungroup() %>%
  select(matricula, disciplina, media) %>%
  mutate(disciplina = as.factor(gsub(" ", ".", disciplina))) %>%
  dcast(matricula ~ disciplina, mean) %>% merge(alunos.evadiu)

## Using media as value column: use value.var to override.

colnames(teste) <- COL_T

1.3 Tratando NAs

Apesar de inicialmente termos retirado as linhas com NA, muitos alunos não apresentam a média de todas as disciplinas, ao trocar as linhas pelas colunas (dcast) essas ausências foram substituídas por NAs.

Assim, foi escolhida a seguinte estratégia para tratar os novos NAs:

Será criada uma nova coluna contendo o número de NAs por linha. Tal coluna poderá ser utilizada como atributo para a classificação já que é notável o número de NAs nos dados dos alunos que evadem, provavelmente por diferentes motivos (evasão no meio do período, perder a disciplina por falta, etc.)
Depois disso se o número de NAs for > 3 então substituiremos os NAs por 0, caso contrário serão substituídos pela média das notas presentes

# Treino
treino$num_NA <- 
  calcula_num_NA(treino$C1, treino$AV, treino$IC, treino$LP1, treino$LPT, treino$P1)

treino$media_NA <- 
  rowMeans(subset(treino, select = c(C1, AV, IC, LP1, LPT, P1)), na.rm = TRUE)

for (i in INICIO_DISC:FIM_DISC) {
  treino[, i] <- decide_na(treino[, i], treino$num_NA, treino$media_NA)
}

treino <- subset(treino, select = -c(media_NA, ano, periodo))

# Teste
teste$num_NA <- calcula_num_NA(teste$C1, teste$AV, teste$IC, teste$LP1, teste$LPT, teste$P1)

teste$media_NA <- rowMeans(subset(teste, select = c(C1, AV, IC, LP1, LPT, P1)), na.rm = TRUE)

for (i in INICIO_DISC:FIM_DISC) {
  teste[, i] <- decide_na(teste[, i], teste$num_NA, teste$media_NA)
}

teste <- subset(teste, select = -c(media_NA, ano, periodo))

2. Use como atributos as médias das disciplinas mais o atributo que você criou na parte 1 (fique a vontade para criar mais atributos)

2.1 Adicionando atributos

Iremos adicionar as variáveis utilizadas na parte 1 da atividade, de modo que após essas transformações os dados de treino e teste estarão prontos para serem usados para gerar e testar modelos, e terão as seguintes colunas:

matricula: identificador do aluno
AV: média em Álgebra Vetorial
C1: média em Cálculo 1
IC: média em Introdução a Computação
LP1: média em Lab. de Programação 1
LPT: média em Leitura e Produção de Texto
P1: média em Programação 1
evadiu: variável de classificação
num_NA: número de NAs
status: TRUE se reprovou todas as disciplinas
cra: média das disciplinas
cra_cc: média nas disciplinas de cc
num_finais: número de finais feitas

# Treino
treino$status <- 
  calcula_status(treino$C1, treino$AV, treino$IC, treino$LP1, treino$LPT, treino$P1)

treino$cra <-
  calcula_cra(treino$C1, treino$AV, treino$IC, treino$LP1, treino$LPT, treino$P1)

treino$cra_cc <- 
  calcula_cra_cc(treino$IC, treino$LP1, treino$P1)

treino$num_finais <- 
  calcula_num_finais(treino$C1, treino$AV, treino$IC, treino$LP1, treino$LPT, treino$P1)

# Teste
teste$status <- 
  calcula_status(teste$C1, teste$AV, teste$IC, teste$LP1, teste$LPT, teste$P1)

teste$cra <-
  calcula_cra(teste$C1, teste$AV, teste$IC, teste$LP1, teste$LPT, teste$P1)

teste$cra_cc <- 
  calcula_cra_cc(teste$IC, teste$LP1, teste$P1)

teste$num_finais <- 
  calcula_num_finais(teste$C1, teste$AV, teste$IC, teste$LP1, teste$LPT, teste$P1)

2.2 Verificando desbalanceamento no treino

Nesta parte do laboratório não iremos tratar o desbalanceamento, mas iremos analisar o quão desbalanceados estão os dados de treino buscando escolher um conjunto de dados de treino que o minimize.

num_evasoes <- treino %>% summarise(num_evasoes = sum(evadiu), num_alunos = n())

# gráfico
num_evasoes.melt <- num_evasoes %>% melt()

## No id variables; using all as measure variables

ggplot(num_evasoes.melt, aes(x = factor(variable), y = value)) + geom_bar(stat="identity") +
  geom_text(aes(label= value), vjust=0) + xlab("") + ylab("Número de alunos")

Dentro dos dados de treino apenas ~10.5% são dados de evasões enquanto os demais são dados referentes a alunos que não evadiram. O desbalanceamento dos dados originais era de ~8.6%, então, apesar de temos diminuido levemente o desbalanceamento, uma abordagem para balancear os dados pode ajudar e poderá ser usada posteriormente.

2.3 Modelos utilizados

Para os próximos passos iremos usar os dados de treino para gerar modelos de classificação para os dados e validar a acurácia utilizando os dados de teste.

Iremos utilizar 2 modelos:

Modelo global: um modelo que não utiliza as médias das disciplinas para a classificação, mas apenas atributos que poderiam ser utilizados para qualquer curso. Portanto iremos utilizar os atributos: cra, status, num_NA, num_finais.
Modelo computação: um modelo que além dos atributos do modelo global também leva em conta a média das disciplinas do primeiro período de Ciência da Computação.

Assim, com os próximos passos teremos uma noção de qual modelo se sai melhor e se levar a média das disciplinas individualmente tem um impacto notável no classificador.

3. Treine modelos de regressão logística

Regressão Logística é um modelo de regressão em que a variável dependente é categórica (ou seja discreta), o funcionamento para geração do modelo é similiar a regressão linear, mas não buscamos uma função que se adeque (tenha menor erro) aos dados, e sim uma função que melhor “separe” os dados de forma a classificá-los.

Para treinar um modelo de regressão logística no R iremos usar a função glm.

3.1 Modelo global

treino$evadiu <- factor(treino$evadiu)
treino$status <- factor(treino$status)

teste$evadiu <- factor(teste$evadiu)
teste$status <- factor(teste$status)

model.reg.global <- train(evadiu ~ cra + num_NA + status + num_finais,
                          data=treino, method="glm", family="binomial", na.action = na.omit)

pred.reg.global <- predict(model.reg.global, newdata=teste)
treino.pred.reg.global <- predict(model.reg.global, newdata=treino)

acc.reg.global <- confusionMatrix(pred.reg.global, teste$evadiu)
acc.reg.global

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

3.2 Modelo computação

model.reg.cc <- train(evadiu ~ cra + num_NA + status + num_finais + C1 + AV + 
                      IC + LP1 + LPT + P1 + cra_cc,
                       data=treino,
                       method="glm",
                       family="binomial",
                       na.action = na.omit)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

pred.reg.cc <- predict(model.reg.cc, newdata=teste)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

treino.pred.reg.cc <- predict(model.reg.cc, newdata=treino)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

acc.reg.cc <- confusionMatrix(pred.reg.cc, teste$evadiu)
acc.reg.cc

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

3.3 Conclusão

Avaliando apenas a acurácia nos testes vemos que ambos tem a mesma acurácia. A precisão e recall serão calculados (tanto para treino quanto para teste) e comparados conjuntamente entre vários modelos na Seção 6.

4. Treine modelos de árvore de decisão

4.1 Modelo global

model.tree.global <- C5.0(evadiu ~ num_finais + num_NA + cra + status, data = treino)
plot(model.tree.global)

pred.tree.global <- predict(model.tree.global, teste, type='class')
treino.pred.tree.global <- predict(model.tree.global, treino, type='class')

acc.tree.global <- confusionMatrix(pred.tree.global, teste$evadiu)
acc.tree.global

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

4.2 Modelo computação

model.tree.cc <-  C5.0(evadiu ~ cra + num_NA + status + num_finais + C1 + AV + 
                      IC + LP1 + LPT + P1 + cra_cc, data = treino)
plot(model.tree.cc)

pred.tree.cc <- predict(model.tree.cc, teste, type='class')
treino.pred.tree.cc <- predict(model.tree.cc, treino, type='class')

acc.tree.cc <- confusionMatrix(pred.tree.cc, teste$evadiu)
acc.tree.cc

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    4
##      TRUE      5   13
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.95238095     0.71664168     0.91153141     0.97799735     0.91005291 
## AccuracyPValue  McnemarPValue 
##     0.02137354     1.00000000 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.7647059            0.9766082 
##       Neg Pred Value            Precision               Recall 
##            0.7222222            0.9766082            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9737609            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.9047619            0.8678181 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

4.3 Conclusão

Considerando apenas acurácia, o modelo global se saiu melhor com acurácia de ~95.77% enquanto o modelo computação obteve acurácia de ~95.24%. No entanto a diferença é bem pequena.

5. Interprete os coeficientes da regressão. Quais atributos parecem ser mais importantes?

summary(model.reg.global)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1944  -0.2634  -0.2206  -0.1817   2.8828  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9164     1.7375   0.527 0.597909    
## cra          -0.6096     0.1957  -3.115 0.001842 ** 
## num_NA        0.1059     0.1579   0.671 0.502485    
## statusTRUE    2.4058     0.6378   3.772 0.000162 ***
## num_finais   -0.1812     0.2168  -0.835 0.403451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 396.26  on 629  degrees of freedom
## Residual deviance: 209.97  on 625  degrees of freedom
## AIC: 219.97
## 
## Number of Fisher Scoring iterations: 6

summary(model.reg.cc)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0086  -0.2706  -0.2086  -0.1630   2.9851  
## 
## Coefficients: (2 not defined because of singularities)
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  2.624264   1.945237   1.349   0.1773  
## cra         -2.594655   2.006185  -1.293   0.1959  
## num_NA      -0.000685   0.180980  -0.004   0.9970  
## statusTRUE   1.867198   0.801056   2.331   0.0198 *
## num_finais  -0.271367   0.219430  -1.237   0.2162  
## C1           0.315905   0.362808   0.871   0.3839  
## AV           0.528855   0.338882   1.561   0.1186  
## IC           0.357497   0.388793   0.920   0.3578  
## LP1          0.472461   0.629596   0.750   0.4530  
## LPT          0.118268   0.362583   0.326   0.7443  
## P1                 NA         NA      NA       NA  
## cra_cc             NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 396.26  on 629  degrees of freedom
## Residual deviance: 198.84  on 620  degrees of freedom
## AIC: 218.84
## 
## Number of Fisher Scoring iterations: 6

summary(model.tree.global)

## 
## Call:
## C5.0.formula(formula = evadiu ~ num_finais + num_NA + cra + status, data
##  = treino)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Mar  8 10:38:45 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## status = FALSE: FALSE (588/24)
## status = TRUE: TRUE (42/6)
## 
## 
## Evaluation on training data (630 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2   30( 4.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     564     6    (a): class FALSE
##      24    36    (b): class TRUE
## 
## 
##  Attribute usage:
## 
##  100.00% status
## 
## 
## Time: 0.0 secs

summary(model.tree.cc)

## 
## Call:
## C5.0.formula(formula = evadiu ~ cra + num_NA + status + num_finais + C1
##  + AV + IC + LP1 + LPT + P1 + cra_cc, data = treino)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Mar  8 10:38:46 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 630 cases (12 attributes) from undefined.data
## 
## Decision tree:
## 
## status = TRUE:
## :...LP1 <= 2.5: TRUE (31/1)
## :   LP1 > 2.5:
## :   :...AV <= 2.6: FALSE (8/3)
## :       AV > 2.6: TRUE (3)
## status = FALSE:
## :...IC > 4.4: FALSE (560/17)
##     IC <= 4.4:
##     :...num_finais <= 4: FALSE (5)
##         num_finais > 4:
##         :...C1 <= 2: FALSE (17/2)
##             C1 > 2: TRUE (6/1)
## 
## 
## Evaluation on training data (630 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       7   24( 3.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     568     2    (a): class FALSE
##      22    38    (b): class TRUE
## 
## 
##  Attribute usage:
## 
##  100.00% status
##   93.33% IC
##    6.67% LP1
##    4.44% num_finais
##    3.65% C1
##    1.75% AV
## 
## 
## Time: 0.0 secs

Segundo as árvores de decisão, Status sem dúvidas é o atributo mais importante sendo utilizado 100% das vezes, ou seja em toda decisão o Status é avaliado, inclusive sendo o único atributo avaliado na árvore de decisão do modelo global e ainda sim obtendo maior acurácia que a árvore mais “complexa”. Outro atributo evidenciado pela árvore do modelo computação é a média em IC.

Já segundo as regressões logísticas Status também é uma das variáveis mais importantes, seguida por CRA no modelo global.

Aparentemente alguns dos demais atributos adicionados (como: num_NA, LPT, …) não estão ajudando muito na classificação e seu uso podem ser repensados para a próxima parte da atividade.

6. Reporte acurácia, precision e recall no treino e teste. Como você avalia os resultados? Justifique sua resposta.

Antes de analisar esses parâmetros, irei rapidamente explicar seu significado e o que implica para esta análise em específico.

Acurácia: quanto por cento acertamos sobre o total, ou seja: (TP+TN)/(TP + FN + FP + TN)
Precisão: quanto por cento dos que previmos que EVADIU realmente evadiram: TP/(TP + FP)
Recall: quanto por cento dos que evadiram nós previmos como EVADIU: TP/(TP + FN)

Portanto queremos um modelo com uma maior acurácia, precisão e recall. Priorizando acurácia e recall, pois prever que X irá evadir e X não evade é menos “grave” do que prever que Y não irá evadir, mas Y evade.

6.1 No treino

# Calculando
treino.reg.global <- confusionMatrix(treino.pred.reg.global, treino$evadiu)
treino.reg.cc <- confusionMatrix(treino.pred.reg.cc, treino$evadiu)

treino.tree.global <- confusionMatrix(treino.pred.tree.global, treino$evadiu)
treino.tree.cc <- confusionMatrix(treino.pred.tree.global, treino$evadiu)


# Acurácia
df <- data_frame('accuracy' = c(treino.reg.global$overall['Accuracy'],
                                treino.reg.cc$overall['Accuracy'],
                                treino.tree.global$overall['Accuracy'],
                                treino.tree.cc$overall['Accuracy']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))

ggplot(data = df, aes(x=model, y=accuracy)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.94, 0.96))

# Precisão
df <- data_frame('precision' = c(treino.reg.global$byClass['Precision'],
                                treino.reg.cc$byClass['Precision'],
                                treino.tree.global$byClass['Precision'],
                                treino.tree.cc$byClass['Precision']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=precision)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.94, 0.97))

# Recall
df <- data_frame('Recall' = c(treino.reg.global$byClass['Recall'],
                                treino.reg.cc$byClass['Recall'],
                                treino.tree.global$byClass['Recall'],
                                treino.tree.cc$byClass['Recall']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=Recall)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.97, 0.99))

6.2 No teste

# Acurácia
df <- data_frame('accuracy' = c(acc.reg.global$overall['Accuracy'],
                                acc.reg.cc$overall['Accuracy'],
                                acc.tree.global$overall['Accuracy'],
                                acc.tree.cc$overall['Accuracy']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))

ggplot(data = df, aes(x=model, y=accuracy)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.94, 0.96))

# Precisão
df <- data_frame('precision' = c(acc.reg.global$byClass['Precision'],
                                acc.reg.cc$byClass['Precision'],
                                acc.tree.global$byClass['Precision'],
                                acc.tree.cc$byClass['Precision']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=precision)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.97, 0.985))

# Recall
df <- data_frame('Recall' = c(acc.reg.global$byClass['Recall'],
                                acc.reg.cc$byClass['Recall'],
                                acc.tree.global$byClass['Recall'],
                                acc.tree.cc$byClass['Recall']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=Recall)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.97, 0.972))

Treino

No treino todos os modelos se comportaram de forma similar apresentando acurácia de ~0.952, precisão de ~0.959 e recall de ~0.988 execto o modelo computação que utilzou regressão logística que apresentou acurácia e precisão maiorres, porém recall um pouco menor.

O alto recall indica que os alunos que evadiram estão sendo identificados corretamente o que é bom, porém a precisão um pouco menor deve ser devido a falsos positivos que estão sendo identificados.

Teste

Já no teste novamente todos os modelos apresentam resultados bem similares, com acurácia ~0.957, precisão de ~0.982 e recall de ~0.971, e apenas um apresentou precisão e acurácia levemente menor que os demais.

7. Controle overfitting usando validação-cruzada (ridge e lasso na regressão logística e condições de “early stopping” nas árvores de decisão, por exemplo, profundidade da árvore)

7.1 Usando regressão logística

7.1.1 Modelo global

ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

ctr.model.reg.global <- train(evadiu ~ cra + num_NA + status + num_finais,
                   data=treino,
                   method="glm",
                   family="binomial",
                   trControl=ctrl,
                   na.action = na.omit)

pred.ctr.reg.global <- predict(ctr.model.reg.global, teste)
treino.ctr.reg.global <- predict(ctr.model.reg.global, treino)

acc.ctr.reg.global <- confusionMatrix(pred.ctr.reg.global, teste$evadiu)
acc.ctr.reg.global

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

7.1.2 Modelo computação

ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

ctr.model.reg.cc <- train(evadiu ~ cra + num_NA + status + num_finais + C1 + AV + 
                      IC + LP1 + LPT + P1 + cra_cc,
                   data=treino,
                   method="glm",
                   family="binomial",
                   trControl=ctrl,
                   na.action = na.omit)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

pred.ctr.reg.cc <- predict(ctr.model.reg.cc, teste)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

treino.ctr.reg.cc <- predict(ctr.model.reg.cc, treino)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

acc.ctr.reg.cc <- confusionMatrix(pred.ctr.reg.cc, teste$evadiu)
acc.ctr.reg.cc

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

7.1.3 Conclusão

Avaliando apenas a acurácia nos testes vemos que ambos tem a mesma acurácia. A precisão e recall serão calculados e comparados conjuntamente entre vários modelos na Seção 8.

7.2 Usando árvore de decisão

Foi treinado um modelo utilizando earlyStopping e controlando a altura da árvore, porém a altura da árvore já está “pequena” (2), portanto esses métodos não mostraram melhoras no modelo. Bucando uma melhora, foi utilizada uma nova biblioteca para gerar a árvore: rpart.

Segundo a documentação de rpart é utilizada cross-validation entre outros métodos para garantir um melhor resultado na geração do modelo, mais detalhes em [4].

7.2.1 Modelo global

model.rp.global <- rpart(evadiu ~ cra + num_NA + status + num_finais, data = treino, method = 'class')

plot(model.rp.global)
text(model.rp.global, pretty=0)

pred.rp.global <- predict(model.rp.global, teste, type='class')

acc.rp.global <- confusionMatrix(pred.rp.global, teste$evadiu)
acc.rp.global

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   167    3
##      TRUE      5   14
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.957671958    0.754465736    0.918303056    0.981551293    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.009831525    0.723673610 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9709302            0.8235294            0.9823529 
##       Neg Pred Value            Precision               Recall 
##            0.7368421            0.9823529            0.9709302 
##                   F1           Prevalence       Detection Rate 
##            0.9766082            0.9100529            0.8835979 
## Detection Prevalence    Balanced Accuracy 
##            0.8994709            0.8972298 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

7.2.2 Modelo computação

model.rp.cc <- rpart(evadiu ~ AV + C1 + LPT + LP1 + P1 + IC + cra + cra_cc + num_NA + status + num_finais, data = treino, method = 'class')

plot(model.rp.cc)
text(model.rp.cc, pretty=0)

pred.rp.cc <- predict(model.rp.cc, teste, type='class')

acc.rp.cc <- confusionMatrix(pred.rp.cc, teste$evadiu)
acc.rp.cc

## $positive
## [1] "FALSE"
## 
## $table
##           Reference
## Prediction FALSE TRUE
##      FALSE   170    4
##      TRUE      2   13
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.968253968    0.795232936    0.932186000    0.988262375    0.910052910 
## AccuracyPValue  McnemarPValue 
##    0.001442486    0.683091398 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9883721            0.7647059            0.9770115 
##       Neg Pred Value            Precision               Recall 
##            0.8666667            0.9770115            0.9883721 
##                   F1           Prevalence       Detection Rate 
##            0.9826590            0.9100529            0.8994709 
## Detection Prevalence    Balanced Accuracy 
##            0.9206349            0.8765390 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

7.2.3 Conclusão

Utilizando apenas a acurácia como parâmetro de comparação o modelo computação se saiu melhor com acurácia de ~0.97.

8. Reporte acurácia, precision e recall da validação-cruzada no teste (para os melhores modelos)

Considerando apenas os novos modelos (que utilizam métodos para controle de overfitting) e apenas os dados de teste como parâmetro, obtivemos o seguinte resultado.

# Acurácia
df <- data_frame('accuracy' = c(acc.ctr.reg.global$overall['Accuracy'],
                                acc.ctr.reg.cc$overall['Accuracy'],
                                acc.rp.global$overall['Accuracy'],
                                acc.rp.cc$overall['Accuracy']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))

ggplot(data = df, aes(x=model, y=accuracy)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.94, 0.97))

# Precisão
df <- data_frame('precision' = c(acc.ctr.reg.global$byClass['Precision'],
                                acc.ctr.reg.cc$byClass['Precision'],
                                acc.rp.global$byClass['Precision'],
                                acc.rp.cc$byClass['Precision']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=precision)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.97, 0.985))

# Recall
df <- data_frame('Recall' = c(acc.ctr.reg.global$byClass['Recall'],
                                acc.ctr.reg.cc$byClass['Recall'],
                                acc.rp.global$byClass['Recall'],
                                acc.rp.cc$byClass['Recall']),
                 'model' = c('Reg. Global', 'Reg. CC', 'Tree. Global', 'Tree. CC'))


ggplot(data = df, aes(x=model, y=Recall)) + geom_bar(stat="identity") + coord_cartesian(ylim=c(0.97, 0.99))

Conclusão

Novamente todos os modelos apresentaram resultados similares, exceto a árvore de decisão que utilizou o modelo computação este modelo apresentou um recall bastante superior no teste (quase 0.99%), uma precisão um pouco inferior, mas uma acurácia notavelmente superior aos demais.

9. Aplique o melhor modelo a você mesmo(a) usando seu histórico e reporte a predição e resultado.

Como visto nas análises anteriores vários modelos obtiveram resultados bastante similares, dito isso irei utilizar dois modelos para esta seção, um que apresentou resultados similares aos demais, e outro que apresentou um resultado mais singular.

Escolhi um dos modelos que apresentou um bom resultado da primeira análise (sem métodos para controle de overfitting), que é o modelo que utiliza Regressão Logística considerando atributos globais e o o outro modelo escolhido foi feito na segunda análise (com métodos para controle de overfitting) utilizando árvore de decisão apenas com atributos de CC, este modelo foi o que conseguiu um maior recall e acurácia indicando que poderá ser um ótimo classificador para este problema.

Segue abaixo o resultado usando ambos classificadores.

teste.marianne <- data.frame(C1 = 8.3, AV = 10, LPT = 9.2, P1 = 10, IC=9.9, LP1 = 10, num_NA = 0, status=as.factor(FALSE), cra=9.56, cra_cc=9.96, num_finais = 0)

pred.marianne.1 <- predict(model.reg.global, newdata = teste.marianne)
pred.marianne.2 <- predict(model.rp.cc, newdata = teste.marianne)

pred.marianne.1

## [1] FALSE
## Levels: FALSE TRUE

pred.marianne.2

##      FALSE       TRUE
## 1 0.967128 0.03287197

Ambos os modelos previram a não evasão como esperado, além disso o segundo modelo além de prever afirma que há a probabilidade de ~0.967 de não evasão que é um valor bem alto (próximo de 100%) e indica o “quão certo” desta classificação o modelo está.

Referências

1. sobre C5.0

2. cross validation linear regression

3. logistic regression in r

4. rpart

AD2 2016.2 - Lab3 - Parte 2

Marianne Linhares

28 de fevereiro de 2016