Classification Prediction

Gustavo Monteiro

November 18, 2018

The database that will be used in this study consists of data on the votes that candidates for the Federal Chamber of Deputies received in the years 2006 and 2010 (source: http://www.tse.jus.br), as well as information on campaigning, party, schooling, … of them.

Loading data.

set.seed(1)

data <- read.csv("train.csv")
data$id <- 1:nrow(data)

data %>% 
  sample_frac(.9) -> train

test <- read.csv("test.csv")
anti_join(data, train, by = 'id') -> validation

Let’s explore the data features

train %>%
  glimpse()
## Observations: 6,860
## Variables: 25
## $ ano                                   <int> 2006, 2006, 2010, 2010, ...
## $ sequencial_candidato                  <dbl> 1.0590e+04, 1.1295e+04, ...
## $ nome                                  <fct> PAULO DUARTE, ANTÔNIO MA...
## $ uf                                    <fct> PR, MG, MA, SP, MG, SP, ...
## $ partido                               <fct> PSDC, PSC, PC do B, PTN,...
## $ quantidade_doacoes                    <int> 1, 3, 5, 3, 1, 27, 23, 4...
## $ quantidade_doadores                   <int> 1, 3, 5, 3, 1, 24, 17, 4...
## $ total_receita                         <dbl> 305.90, 3670.00, 13030.0...
## $ media_receita                         <dbl> 305.90, 1223.33, 2606.00...
## $ recursos_de_outros_candidatos.comites <dbl> 305.90, 2340.00, 0.00, 0...
## $ recursos_de_pessoas_fisicas           <dbl> 0.0, 1330.0, 1700.0, 100...
## $ recursos_de_pessoas_juridicas         <dbl> 0.00, 0.00, 11330.00, 0....
## $ recursos_proprios                     <dbl> 0.00, 0.00, 0.00, 0.00, ...
## $ recursos_de_partido_politico          <dbl> 0.0, 0.0, 0.0, 354.5, 0....
## $ quantidade_despesas                   <int> 1, 3, 7, 3, 1, 75, 22, 2...
## $ quantidade_fornecedores               <int> 1, 3, 6, 3, 1, 69, 22, 1...
## $ total_despesa                         <dbl> 305.90, 3670.00, 12230.0...
## $ media_despesa                         <dbl> 305.90, 1223.33, 1747.14...
## $ cargo                                 <fct> DEPUTADO FEDERAL, DEPUTA...
## $ sexo                                  <fct> MASCULINO, MASCULINO, MA...
## $ grau                                  <fct> SUPERIOR INCOMPLETO, SUP...
## $ estado_civil                          <fct> CASADO(A), DIVORCIADO(A)...
## $ ocupacao                              <fct> EMPRESÁRIO, ESTUDANTE, B...
## $ situacao                              <fct> nao_eleito, nao_eleito, ...
## $ id                                    <int> 2024, 2836, 4366, 6920, ...

following, we will explore classes balance of data attributes, that is an important point to prevent overfiting in prediction models.

Feature by feature, we will show balance by target variable, which in this study is situaçao, whether a deputy was elected or not, and comment if the class is umbalanced

recursos_proprios

train %>%
  ggplot(aes(situacao,recursos_proprios)) +
  geom_boxplot()

unbalanced, high ranges recourse is more common in non-elected

recursos_de_partido_politico

train %>%
  ggplot(aes(situacao,
             recursos_de_partido_politico)) +
  geom_boxplot()

unbalanced, high ranges recourse is more common in elected

recursos_de_outros_candidatos.comites

train %>%
  ggplot(aes(situacao,
             recursos_de_outros_candidatos.comites)) +
  geom_boxplot()

the same situation of recursos_de_partido_politico

recursos_de_pessoas_fisicas

train %>%
  ggplot(aes(situacao,
             recursos_de_pessoas_fisicas)) +
  geom_boxplot() 

unbalanced to a range of lower values for non-elected and high for elected

recursos_de_pessoas_juridicas

train %>%
  ggplot(aes(situacao,
             recursos_de_pessoas_juridicas)) +
  geom_boxplot()

unbalanced in a range of values

quantidade_doacoes

train %>%
  ggplot(aes(situacao,
             quantidade_doacoes)) +
  geom_boxplot()

unbalanced to a very high amount of donations

quantidade_doadores

train %>%
  ggplot(aes(situacao,
             quantidade_doadores)) +
  geom_boxplot()

the same situation of quantidade_doacoes

media_receita

train %>%
  ggplot(aes(situacao,
             media_receita)) +
  geom_boxplot()

cases of a very high value only for an unelected

total_receita

train %>%
  ggplot(aes(situacao,
             total_receita)) +
  geom_boxplot()

ranges of values that are only present in non-elected

sexo

train %>%
ggplot() +
   geom_mosaic(aes(x = product(sexo, situacao),
                   fill=sexo)) +
   theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

grau

train %>%
ggplot() +
   geom_mosaic(aes(x = product(grau, situacao),
                   fill=grau)) +
   theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

estado_civil

train %>%
ggplot() +
   geom_mosaic(aes(x = product(estado_civil, situacao),
                   fill=estado_civil)) +
   theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

we can see that there is rather an imbalance in some variables, especially of values that are present only between elected or non-elected, this can cause problems of overfiting, and among the possible strategies for the correction of this problem will be used a hybrid in this study , undersampling more common values and resampling less common values.

The next step will build an encoding to transform sexo feature in a factor variable.

encoding <- build_encoding(dataSet = train,
                           cols = c("sexo"),
                           verbose = F)

train <- one_hot_encoder(dataSet = train,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

encoding <- build_encoding(dataSet = validation,
                           cols = c("sexo"),
                           verbose = F)

validation <- one_hot_encoder(dataSet = validation,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

encoding <- build_encoding(dataSet = test,
                           cols = c("sexo"),
                           verbose = F)

test <- one_hot_encoder(dataSet = test,
                           encoding = encoding,
                           drop = TRUE,
                           verbose = F)

Creating a personal function to evaluente F1 metric of a model.

f1 <- function(data, lev = NULL, model = NULL) {
  f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
  c(F1 = f1_val)
}

F_Measure <- function(expected, predicted, ...) {
  data.frame(expected=expected,
             prediction=predicted) %>%
      mutate(fn = ifelse(expected == "eleito" &
                         prediction == "nao_eleito",1,0),
             fp = ifelse(expected == "nao_eleito" &
                         prediction == "eleito",1,0)) -> result
}

And we will remove all variables that has no or minimum variance beacuse them is uncessary for creating models.

## Removing variables that was unnecessary or has a minim
train <- train %>% select(-nome,-ano, -ocupacao, -partido,
                          -estado_civil, -grau, -cargo,
                          -sequencial_candidato, -uf, -id)

validation <- validation %>% select(-nome,-ano, -ocupacao, -partido,
                          -estado_civil, -grau, -cargo,
                          -sequencial_candidato, -uf, -id)

test <- test %>% select(-nome,-ano, -ocupacao, -partido,
                          -estado_civil, -grau, -cargo, -uf)

Finally, we can start to create the five models that we will test here, begining by decision three.

Dessision Three

set.seed(1)

cctrl <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "boot", sampling = "smote")
lambda <- expand.grid(.cp = seq(from=0, to=0.02, by=0.005))

tree <- train(situacao ~ .,
              train,
              method="rpart",
              metric = "F1",
              tuneGrid = lambda,
              preProc = c("center", "scale"),
              trControl = trainControl(summaryFunction = f1, classProbs = TRUE, method = "boot"))

Resultant model

varImp(tree)
## rpart variable importance
## 
##                                        Overall
## total_receita                         100.0000
## total_despesa                          99.6657
## recursos_de_pessoas_juridicas          87.2176
## quantidade_fornecedores                61.8274
## quantidade_despesas                    61.4545
## quantidade_doadores                     3.6181
## recursos_proprios                       2.8066
## media_despesa                           1.5733
## recursos_de_partido_politico            1.4601
## media_receita                           0.9026
## quantidade_doacoes                      0.8577
## recursos_de_pessoas_fisicas             0.6909
## sexo.MASCULINO                          0.0000
## recursos_de_outros_candidatos.comites   0.0000
## sexo.FEMININO                           0.0000

and the best tunning for him

tree$finalModel
## n= 6860 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 6860 931 nao_eleito (0.13571429 0.86428571)  
##    2) total_receita>=0.06795873 1280 505 eleito (0.60546875 0.39453125)  
##      4) recursos_de_pessoas_juridicas>=1.03738 472 103 eleito (0.78177966 0.21822034) *
##      5) recursos_de_pessoas_juridicas< 1.03738 808 402 eleito (0.50247525 0.49752475)  
##       10) total_despesa>=0.6243804 310 123 eleito (0.60322581 0.39677419)  
##         20) recursos_proprios< 3.169534 278 100 eleito (0.64028777 0.35971223) *
##         21) recursos_proprios>=3.169534 32   9 nao_eleito (0.28125000 0.71875000) *
##       11) total_despesa< 0.6243804 498 219 nao_eleito (0.43975904 0.56024096)  
##         22) recursos_proprios>=-0.08563123 301 150 eleito (0.50166113 0.49833887)  
##           44) quantidade_despesas< 0.3188294 202  91 eleito (0.54950495 0.45049505)  
##             88) quantidade_doadores>=-0.0920028 158  62 eleito (0.60759494 0.39240506) *
##             89) quantidade_doadores< -0.0920028 44  15 nao_eleito (0.34090909 0.65909091) *
##           45) quantidade_despesas>=0.3188294 99  40 nao_eleito (0.40404040 0.59595960) *
##         23) recursos_proprios< -0.08563123 197  68 nao_eleito (0.34517766 0.65482234) *
##    3) total_receita< 0.06795873 5580 156 nao_eleito (0.02795699 0.97204301) *

Let’s see the final three

# plot the model
plot(tree$finalModel, uniform=TRUE,
     main="Classification Tree")
text(tree$finalModel, all=TRUE, cex=.8)

and the same but more pretty

fancyRpartPlot(tree$finalModel)

Model validation using false positive and false negative percentages of predictions

data.frame(validation %>%
  select(-situacao) %>%
  predict(object=tree,.) %>%
  F_Measure(validation$situacao,.) %>% 
  summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))
##   false_negative false_positive
## 1     0.03412073     0.04724409

and now the F1 performance graph of this model for all hyper params tested

ggplot(tree)

the F1 metric is medin to good, and the performance in validation predictions of false potives and false negatives together account for about 8%, for a medium F1 were good results

Logistic regression

set.seed(1)
rlGrid <- expand.grid( cost = c(200,2,0.02),
                       loss = c("L2_dual"),
                       epsilon = c(0.001, 0.01) )

cctrl2 <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "cv", number = 5, sampling = "smote")

logistic <- train(situacao ~ ., train, 
                             method = "regLogistic", 
                             trControl = cctrl2,
                             metric = "F1", 
                             preProc = c("center", "scale"),
                             tuneGrid = rlGrid)
## Loading required package: grid

Resultant model

varImp(logistic)
## ROC curve variable importance
## 
##                                       Importance
## total_despesa                             100.00
## total_receita                              99.96
## quantidade_fornecedores                    90.21
## quantidade_despesas                        90.01
## recursos_de_pessoas_juridicas              89.79
## quantidade_doacoes                         85.81
## quantidade_doadores                        85.59
## media_receita                              82.48
## recursos_de_pessoas_fisicas                80.05
## media_despesa                              64.85
## recursos_proprios                          52.92
## recursos_de_outros_candidatos.comites      39.83
## recursos_de_partido_politico               32.78
## sexo.FEMININO                               0.00
## sexo.MASCULINO                              0.00

and the best tunning for him

logistic$finalModel
## $TypeDetail
## [1] "L2-regularized logistic regression dual (L2R_LR_DUAL)"
## 
## $Type
## [1] 7
## 
## $W
##      quantidade_doacoes quantidade_doadores total_receita media_receita
## [1,]         0.06761115            -1.51121    -0.3568663    -0.6368428
##      recursos_de_outros_candidatos.comites recursos_de_pessoas_fisicas
## [1,]                             0.0215394                  -0.8941101
##      recursos_de_pessoas_juridicas recursos_proprios
## [1,]                    -0.6634031        -0.1554319
##      recursos_de_partido_politico quantidade_despesas
## [1,]                    0.6104716            1.200539
##      quantidade_fornecedores total_despesa media_despesa sexo.FEMININO
## [1,]             -0.02511319       -2.6219    -0.4423231    0.01841145
##      sexo.MASCULINO       Bias
## [1,]    -0.01841145 -0.4740965
## 
## $Bias
## [1] 1
## 
## $ClassNames
## [1] nao_eleito eleito    
## Levels: eleito nao_eleito
## 
## $NbClass
## [1] 2
## 
## $xNames
##  [1] "quantidade_doacoes"                   
##  [2] "quantidade_doadores"                  
##  [3] "total_receita"                        
##  [4] "media_receita"                        
##  [5] "recursos_de_outros_candidatos.comites"
##  [6] "recursos_de_pessoas_fisicas"          
##  [7] "recursos_de_pessoas_juridicas"        
##  [8] "recursos_proprios"                    
##  [9] "recursos_de_partido_politico"         
## [10] "quantidade_despesas"                  
## [11] "quantidade_fornecedores"              
## [12] "total_despesa"                        
## [13] "media_despesa"                        
## [14] "sexo.FEMININO"                        
## [15] "sexo.MASCULINO"                       
## 
## $problemType
## [1] "Classification"
## 
## $tuneValue
##   cost    loss epsilon
## 3    2 L2_dual   0.001
## 
## $obsLevels
## [1] "eleito"     "nao_eleito"
## attr(,"ordered")
## [1] FALSE
## 
## $param
## list()
## 
## attr(,"class")
## [1] "LiblineaR"

Model validation using false positive and false negative percentages of predictions

data.frame(validation %>%
  select(-situacao) %>%
  predict(object=logistic,.) %>%
  F_Measure(validation$situacao,.) %>% 
  summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))
##   false_negative false_positive
## 1     0.02099738     0.06299213

and now the F1 performance graph of this model for all hyper params tested

ggplot(logistic)

the F1 metric is median, and the performance in validation predictions of false potives and false negatives together account for about 8%, for a medium F1 were good results, with more false negatives than false positives.

KNN

set.seed(1)
neighborsGrid <- expand.grid(.k = seq(from=1, to=50, by=1))

knn <- train(situacao ~ ., train,
        metric = "F1",
        method = "knn",
        tuneGrid = neighborsGrid,
        trControl = cctrl2)

Resultant model

varImp(knn)
## ROC curve variable importance
## 
##                                       Importance
## total_despesa                             100.00
## total_receita                              99.96
## quantidade_fornecedores                    90.21
## quantidade_despesas                        90.01
## recursos_de_pessoas_juridicas              89.79
## quantidade_doacoes                         85.81
## quantidade_doadores                        85.59
## media_receita                              82.48
## recursos_de_pessoas_fisicas                80.05
## media_despesa                              64.85
## recursos_proprios                          52.92
## recursos_de_outros_candidatos.comites      39.83
## recursos_de_partido_politico               32.78
## sexo.FEMININO                               0.00
## sexo.MASCULINO                              0.00

and the best tunning for him

knn$bestTune
##     k
## 35 35

Model validation using false positive and false negative percentages of predictions

data.frame(validation %>%
  select(-situacao) %>%
  predict(object=knn,.) %>%
  F_Measure(validation$situacao,.) %>% 
  summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))
##   false_negative false_positive
## 1    0.005249344      0.1102362

and now the F1 performance graph of this model for all hyper params tested

ggplot(knn)

the F1 metric performance is similar to the decision tree, the same in false positives/negatives prcentage in predictions of validation set.

AdaBoost

set.seed(1)
grid <- expand.grid(mfinal = (1:3)*3, maxdepth = c(1, 2),
                    coeflearn = c("Breiman", "Freund", "Zhu"))

seeds <- vector(mode = "list", length = nrow(train) + 1)
seeds <- lapply(seeds, function(x) 1:20)

cctrl2 <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "cv", number = 5)
adaboost <- train(situacao ~ ., train, 
                             method = "AdaBoost.M1", 
                             trControl = cctrl,
                            na.action = na.exclude,
                             tuneGrid = grid,
                             metric = "F1", 
                             preProc = c("center", "scale"))

Resultant model

varImp(adaboost)
## AdaBoost.M1 variable importance
## 
##                                        Overall
## total_receita                         100.0000
## total_despesa                          24.3953
## recursos_de_pessoas_juridicas           4.3038
## recursos_proprios                       4.1647
## media_despesa                           1.4067
## recursos_de_pessoas_fisicas             0.3218
## quantidade_doadores                     0.0000
## quantidade_fornecedores                 0.0000
## quantidade_doacoes                      0.0000
## sexo.FEMININO                           0.0000
## recursos_de_partido_politico            0.0000
## recursos_de_outros_candidatos.comites   0.0000
## quantidade_despesas                     0.0000
## media_receita                           0.0000
## sexo.MASCULINO                          0.0000

Model validation using false positive and false negative percentages of predictions

data.frame(validation %>%
  select(-situacao) %>%
  predict(object=adaboost,.) %>%
  F_Measure(validation$situacao,.) %>% 
  summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))
##   false_negative false_positive
## 1    0.002624672      0.1154856

and now the F1 performance graph of this model for all hyper params tested

ggplot(adaboost)

the F1 metric is median, with a courious results of validation, a minimal percentage of false negatives and more than 11% of false positives.

In all models, and shown more explicitly in the decision tree, parameters for spending money in the field seem to be the most important, and show what really makes a difference for a candidate to be elected or not.

Following, we will test and big famous model, the gradient boosting, as a extra model.

seeds <- vector(mode = "list", length = nrow(train) + 1)
seeds <- lapply(seeds, function(x) 1:20)

cctrl3 <- trainControl(
  summaryFunction = f1,
  classProbs = TRUE,
  method = "cv",
  number = 10,
  seeds = seeds)

gbm <- train(situacao ~ ., train, 
                             method = "gbm", 
                             trControl = cctrl3,
                             metric = "F1",
             na.action = na.exclude,
                             preProc = c("center", "scale"),
                             tuneGrid = expand.grid(interaction.depth = c(1, 2, 3, 4, 5),
                                       shrinkage = c(.12, .13),
                                       n.trees = c(55, 56, 57, 58, 59),
                                       n.minobsinnode = c(2,3)),
                             verbose = FALSE,
                            distribution = "adaboost")

the best tunning for him

gbm$bestTune
##   n.trees interaction.depth shrinkage n.minobsinnode
## 6      55                 1      0.12              3

Model validation using false positive and false negative percentages of predictions

data.frame(validation %>%
  select(-situacao) %>%
  predict(object=gbm,.) %>%
  F_Measure(validation$situacao,.) %>% 
  summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))
##   false_negative false_positive
## 1      0.0328084     0.04593176

and now the F1 performance graph of this model for all hyper params tested

ggplot(gbm)

the F1 metric is the best achieved among the models created in this study, even the performance in validation being practically the same as the other models in false positives/negatives.

Results

The results of this study show that the models achieve a reasonable performance for metric F1, being the gradient boosting model the chosen one for the sending to competition of the Kaggle, we also saw that in the general, variables referring to financial investment and support of companies were the features that in all models showed to have the greatest relevance

Generation predictions for test dataframe to submit to Kaggle, to related competition.

prediction <- predict(gbm, test)
data_out <- data.frame(ID = test$sequencial_candidato, Predicted = prediction) 
data_out$ID <-as.character(data_out$ID)  
data_out %>% write_csv(path = "response.csv")