Introduction¶

Data Analysis and Classification with Neural Networks on a subset of data about polls for the 2006 and 2010 elections in Brazil for the "Câmara Federal de Deputados". Data was taken from the TSE portal which originally encompassed approximately 7300 candidates.

Data Overview¶

The variables¶

The response variable is the variable that you are interesting in making measurements and conclusions on.

A predictor variable is a variable used to predict another variable.

Our response variable will be "situacao", we want to study how well the predictor variables can help predict its behavior and how they impact in the linear regression.

Each item corresponds to a candidate, the attributes of each item are as follows:¶

ano : Year at which the election took place.
sequencial_candidato : Sequential ID to map the candidates
nome : Name of the candidate
uf : Federate state to which the candidate belongs.
partido : Political party to which the candidate belongs.
quantidade_doacoes : Number of donations received during political campaign.
quantidade_doadores : Number of donors that contributed to the candidate's political campaign.
total_receita : Total revenue.
media_receita : Mean revenue.
recursos_de_outros_candidatos.comites : Revenue from other candidate's committees.
recursos_de_pessoas_fisicas : Revenue from individuals.
recursos_de_pessoas_juridicas : Revenue from legal entities.
recursos_proprios : Revenue from personal resources.
recursos_de_partido_politico : Revenue from political party.
quantidade_despesas : Number of expenses.
quantidade_fornecedores : Number of suppliers.
total_despesa : Total expenditure.
media_despesa : Mean expenditure.
cargo : Position.
sexo : Sex.
grau : Level of education.
estado_civil : Marital status.
ocupacao : Candidate's occupation up to the election.
situacao : Whether the candidate was elected.

Loading Data¶

library(dataPreparation)
library(tidyverse)
library(janitor)
library(magrittr)
library(GGally)
library(caret)
library(keras)
library(ROSE)
library(here)

theme_set(theme_bw())

Loading required package: lubridate

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

Loading required package: stringr
Loading required package: Matrix
Loading required package: progress
dataPreparation 0.3.5
Type dataPrepNews() to see new features/changes/bug fixes.
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1.9000     ✔ readr   1.2.0     
✔ tibble  1.4.2          ✔ purrr   0.2.4     
✔ tidyr   0.8.0          ✔ dplyr   0.7.4     
✔ ggplot2 2.2.1.9000     ✔ forcats 0.3.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ tidyr::expand()          masks Matrix::expand()
✖ dplyr::filter()          masks stats::filter()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()
✖ dplyr::vars()            masks ggplot2::vars()

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract


Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift

Loaded ROSE 0.0-3

here() starts at /kaggle/working

Attaching package: ‘here’

The following object is masked from ‘package:lubridate’:

    here

readr::read_csv(here::here('../input/train_class.csv'),
                progress = FALSE,
                local=readr::locale("br"),
                col_types = cols(ano = col_integer(),
                                 sequencial_candidato = col_character(),
                                 quantidade_doacoes = col_integer(),
                                 quantidade_doadores = col_integer(),
                                 total_receita = col_double(),
                                 media_receita = col_double(),
                                 recursos_de_outros_candidatos.comites = col_double(),
                                 recursos_de_pessoas_fisicas = col_double(),
                                 recursos_de_pessoas_juridicas = col_double(),
                                 recursos_proprios = col_double(),
                                 `recursos_de_partido_politico` = col_double(),
                                 quantidade_despesas = col_integer(),
                                 quantidade_fornecedores = col_integer(),
                                 total_despesa = col_double(),
                                 media_despesa = col_double(),
                                 situacao = col_character(),
                                 .default = col_character())) %>%
  mutate(sequencial_candidato = as.numeric(sequencial_candidato),
         estado_civil = as.factor(estado_civil),
         ocupacao = as.factor(ocupacao),
         partido = as.factor(partido),
         cargo = as.factor(cargo),
         nome = as.factor(nome),
         grau = as.factor(grau),
         sexo = as.factor(sexo),
         uf = as.factor(uf)) -> data

data %>%
  glimpse()

Observations: 7,622
Variables: 24
$ ano                                   <int> 2006, 2006, 2006, 2006, 2006,...
$ sequencial_candidato                  <dbl> 10001, 10002, 10002, 10002, 1...
$ nome                                  <fct> JOSÉ LUIZ NOGUEIRA DE SOUSA, ...
$ uf                                    <fct> AP, RO, AP, MS, RO, AP, PI, M...
$ partido                               <fct> PT, PT, PT, PRONA, PT, PT, PC...
$ quantidade_doacoes                    <int> 6, 13, 17, 6, 48, 8, 6, 14, 2...
$ quantidade_doadores                   <int> 6, 13, 16, 6, 48, 8, 6, 7, 2,...
$ total_receita                         <dbl> 16600.00, 22826.00, 158120.80...
$ media_receita                         <dbl> 2766.67, 1755.85, 9301.22, 50...
$ recursos_de_outros_candidatos.comites <dbl> 0.00, 6625.00, 2250.00, 0.00,...
$ recursos_de_pessoas_fisicas           <dbl> 9000.00, 15000.00, 34150.00, ...
$ recursos_de_pessoas_juridicas         <dbl> 6300.00, 1000.00, 62220.80, 1...
$ recursos_proprios                     <dbl> 1300.00, 201.00, 59500.00, 75...
$ recursos_de_partido_politico          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
$ quantidade_despesas                   <int> 14, 24, 123, 8, 133, 38, 9, 1...
$ quantidade_fornecedores               <int> 14, 23, 108, 8, 120, 37, 9, 1...
$ total_despesa                         <dbl> 16583.60, 20325.99, 146011.70...
$ media_despesa                         <dbl> 1184.54, 846.92, 1187.09, 375...
$ cargo                                 <fct> DEPUTADO FEDERAL, DEPUTADO FE...
$ sexo                                  <fct> MASCULINO, FEMININO, FEMININO...
$ grau                                  <fct> ENSINO MÉDIO COMPLETO, SUPERI...
$ estado_civil                          <fct> CASADO(A), SOLTEIRO(A), VIÚVO...
$ ocupacao                              <fct> VEREADOR, SERVIDOR PÚBLICO ES...
$ situacao                              <chr> "nao_eleito", "nao_eleito", "...

data %>%
  map_df(function(x) sum(is.na(x))) %>%
  gather(feature, num_nulls) %>%
  arrange(desc(num_nulls))

Data Exploration¶

Imbalance on class distribution¶

data %>%
  ggplot(aes(situacao)) +
  geom_bar() +
  labs(x="Situation", y="Absolute Frequency")

data %>%
  group_by(situacao) %>%
  summarise(num = n()) %>%
  ungroup() %>%
  mutate(total = sum(num),
         proportion = num/total)

br>

There's a strong imbalance in the class distribution of the dataset with around 13% of the entries in the class "eleito" (elected).¶

This imbalance can lead to a bias in the model that will learn to overlook the less frequent classes. Such bias can have a negative impact in the model generalization and its performance.
- We can restore balance by removing instances from the most frequent class $undersampling$.
- We can restore balance by adding instances from the most frequent class $oversampling$.

data %>% 
  select(-ano,
         -sequencial_candidato,
         -nome) %>%
  select(
    quantidade_doacoes,
    quantidade_doadores,
    total_receita,
    media_receita,
    recursos_de_outros_candidatos.comites,
    recursos_de_pessoas_fisicas,
    recursos_de_pessoas_juridicas,
    recursos_proprios,
    `recursos_de_partido_politico`) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu", label = TRUE,
       hjust = 0.95, label_size = 3,size = 3,
       nbreaks = 5, layout.exp = 5) +
  ggtitle("Correlation plot for employed variables")

Predictors such as quantidade_doacoes (Number of Donations) and quantidade_doadores (Number of Donors) are highly correlated and therefore redundant.

Preparing data¶

Splitting data¶

set.seed(107)

data$id <- 1:nrow(data)

data %>% 
  dplyr::sample_frac(.8) -> train

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train),
    "\n##### Variables: ",ncol(train))

#### Train Shape 
##### Observations:  6098 
##### Variables:  25

dplyr::anti_join(data, 
                 train, 
                 by = 'id') -> test

cat("#### Test Shape",
    "\n##### Observations: ",nrow(test),
    "\n##### Variables: ",ncol(test))

#### Test Shape 
##### Observations:  1524 
##### Variables:  25

train %>%
    select(-ano,-nome,-id,-sequencial_candidato) -> train

test %>%
    select(-ano,-nome,-id,-sequencial_candidato) -> test

train %>%
  dplyr::select_if(.,is.numeric) -> train.numeric

train %>%
  dplyr::select_if(.,negate(is.numeric)) -> train.categorical

test %>%
  dplyr::select_if(.,is.numeric) -> test.numeric

test %>%
  dplyr::select_if(.,negate(is.numeric)) -> test.categorical

train.numeric %>%
  preProcess(.,method = c("center","scale")) -> processParams

processParams %>%
  predict(.,train.numeric) -> train.numeric 

processParams %>% 
  predict(.,test.numeric) -> test.numeric 

processParams

Created from 6098 samples and 13 variables

Pre-processing:
  - centered (13)
  - ignored (0)
  - scaled (13)

train.numeric %>%
  dplyr::bind_cols(train.categorical) -> train

test.numeric %>%
  dplyr::bind_cols(test.categorical) -> test

Generate Balanced Data with ROSE algorithm¶

train %>%
clean_names() %>% 
ROSE(situacao ~ .,
     data =.,
     seed = 107) %$%
data -> train.rose

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train.rose),
    "\n##### Variables: ",ncol(train.rose))

#### Train Shape 
##### Observations:  6098 
##### Variables:  21

train.rose %>%
  group_by(situacao) %>%
  summarise(num = n()) %>%
  ungroup() %>%
  mutate(total = sum(num),
         proportion = num/total)

One hot encoding¶

encoding <- build_encoding(dataSet = train,
                          cols = c("uf","sexo","grau","ocupacao",
                                   "partido","estado_civil"),
                          verbose = F)

train <- one_hot_encoder(dataSet = train,
                          encoding = encoding,
                          drop = TRUE,
                          verbose = F)

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train),
    "\n##### Variables: ",ncol(train))

Warning message in `[.data.table`(dataSet, , `:=`((new_cols), 0)):
“Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.”

#### Train Shape 
##### Observations:  6098 
##### Variables:  263

train.rose <- one_hot_encoder(dataSet = train.rose,
                              encoding = encoding,
                              drop = TRUE,
                              verbose = F)

cat("#### Balanced Data Shape",
    "\n##### Observations: ",nrow(train.rose),
    "\n##### Variables: ",ncol(train.rose))

#### Balanced Data Shape 
##### Observations:  6098 
##### Variables:  263

test <- one_hot_encoder(dataSet = test,
                          encoding = encoding,
                          drop = TRUE,
                          verbose = F)

cat("#### Test Data Shape",
    "\n##### Observations: ",nrow(test),
    "\n##### Variables: ",ncol(test))

#### Test Data Shape 
##### Observations:  1524 
##### Variables:  263

Near Zero Variance Predictors¶

train %>%
  nearZeroVar(saveMetrics = TRUE) %>%
  tibble::rownames_to_column("variable") %>%
  filter(nzv == T) %>% 
  pull(variable) -> near_zero_vars

train %>% 
    select(-one_of(near_zero_vars)) -> train

train.rose %>% 
    select(-one_of(near_zero_vars)) -> train.rose

test %>%
    select(-one_of(near_zero_vars)) -> test


near_zero_vars %>% 
  glimpse()

 chr [1:224] "cargo" "uf.AC" "uf.AL" "uf.AM" "uf.AP" "uf.BA" "uf.CE" ...

Conform data to Keras¶

split_target_predictors <- function(df) {
    
  df %>%
    select(-situacao) %>%
    as.matrix() -> x_data

  df %>% 
    select(situacao) %>%  
    dummyVars(" ~ situacao", data = .,levelsOnly = TRUE) -> dmy

  df %>%
    select(situacao) %>%  
    data.frame(predict(dmy, newdata = .)) %>% 
    select(-situacao) %>%
    as.matrix() -> y_data
    
  dimnames(x_data) <- NULL
  dimnames(y_data) <- NULL
    
  newData <- list("predictors" = x_data, "target" = y_data)
  return(newData)
}

Prepare test data¶

test %>%
   split_target_predictors() -> x_test
  
# extract target and predictors
y_test <- x_test$target
x_test <- x_test$predictors

Prepare submission data¶

readr::read_csv(here::here('../input/test_class.csv'),
                progress = FALSE,
                local=readr::locale("br"),
                col_types = cols(ano = col_integer(),
                                 sequencial_candidato = col_character(),
                                 quantidade_doacoes = col_integer(),
                                 quantidade_doadores = col_integer(),
                                 total_receita = col_double(),
                                 media_receita = col_double(),
                                 recursos_de_outros_candidatos.comites = col_double(),
                                 recursos_de_pessoas_fisicas = col_double(),
                                 recursos_de_pessoas_juridicas = col_double(),
                                 recursos_proprios = col_double(),
                                 `recursos_de_partido_politico` = col_double(),
                                 quantidade_despesas = col_integer(),
                                 quantidade_fornecedores = col_integer(),
                                 total_despesa = col_double(),
                                 media_despesa = col_double(),
                                 .default = col_character())) %>%
  mutate(sequencial_candidato = as.numeric(sequencial_candidato),
         estado_civil = as.factor(estado_civil),
         ocupacao = as.factor(ocupacao),
         partido = as.factor(partido),
         cargo = as.factor(cargo),
         nome = as.factor(nome),
         grau = as.factor(grau),
         sexo = as.factor(sexo),
         uf = as.factor(uf)) -> submit_data

submit_data %$% 
  sequencial_candidato -> Id

submit_data %>%
  glimpse()

Observations: 4,592
Variables: 23
$ ano                                   <int> 2014, 2014, 2014, 2014, 2014,...
$ sequencial_candidato                  <dbl> 1e+10, 1e+10, 1e+10, 1e+10, 1...
$ nome                                  <fct> EMERSON DA SILVA SANTOS, GERA...
$ uf                                    <fct> AC, AC, AC, AC, AC, AC, AC, A...
$ partido                               <fct> PSOL, PSOL, PSB, PT, PT, PT, ...
$ quantidade_doacoes                    <int> 3, 5, 40, 29, 160, 4, 48, 15,...
$ quantidade_doadores                   <int> 3, 5, 38, 29, 146, 3, 48, 13,...
$ total_receita                         <dbl> 1580.00, 3180.00, 336793.13, ...
$ media_receita                         <dbl> 526.67, 636.00, 8419.83, 5404...
$ recursos_de_outros_candidatos.comites <dbl> 0.00, 0.00, 1923.07, 39122.32...
$ recursos_de_pessoas_fisicas           <dbl> 1500.00, 3100.00, 65700.00, 6...
$ recursos_de_pessoas_juridicas         <dbl> 0.00, 0.00, 154170.06, 17000....
$ recursos_proprios                     <dbl> 0.00, 0.00, 115000.00, 6813.1...
$ recursos_de_partido_politico          <dbl> 80.00, 80.00, 0.00, 25000.00,...
$ quantidade_despesas                   <int> 3, 6, 145, 136, 518, 12, 336,...
$ quantidade_fornecedores               <int> 3, 5, 139, 121, 354, 12, 281,...
$ total_despesa                         <dbl> 1580.00, 3130.02, 326869.78, ...
$ media_despesa                         <dbl> 526.67, 521.67, 2254.27, 1772...
$ cargo                                 <fct> DEPUTADO FEDERAL, DEPUTADO FE...
$ sexo                                  <fct> MASCULINO, MASCULINO, MASCULI...
$ grau                                  <fct> ENSINO MÉDIO COMPLETO, SUPERI...
$ estado_civil                          <fct> SOLTEIRO(A), SOLTEIRO(A), CAS...
$ ocupacao                              <fct> CORRETOR DE IMÓVEIS, SEGUROS,...

submit_data %>%
    select(-ano,-nome,-sequencial_candidato) -> submit_data

submit_data %>%
  dplyr::select_if(.,is.numeric) -> submit_data.numeric

submit_data %>%
  dplyr::select_if(.,negate(is.numeric)) -> submit_data.categorical

processParams %>% 
  predict(.,submit_data.numeric) -> submit_data.numeric 

submit_data.numeric %>%
  dplyr::bind_cols(submit_data.categorical) -> submit_data

submit_data <- one_hot_encoder(dataSet = submit_data,
                               encoding = encoding,
                               drop = TRUE,
                               verbose = F)

submit_data %>%
    select(-one_of(near_zero_vars)) -> submit_data

submit_data %>% 
glimpse()

Observations: 4,592
Variables: 38
$ quantidade_doacoes                    <dbl> -0.19978723, -0.18415635, 0.0...
$ quantidade_doadores                   <dbl> -0.20625998, -0.18832944, 0.1...
$ total_receita                         <dbl> -0.38636694, -0.38280827, 0.3...
$ media_receita                         <dbl> -0.343680979, -0.336117447, 0...
$ recursos_de_outros_candidatos.comites <dbl> -0.190636839, -0.190636839, -...
$ recursos_de_pessoas_fisicas           <dbl> -0.3296413, -0.3056035, 0.634...
$ recursos_de_pessoas_juridicas         <dbl> -0.31502228, -0.31502228, 0.2...
$ recursos_proprios                     <dbl> -0.16885161, -0.16885161, 0.5...
$ recursos_de_partido_politico          <dbl> -0.191460411, -0.191460411, -...
$ quantidade_despesas                   <dbl> -0.305124337, -0.298150602, 0...
$ quantidade_fornecedores               <dbl> -0.30825230, -0.30254043, 0.0...
$ total_despesa                         <dbl> -0.39456340, -0.39063257, 0.4...
$ media_despesa                         <dbl> -0.123162519, -0.123902300, 0...
$ uf.MG                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.PR                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.RJ                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.RS                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.SP                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ sexo.FEMININO                         <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
$ sexo.MASCULINO                        <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,...
$ `grau.ENSINO FUNDAMENTAL COMPLETO`    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ `grau.ENSINO MÉDIO COMPLETO`          <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
$ `grau.SUPERIOR COMPLETO`              <int> 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,...
$ `grau.SUPERIOR INCOMPLETO`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ ocupacao.ADVOGADO                     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
$ ocupacao.DEPUTADO                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
$ ocupacao.EMPRESÁRIO                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ ocupacao.OUTROS                       <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PDT                           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PMDB                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PSB                           <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PSDB                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PT                            <int> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,...
$ partido.PTB                           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
$ partido.PV                            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ estado.civil.CASADO.A.                <int> 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,...
$ estado.civil.DIVORCIADO.A.            <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
$ estado.civil.SOLTEIRO.A.              <int> 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,...

submit_data %>%
  as.matrix() -> x_submit
  
dimnames(x_submit) <- NULL

Base Neural Networks¶

# Initialize a sequential model
model.simple <- keras_model_sequential()

# Add layers to the model
model.simple %>%
  layer_dense(units = 8, activation = 'relu', input_shape = c(38)) %>% 
  layer_dense(units = 2, activation = 'softmax')

summary(model.simple)

________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
dense_1 (Dense)                     (None, 8)                       312         
________________________________________________________________________________
dense_2 (Dense)                     (None, 2)                       18          
================================================================================
Total params: 330
Trainable params: 330
Non-trainable params: 0
________________________________________________________________________________

# Initialize a sequential model
model.complex <- keras_model_sequential()

# Add layers to the model
model.complex %>%
  layer_dense(units = 8, activation = 'relu', input_shape = c(38)) %>% 
  layer_dense(units = 6, activation = 'relu') %>% 
  layer_dense(units = 10) %>% 
  layer_dense(units = 2, activation = 'softmax')

summary(model.complex)

________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
dense_3 (Dense)                     (None, 8)                       312         
________________________________________________________________________________
dense_4 (Dense)                     (None, 6)                       54          
________________________________________________________________________________
dense_5 (Dense)                     (None, 10)                      70          
________________________________________________________________________________
dense_6 (Dense)                     (None, 2)                       22          
================================================================================
Total params: 458
Trainable params: 458
Non-trainable params: 0
________________________________________________________________________________

Cross Validation and Tuning¶

neuralNetCV <- function(df,
                        model,
                        k=5,
                        loss_method = 'binary_crossentropy',
                        optim_approach = 'adam',
                        summary_metrics='accuracy',
                        epochs = 200,
                        batch_size = 5,
                        validation_split = 0.2) {
    
  model %>% 
    compile(
        loss = loss_method,
        optimizer = optim_approach,
        metrics = summary_metrics)
  df %>%
    mutate(folds = sample(rep_len(1:k, nrow(.)))) -> df
  
  result <-  data.frame("loss"=c(),metrics=c()) 
    
  for(f in unique(df$folds)){

  # split into train/validation
    df %>%
      filter(folds == f) -> train_df
    
    df %>%
      filter(folds != f) -> valid_df
      
 # Remove auxiliary column
    train_df %>%
      select(-folds) -> train_df
      
    valid_df %>%
      select(-folds) -> valid_df
      
  # create matrices
    train_df %>%
      split_target_predictors() -> x_train
    
    valid_df %>%
      split_target_predictors() -> x_valid
  
  # extract target and predictors
    y_train <- x_train$target
    x_train <- x_train$predictors
  
    y_valid <- x_valid$target
    x_valid <- x_valid$predictors
    
  # Train model
  history <- model %>% fit(
      x_train, y_train,
      epochs = epochs,
      batch_size = batch_size, 
      validation_split = validation_split)
  
  # Evaluate the model
    model %>%
      evaluate(x_valid,
               y_valid,
               batch_size = 128) -> score
  
  score %>%
      as.data.frame() -> temporary
      
  result <- rbind(result,temporary)
  }
 cvsummary <- list("result"=result,
                   "history"=history,
                   "model"=model)   
 return(cvsummary);
}

tuneNeuralNetwork <- function(model,data,paramsGrid,
                              target,k=5) {
   environment(neuralNetCV) <- environment()    

    best_accuracy <- 0
    best_loss <- 0
    best_tune <- NULL
    best_history <- NULL
    best_model <- NULL
    optmizer <- NULL
    losses <- c()
    acc <- c()

     for(i in 1:nrow(paramsGrid)) {  
        row <- paramsGrid[i,]
        if (row$optim_approach == "sgd") 
            optmizer <- optimizer_sgd(lr = 0.01)
        else if (row$optim_approach == "rmsprop") 
            optmizer <- optimizer_rmsprop(lr = 0.001, rho = 0.9)
        else if (row$optim_approach == "adam") 
            optmizer <- optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999)

        # apply CV Kfold to particular set of params
        neuralNetCV(data,
                    model=model,
                    k=k,
                    loss_method = paste0(row$loss_method),
                    optim_approach = optmizer,
                    summary_metrics= paste0(row$summary_metrics),
                    epochs = paste0(row$epochs),
                    batch_size = paste0(row$batch_size),
                    validation_split = paste0(row$validation_split)) -> cvsummary

        cvsummary %$%
        result %$%
        acc %>%
        mean() -> mean_accuracy

        cvsummary %$%
        result %$%
        loss %>%
        mean() -> mean_loss
       
        losses <- append(losses,mean_loss)  
        acc <- append(acc, mean_accuracy)

        if(mean_accuracy > best_accuracy) {
            best_accuracy <- mean_accuracy
            best_history <- cvsummary$history
            best_model <- cvsummary$model
            best_result <- mean_accuracy
            best_loss <- mean_loss
            best_tune <- row
        }    
    }
    paramsGrid$loss <- losses
    paramsGrid$accuracy <- acc
    best_tune$accuracy <- best_accuracy
    best_tune$loss <- best_loss

    result <- list("history"=best_history,
                   "best_tune"=best_tune,
                   "iterations"=paramsGrid,
                   "best_model"=best_model)
    return(result)

    }

Tuning Parameters¶

optm <- c("sgd", "adam")
validation_split <- c(0.2,0.4)

paramsGrid <- expand.grid(optim_approach=optm,
                          loss_method = 'binary_crossentropy',
                          summary_metrics='accuracy',
                          epochs = 200,
                          batch_size = 5,
                          validation_split = validation_split)

paramsGrid

Simplified Neural Net with Unbalaced data¶

tuneNeuralNetwork(model.simple,
                  train,paramsGrid,
                  k=5) -> tunedSimple

Test Score¶

tunedSimple %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.simple

score.simple

Visualize Train Fit¶

tunedSimple  %$%
    history %>%
    plot()

Make Predictions¶

tunedSimple %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.simple

data.frame(Id=Id,
          Predicted=classes.simple) -> submission.simple

submission.simple %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_simple.csv")

Simplified Neural Network with Balanced data (ROSE)¶

tuneNeuralNetwork(model.simple,
                  train.rose,paramsGrid,
                  k=5) -> tunedSimpleRose

Test Score¶

tunedSimpleRose %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.simple.rose

score.simple.rose

Visualize Train Fit¶

tunedSimpleRose  %$%
    history %>%
    plot()

Make Predictions¶

tunedSimpleRose %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.simple.rose

data.frame(Id=Id,
          Predicted=classes.simple.rose) -> submission.simple.rose

submission.simple.rose %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_simple_rose.csv")

Slightly more complex Neural Networn with unbalaced data¶

tuneNeuralNetwork(model.complex,
                  train,paramsGrid,
                  k=5) -> tunedComplex

Test Score¶

tunedComplex %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.complex

score.complex

Visualize Train Fit¶

tunedComplex  %$%
    history %>%
    plot()

Make Predictions¶

tunedComplex %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.complex

data.frame(Id=Id,
          Predicted=classes.complex) -> submission.complex

submission.complex %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_complex.csv")

Slightly more complex Neural Networn with balaced data (ROSE)¶

tuneNeuralNetwork(model.complex,
                  train.rose,paramsGrid,
                  k=5) -> tunedComplexRose

Test Score¶

tunedComplexRose %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.complex.rose

score.complex.rose

Visualize Train Fit¶

tunedComplexRose  %$%
    history %>%
    plot()

Make Predictions¶

tunedComplexRose %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.complex.rose

data.frame(Id=Id,
          Predicted=classes.complex.rose) -> submission.complex.rose

submission.complex.rose %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_complex_rose.csv")

feature	num_nulls
ano	0
sequencial_candidato	0
nome	0
uf	0
partido	0
quantidade_doacoes	0
quantidade_doadores	0
total_receita	0
media_receita	0
recursos_de_outros_candidatos.comites	0
recursos_de_pessoas_fisicas	0
recursos_de_pessoas_juridicas	0
recursos_proprios	0
recursos_de_partido_politico	0
quantidade_despesas	0
quantidade_fornecedores	0
total_despesa	0
media_despesa	0
cargo	0
sexo	0
grau	0
estado_civil	0
ocupacao	0
situacao	0

optim_approach	loss_method	summary_metrics	epochs	batch_size	validation_split
sgd	binary_crossentropy	accuracy	200	5	0.2
adam	binary_crossentropy	accuracy	200	5	0.2
sgd	binary_crossentropy	accuracy	200	5	0.4
adam	binary_crossentropy	accuracy	200	5	0.4

situacao	num	total	proportion
eleito	1026	7622	0.1346103
nao_eleito	6596	7622	0.8653897

situacao	num	total	proportion
eleito	3030	6098	0.4968842
nao_eleito	3068	6098	0.5031158