Introduction


Data Analysis and Classification with Neural Networks on a subset of data about polls for the 2006 and 2010 elections in Brazil for the "Câmara Federal de Deputados". Data was taken from the TSE portal which originally encompassed approximately 7300 candidates.




Data Overview

The variables


The response variable is the variable that you are interesting in making measurements and conclusions on.

A predictor variable is a variable used to predict another variable.

Our response variable will be "situacao", we want to study how well the predictor variables can help predict its behavior and how they impact in the linear regression.


Each item corresponds to a candidate, the attributes of each item are as follows:

  • ano : Year at which the election took place.
  • sequencial_candidato : Sequential ID to map the candidates
  • nome : Name of the candidate
  • uf : Federate state to which the candidate belongs.
  • partido : Political party to which the candidate belongs.
  • quantidade_doacoes : Number of donations received during political campaign.
  • quantidade_doadores : Number of donors that contributed to the candidate's political campaign.
  • total_receita : Total revenue.
  • media_receita : Mean revenue.
  • recursos_de_outros_candidatos.comites : Revenue from other candidate's committees.
  • recursos_de_pessoas_fisicas : Revenue from individuals.
  • recursos_de_pessoas_juridicas : Revenue from legal entities.
  • recursos_proprios : Revenue from personal resources.
  • recursos_de_partido_politico : Revenue from political party.
  • quantidade_despesas : Number of expenses.
  • quantidade_fornecedores : Number of suppliers.
  • total_despesa : Total expenditure.
  • media_despesa : Mean expenditure.
  • cargo : Position.
  • sexo : Sex.
  • grau : Level of education.
  • estado_civil : Marital status.
  • ocupacao : Candidate's occupation up to the election.
  • situacao : Whether the candidate was elected.


Loading Data

In [1]:
library(dataPreparation)
library(tidyverse)
library(janitor)
library(magrittr)
library(GGally)
library(caret)
library(keras)
library(ROSE)
library(here)

theme_set(theme_bw())
Loading required package: lubridate

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

Loading required package: stringr
Loading required package: Matrix
Loading required package: progress
dataPreparation 0.3.5
Type dataPrepNews() to see new features/changes/bug fixes.
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1.9000     ✔ readr   1.2.0     
✔ tibble  1.4.2          ✔ purrr   0.2.4     
✔ tidyr   0.8.0          ✔ dplyr   0.7.4     
✔ ggplot2 2.2.1.9000     ✔ forcats 0.3.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ tidyr::expand()          masks Matrix::expand()
✖ dplyr::filter()          masks stats::filter()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()
✖ dplyr::vars()            masks ggplot2::vars()

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract


Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift

Loaded ROSE 0.0-3

here() starts at /kaggle/working

Attaching package: ‘here’

The following object is masked from ‘package:lubridate’:

    here

In [2]:
readr::read_csv(here::here('../input/train_class.csv'),
                progress = FALSE,
                local=readr::locale("br"),
                col_types = cols(ano = col_integer(),
                                 sequencial_candidato = col_character(),
                                 quantidade_doacoes = col_integer(),
                                 quantidade_doadores = col_integer(),
                                 total_receita = col_double(),
                                 media_receita = col_double(),
                                 recursos_de_outros_candidatos.comites = col_double(),
                                 recursos_de_pessoas_fisicas = col_double(),
                                 recursos_de_pessoas_juridicas = col_double(),
                                 recursos_proprios = col_double(),
                                 `recursos_de_partido_politico` = col_double(),
                                 quantidade_despesas = col_integer(),
                                 quantidade_fornecedores = col_integer(),
                                 total_despesa = col_double(),
                                 media_despesa = col_double(),
                                 situacao = col_character(),
                                 .default = col_character())) %>%
  mutate(sequencial_candidato = as.numeric(sequencial_candidato),
         estado_civil = as.factor(estado_civil),
         ocupacao = as.factor(ocupacao),
         partido = as.factor(partido),
         cargo = as.factor(cargo),
         nome = as.factor(nome),
         grau = as.factor(grau),
         sexo = as.factor(sexo),
         uf = as.factor(uf)) -> data

data %>%
  glimpse()
Observations: 7,622
Variables: 24
$ ano                                   <int> 2006, 2006, 2006, 2006, 2006,...
$ sequencial_candidato                  <dbl> 10001, 10002, 10002, 10002, 1...
$ nome                                  <fct> JOSÉ LUIZ NOGUEIRA DE SOUSA, ...
$ uf                                    <fct> AP, RO, AP, MS, RO, AP, PI, M...
$ partido                               <fct> PT, PT, PT, PRONA, PT, PT, PC...
$ quantidade_doacoes                    <int> 6, 13, 17, 6, 48, 8, 6, 14, 2...
$ quantidade_doadores                   <int> 6, 13, 16, 6, 48, 8, 6, 7, 2,...
$ total_receita                         <dbl> 16600.00, 22826.00, 158120.80...
$ media_receita                         <dbl> 2766.67, 1755.85, 9301.22, 50...
$ recursos_de_outros_candidatos.comites <dbl> 0.00, 6625.00, 2250.00, 0.00,...
$ recursos_de_pessoas_fisicas           <dbl> 9000.00, 15000.00, 34150.00, ...
$ recursos_de_pessoas_juridicas         <dbl> 6300.00, 1000.00, 62220.80, 1...
$ recursos_proprios                     <dbl> 1300.00, 201.00, 59500.00, 75...
$ recursos_de_partido_politico          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
$ quantidade_despesas                   <int> 14, 24, 123, 8, 133, 38, 9, 1...
$ quantidade_fornecedores               <int> 14, 23, 108, 8, 120, 37, 9, 1...
$ total_despesa                         <dbl> 16583.60, 20325.99, 146011.70...
$ media_despesa                         <dbl> 1184.54, 846.92, 1187.09, 375...
$ cargo                                 <fct> DEPUTADO FEDERAL, DEPUTADO FE...
$ sexo                                  <fct> MASCULINO, FEMININO, FEMININO...
$ grau                                  <fct> ENSINO MÉDIO COMPLETO, SUPERI...
$ estado_civil                          <fct> CASADO(A), SOLTEIRO(A), VIÚVO...
$ ocupacao                              <fct> VEREADOR, SERVIDOR PÚBLICO ES...
$ situacao                              <chr> "nao_eleito", "nao_eleito", "...
In [3]:
data %>%
  map_df(function(x) sum(is.na(x))) %>%
  gather(feature, num_nulls) %>%
  arrange(desc(num_nulls))
featurenum_nulls
ano 0
sequencial_candidato 0
nome 0
uf 0
partido 0
quantidade_doacoes 0
quantidade_doadores 0
total_receita 0
media_receita 0
recursos_de_outros_candidatos.comites0
recursos_de_pessoas_fisicas 0
recursos_de_pessoas_juridicas 0
recursos_proprios 0
recursos_de_partido_politico 0
quantidade_despesas 0
quantidade_fornecedores 0
total_despesa 0
media_despesa 0
cargo 0
sexo 0
grau 0
estado_civil 0
ocupacao 0
situacao 0

Data Exploration

Imbalance on class distribution

In [4]:
data %>%
  ggplot(aes(situacao)) +
  geom_bar() +
  labs(x="Situation", y="Absolute Frequency")
In [5]:
data %>%
  group_by(situacao) %>%
  summarise(num = n()) %>%
  ungroup() %>%
  mutate(total = sum(num),
         proportion = num/total)
situacaonumtotalproportion
eleito 1026 7622 0.1346103
nao_eleito6596 7622 0.8653897

br>

There's a strong imbalance in the class distribution of the dataset with around 13% of the entries in the class "eleito" (elected).

  • This imbalance can lead to a bias in the model that will learn to overlook the less frequent classes. Such bias can have a negative impact in the model generalization and its performance.
    • We can restore balance by removing instances from the most frequent class $undersampling$.
    • We can restore balance by adding instances from the most frequent class $oversampling$.
In [6]:
data %>% 
  select(-ano,
         -sequencial_candidato,
         -nome) %>%
  select(
    quantidade_doacoes,
    quantidade_doadores,
    total_receita,
    media_receita,
    recursos_de_outros_candidatos.comites,
    recursos_de_pessoas_fisicas,
    recursos_de_pessoas_juridicas,
    recursos_proprios,
    `recursos_de_partido_politico`) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu", label = TRUE,
       hjust = 0.95, label_size = 3,size = 3,
       nbreaks = 5, layout.exp = 5) +
  ggtitle("Correlation plot for employed variables")
  • Predictors such as quantidade_doacoes (Number of Donations) and quantidade_doadores (Number of Donors) are highly correlated and therefore redundant.

Preparing data

Splitting data

In [7]:
set.seed(107)

data$id <- 1:nrow(data)

data %>% 
  dplyr::sample_frac(.8) -> train

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train),
    "\n##### Variables: ",ncol(train))
#### Train Shape 
##### Observations:  6098 
##### Variables:  25
In [8]:
dplyr::anti_join(data, 
                 train, 
                 by = 'id') -> test

cat("#### Test Shape",
    "\n##### Observations: ",nrow(test),
    "\n##### Variables: ",ncol(test))
#### Test Shape 
##### Observations:  1524 
##### Variables:  25
In [9]:
train %>%
    select(-ano,-nome,-id,-sequencial_candidato) -> train

test %>%
    select(-ano,-nome,-id,-sequencial_candidato) -> test
In [10]:
train %>%
  dplyr::select_if(.,is.numeric) -> train.numeric

train %>%
  dplyr::select_if(.,negate(is.numeric)) -> train.categorical

test %>%
  dplyr::select_if(.,is.numeric) -> test.numeric

test %>%
  dplyr::select_if(.,negate(is.numeric)) -> test.categorical
In [11]:
train.numeric %>%
  preProcess(.,method = c("center","scale")) -> processParams

processParams %>%
  predict(.,train.numeric) -> train.numeric 

processParams %>% 
  predict(.,test.numeric) -> test.numeric 

processParams
Created from 6098 samples and 13 variables

Pre-processing:
  - centered (13)
  - ignored (0)
  - scaled (13)
In [12]:
train.numeric %>%
  dplyr::bind_cols(train.categorical) -> train

test.numeric %>%
  dplyr::bind_cols(test.categorical) -> test

Generate Balanced Data with ROSE algorithm

In [13]:
train %>%
clean_names() %>% 
ROSE(situacao ~ .,
     data =.,
     seed = 107) %$%
data -> train.rose

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train.rose),
    "\n##### Variables: ",ncol(train.rose))
#### Train Shape 
##### Observations:  6098 
##### Variables:  21
In [14]:
train.rose %>%
  group_by(situacao) %>%
  summarise(num = n()) %>%
  ungroup() %>%
  mutate(total = sum(num),
         proportion = num/total)
situacaonumtotalproportion
eleito 3030 6098 0.4968842
nao_eleito3068 6098 0.5031158

One hot encoding

In [15]:
encoding <- build_encoding(dataSet = train,
                          cols = c("uf","sexo","grau","ocupacao",
                                   "partido","estado_civil"),
                          verbose = F)

train <- one_hot_encoder(dataSet = train,
                          encoding = encoding,
                          drop = TRUE,
                          verbose = F)

cat("#### Train Shape",
    "\n##### Observations: ",nrow(train),
    "\n##### Variables: ",ncol(train))
Warning message in `[.data.table`(dataSet, , `:=`((new_cols), 0)):
“Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.”
#### Train Shape 
##### Observations:  6098 
##### Variables:  263
In [16]:
train.rose <- one_hot_encoder(dataSet = train.rose,
                              encoding = encoding,
                              drop = TRUE,
                              verbose = F)

cat("#### Balanced Data Shape",
    "\n##### Observations: ",nrow(train.rose),
    "\n##### Variables: ",ncol(train.rose))
#### Balanced Data Shape 
##### Observations:  6098 
##### Variables:  263
In [17]:
test <- one_hot_encoder(dataSet = test,
                          encoding = encoding,
                          drop = TRUE,
                          verbose = F)

cat("#### Test Data Shape",
    "\n##### Observations: ",nrow(test),
    "\n##### Variables: ",ncol(test))
#### Test Data Shape 
##### Observations:  1524 
##### Variables:  263

Near Zero Variance Predictors

In [18]:
train %>%
  nearZeroVar(saveMetrics = TRUE) %>%
  tibble::rownames_to_column("variable") %>%
  filter(nzv == T) %>% 
  pull(variable) -> near_zero_vars

train %>% 
    select(-one_of(near_zero_vars)) -> train

train.rose %>% 
    select(-one_of(near_zero_vars)) -> train.rose

test %>%
    select(-one_of(near_zero_vars)) -> test


near_zero_vars %>% 
  glimpse()
 chr [1:224] "cargo" "uf.AC" "uf.AL" "uf.AM" "uf.AP" "uf.BA" "uf.CE" ...

Conform data to Keras

In [19]:
split_target_predictors <- function(df) {
    
  df %>%
    select(-situacao) %>%
    as.matrix() -> x_data

  df %>% 
    select(situacao) %>%  
    dummyVars(" ~ situacao", data = .,levelsOnly = TRUE) -> dmy

  df %>%
    select(situacao) %>%  
    data.frame(predict(dmy, newdata = .)) %>% 
    select(-situacao) %>%
    as.matrix() -> y_data
    
  dimnames(x_data) <- NULL
  dimnames(y_data) <- NULL
    
  newData <- list("predictors" = x_data, "target" = y_data)
  return(newData)
}

Prepare test data

In [20]:
test %>%
   split_target_predictors() -> x_test
  
# extract target and predictors
y_test <- x_test$target
x_test <- x_test$predictors

Prepare submission data

In [21]:
readr::read_csv(here::here('../input/test_class.csv'),
                progress = FALSE,
                local=readr::locale("br"),
                col_types = cols(ano = col_integer(),
                                 sequencial_candidato = col_character(),
                                 quantidade_doacoes = col_integer(),
                                 quantidade_doadores = col_integer(),
                                 total_receita = col_double(),
                                 media_receita = col_double(),
                                 recursos_de_outros_candidatos.comites = col_double(),
                                 recursos_de_pessoas_fisicas = col_double(),
                                 recursos_de_pessoas_juridicas = col_double(),
                                 recursos_proprios = col_double(),
                                 `recursos_de_partido_politico` = col_double(),
                                 quantidade_despesas = col_integer(),
                                 quantidade_fornecedores = col_integer(),
                                 total_despesa = col_double(),
                                 media_despesa = col_double(),
                                 .default = col_character())) %>%
  mutate(sequencial_candidato = as.numeric(sequencial_candidato),
         estado_civil = as.factor(estado_civil),
         ocupacao = as.factor(ocupacao),
         partido = as.factor(partido),
         cargo = as.factor(cargo),
         nome = as.factor(nome),
         grau = as.factor(grau),
         sexo = as.factor(sexo),
         uf = as.factor(uf)) -> submit_data

submit_data %$% 
  sequencial_candidato -> Id

submit_data %>%
  glimpse()
Observations: 4,592
Variables: 23
$ ano                                   <int> 2014, 2014, 2014, 2014, 2014,...
$ sequencial_candidato                  <dbl> 1e+10, 1e+10, 1e+10, 1e+10, 1...
$ nome                                  <fct> EMERSON DA SILVA SANTOS, GERA...
$ uf                                    <fct> AC, AC, AC, AC, AC, AC, AC, A...
$ partido                               <fct> PSOL, PSOL, PSB, PT, PT, PT, ...
$ quantidade_doacoes                    <int> 3, 5, 40, 29, 160, 4, 48, 15,...
$ quantidade_doadores                   <int> 3, 5, 38, 29, 146, 3, 48, 13,...
$ total_receita                         <dbl> 1580.00, 3180.00, 336793.13, ...
$ media_receita                         <dbl> 526.67, 636.00, 8419.83, 5404...
$ recursos_de_outros_candidatos.comites <dbl> 0.00, 0.00, 1923.07, 39122.32...
$ recursos_de_pessoas_fisicas           <dbl> 1500.00, 3100.00, 65700.00, 6...
$ recursos_de_pessoas_juridicas         <dbl> 0.00, 0.00, 154170.06, 17000....
$ recursos_proprios                     <dbl> 0.00, 0.00, 115000.00, 6813.1...
$ recursos_de_partido_politico          <dbl> 80.00, 80.00, 0.00, 25000.00,...
$ quantidade_despesas                   <int> 3, 6, 145, 136, 518, 12, 336,...
$ quantidade_fornecedores               <int> 3, 5, 139, 121, 354, 12, 281,...
$ total_despesa                         <dbl> 1580.00, 3130.02, 326869.78, ...
$ media_despesa                         <dbl> 526.67, 521.67, 2254.27, 1772...
$ cargo                                 <fct> DEPUTADO FEDERAL, DEPUTADO FE...
$ sexo                                  <fct> MASCULINO, MASCULINO, MASCULI...
$ grau                                  <fct> ENSINO MÉDIO COMPLETO, SUPERI...
$ estado_civil                          <fct> SOLTEIRO(A), SOLTEIRO(A), CAS...
$ ocupacao                              <fct> CORRETOR DE IMÓVEIS, SEGUROS,...
In [22]:
submit_data %>%
    select(-ano,-nome,-sequencial_candidato) -> submit_data

submit_data %>%
  dplyr::select_if(.,is.numeric) -> submit_data.numeric

submit_data %>%
  dplyr::select_if(.,negate(is.numeric)) -> submit_data.categorical

processParams %>% 
  predict(.,submit_data.numeric) -> submit_data.numeric 

submit_data.numeric %>%
  dplyr::bind_cols(submit_data.categorical) -> submit_data

submit_data <- one_hot_encoder(dataSet = submit_data,
                               encoding = encoding,
                               drop = TRUE,
                               verbose = F)

submit_data %>%
    select(-one_of(near_zero_vars)) -> submit_data

submit_data %>% 
glimpse()
Observations: 4,592
Variables: 38
$ quantidade_doacoes                    <dbl> -0.19978723, -0.18415635, 0.0...
$ quantidade_doadores                   <dbl> -0.20625998, -0.18832944, 0.1...
$ total_receita                         <dbl> -0.38636694, -0.38280827, 0.3...
$ media_receita                         <dbl> -0.343680979, -0.336117447, 0...
$ recursos_de_outros_candidatos.comites <dbl> -0.190636839, -0.190636839, -...
$ recursos_de_pessoas_fisicas           <dbl> -0.3296413, -0.3056035, 0.634...
$ recursos_de_pessoas_juridicas         <dbl> -0.31502228, -0.31502228, 0.2...
$ recursos_proprios                     <dbl> -0.16885161, -0.16885161, 0.5...
$ recursos_de_partido_politico          <dbl> -0.191460411, -0.191460411, -...
$ quantidade_despesas                   <dbl> -0.305124337, -0.298150602, 0...
$ quantidade_fornecedores               <dbl> -0.30825230, -0.30254043, 0.0...
$ total_despesa                         <dbl> -0.39456340, -0.39063257, 0.4...
$ media_despesa                         <dbl> -0.123162519, -0.123902300, 0...
$ uf.MG                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.PR                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.RJ                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.RS                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ uf.SP                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ sexo.FEMININO                         <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
$ sexo.MASCULINO                        <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,...
$ `grau.ENSINO FUNDAMENTAL COMPLETO`    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ `grau.ENSINO MÉDIO COMPLETO`          <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
$ `grau.SUPERIOR COMPLETO`              <int> 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,...
$ `grau.SUPERIOR INCOMPLETO`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ ocupacao.ADVOGADO                     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
$ ocupacao.DEPUTADO                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
$ ocupacao.EMPRESÁRIO                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ ocupacao.OUTROS                       <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PDT                           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PMDB                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PSB                           <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PSDB                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ partido.PT                            <int> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,...
$ partido.PTB                           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
$ partido.PV                            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ estado.civil.CASADO.A.                <int> 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,...
$ estado.civil.DIVORCIADO.A.            <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
$ estado.civil.SOLTEIRO.A.              <int> 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
In [23]:
submit_data %>%
  as.matrix() -> x_submit
  
dimnames(x_submit) <- NULL

Base Neural Networks

In [24]:
# Initialize a sequential model
model.simple <- keras_model_sequential()

# Add layers to the model
model.simple %>%
  layer_dense(units = 8, activation = 'relu', input_shape = c(38)) %>% 
  layer_dense(units = 2, activation = 'softmax')
In [25]:
summary(model.simple)
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
dense_1 (Dense)                     (None, 8)                       312         
________________________________________________________________________________
dense_2 (Dense)                     (None, 2)                       18          
================================================================================
Total params: 330
Trainable params: 330
Non-trainable params: 0
________________________________________________________________________________
In [26]:
# Initialize a sequential model
model.complex <- keras_model_sequential()

# Add layers to the model
model.complex %>%
  layer_dense(units = 8, activation = 'relu', input_shape = c(38)) %>% 
  layer_dense(units = 6, activation = 'relu') %>% 
  layer_dense(units = 10) %>% 
  layer_dense(units = 2, activation = 'softmax')
In [27]:
summary(model.complex)
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
================================================================================
dense_3 (Dense)                     (None, 8)                       312         
________________________________________________________________________________
dense_4 (Dense)                     (None, 6)                       54          
________________________________________________________________________________
dense_5 (Dense)                     (None, 10)                      70          
________________________________________________________________________________
dense_6 (Dense)                     (None, 2)                       22          
================================================================================
Total params: 458
Trainable params: 458
Non-trainable params: 0
________________________________________________________________________________

Cross Validation and Tuning

In [28]:
neuralNetCV <- function(df,
                        model,
                        k=5,
                        loss_method = 'binary_crossentropy',
                        optim_approach = 'adam',
                        summary_metrics='accuracy',
                        epochs = 200,
                        batch_size = 5,
                        validation_split = 0.2) {
    
  model %>% 
    compile(
        loss = loss_method,
        optimizer = optim_approach,
        metrics = summary_metrics)
  df %>%
    mutate(folds = sample(rep_len(1:k, nrow(.)))) -> df
  
  result <-  data.frame("loss"=c(),metrics=c()) 
    
  for(f in unique(df$folds)){

  # split into train/validation
    df %>%
      filter(folds == f) -> train_df
    
    df %>%
      filter(folds != f) -> valid_df
      
 # Remove auxiliary column
    train_df %>%
      select(-folds) -> train_df
      
    valid_df %>%
      select(-folds) -> valid_df
      
  # create matrices
    train_df %>%
      split_target_predictors() -> x_train
    
    valid_df %>%
      split_target_predictors() -> x_valid
  
  # extract target and predictors
    y_train <- x_train$target
    x_train <- x_train$predictors
  
    y_valid <- x_valid$target
    x_valid <- x_valid$predictors
    
  # Train model
  history <- model %>% fit(
      x_train, y_train,
      epochs = epochs,
      batch_size = batch_size, 
      validation_split = validation_split)
  
  # Evaluate the model
    model %>%
      evaluate(x_valid,
               y_valid,
               batch_size = 128) -> score
  
  score %>%
      as.data.frame() -> temporary
      
  result <- rbind(result,temporary)
  }
 cvsummary <- list("result"=result,
                   "history"=history,
                   "model"=model)   
 return(cvsummary);
}
In [29]:
tuneNeuralNetwork <- function(model,data,paramsGrid,
                              target,k=5) {
   environment(neuralNetCV) <- environment()    

    best_accuracy <- 0
    best_loss <- 0
    best_tune <- NULL
    best_history <- NULL
    best_model <- NULL
    optmizer <- NULL
    losses <- c()
    acc <- c()

     for(i in 1:nrow(paramsGrid)) {  
        row <- paramsGrid[i,]
        if (row$optim_approach == "sgd") 
            optmizer <- optimizer_sgd(lr = 0.01)
        else if (row$optim_approach == "rmsprop") 
            optmizer <- optimizer_rmsprop(lr = 0.001, rho = 0.9)
        else if (row$optim_approach == "adam") 
            optmizer <- optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999)

        # apply CV Kfold to particular set of params
        neuralNetCV(data,
                    model=model,
                    k=k,
                    loss_method = paste0(row$loss_method),
                    optim_approach = optmizer,
                    summary_metrics= paste0(row$summary_metrics),
                    epochs = paste0(row$epochs),
                    batch_size = paste0(row$batch_size),
                    validation_split = paste0(row$validation_split)) -> cvsummary

        cvsummary %$%
        result %$%
        acc %>%
        mean() -> mean_accuracy

        cvsummary %$%
        result %$%
        loss %>%
        mean() -> mean_loss
       
        losses <- append(losses,mean_loss)  
        acc <- append(acc, mean_accuracy)

        if(mean_accuracy > best_accuracy) {
            best_accuracy <- mean_accuracy
            best_history <- cvsummary$history
            best_model <- cvsummary$model
            best_result <- mean_accuracy
            best_loss <- mean_loss
            best_tune <- row
        }    
    }
    paramsGrid$loss <- losses
    paramsGrid$accuracy <- acc
    best_tune$accuracy <- best_accuracy
    best_tune$loss <- best_loss

    result <- list("history"=best_history,
                   "best_tune"=best_tune,
                   "iterations"=paramsGrid,
                   "best_model"=best_model)
    return(result)

    }

Tuning Parameters

In [30]:
optm <- c("sgd", "adam")
validation_split <- c(0.2,0.4)

paramsGrid <- expand.grid(optim_approach=optm,
                          loss_method = 'binary_crossentropy',
                          summary_metrics='accuracy',
                          epochs = 200,
                          batch_size = 5,
                          validation_split = validation_split)

paramsGrid
optim_approachloss_methodsummary_metricsepochsbatch_sizevalidation_split
sgd binary_crossentropyaccuracy 200 5 0.2
adam binary_crossentropyaccuracy 200 5 0.2
sgd binary_crossentropyaccuracy 200 5 0.4
adam binary_crossentropyaccuracy 200 5 0.4

Simplified Neural Net with Unbalaced data

In [31]:
tuneNeuralNetwork(model.simple,
                  train,paramsGrid,
                  k=5) -> tunedSimple

Test Score

In [32]:
tunedSimple %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.simple

score.simple
$loss
0.695294341957319
$acc
0.904855643201062

Visualize Train Fit

In [33]:
tunedSimple  %$%
    history %>%
    plot()

Make Predictions

In [34]:
tunedSimple %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.simple

data.frame(Id=Id,
          Predicted=classes.simple) -> submission.simple

submission.simple %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_simple.csv")

Simplified Neural Network with Balanced data (ROSE)

In [35]:
tuneNeuralNetwork(model.simple,
                  train.rose,paramsGrid,
                  k=5) -> tunedSimpleRose

Test Score

In [36]:
tunedSimpleRose %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.simple.rose

score.simple.rose
$loss
1.16133909205126
$acc
0.896325461038454

Visualize Train Fit

In [37]:
tunedSimpleRose  %$%
    history %>%
    plot()

Make Predictions

In [38]:
tunedSimpleRose %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.simple.rose

data.frame(Id=Id,
          Predicted=classes.simple.rose) -> submission.simple.rose

submission.simple.rose %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_simple_rose.csv")

Slightly more complex Neural Networn with unbalaced data

In [39]:
tuneNeuralNetwork(model.complex,
                  train,paramsGrid,
                  k=5) -> tunedComplex

Test Score

In [40]:
tunedComplex %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.complex

score.complex
$loss
0.453967891408624
$acc
0.902230971285052

Visualize Train Fit

In [41]:
tunedComplex  %$%
    history %>%
    plot()

Make Predictions

In [42]:
tunedComplex %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.complex

data.frame(Id=Id,
          Predicted=classes.complex) -> submission.complex

submission.complex %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_complex.csv")

Slightly more complex Neural Networn with balaced data (ROSE)

In [43]:
tuneNeuralNetwork(model.complex,
                  train.rose,paramsGrid,
                  k=5) -> tunedComplexRose

Test Score

In [44]:
tunedComplexRose %$%
    best_model %>%
    evaluate(x_test,
             y_test,
             batch_size = 128) -> score.complex.rose

score.complex.rose
$loss
1.12783700325514
$acc
0.889107611704999

Visualize Train Fit

In [45]:
tunedComplexRose  %$%
    history %>%
    plot()

Make Predictions

In [46]:
tunedComplexRose %$% 
    best_model %>%
    predict_classes(x_submit, batch_size = 128) -> classes.complex.rose

data.frame(Id=Id,
          Predicted=classes.complex.rose) -> submission.complex.rose

submission.complex.rose %>%
  mutate(Id = as.character(Id)) %>%
  write_csv("nn_submission_complex_rose.csv")