The database that will be used in this study consists of data on the votes that candidates for the Federal Chamber of Deputies received in the years 2006 and 2010 (source: http://www.tse.jus.br), as well as information on campaigning, party, schooling, … of them.
Loading data.
set.seed(1)
data <- read.csv("train.csv")
data$id <- 1:nrow(data)
data %>%
sample_frac(.9) -> train
test <- read.csv("test.csv")
anti_join(data, train, by = 'id') -> validationLet’s explore the data features
train %>%
glimpse()## Observations: 6,860
## Variables: 25
## $ ano <int> 2006, 2006, 2010, 2010, ...
## $ sequencial_candidato <dbl> 1.0590e+04, 1.1295e+04, ...
## $ nome <fct> PAULO DUARTE, ANTÔNIO MA...
## $ uf <fct> PR, MG, MA, SP, MG, SP, ...
## $ partido <fct> PSDC, PSC, PC do B, PTN,...
## $ quantidade_doacoes <int> 1, 3, 5, 3, 1, 27, 23, 4...
## $ quantidade_doadores <int> 1, 3, 5, 3, 1, 24, 17, 4...
## $ total_receita <dbl> 305.90, 3670.00, 13030.0...
## $ media_receita <dbl> 305.90, 1223.33, 2606.00...
## $ recursos_de_outros_candidatos.comites <dbl> 305.90, 2340.00, 0.00, 0...
## $ recursos_de_pessoas_fisicas <dbl> 0.0, 1330.0, 1700.0, 100...
## $ recursos_de_pessoas_juridicas <dbl> 0.00, 0.00, 11330.00, 0....
## $ recursos_proprios <dbl> 0.00, 0.00, 0.00, 0.00, ...
## $ recursos_de_partido_politico <dbl> 0.0, 0.0, 0.0, 354.5, 0....
## $ quantidade_despesas <int> 1, 3, 7, 3, 1, 75, 22, 2...
## $ quantidade_fornecedores <int> 1, 3, 6, 3, 1, 69, 22, 1...
## $ total_despesa <dbl> 305.90, 3670.00, 12230.0...
## $ media_despesa <dbl> 305.90, 1223.33, 1747.14...
## $ cargo <fct> DEPUTADO FEDERAL, DEPUTA...
## $ sexo <fct> MASCULINO, MASCULINO, MA...
## $ grau <fct> SUPERIOR INCOMPLETO, SUP...
## $ estado_civil <fct> CASADO(A), DIVORCIADO(A)...
## $ ocupacao <fct> EMPRESÁRIO, ESTUDANTE, B...
## $ situacao <fct> nao_eleito, nao_eleito, ...
## $ id <int> 2024, 2836, 4366, 6920, ...
following, we will explore classes balance of data attributes, that is an important point to prevent overfiting in prediction models.
Feature by feature, we will show balance by target variable, which in this study is situaçao, whether a deputy was elected or not, and comment if the class is umbalanced
recursos_proprios
train %>%
ggplot(aes(situacao,recursos_proprios)) +
geom_boxplot()unbalanced, high ranges recourse is more common in non-elected
recursos_de_partido_politico
train %>%
ggplot(aes(situacao,
recursos_de_partido_politico)) +
geom_boxplot()unbalanced, high ranges recourse is more common in elected
recursos_de_outros_candidatos.comites
train %>%
ggplot(aes(situacao,
recursos_de_outros_candidatos.comites)) +
geom_boxplot()the same situation of recursos_de_partido_politico
recursos_de_pessoas_fisicas
train %>%
ggplot(aes(situacao,
recursos_de_pessoas_fisicas)) +
geom_boxplot() unbalanced to a range of lower values for non-elected and high for elected
recursos_de_pessoas_juridicas
train %>%
ggplot(aes(situacao,
recursos_de_pessoas_juridicas)) +
geom_boxplot()unbalanced in a range of values
quantidade_doacoes
train %>%
ggplot(aes(situacao,
quantidade_doacoes)) +
geom_boxplot()unbalanced to a very high amount of donations
quantidade_doadores
train %>%
ggplot(aes(situacao,
quantidade_doadores)) +
geom_boxplot()the same situation of quantidade_doacoes
media_receita
train %>%
ggplot(aes(situacao,
media_receita)) +
geom_boxplot()cases of a very high value only for an unelected
total_receita
train %>%
ggplot(aes(situacao,
total_receita)) +
geom_boxplot()ranges of values that are only present in non-elected
sexo
train %>%
ggplot() +
geom_mosaic(aes(x = product(sexo, situacao),
fill=sexo)) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())grau
train %>%
ggplot() +
geom_mosaic(aes(x = product(grau, situacao),
fill=grau)) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())estado_civil
train %>%
ggplot() +
geom_mosaic(aes(x = product(estado_civil, situacao),
fill=estado_civil)) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())we can see that there is rather an imbalance in some variables, especially of values that are present only between elected or non-elected, this can cause problems of overfiting, and among the possible strategies for the correction of this problem will be used a hybrid in this study , undersampling more common values and resampling less common values.
The next step will build an encoding to transform sexo feature in a factor variable.
encoding <- build_encoding(dataSet = train,
cols = c("sexo"),
verbose = F)
train <- one_hot_encoder(dataSet = train,
encoding = encoding,
drop = TRUE,
verbose = F)
encoding <- build_encoding(dataSet = validation,
cols = c("sexo"),
verbose = F)
validation <- one_hot_encoder(dataSet = validation,
encoding = encoding,
drop = TRUE,
verbose = F)
encoding <- build_encoding(dataSet = test,
cols = c("sexo"),
verbose = F)
test <- one_hot_encoder(dataSet = test,
encoding = encoding,
drop = TRUE,
verbose = F)Creating a personal function to evaluente F1 metric of a model.
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
c(F1 = f1_val)
}
F_Measure <- function(expected, predicted, ...) {
data.frame(expected=expected,
prediction=predicted) %>%
mutate(fn = ifelse(expected == "eleito" &
prediction == "nao_eleito",1,0),
fp = ifelse(expected == "nao_eleito" &
prediction == "eleito",1,0)) -> result
}And we will remove all variables that has no or minimum variance beacuse them is uncessary for creating models.
## Removing variables that was unnecessary or has a minim
train <- train %>% select(-nome,-ano, -ocupacao, -partido,
-estado_civil, -grau, -cargo,
-sequencial_candidato, -uf, -id)
validation <- validation %>% select(-nome,-ano, -ocupacao, -partido,
-estado_civil, -grau, -cargo,
-sequencial_candidato, -uf, -id)
test <- test %>% select(-nome,-ano, -ocupacao, -partido,
-estado_civil, -grau, -cargo, -uf)Finally, we can start to create the five models that we will test here, begining by decision three.
Dessision Three
set.seed(1)
cctrl <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "boot", sampling = "smote")
lambda <- expand.grid(.cp = seq(from=0, to=0.02, by=0.005))
tree <- train(situacao ~ .,
train,
method="rpart",
metric = "F1",
tuneGrid = lambda,
preProc = c("center", "scale"),
trControl = trainControl(summaryFunction = f1, classProbs = TRUE, method = "boot"))Resultant model
varImp(tree)## rpart variable importance
##
## Overall
## total_receita 100.0000
## total_despesa 99.6657
## recursos_de_pessoas_juridicas 87.2176
## quantidade_fornecedores 61.8274
## quantidade_despesas 61.4545
## quantidade_doadores 3.6181
## recursos_proprios 2.8066
## media_despesa 1.5733
## recursos_de_partido_politico 1.4601
## media_receita 0.9026
## quantidade_doacoes 0.8577
## recursos_de_pessoas_fisicas 0.6909
## sexo.MASCULINO 0.0000
## recursos_de_outros_candidatos.comites 0.0000
## sexo.FEMININO 0.0000
and the best tunning for him
tree$finalModel## n= 6860
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 6860 931 nao_eleito (0.13571429 0.86428571)
## 2) total_receita>=0.06795873 1280 505 eleito (0.60546875 0.39453125)
## 4) recursos_de_pessoas_juridicas>=1.03738 472 103 eleito (0.78177966 0.21822034) *
## 5) recursos_de_pessoas_juridicas< 1.03738 808 402 eleito (0.50247525 0.49752475)
## 10) total_despesa>=0.6243804 310 123 eleito (0.60322581 0.39677419)
## 20) recursos_proprios< 3.169534 278 100 eleito (0.64028777 0.35971223) *
## 21) recursos_proprios>=3.169534 32 9 nao_eleito (0.28125000 0.71875000) *
## 11) total_despesa< 0.6243804 498 219 nao_eleito (0.43975904 0.56024096)
## 22) recursos_proprios>=-0.08563123 301 150 eleito (0.50166113 0.49833887)
## 44) quantidade_despesas< 0.3188294 202 91 eleito (0.54950495 0.45049505)
## 88) quantidade_doadores>=-0.0920028 158 62 eleito (0.60759494 0.39240506) *
## 89) quantidade_doadores< -0.0920028 44 15 nao_eleito (0.34090909 0.65909091) *
## 45) quantidade_despesas>=0.3188294 99 40 nao_eleito (0.40404040 0.59595960) *
## 23) recursos_proprios< -0.08563123 197 68 nao_eleito (0.34517766 0.65482234) *
## 3) total_receita< 0.06795873 5580 156 nao_eleito (0.02795699 0.97204301) *
Let’s see the final three
# plot the model
plot(tree$finalModel, uniform=TRUE,
main="Classification Tree")
text(tree$finalModel, all=TRUE, cex=.8)and the same but more pretty
fancyRpartPlot(tree$finalModel)Model validation using false positive and false negative percentages of predictions
data.frame(validation %>%
select(-situacao) %>%
predict(object=tree,.) %>%
F_Measure(validation$situacao,.) %>%
summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))## false_negative false_positive
## 1 0.03412073 0.04724409
and now the F1 performance graph of this model for all hyper params tested
ggplot(tree)the F1 metric is medin to good, and the performance in validation predictions of false potives and false negatives together account for about 8%, for a medium F1 were good results
Logistic regression
set.seed(1)
rlGrid <- expand.grid( cost = c(200,2,0.02),
loss = c("L2_dual"),
epsilon = c(0.001, 0.01) )
cctrl2 <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "cv", number = 5, sampling = "smote")
logistic <- train(situacao ~ ., train,
method = "regLogistic",
trControl = cctrl2,
metric = "F1",
preProc = c("center", "scale"),
tuneGrid = rlGrid)## Loading required package: grid
Resultant model
varImp(logistic)## ROC curve variable importance
##
## Importance
## total_despesa 100.00
## total_receita 99.96
## quantidade_fornecedores 90.21
## quantidade_despesas 90.01
## recursos_de_pessoas_juridicas 89.79
## quantidade_doacoes 85.81
## quantidade_doadores 85.59
## media_receita 82.48
## recursos_de_pessoas_fisicas 80.05
## media_despesa 64.85
## recursos_proprios 52.92
## recursos_de_outros_candidatos.comites 39.83
## recursos_de_partido_politico 32.78
## sexo.FEMININO 0.00
## sexo.MASCULINO 0.00
and the best tunning for him
logistic$finalModel## $TypeDetail
## [1] "L2-regularized logistic regression dual (L2R_LR_DUAL)"
##
## $Type
## [1] 7
##
## $W
## quantidade_doacoes quantidade_doadores total_receita media_receita
## [1,] 0.06761115 -1.51121 -0.3568663 -0.6368428
## recursos_de_outros_candidatos.comites recursos_de_pessoas_fisicas
## [1,] 0.0215394 -0.8941101
## recursos_de_pessoas_juridicas recursos_proprios
## [1,] -0.6634031 -0.1554319
## recursos_de_partido_politico quantidade_despesas
## [1,] 0.6104716 1.200539
## quantidade_fornecedores total_despesa media_despesa sexo.FEMININO
## [1,] -0.02511319 -2.6219 -0.4423231 0.01841145
## sexo.MASCULINO Bias
## [1,] -0.01841145 -0.4740965
##
## $Bias
## [1] 1
##
## $ClassNames
## [1] nao_eleito eleito
## Levels: eleito nao_eleito
##
## $NbClass
## [1] 2
##
## $xNames
## [1] "quantidade_doacoes"
## [2] "quantidade_doadores"
## [3] "total_receita"
## [4] "media_receita"
## [5] "recursos_de_outros_candidatos.comites"
## [6] "recursos_de_pessoas_fisicas"
## [7] "recursos_de_pessoas_juridicas"
## [8] "recursos_proprios"
## [9] "recursos_de_partido_politico"
## [10] "quantidade_despesas"
## [11] "quantidade_fornecedores"
## [12] "total_despesa"
## [13] "media_despesa"
## [14] "sexo.FEMININO"
## [15] "sexo.MASCULINO"
##
## $problemType
## [1] "Classification"
##
## $tuneValue
## cost loss epsilon
## 3 2 L2_dual 0.001
##
## $obsLevels
## [1] "eleito" "nao_eleito"
## attr(,"ordered")
## [1] FALSE
##
## $param
## list()
##
## attr(,"class")
## [1] "LiblineaR"
Model validation using false positive and false negative percentages of predictions
data.frame(validation %>%
select(-situacao) %>%
predict(object=logistic,.) %>%
F_Measure(validation$situacao,.) %>%
summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))## false_negative false_positive
## 1 0.02099738 0.06299213
and now the F1 performance graph of this model for all hyper params tested
ggplot(logistic)the F1 metric is median, and the performance in validation predictions of false potives and false negatives together account for about 8%, for a medium F1 were good results, with more false negatives than false positives.
KNN
set.seed(1)
neighborsGrid <- expand.grid(.k = seq(from=1, to=50, by=1))
knn <- train(situacao ~ ., train,
metric = "F1",
method = "knn",
tuneGrid = neighborsGrid,
trControl = cctrl2)Resultant model
varImp(knn)## ROC curve variable importance
##
## Importance
## total_despesa 100.00
## total_receita 99.96
## quantidade_fornecedores 90.21
## quantidade_despesas 90.01
## recursos_de_pessoas_juridicas 89.79
## quantidade_doacoes 85.81
## quantidade_doadores 85.59
## media_receita 82.48
## recursos_de_pessoas_fisicas 80.05
## media_despesa 64.85
## recursos_proprios 52.92
## recursos_de_outros_candidatos.comites 39.83
## recursos_de_partido_politico 32.78
## sexo.FEMININO 0.00
## sexo.MASCULINO 0.00
and the best tunning for him
knn$bestTune## k
## 35 35
Model validation using false positive and false negative percentages of predictions
data.frame(validation %>%
select(-situacao) %>%
predict(object=knn,.) %>%
F_Measure(validation$situacao,.) %>%
summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))## false_negative false_positive
## 1 0.005249344 0.1102362
and now the F1 performance graph of this model for all hyper params tested
ggplot(knn)the F1 metric performance is similar to the decision tree, the same in false positives/negatives prcentage in predictions of validation set.
AdaBoost
set.seed(1)
grid <- expand.grid(mfinal = (1:3)*3, maxdepth = c(1, 2),
coeflearn = c("Breiman", "Freund", "Zhu"))
seeds <- vector(mode = "list", length = nrow(train) + 1)
seeds <- lapply(seeds, function(x) 1:20)
cctrl2 <- trainControl(summaryFunction = f1, classProbs = TRUE, method = "cv", number = 5)
adaboost <- train(situacao ~ ., train,
method = "AdaBoost.M1",
trControl = cctrl,
na.action = na.exclude,
tuneGrid = grid,
metric = "F1",
preProc = c("center", "scale"))Resultant model
varImp(adaboost)## AdaBoost.M1 variable importance
##
## Overall
## total_receita 100.0000
## total_despesa 24.3953
## recursos_de_pessoas_juridicas 4.3038
## recursos_proprios 4.1647
## media_despesa 1.4067
## recursos_de_pessoas_fisicas 0.3218
## quantidade_doadores 0.0000
## quantidade_fornecedores 0.0000
## quantidade_doacoes 0.0000
## sexo.FEMININO 0.0000
## recursos_de_partido_politico 0.0000
## recursos_de_outros_candidatos.comites 0.0000
## quantidade_despesas 0.0000
## media_receita 0.0000
## sexo.MASCULINO 0.0000
Model validation using false positive and false negative percentages of predictions
data.frame(validation %>%
select(-situacao) %>%
predict(object=adaboost,.) %>%
F_Measure(validation$situacao,.) %>%
summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))## false_negative false_positive
## 1 0.002624672 0.1154856
and now the F1 performance graph of this model for all hyper params tested
ggplot(adaboost)the F1 metric is median, with a courious results of validation, a minimal percentage of false negatives and more than 11% of false positives.
In all models, and shown more explicitly in the decision tree, parameters for spending money in the field seem to be the most important, and show what really makes a difference for a candidate to be elected or not.
Following, we will test and big famous model, the gradient boosting, as a extra model.
seeds <- vector(mode = "list", length = nrow(train) + 1)
seeds <- lapply(seeds, function(x) 1:20)
cctrl3 <- trainControl(
summaryFunction = f1,
classProbs = TRUE,
method = "cv",
number = 10,
seeds = seeds)
gbm <- train(situacao ~ ., train,
method = "gbm",
trControl = cctrl3,
metric = "F1",
na.action = na.exclude,
preProc = c("center", "scale"),
tuneGrid = expand.grid(interaction.depth = c(1, 2, 3, 4, 5),
shrinkage = c(.12, .13),
n.trees = c(55, 56, 57, 58, 59),
n.minobsinnode = c(2,3)),
verbose = FALSE,
distribution = "adaboost")the best tunning for him
gbm$bestTune## n.trees interaction.depth shrinkage n.minobsinnode
## 6 55 1 0.12 3
Model validation using false positive and false negative percentages of predictions
data.frame(validation %>%
select(-situacao) %>%
predict(object=gbm,.) %>%
F_Measure(validation$situacao,.) %>%
summarise(false_negative = sum(fn) / nrow(validation), false_positive = sum(fp) / nrow(validation)))## false_negative false_positive
## 1 0.0328084 0.04593176
and now the F1 performance graph of this model for all hyper params tested
ggplot(gbm)the F1 metric is the best achieved among the models created in this study, even the performance in validation being practically the same as the other models in false positives/negatives.
Results
The results of this study show that the models achieve a reasonable performance for metric F1, being the gradient boosting model the chosen one for the sending to competition of the Kaggle, we also saw that in the general, variables referring to financial investment and support of companies were the features that in all models showed to have the greatest relevance
Generation predictions for test dataframe to submit to Kaggle, to related competition.
prediction <- predict(gbm, test)
data_out <- data.frame(ID = test$sequencial_candidato, Predicted = prediction)
data_out$ID <-as.character(data_out$ID)
data_out %>% write_csv(path = "response.csv")