The database that will be used in this study consists of data on the votes that candidates for the Federal Chamber of Deputies received in the years 2006 and 2010 (source: http://www.tse.jus.br), as well as information on campaigning, party, schooling, … of them.

Loading data.

train <- read.csv("train.csv")
test <- read.csv("test.csv")

Removing categoric variables.

train <- train %>% select(-nome, -uf, -estado_civil, 
                          -partido, -ocupacao,-ano, 
                          -cargo,-grau,-sexo, 
                          -sequencial_candidato)

for this we will compare 3 types of regression, ridge, lasso and knn, which will be executed follow.

Ridge regression.

set.seed(1)

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))

# modelo utilizando regressão ridge
model <- train(votos ~ ., 
               data = train,
               method = "ridge",
               trControl = fitControl,
               preProcess = c('scale', 'center'),
               tuneGrid = lambdaGrid,
               na.action = na.omit)
model
## Ridge Regression 
## 
## 7476 samples
##   13 predictors
## 
## Pre-processing: scaled (13), centered (13) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 6729, 6729, 6728, 6729, 6728, 6729, ... 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE       Rsquared   MAE     
##   1.000000e-02   37992.18  0.4029785  16917.96
##   1.321941e-02   38028.99  0.4024646  16918.64
##   1.747528e-02   38074.06  0.4018611  16916.32
##   2.310130e-02   38127.82  0.4011888  16909.12
##   3.053856e-02   38191.23  0.4004698  16894.77
##   4.037017e-02   38266.21  0.3997206  16870.82
##   5.336699e-02   38356.25  0.3989470  16834.55
##   7.054802e-02   38466.99  0.3981398  16782.72
##   9.326033e-02   38607.21  0.3972726  16710.82
##   1.232847e-01   38790.02  0.3963037  16615.54
##   1.629751e-01   39034.91  0.3951806  16492.44
##   2.154435e-01   39370.79  0.3938494  16335.49
##   2.848036e-01   39840.63  0.3922681  16139.24
##   3.764936e-01   40507.88  0.3904209  15905.99
##   4.977024e-01   41464.25  0.3883290  15638.74
##   6.579332e-01   42836.95  0.3860532  15361.10
##   8.697490e-01   44790.93  0.3836857  15154.35
##   1.149757e+00   47519.77  0.3813346  15343.81
##   1.519911e+00   51219.24  0.3791034  16896.20
##   2.009233e+00   56044.17  0.3770752  19500.74
##   2.656088e+00   62058.68  0.3753024  22802.30
##   3.511192e+00   69197.66  0.3738051  26690.85
##   4.641589e+00   77254.86  0.3725763  31020.24
##   6.135907e+00   85904.66  0.3715908  35632.55
##   8.111308e+00   94752.21  0.3708140  40321.60
##   1.072267e+01  103398.10  0.3702096  44886.85
##   1.417474e+01  111498.64  0.3697435  49143.23
##   1.873817e+01  118806.32  0.3693863  52969.71
##   2.477076e+01  125183.55  0.3691137  56301.25
##   3.274549e+01  130592.92  0.3689061  59124.41
##   4.328761e+01  135073.52  0.3687485  61462.65
##   5.722368e+01  138713.09  0.3686289  63361.32
##   7.564633e+01  141623.27  0.3685382  64879.22
##   1.000000e+02  143921.24  0.3684695  66078.10
##   1.321941e+02  145717.95  0.3684174  67015.29
##   1.747528e+02  147111.97  0.3683780  67742.30
##   2.310130e+02  148187.10  0.3683482  68302.96
##   3.053856e+02  149012.50  0.3683256  68733.51
##   4.037017e+02  149643.94  0.3683086  69062.91
##   5.336699e+02  150125.69  0.3682956  69314.21
##   7.054802e+02  150492.49  0.3682859  69505.54
##   9.326033e+02  150771.33  0.3682785  69650.99
##   1.232847e+03  150983.05  0.3682729  69761.42
##   1.629751e+03  151143.66  0.3682686  69845.20
##   2.154435e+03  151265.41  0.3682654  69908.73
##   2.848036e+03  151357.66  0.3682630  69956.86
##   3.764936e+03  151427.54  0.3682612  69993.32
##   4.977024e+03  151480.44  0.3682598  70020.93
##   6.579332e+03  151520.49  0.3682587  70041.83
##   8.697490e+03  151550.80  0.3682580  70057.64
##   1.149757e+04  151573.74  0.3682574  70069.61
##   1.519911e+04  151591.09  0.3682569  70078.67
##   2.009233e+04  151604.23  0.3682566  70085.52
##   2.656088e+04  151614.16  0.3682563  70090.70
##   3.511192e+04  151621.68  0.3682561  70094.63
##   4.641589e+04  151627.37  0.3682560  70097.59
##   6.135907e+04  151631.67  0.3682558  70099.84
##   8.111308e+04  151634.93  0.3682558  70101.54
##   1.072267e+05  151637.39  0.3682557  70102.82
##   1.417474e+05  151639.25  0.3682556  70103.79
##   1.873817e+05  151640.66  0.3682556  70104.53
##   2.477076e+05  151641.73  0.3682556  70105.09
##   3.274549e+05  151642.53  0.3682556  70105.51
##   4.328761e+05  151643.14  0.3682555  70105.83
##   5.722368e+05  151643.60  0.3682555  70106.07
##   7.564633e+05  151643.95  0.3682555  70106.25
##   1.000000e+06  151644.22  0.3682555  70106.39
##   1.321941e+06  151644.42  0.3682555  70106.49
##   1.747528e+06  151644.57  0.3682555  70106.57
##   2.310130e+06  151644.68  0.3682555  70106.63
##   3.053856e+06  151644.77  0.3682555  70106.67
##   4.037017e+06  151644.83  0.3682555  70106.71
##   5.336699e+06  151644.88  0.3682555  70106.73
##   7.054802e+06  151644.92  0.3682555  70106.75
##   9.326033e+06  151644.95  0.3682555  70106.77
##   1.232847e+07  151644.97  0.3682555  70106.78
##   1.629751e+07  151644.99  0.3682555  70106.79
##   2.154435e+07  151645.00  0.3682555  70106.79
##   2.848036e+07  151645.01  0.3682555  70106.80
##   3.764936e+07  151645.02  0.3682555  70106.80
##   4.977024e+07  151645.02  0.3682555  70106.81
##   6.579332e+07  151645.03  0.3682555  70106.81
##   8.697490e+07  151645.03  0.3682555  70106.81
##   1.149757e+08  151645.03  0.3682555  70106.81
##   1.519911e+08  151645.03  0.3682555  70106.81
##   2.009233e+08  151645.03  0.3682555  70106.81
##   2.656088e+08  151645.04  0.3682555  70106.81
##   3.511192e+08  151645.04  0.3682555  70106.81
##   4.641589e+08  151645.04  0.3682555  70106.81
##   6.135907e+08  151645.04  0.3682555  70106.81
##   8.111308e+08  151645.04  0.3682555  70106.81
##   1.072267e+09  151645.04  0.3682555  70106.81
##   1.417474e+09  151645.04  0.3682555  70106.81
##   1.873817e+09  151645.04  0.3682555  70106.81
##   2.477076e+09  151645.04  0.3682555  70106.81
##   3.274549e+09  151645.04  0.3682555  70106.81
##   4.328761e+09  151645.04  0.3682555  70106.81
##   5.722368e+09  151645.04  0.3682555  70106.81
##   7.564633e+09  151645.04  0.3682555  70106.81
##   1.000000e+10  151645.04  0.3682555  70106.81
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.01.
ggplot(model)

Variable importance for ridge regression

ggplot(varImp(model))

Lasso regression.

set.seed(1)
lambda <- expand.grid(fraction = seq(0.01, 10^-8, length=20))
model_lasso <- train(votos ~ ., 
                     data = train, 
                     method = "lasso", 
                     tuneGrid = lambda,
                     preProc = c("center", "scale"),
                     trControl = fitControl)
ggplot(model_lasso)

model_lasso
## The lasso 
## 
## 7476 samples
##   13 predictors
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 6729, 6729, 6728, 6729, 6728, 6729, ... 
## Resampling results across tuning parameters:
## 
##   fraction      RMSE      Rsquared   MAE     
##   0.0000000100  46964.90  0.4034546  28624.33
##   0.0005263253  38315.10  0.3960888  19639.19
##   0.0010526405  38452.07  0.4009972  17766.75
##   0.0015789558  38428.58  0.4039639  17536.65
##   0.0021052711  38488.95  0.4029602  17512.22
##   0.0026315863  38542.52  0.4020506  17506.00
##   0.0031579016  38566.62  0.4016407  17504.58
##   0.0036842168  38564.98  0.4016362  17503.38
##   0.0042105321  38563.35  0.4016316  17502.18
##   0.0047368474  38561.73  0.4016267  17500.98
##   0.0052631626  38560.12  0.4016214  17499.77
##   0.0057894779  38558.53  0.4016158  17498.57
##   0.0063157932  38556.95  0.4016098  17497.37
##   0.0068421084  38555.39  0.4016036  17496.17
##   0.0073684237  38553.84  0.4015970  17494.98
##   0.0078947389  38552.30  0.4015901  17493.79
##   0.0084210542  38550.78  0.4015828  17492.59
##   0.0089473695  38549.28  0.4015753  17491.40
##   0.0094736847  38547.79  0.4015674  17490.21
##   0.0100000000  38546.31  0.4015592  17489.02
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.0005263253.

Variable importance for ridge regression

ggplot(varImp(model_lasso))

Knn regression.

k <- expand.grid(k = seq(25, 50, length=81))
model_knn <- train(votos ~ ., 
                     data = train, 
                     method = "knn", 
                     tuneGrid = k,
                     preProc = c("center", "scale"),
                     trControl = fitControl)
ggplot(model_knn)

Variable importance for ridge regression

ggplot(varImp(model_knn))

Results

For the tests done, we see that in all models the definition of which predictors are most useful basically does not change, what was expected, having only some eventual change of position in the ranking, but in all defining the most significant variables as total_receita, total_despeza, resources_of_personal_juridicas.

In terms of comparison of the forms of model creation, using as a parameter of comparison the [RMSE] (https://en.wikipedia.org/wiki/Root-mean-square_deviation), through the results we can see a better performance of the model created with knn, with a k defined at 32 bringing the best result, bringing an RMSE value close to 30100, while the lasso regression had its best result around 38,5K and the ridge around 38K.

Following a model using knn will be created, which proved to be the one with the best results and using only the best predictors, this prediction model will be used for prediction of votes over the 2014 elections data in Brazil

model_knn_topper <- train(votos ~ total_receita * total_despesa * recursos_de_pessoas_juridicas, 
                     data = train, 
                     method = "knn", 
                     tuneGrid = k,
                     preProc = c("center", "scale"),
                     trControl = fitControl)