Carregando bibliotecas* necessárias
Carregando os dados e removendo informações pouco relevantes, como nome, cargo, número, sequencial do candidato e ocupação
train <- read.csv("./train.csv")
train <- train %>% select(-cargo, -nome, -sequencial_candidato, -ocupacao)
train[is.na(train)] <- 0
1. Usando todas as variáveis disponíveis, tune (usando validação cruzada): (i) um modelo de regressão Ridge, (ii) um modelo de regressão Lasso e (iii) um modelo KNN. Para os modelos de regressão linear, o parâmetro a ser tunado é o lambda (penalização dos coeficientes) e o KNN o número de vizinhos.
fitControl <- trainControl(method = "cv",
number = 10,
search = "random")
preProcValues <- c("center", "scale", "nzv")
model.ridge <- train(votos ~ .,
data = train,
trControl = fitControl,
method = "ridge",
preProcess = preProcValues,
tuneLength = 15)
model.ridge
## Ridge Regression
##
## 7476 samples
## 19 predictor
##
## Pre-processing: centered (32), scaled (32), remove (49)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6729, 6728, 6728, 6728, 6728, 6729, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 1.031154e-05 39113.25 0.3945467 16348.62
## 2.971823e-05 39113.91 0.3945407 16348.75
## 7.648806e-05 39115.49 0.3945263 16349.04
## 1.668854e-04 39118.55 0.3944980 16349.59
## 1.296040e-03 39156.34 0.3941311 16355.81
## 1.652158e-03 39167.84 0.3940152 16357.48
## 2.026988e-03 39179.68 0.3938949 16359.11
## 3.490212e-03 39223.01 0.3934482 16364.31
## 9.574590e-03 39363.18 0.3920181 16374.58
## 7.812353e-02 40019.37 0.3882094 16298.58
## 2.054584e-01 40935.67 0.3865904 16166.29
## 2.259082e-01 41090.46 0.3863749 16160.24
## 6.328502e-01 44694.34 0.3827507 16667.57
## 1.088405e+00 49458.80 0.3799824 18221.68
## 3.029257e+00 69475.28 0.3746666 27437.13
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1.031154e-05.
model.lasso <- train(votos ~ .,
data = train,
trControl = fitControl,
method = "lasso",
preProcess = preProcValues,
tuneLength = 15)
model.lasso
## The lasso
##
## 7476 samples
## 19 predictor
##
## Pre-processing: centered (32), scaled (32), remove (49)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6728, 6728, 6729, 6729, 6728, 6728, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.04225047 38512.11 0.4145641 16341.36
## 0.12701449 39011.83 0.4107561 16402.23
## 0.13882840 39113.85 0.4101744 16410.71
## 0.17289607 39445.32 0.4085883 16435.17
## 0.31684908 41301.15 0.4040472 16538.54
## 0.33504165 41572.86 0.4036883 16551.60
## 0.42398068 42977.43 0.4023705 16615.47
## 0.56322331 45348.54 0.4012084 16715.45
## 0.57165515 45496.73 0.4011595 16721.51
## 0.65483152 46978.60 0.4007610 16781.23
## 0.77116897 49098.25 0.4003840 16864.77
## 0.87289337 50982.87 0.4001606 16937.81
## 0.93775147 52195.54 0.4000519 16984.39
## 0.96585522 52723.19 0.4000110 17004.57
## 0.98287705 53043.36 0.3999878 17016.79
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.04225047.
fitControl <- trainControl(method = "cv",
number = 5)
preProcValues <- c("center", "scale", "nzv")
model.knn <- train(votos ~ .,
data = train,
method = "knn",
trControl = fitControl,
preProcess = preProcValues,
tuneLength = 15)
model.knn
## k-Nearest Neighbors
##
## 7476 samples
## 19 predictor
##
## Pre-processing: centered (32), scaled (32), remove (49)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5980, 5980, 5980, 5983, 5981
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 35463.65 0.4557163 13862.66
## 7 35063.29 0.4699271 13695.80
## 9 35112.39 0.4704779 13737.42
## 11 35072.00 0.4743449 13846.13
## 13 35071.55 0.4760923 13933.54
## 15 35160.97 0.4742608 14037.76
## 17 35191.44 0.4741594 14071.83
## 19 35251.99 0.4730065 14193.97
## 21 35380.34 0.4695652 14284.63
## 23 35369.97 0.4712983 14311.12
## 25 35439.69 0.4702958 14333.16
## 27 35552.89 0.4669965 14406.56
## 29 35585.14 0.4661299 14462.73
## 31 35657.46 0.4644162 14497.99
## 33 35676.69 0.4647745 14529.24
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
2. Compare os três modelos em termos do erro RMSE de validação cruzada.
3. Quais as variáveis mais importantes segundo o modelo de regressão Ridge e Lasso? Variáveis foram descartadas pelo Lasso? Quais?
ggplot(varImp(model.ridge))
ggplot(varImp(model.lasso))
4. Re-treine o melhor modelo (usando os melhores valores de parâmetros encontrados em todos os dados, sem usar validação cruzada).
best.grid <- expand.grid(k = model.knn$bestTune)
best.model <- train(votos ~ .,
data = train,
method = "knn",
tuneGrid = best.grid)
best.model
## k-Nearest Neighbors
##
## 7476 samples
## 19 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7476, 7476, 7476, 7476, 7476, 7476, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35330.85 0.4670892 13344.2
##
## Tuning parameter 'k' was held constant at a value of 7
5. Use esse último modelo treinado para prever os dados de teste disponíveis no challenge que criamos na plataforma Kaggle (Links para um site externo)
test <- read.csv("./test.csv")
submission <- test %>%
select(sequencial_candidato)
test <- test %>%
select(-sequencial_candidato,
-nome,
-cargo,
-ocupacao)
predictions <- predict(best.model, test)
submission$votos <- predictions
submission <- submission %>%
select(ID = sequencial_candidato,
votos = votos)
write.csv(x = submission,
file = "sample_submission.csv",
row.names = FALSE)