1. Colombian presidential elections
Since 2020 some consulting and data companies have been making electoral
surveys to find out Colombian voters preferences about presidential
candidates. However, over the last years we have seen how electoral
outcomes don’t match with a lot of surveys in events like: USA president
election in 2018 or Colombian Referendum for peace in 2016. These
failures have contributed to generate discussions inside parametric
field and have opened the door for new approaches as forecast Bayesian
models. These models don’t need representative samples as ARIMA(p,d,q)
or conditional heteroskedasticity family.
\(~\)
Some of these new models work with data from parametric surveys and try
to make a forecast. In Colombia the “Consejo Nacional
Electoral” (CNE) has register of all electoral surveys.
Although at this moment (15-march-2022) the CNE hasn’t loaded data from
presidential surveys in 2022.
\(~\)
I will show you how to download survey data available in Wikipedia with
“web-scraping” so We could obtain tables with electoral data. Also it is
possible to export R dataframes for Excel sheet (“xlsx”
library) or directly into Google Sheets (“googlesheets4”
library).
2. Data selected
We can find most Colombian electoral surveys results at Wikipedia
page for presidential elections in 2022
Please remember over time web page will be update with new data or some
adjustments.
3. Packages for make a web scraping with R
The Rvest library allows us to do “Web Scraping” with the R language.
Some functions as html_nodes() and htm_table() are
very useful to read html pages and tables from those files.
\(~\)
4. Coding step by step
a. First step
I will remove all objects at RStudio environment and then load libraries for this assignment.
remove(list = ls())
library(dplyr) ## To make pipelines with pipe operator (%>%)
library(dplyr) ## To make pipelines with pipe operator (%>%)
library(rvest) ## Web-scraping package
library(janitor) ## To put first rows like columns title
library(lubridate) ## Transform to Date format in R
library(knitr) ## To show tables in HTML format
library(xlsx) ## Export dataframes in xlsx format
library(googlesheets4) ## Import or export sheets between Google Drive and R
library(scales)b. Second step
I will save the link into an object called url and later I
built a new list (tablas) throughout a pipeline. The function
read_html() allows read data from web pages,
html_nodes() extracts one list with nodes which are surveys
tables, and finally html_table() parses html tables as
dataframes.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2022_Colombian_presidential_election"
tablas <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)c. Third step
I called the list as tablas, this list has nine objects
(actually are nine tables). For this example I’ll use the fifth table
which has presidential surveys from 2022 by five consulting companies.
This table has thirtytwo rows (surveys) and nine columns (presidential
candidates). I’ll change some characters like percent sign (%) for empty
space, or commas for dots throughout new functions:
cambio <- function(x){
x %>%
gsub(",", ".", .) %>%
gsub("%", "", .) %>%
sub("\\[.*", "", .)
}
numerico <- function(j){
j <- as.numeric(as.character(j))
return(j)
}
Those new Functions are: cambio() to change commas and
percentages. Function numerico() to convert character data for
numeric.
\(~\)
d. Fourth step
Now I extract the fifth table, then turn it in dataframe class, select
first row as columns titles and later apply cambio() across all
dataframe columns. We need to use pipe operator (dplyr library)
to build a pipeline. The new dataframe’s name will be
tabla_1.
tabla_1 <- tablas[[5]] %>%
data.frame() %>%
filter(!row_number() %in% c(1)) %>%
mutate(across(.fns = cambio))e. Fifth step
Although I have applied cambio() for all columns the next step
is select columns with numeric values (however R still recognizes them
as characters). With class function we can parse the classes of
dataframe’s columns, throughout sapply:
sapply(tabla_1, class)## Polling.organisation.client Dates.conducted
## "character" "character"
## Samplesize Gustavo.PetroPHC
## "character" "character"
## Rodolfo.HernándezLIGA Blank
## "character" "character"
## No.vote Unsure.No.answer
## "character" "character"
## Margin.of.Error
## "character"
g. Sixth step
We need to remove a special character (“–”) and replace it with NA at
columns seven and eight.
tabla_1[, 7:8] <- sapply(tabla_1[, 7:8],
function(x) gsub("–", NA, x))f. Seventh step
At the third column (Samplesize) I will remove character dot. Now it’s
necessary to change columns character class for numeric class, with
numerico() and sapply().
tabla_1$Samplesize <- sub("\\.", "",
tabla_1$Samplesize, perl=TRUE)
tabla_1[, 3:ncol(tabla_1)] <- sapply(tabla_1[, 3:ncol(tabla_1)],
function(x) numerico(x))
Now we can divide columns four to nine by 100.
tabla_1[, 4:ncol(tabla_1)] <- sapply(tabla_1[, 4:ncol(tabla_1)],
function(x) x/100)Now I rename every column in our Data-Frame (tabla_1)
colnames(tabla_1) <- c("Encuestadora", "fecha", "Tamaño muestral",
"Gustavo Petro", "Rodolfo Hernández",
"Voto en Blanco", "No vota", "Indeciso",
"Margen de error")The second column (fecha) doesn’t have a Date format. That’s why we replace special characters and with dplyr package convert to Date.
tabla_1 <- tabla_1 %>%
mutate(fecha = gsub(".*–", "", fecha),
fecha = gsub(" ", "-", fecha),
fecha = dmy(fecha))The table is ready and you can view it:
kable(tabla_1, caption = "Encuestas de canditados presidenciales 2022")| Encuestadora | fecha | Tamaño muestral | Gustavo Petro | Rodolfo Hernández | Voto en Blanco | No vota | Indeciso | Margen de error |
|---|---|---|---|---|---|---|---|---|
| AtlasIntel | 2022-06-11 | 4467 | 0.4750 | 0.5020 | 0.0240 | NA | NA | 0.0100 |
| YanHaas | 2022-06-10 | 1234 | 0.4500 | 0.3500 | 0.1300 | NA | 0.0700 | 0.0320 |
| GAD3 | 2022-06-10 | 5236 | 0.4710 | 0.4790 | 0.0500 | NA | NA | 0.0140 |
| Guarumo | 2022-06-09 | 2029 | 0.4650 | 0.4820 | 0.0530 | NA | NA | 0.0250 |
| GAD3 | 2022-06-09 | 4834 | 0.4810 | 0.4680 | 0.0510 | NA | NA | 0.0130 |
| Mosqueteros | 2022-06-08 | 6000 | 0.4383 | 0.4468 | 0.0354 | NA | 0.0795 | 0.0127 |
| GAD3 | 2022-06-08 | 4438 | 0.4850 | 0.4670 | 0.0490 | NA | NA | 0.0150 |
| Invamer | 2022-06-07 | 1329 | 0.4720 | 0.4820 | 0.0470 | NA | NA | 0.0269 |
| GAD3 | 2022-06-07 | 4041 | 0.4780 | 0.4710 | 0.0510 | NA | NA | 0.0160 |
| GAD3 | 2022-06-06 | 3240 | 0.4680 | 0.4780 | 0.0540 | NA | NA | 0.0170 |
| GAD3 | 2022-06-04 | 3240 | 0.4680 | 0.4810 | 0.0510 | NA | NA | 0.0130 |
| Guarumo | 2022-06-04 | 1958 | 0.4330 | 0.4640 | 0.0840 | NA | 0.0190 | 0.0250 |
| YanHaas | 2022-06-03 | 1234 | 0.4200 | 0.4100 | 0.1300 | NA | 0.0500 | 0.0320 |
| GAD3 | 2022-06-03 | 2840 | 0.4630 | 0.4790 | 0.0580 | NA | NA | 0.0190 |
| CNC | 2022-06-02 | 2172 | 0.4490 | 0.4100 | 0.0300 | 0.017 | 0.0940 | 0.0210 |
| GAD3 | 2022-06-02 | 553 | 0.4560 | 0.5040 | 0.0400 | NA | NA | 0.0210 |
| GAD3 | 2022-06-01 | 555 | 0.4510 | 0.5230 | 0.0260 | NA | NA | 0.0240 |
| GAD3 | 2022-05-31 | 1199 | 0.4480 | 0.5250 | 0.0270 | NA | NA | 0.0290 |
| CNC | 2022-05-31 | 1200 | 0.3900 | 0.4100 | 0.0500 | 0.010 | 0.1400 | 0.0280 |
| AtlasIntel | 2022-05-20 | 2781 | 0.3980 | 0.4820 | NA | NA | NA | 0.0200 |
| Guarumo | 2022-05-19 | 2258 | 0.4520 | 0.4150 | 0.0860 | NA | 0.0470 | 0.0250 |
| CNC | 2022-05-19 | 4412 | 0.4050 | 0.4050 | 0.0370 | 0.093 | 0.0600 | 0.0150 |
| Mosqueteros | 2022-05-18 | 6000 | 0.4530 | 0.3500 | 0.1970 | NA | NA | 0.0127 |
| Invamer | 2022-05-18 | 1195 | 0.5000 | 0.4740 | 0.0260 | NA | NA | 0.0283 |
| YanHaas | 2022-05-07 | 1232 | 0.4600 | 0.3100 | NA | 0.230 | NA | 0.0320 |
| Guarumo | 2022-04-29 | 2132 | 0.4450 | 0.3710 | 0.0860 | NA | 0.0980 | 0.0250 |
| Invamer | 2022-04-27 | 1243 | 0.5600 | 0.4060 | 0.0340 | NA | NA | 0.0255 |
| CNC | 2022-04-21 | 4599 | 0.4410 | 0.3410 | 0.0490 | 0.114 | 0.0550 | 0.0140 |
| CNC | 2022-04-07 | 1965 | 0.4000 | 0.3800 | 0.0400 | 0.100 | 0.0800 | 0.0220 |
| CNC | 2022-03-31 | 4206 | 0.4260 | 0.3440 | 0.0450 | 0.124 | 0.0410 | 0.0150 |
| YanHaas | 2022-03-19 | 1236 | 0.4100 | 0.3200 | NA | 0.270 | NA | 0.0320 |
If you want a percentage data, You will need to use percent() from scales library:
tabla_1[, 4:ncol(tabla_1)] <- sapply(tabla_1[, 4:ncol(tabla_1)],
function(x) scales::percent(x, digits = 1))kable(tabla_1, caption = "Encuestas de canditados presidenciales 2022")| Encuestadora | fecha | Tamaño muestral | Gustavo Petro | Rodolfo Hernández | Voto en Blanco | No vota | Indeciso | Margen de error |
|---|---|---|---|---|---|---|---|---|
| AtlasIntel | 2022-06-11 | 4467 | 47.50% | 50.20% | 2.40% | NA | NA | 1.00% |
| YanHaas | 2022-06-10 | 1234 | 45.00% | 35.00% | 13.00% | NA | 7.00% | 3.20% |
| GAD3 | 2022-06-10 | 5236 | 47.10% | 47.90% | 5.00% | NA | NA | 1.40% |
| Guarumo | 2022-06-09 | 2029 | 46.50% | 48.20% | 5.30% | NA | NA | 2.50% |
| GAD3 | 2022-06-09 | 4834 | 48.10% | 46.80% | 5.10% | NA | NA | 1.30% |
| Mosqueteros | 2022-06-08 | 6000 | 43.83% | 44.68% | 3.54% | NA | 7.95% | 1.27% |
| GAD3 | 2022-06-08 | 4438 | 48.50% | 46.70% | 4.90% | NA | NA | 1.50% |
| Invamer | 2022-06-07 | 1329 | 47.20% | 48.20% | 4.70% | NA | NA | 2.69% |
| GAD3 | 2022-06-07 | 4041 | 47.80% | 47.10% | 5.10% | NA | NA | 1.60% |
| GAD3 | 2022-06-06 | 3240 | 46.80% | 47.80% | 5.40% | NA | NA | 1.70% |
| GAD3 | 2022-06-04 | 3240 | 46.80% | 48.10% | 5.10% | NA | NA | 1.30% |
| Guarumo | 2022-06-04 | 1958 | 43.30% | 46.40% | 8.40% | NA | 1.90% | 2.50% |
| YanHaas | 2022-06-03 | 1234 | 42.00% | 41.00% | 13.00% | NA | 5.00% | 3.20% |
| GAD3 | 2022-06-03 | 2840 | 46.30% | 47.90% | 5.80% | NA | NA | 1.90% |
| CNC | 2022-06-02 | 2172 | 44.90% | 41.00% | 3.00% | 1.7% | 9.40% | 2.10% |
| GAD3 | 2022-06-02 | 553 | 45.60% | 50.40% | 4.00% | NA | NA | 2.10% |
| GAD3 | 2022-06-01 | 555 | 45.10% | 52.30% | 2.60% | NA | NA | 2.40% |
| GAD3 | 2022-05-31 | 1199 | 44.80% | 52.50% | 2.70% | NA | NA | 2.90% |
| CNC | 2022-05-31 | 1200 | 39.00% | 41.00% | 5.00% | 1.0% | 14.00% | 2.80% |
| AtlasIntel | 2022-05-20 | 2781 | 39.80% | 48.20% | NA | NA | NA | 2.00% |
| Guarumo | 2022-05-19 | 2258 | 45.20% | 41.50% | 8.60% | NA | 4.70% | 2.50% |
| CNC | 2022-05-19 | 4412 | 40.50% | 40.50% | 3.70% | 9.3% | 6.00% | 1.50% |
| Mosqueteros | 2022-05-18 | 6000 | 45.30% | 35.00% | 19.70% | NA | NA | 1.27% |
| Invamer | 2022-05-18 | 1195 | 50.00% | 47.40% | 2.60% | NA | NA | 2.83% |
| YanHaas | 2022-05-07 | 1232 | 46.00% | 31.00% | NA | 23.0% | NA | 3.20% |
| Guarumo | 2022-04-29 | 2132 | 44.50% | 37.10% | 8.60% | NA | 9.80% | 2.50% |
| Invamer | 2022-04-27 | 1243 | 56.00% | 40.60% | 3.40% | NA | NA | 2.55% |
| CNC | 2022-04-21 | 4599 | 44.10% | 34.10% | 4.90% | 11.4% | 5.50% | 1.40% |
| CNC | 2022-04-07 | 1965 | 40.00% | 38.00% | 4.00% | 10.0% | 8.00% | 2.20% |
| CNC | 2022-03-31 | 4206 | 42.60% | 34.40% | 4.50% | 12.4% | 4.10% | 1.50% |
| YanHaas | 2022-03-19 | 1236 | 41.00% | 32.00% | NA | 27.0% | NA | 3.20% |