1. Colombian presidential elections

Since 2020 some consulting and data companies have been making electoral surveys to find out Colombian voters preferences about presidential candidates. However, over the last years we have seen how electoral outcomes don’t match with a lot of surveys in events like: USA president election in 2018 or Colombian Referendum for peace in 2016. These failures have contributed to generate discussions inside parametric field and have opened the door for new approaches as forecast Bayesian models. These models don’t need representative samples as ARIMA(p,d,q) or conditional heteroskedasticity family.
\(~\)
Some of these new models work with data from parametric surveys and try to make a forecast. In Colombia the “Consejo Nacional Electoral” (CNE) has register of all electoral surveys. Although at this moment (15-march-2022) the CNE hasn’t loaded data from presidential surveys in 2022.
\(~\)
I will show you how to download survey data available in Wikipedia with “web-scraping” so We could obtain tables with electoral data. Also it is possible to export R dataframes for Excel sheet (“xlsx” library) or directly into Google Sheets (“googlesheets4” library).


2. Data selected

We can find most Colombian electoral surveys results at Wikipedia page for presidential elections in 2022
Please remember over time web page will be update with new data or some adjustments.


3. Packages for make a web scraping with R

The Rvest library allows us to do “Web Scraping” with the R language. Some functions as html_nodes() and htm_table() are very useful to read html pages and tables from those files.

\(~\)

4. Coding step by step

a. First step

I will remove all objects at RStudio environment and then load libraries for this assignment.

remove(list = ls())

library(dplyr)               ## To make pipelines with pipe operator (%>%)
library(dplyr)               ## To make pipelines with pipe operator (%>%)
library(rvest)               ## Web-scraping package
library(janitor)             ## To put first rows like columns title
library(lubridate)           ## Transform to Date format in R
library(knitr)               ## To show tables in HTML format
library(xlsx)                ## Export dataframes in xlsx format
library(googlesheets4)       ## Import or export sheets between Google Drive and R
library(scales)

b. Second step

I will save the link into an object called url and later I built a new list (tablas) throughout a pipeline. The function read_html() allows read data from web pages, html_nodes() extracts one list with nodes which are surveys tables, and finally html_table() parses html tables as dataframes.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2022_Colombian_presidential_election"

tablas <- url %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

c. Third step

I called the list as tablas, this list has nine objects (actually are nine tables). For this example I’ll use the fifth table which has presidential surveys from 2022 by five consulting companies. This table has thirtytwo rows (surveys) and nine columns (presidential candidates). I’ll change some characters like percent sign (%) for empty space, or commas for dots throughout new functions:
cambio <- function(x){ 
  x %>%
    gsub(",", ".", .) %>% 
    gsub("%", "", .) %>% 
    sub("\\[.*", "", .)
}

numerico <- function(j){
  j <- as.numeric(as.character(j))
  return(j)
}
Those new Functions are: cambio() to change commas and percentages. Function numerico() to convert character data for numeric.

\(~\)

d. Fourth step

Now I extract the fifth table, then turn it in dataframe class, select first row as columns titles and later apply cambio() across all dataframe columns. We need to use pipe operator (dplyr library) to build a pipeline. The new dataframe’s name will be tabla_1.
tabla_1 <- tablas[[5]] %>% 
  data.frame() %>% 
  filter(!row_number() %in% c(1)) %>% 
  mutate(across(.fns = cambio))

e. Fifth step

Although I have applied cambio() for all columns the next step is select columns with numeric values (however R still recognizes them as characters). With class function we can parse the classes of dataframe’s columns, throughout sapply:
sapply(tabla_1, class)
## Polling.organisation.client             Dates.conducted 
##                 "character"                 "character" 
##                  Samplesize            Gustavo.PetroPHC 
##                 "character"                 "character" 
##       Rodolfo.HernándezLIGA                       Blank 
##                 "character"                 "character" 
##                     No.vote            Unsure.No.answer 
##                 "character"                 "character" 
##             Margin.of.Error 
##                 "character"

g. Sixth step

We need to remove a special character (“–”) and replace it with NA at columns seven and eight.
tabla_1[, 7:8] <- sapply(tabla_1[, 7:8], 
                         function(x) gsub("–", NA, x))

f. Seventh step

At the third column (Samplesize) I will remove character dot. Now it’s necessary to change columns character class for numeric class, with numerico() and sapply().
tabla_1$Samplesize <- sub("\\.", "", 
                           tabla_1$Samplesize, perl=TRUE)

tabla_1[, 3:ncol(tabla_1)] <- sapply(tabla_1[, 3:ncol(tabla_1)], 
                         function(x) numerico(x))
Now we can divide columns four to nine by 100.
tabla_1[, 4:ncol(tabla_1)] <- sapply(tabla_1[, 4:ncol(tabla_1)], 
                                     function(x) x/100)

Now I rename every column in our Data-Frame (tabla_1)

colnames(tabla_1) <- c("Encuestadora", "fecha", "Tamaño muestral", 
                       "Gustavo Petro", "Rodolfo Hernández", 
                       "Voto en Blanco", "No vota", "Indeciso", 
                       "Margen de error")

The second column (fecha) doesn’t have a Date format. That’s why we replace special characters and with dplyr package convert to Date.

tabla_1 <- tabla_1 %>% 
  mutate(fecha = gsub(".*–", "", fecha), 
         fecha = gsub(" ", "-", fecha), 
         fecha = dmy(fecha))

The table is ready and you can view it:

kable(tabla_1, caption = "Encuestas de canditados presidenciales 2022")
Encuestas de canditados presidenciales 2022
Encuestadora fecha Tamaño muestral Gustavo Petro Rodolfo Hernández Voto en Blanco No vota Indeciso Margen de error
AtlasIntel 2022-06-11 4467 0.4750 0.5020 0.0240 NA NA 0.0100
YanHaas 2022-06-10 1234 0.4500 0.3500 0.1300 NA 0.0700 0.0320
GAD3 2022-06-10 5236 0.4710 0.4790 0.0500 NA NA 0.0140
Guarumo 2022-06-09 2029 0.4650 0.4820 0.0530 NA NA 0.0250
GAD3 2022-06-09 4834 0.4810 0.4680 0.0510 NA NA 0.0130
Mosqueteros 2022-06-08 6000 0.4383 0.4468 0.0354 NA 0.0795 0.0127
GAD3 2022-06-08 4438 0.4850 0.4670 0.0490 NA NA 0.0150
Invamer 2022-06-07 1329 0.4720 0.4820 0.0470 NA NA 0.0269
GAD3 2022-06-07 4041 0.4780 0.4710 0.0510 NA NA 0.0160
GAD3 2022-06-06 3240 0.4680 0.4780 0.0540 NA NA 0.0170
GAD3 2022-06-04 3240 0.4680 0.4810 0.0510 NA NA 0.0130
Guarumo 2022-06-04 1958 0.4330 0.4640 0.0840 NA 0.0190 0.0250
YanHaas 2022-06-03 1234 0.4200 0.4100 0.1300 NA 0.0500 0.0320
GAD3 2022-06-03 2840 0.4630 0.4790 0.0580 NA NA 0.0190
CNC 2022-06-02 2172 0.4490 0.4100 0.0300 0.017 0.0940 0.0210
GAD3 2022-06-02 553 0.4560 0.5040 0.0400 NA NA 0.0210
GAD3 2022-06-01 555 0.4510 0.5230 0.0260 NA NA 0.0240
GAD3 2022-05-31 1199 0.4480 0.5250 0.0270 NA NA 0.0290
CNC 2022-05-31 1200 0.3900 0.4100 0.0500 0.010 0.1400 0.0280
AtlasIntel 2022-05-20 2781 0.3980 0.4820 NA NA NA 0.0200
Guarumo 2022-05-19 2258 0.4520 0.4150 0.0860 NA 0.0470 0.0250
CNC 2022-05-19 4412 0.4050 0.4050 0.0370 0.093 0.0600 0.0150
Mosqueteros 2022-05-18 6000 0.4530 0.3500 0.1970 NA NA 0.0127
Invamer 2022-05-18 1195 0.5000 0.4740 0.0260 NA NA 0.0283
YanHaas 2022-05-07 1232 0.4600 0.3100 NA 0.230 NA 0.0320
Guarumo 2022-04-29 2132 0.4450 0.3710 0.0860 NA 0.0980 0.0250
Invamer 2022-04-27 1243 0.5600 0.4060 0.0340 NA NA 0.0255
CNC 2022-04-21 4599 0.4410 0.3410 0.0490 0.114 0.0550 0.0140
CNC 2022-04-07 1965 0.4000 0.3800 0.0400 0.100 0.0800 0.0220
CNC 2022-03-31 4206 0.4260 0.3440 0.0450 0.124 0.0410 0.0150
YanHaas 2022-03-19 1236 0.4100 0.3200 NA 0.270 NA 0.0320

If you want a percentage data, You will need to use percent() from scales library:

tabla_1[, 4:ncol(tabla_1)] <- sapply(tabla_1[, 4:ncol(tabla_1)], 
                         function(x) scales::percent(x, digits = 1))
kable(tabla_1, caption = "Encuestas de canditados presidenciales 2022")
Encuestas de canditados presidenciales 2022
Encuestadora fecha Tamaño muestral Gustavo Petro Rodolfo Hernández Voto en Blanco No vota Indeciso Margen de error
AtlasIntel 2022-06-11 4467 47.50% 50.20% 2.40% NA NA 1.00%
YanHaas 2022-06-10 1234 45.00% 35.00% 13.00% NA 7.00% 3.20%
GAD3 2022-06-10 5236 47.10% 47.90% 5.00% NA NA 1.40%
Guarumo 2022-06-09 2029 46.50% 48.20% 5.30% NA NA 2.50%
GAD3 2022-06-09 4834 48.10% 46.80% 5.10% NA NA 1.30%
Mosqueteros 2022-06-08 6000 43.83% 44.68% 3.54% NA 7.95% 1.27%
GAD3 2022-06-08 4438 48.50% 46.70% 4.90% NA NA 1.50%
Invamer 2022-06-07 1329 47.20% 48.20% 4.70% NA NA 2.69%
GAD3 2022-06-07 4041 47.80% 47.10% 5.10% NA NA 1.60%
GAD3 2022-06-06 3240 46.80% 47.80% 5.40% NA NA 1.70%
GAD3 2022-06-04 3240 46.80% 48.10% 5.10% NA NA 1.30%
Guarumo 2022-06-04 1958 43.30% 46.40% 8.40% NA 1.90% 2.50%
YanHaas 2022-06-03 1234 42.00% 41.00% 13.00% NA 5.00% 3.20%
GAD3 2022-06-03 2840 46.30% 47.90% 5.80% NA NA 1.90%
CNC 2022-06-02 2172 44.90% 41.00% 3.00% 1.7% 9.40% 2.10%
GAD3 2022-06-02 553 45.60% 50.40% 4.00% NA NA 2.10%
GAD3 2022-06-01 555 45.10% 52.30% 2.60% NA NA 2.40%
GAD3 2022-05-31 1199 44.80% 52.50% 2.70% NA NA 2.90%
CNC 2022-05-31 1200 39.00% 41.00% 5.00% 1.0% 14.00% 2.80%
AtlasIntel 2022-05-20 2781 39.80% 48.20% NA NA NA 2.00%
Guarumo 2022-05-19 2258 45.20% 41.50% 8.60% NA 4.70% 2.50%
CNC 2022-05-19 4412 40.50% 40.50% 3.70% 9.3% 6.00% 1.50%
Mosqueteros 2022-05-18 6000 45.30% 35.00% 19.70% NA NA 1.27%
Invamer 2022-05-18 1195 50.00% 47.40% 2.60% NA NA 2.83%
YanHaas 2022-05-07 1232 46.00% 31.00% NA 23.0% NA 3.20%
Guarumo 2022-04-29 2132 44.50% 37.10% 8.60% NA 9.80% 2.50%
Invamer 2022-04-27 1243 56.00% 40.60% 3.40% NA NA 2.55%
CNC 2022-04-21 4599 44.10% 34.10% 4.90% 11.4% 5.50% 1.40%
CNC 2022-04-07 1965 40.00% 38.00% 4.00% 10.0% 8.00% 2.20%
CNC 2022-03-31 4206 42.60% 34.40% 4.50% 12.4% 4.10% 1.50%
YanHaas 2022-03-19 1236 41.00% 32.00% NA 27.0% NA 3.20%