Summary

I became very interested in COVID-19 data when my friends, co-workers and family started to express fear based on data provided by the media (which they are rescinding every day due to inaccuracy). I decided to automate Worldometer data from the backend to simplify the data for ‘non-data analysts’.

I then thought, how many people in my community use R everyday and are busy building their own scripts to not have time to build a web scraping tool to extract this public data. So now … here is your public web-scraping tool to automate worldometers data and aggregate as you like!

Load Libraries and URL Source

First, we will need to load in our R libraries.

library(rvest)
library(stringr)
library(readr)
library(magrittr)

Next, I used the ‘read_html’ function to grab the worldometers countries’ page.

URL_WORLD              <- read_html("https://www.worldometers.info/coronavirus/#countries")

Extracting HTML Nodes

After researching the worldometer url, I found the most consistent html node pathway was found in “//tr”. In spite of other people who have extracted data from their site, I was only able to bring this to pass through my xpath call, which I could not find previously attempted.

The srings come with a carriage return “” operator whic I used to split the strings. The column names were pulled from the first string of the first list. I combined everything together with a nested ‘reduce’ and ‘rbind’ function call.

COVID_WORLD            <- URL_WORLD %>% html_nodes(xpath = "//tr", ) %>% html_text()
COVID_WORLD            <- str_split(COVID_WORLD, "\n")
WORLD_NAMES            <- COVID_WORLD[[1]][1:12]
COVID_WORLD            <- data.frame(Reduce(rbind, COVID_WORLD))

Cleaning and Formatting the Data

The next phase is creating a dynamic reference to formatting the table based on data changing within the worldometers site and the total rows that will skew any analysis results. My goal was to exact-match the countries’ url format.

WORLD_Sub_A            <- grep("All",COVID_WORLD$X13)               # locate countries rows to keep
WORLD_Sub_B            <- grep("Total:", COVID_WORLD$X1)            # locate total rows to remove
WORLD_Sub_1            <- WORLD_Sub_A[1] + 1                        # first position
WORLD_Sub_2            <- WORLD_Sub_B[1] - 1                        # last position
COVID_WORLD            <- COVID_WORLD[ WORLD_Sub_1 : WORLD_Sub_2, 1:12]

This phase just cleans up the ’Total Cases" column to remove spaces and commas.

colnames(COVID_WORLD)       <- WORLD_NAMES
COVID_WORLD$TotalCases      <- gsub(" ", "", COVID_WORLD$TotalCases)
COVID_WORLD$TotalCases      <- gsub(",", "", COVID_WORLD$TotalCases)

Here, I re-order the columns, pass the data type for ‘Total Cases’ as numeric and sort the data set by most ‘Total Cases’.

colnames(COVID_WORLD)  <- c("Country","TotalCases","NewCases","TotalDeaths","NewDeaths","TotalRecovered",
                            "ActiveCases", "SeriousCritical","Tot Cases/1M pop","Deaths/1M pop",
                            "TotalTests","Tests/1M pop")
COVID_WORLD$TotalCases      <- as.numeric(COVID_WORLD$TotalCases)
COVID_WORLD                 <- COVID_WORLD[ order(COVID_WORLD$TotalCases,decreasing = T),]
head(COVID_WORLD[,1:6])
##      Country TotalCases NewCases TotalDeaths NewDeaths TotalRecovered
## X.8      USA     819175     +431     45,343        +25         82,973
## X.9    Spain     204178              21,282                    82,514
## X.10   Italy     183957              24,648                    51,600
## X.11  France     158050              20,796                    39,181
## X.12 Germany     148453               5,086                    99,400
## X.13      UK     129044              17,337                       N/A

Quick Analysis

Here is a quick analysis from the data to show extracting works and I did not copy and paste any together so you can use this as a tool.

InterQtRange <- quantile(x = COVID_WORLD$TotalCases, 0.75) - quantile(x = COVID_WORLD$TotalCases, 0.25)
LowerBound <- quantile(x = COVID_WORLD$TotalCases, 0.25) - 1.5 * InterQtRange
UpperBound <- quantile(x = COVID_WORLD$TotalCases, 0.75) + 1.5 * InterQtRange
COVID_WORLD_Bar <- COVID_WORLD[COVID_WORLD$TotalCases >=  UpperBound, ]
list(
"Distribution of Total Cases for All Countries"=summary(COVID_WORLD$TotalCases),
"Counties with Total Cases Above Upper Bound"=paste(length(COVID_WORLD_Bar$TotalCases)," countries out of ", length(COVID_WORLD$TotalCases), " countries are considered outliers"),
"Country Outliers"= sort(as.character(COVID_WORLD_Bar$Country))
)
## $`Distribution of Total Cases for All Countries`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      1.0     56.2    419.0  12100.5   3160.2 819175.0 
## 
## $`Counties with Total Cases Above Upper Bound`
## [1] "32  countries out of  212  countries are considered outliers"
## 
## $`Country Outliers`
##  [1] "Austria"      "Belgium"      "Brazil"       "Canada"       "Chile"       
##  [6] "China"        "Ecuador"      "France"       "Germany"      "India"       
## [11] "Iran"         "Ireland"      "Israel"       "Italy"        "Japan"       
## [16] "Mexico"       "Netherlands"  "Pakistan"     "Peru"         "Poland"      
## [21] "Portugal"     "Romania"      "Russia"       "S. Korea"     "Saudi Arabia"
## [26] "Singapore"    "Spain"        "Sweden"       "Switzerland"  "Turkey"      
## [31] "UK"           "USA"

For an added bonus, I have formulated a table of all county cards updated daily from Johns Hopkins University that has all country demographics not provided in their JHU dashboard. The demographics data was provided by Simple Maps. You can acces the code and the demographics table with the county cards at my github repo.

Citations

Worldometer Info

Johns Hopkins Dashboard

Simple Maps

My github repo

citation("tidyverse")
## 
##   Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
##   Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }