I became very interested in COVID-19 data when my friends, co-workers and family started to express fear based on data provided by the media (which they are rescinding every day due to inaccuracy). I decided to automate Worldometer data from the backend to simplify the data for ‘non-data analysts’.
I then thought, how many people in my community use R everyday and are busy building their own scripts to not have time to build a web scraping tool to extract this public data. So now … here is your public web-scraping tool to automate worldometers data and aggregate as you like!
First, we will need to load in our R libraries.
library(rvest)
library(stringr)
library(readr)
library(magrittr)
Next, I used the ‘read_html’ function to grab the worldometers countries’ page.
URL_WORLD <- read_html("https://www.worldometers.info/coronavirus/#countries")
After researching the worldometer url, I found the most consistent html node pathway was found in “//tr”. In spite of other people who have extracted data from their site, I was only able to bring this to pass through my xpath call, which I could not find previously attempted.
The srings come with a carriage return “” operator whic I used to split the strings. The column names were pulled from the first string of the first list. I combined everything together with a nested ‘reduce’ and ‘rbind’ function call.
COVID_WORLD <- URL_WORLD %>% html_nodes(xpath = "//tr", ) %>% html_text()
COVID_WORLD <- str_split(COVID_WORLD, "\n")
WORLD_NAMES <- COVID_WORLD[[1]][1:12]
COVID_WORLD <- data.frame(Reduce(rbind, COVID_WORLD))
The next phase is creating a dynamic reference to formatting the table based on data changing within the worldometers site and the total rows that will skew any analysis results. My goal was to exact-match the countries’ url format.
WORLD_Sub_A <- grep("All",COVID_WORLD$X13) # locate countries rows to keep
WORLD_Sub_B <- grep("Total:", COVID_WORLD$X1) # locate total rows to remove
WORLD_Sub_1 <- WORLD_Sub_A[1] + 1 # first position
WORLD_Sub_2 <- WORLD_Sub_B[1] - 1 # last position
COVID_WORLD <- COVID_WORLD[ WORLD_Sub_1 : WORLD_Sub_2, 1:12]
This phase just cleans up the ’Total Cases" column to remove spaces and commas.
colnames(COVID_WORLD) <- WORLD_NAMES
COVID_WORLD$TotalCases <- gsub(" ", "", COVID_WORLD$TotalCases)
COVID_WORLD$TotalCases <- gsub(",", "", COVID_WORLD$TotalCases)
Here, I re-order the columns, pass the data type for ‘Total Cases’ as numeric and sort the data set by most ‘Total Cases’.
colnames(COVID_WORLD) <- c("Country","TotalCases","NewCases","TotalDeaths","NewDeaths","TotalRecovered",
"ActiveCases", "SeriousCritical","Tot Cases/1M pop","Deaths/1M pop",
"TotalTests","Tests/1M pop")
COVID_WORLD$TotalCases <- as.numeric(COVID_WORLD$TotalCases)
COVID_WORLD <- COVID_WORLD[ order(COVID_WORLD$TotalCases,decreasing = T),]
head(COVID_WORLD[,1:6])
## Country TotalCases NewCases TotalDeaths NewDeaths TotalRecovered
## X.8 USA 819175 +431 45,343 +25 82,973
## X.9 Spain 204178 21,282 82,514
## X.10 Italy 183957 24,648 51,600
## X.11 France 158050 20,796 39,181
## X.12 Germany 148453 5,086 99,400
## X.13 UK 129044 17,337 N/A
Here is a quick analysis from the data to show extracting works and I did not copy and paste any together so you can use this as a tool.
InterQtRange <- quantile(x = COVID_WORLD$TotalCases, 0.75) - quantile(x = COVID_WORLD$TotalCases, 0.25)
LowerBound <- quantile(x = COVID_WORLD$TotalCases, 0.25) - 1.5 * InterQtRange
UpperBound <- quantile(x = COVID_WORLD$TotalCases, 0.75) + 1.5 * InterQtRange
COVID_WORLD_Bar <- COVID_WORLD[COVID_WORLD$TotalCases >= UpperBound, ]
list(
"Distribution of Total Cases for All Countries"=summary(COVID_WORLD$TotalCases),
"Counties with Total Cases Above Upper Bound"=paste(length(COVID_WORLD_Bar$TotalCases)," countries out of ", length(COVID_WORLD$TotalCases), " countries are considered outliers"),
"Country Outliers"= sort(as.character(COVID_WORLD_Bar$Country))
)
## $`Distribution of Total Cases for All Countries`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 56.2 419.0 12100.5 3160.2 819175.0
##
## $`Counties with Total Cases Above Upper Bound`
## [1] "32 countries out of 212 countries are considered outliers"
##
## $`Country Outliers`
## [1] "Austria" "Belgium" "Brazil" "Canada" "Chile"
## [6] "China" "Ecuador" "France" "Germany" "India"
## [11] "Iran" "Ireland" "Israel" "Italy" "Japan"
## [16] "Mexico" "Netherlands" "Pakistan" "Peru" "Poland"
## [21] "Portugal" "Romania" "Russia" "S. Korea" "Saudi Arabia"
## [26] "Singapore" "Spain" "Sweden" "Switzerland" "Turkey"
## [31] "UK" "USA"
For an added bonus, I have formulated a table of all county cards updated daily from Johns Hopkins University that has all country demographics not provided in their JHU dashboard. The demographics data was provided by Simple Maps. You can acces the code and the demographics table with the county cards at my github repo.
citation("tidyverse")
##
## Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
## Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }