WORLD STATISTIC ON THE CORONAVIRUS

syntaxed by Leo Langat Maina

Web Scraping

This is the term used in extracting data from a website. Sometimes websites do not offer links to the csv files or xsl files to download from. It is quite possible to capture such data because the information used by a browser to render pages is received as text from a server. This text is normally written in a hyper text markup language (HTML)

So essentially I have downloaded this files, imported it into R and wrote a program to extract the information that I need from the page.

library(rvest)
## Loading required package: xml2
url <- "https://www.worldometers.info/coronavirus/"
CoV_19 <- read_html(url)
class(CoV_19)  #the class of my object is in a general markup language, xml document
## [1] "xml_document" "xml_node"
table <- CoV_19 %>% html_nodes("table")  # I extract tables from these html nodes
table <- table[[2]] # my table of interest
table
## {html_node}
## <table id="main_table_countries_yesterday" class="table table-bordered table-hover main_table_countries" style="width:100%;margin-top: 0px !important;display:none;">
## [1] <thead><tr>\n<th width="1%">#</th>\n<th width="100">Country,<br>Other</th ...
## [2] <tbody>\n<tr class="total_row_world row_continent" data-continent="Asia"  ...
## [3] <tbody class="body_continents">\n<tr class="row_continent total_row" data ...
## [4] <tbody class="total_row_body body_world"><tr class="total_row">\n<td></td ...
table <- table %>% html_table
class(table)  # now I have my table as a data frame
## [1] "data.frame"

The class of table has now become a data frame, however I will have to do some data wrangling because my data is not tidy. You will notice some of the columns that are supposed to be numbers are characters because of the commas. I will have to convert my extracted numeric data contained in this character strings into usable data types to create visuals and plots and analyse my data.

I achieve this with the following libraries.

library(stringr)
library(dplyr)
library(readr)

commas <- function(x) any(str_detect(x,","))  # detecting any character strings in the columns of the dataframe(commas)
table %>% summarize_all(funs(commas))
##    # Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered
## 1 NA         FALSE       TRUE     TRUE        TRUE      TRUE           TRUE
##   NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop
## 1         TRUE        TRUE             TRUE             TRUE          TRUE
##   TotalTests Tests/1M pop Population Continent 1 Caseevery X ppl
## 1       TRUE         TRUE       TRUE     FALSE              TRUE
##   1 Deathevery X ppl 1 Testevery X ppl
## 1               TRUE              TRUE
#parse_number(table$TotalCases)  #the function (parse_number) removes non numeric characters before coercing it                                    to a numeric string, I use it hear to see my output
#parse_number(table$NewDeaths)
newtab <- table %>% mutate_at(3:15, parse_number)
newtab <- newtab %>% mutate_at(17:19, parse_number)

myDataTab <- newtab[-c(1:8,224:231),]
myDataTab %>% group_by(Continent) %>% summarise(n=n()) %>% knitr::kable()
Continent n
2
Africa 57
Asia 49
Australia/Oceania 6
Europe 48
North America 39
South America 14