syntaxed by Leo Langat Maina
This is the term used in extracting data from a website. Sometimes websites do not offer links to the csv files or xsl files to download from. It is quite possible to capture such data because the information used by a browser to render pages is received as text from a server. This text is normally written in a hyper text markup language (HTML)
So essentially I have downloaded this files, imported it into R and wrote a program to extract the information that I need from the page.
library(rvest)
## Loading required package: xml2
url <- "https://www.worldometers.info/coronavirus/"
CoV_19 <- read_html(url)
class(CoV_19) #the class of my object is in a general markup language, xml document
## [1] "xml_document" "xml_node"
table <- CoV_19 %>% html_nodes("table") # I extract tables from these html nodes
table <- table[[2]] # my table of interest
table
## {html_node}
## <table id="main_table_countries_yesterday" class="table table-bordered table-hover main_table_countries" style="width:100%;margin-top: 0px !important;display:none;">
## [1] <thead><tr>\n<th width="1%">#</th>\n<th width="100">Country,<br>Other</th ...
## [2] <tbody>\n<tr class="total_row_world row_continent" data-continent="Asia" ...
## [3] <tbody class="body_continents">\n<tr class="row_continent total_row" data ...
## [4] <tbody class="total_row_body body_world"><tr class="total_row">\n<td></td ...
table <- table %>% html_table
class(table) # now I have my table as a data frame
## [1] "data.frame"
The class of table has now become a data frame, however I will have to do some data wrangling because my data is not tidy. You will notice some of the columns that are supposed to be numbers are characters because of the commas. I will have to convert my extracted numeric data contained in this character strings into usable data types to create visuals and plots and analyse my data.
I achieve this with the following libraries.
library(stringr)
library(dplyr)
library(readr)
commas <- function(x) any(str_detect(x,",")) # detecting any character strings in the columns of the dataframe(commas)
table %>% summarize_all(funs(commas))
## # Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered
## 1 NA FALSE TRUE TRUE TRUE TRUE TRUE
## NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop
## 1 TRUE TRUE TRUE TRUE TRUE
## TotalTests Tests/1M pop Population Continent 1 Caseevery X ppl
## 1 TRUE TRUE TRUE FALSE TRUE
## 1 Deathevery X ppl 1 Testevery X ppl
## 1 TRUE TRUE
#parse_number(table$TotalCases) #the function (parse_number) removes non numeric characters before coercing it to a numeric string, I use it hear to see my output
#parse_number(table$NewDeaths)
newtab <- table %>% mutate_at(3:15, parse_number)
newtab <- newtab %>% mutate_at(17:19, parse_number)
myDataTab <- newtab[-c(1:8,224:231),]
myDataTab %>% group_by(Continent) %>% summarise(n=n()) %>% knitr::kable()
| Continent | n |
|---|---|
| 2 | |
| Africa | 57 |
| Asia | 49 |
| Australia/Oceania | 6 |
| Europe | 48 |
| North America | 39 |
| South America | 14 |