In a previous effort, I used the John Hopkins University Data as my main source. Upon discovering several artefacts in the data such as COVID19 resurrections and other unexplained anomalies I decided to switch my data source to Our World in Data (OWiD).
library(dplyr)
We download the data via the raw URL of where the .csv
file sits in the OWiD Github repo.
covid = read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
## save local copy
write.csv(covid, '../owid-data.csv')
A quick look into the structure of the data shows us that the data is quite rich and many features have already been pre-computed. We won’t need many of these features for our simple dashboard:
str(covid)
## 'data.frame': 52441 obs. of 41 variables:
## $ iso_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : chr "2019-12-31" "2020-01-01" "2020-01-02" "2020-01-03" ...
## $ total_cases : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_cases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed : num NA NA NA NA NA NA 0 0 0 0 ...
## $ total_deaths : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed : num NA NA NA NA NA NA 0 0 0 0 ...
## $ total_cases_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_cases_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed_per_million : num NA NA NA NA NA NA 0 0 0 0 ...
## $ total_deaths_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed_per_million: num NA NA NA NA NA NA 0 0 0 0 ...
## $ total_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand: num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : chr "" "" "" "" ...
## $ stringency_index : num NA 0 0 0 0 0 0 0 0 0 ...
## $ population : num 38928341 38928341 38928341 38928341 38928341 ...
## $ population_density : num 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num 597 597 597 597 597 ...
## $ diabetes_prevalence : num 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num 64.8 64.8 64.8 64.8 64.8 ...
## $ human_development_index : num 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 ...
We see that the table consists of day-wise and country-wise features. For example, the human development index is only computed every few years and is hence a country-wise feature. Other features such as the new_deaths
are day-wise features in the sense that they reflect the number of daily cases at any given day in any given country. We are interested in cumulative events (deaths, confirmed cases, recoveries, number of tests) and daily events, hence we will make one table for each use case.
First of all we will select only the column that are useful to us (country identifier such as name or ISO code are useful, especially for maps):
covid = covid %>%
select(iso_code,
continent,
location,
date,
total_cases,
new_cases,
total_deaths,
new_deaths,
total_tests,
new_tests
)
Before moving on, we will also change the date
column from character type to Date type:
covid$date = as.Date(covid$date)
Note: We actually won’t use tests data as it has been very sparsely collected, see it for yourself in the chart below.
library(naniar)
vis_miss(covid %>% arrange(date) %>% select(contains(c('total','new'))))
As you can see above roughly 62% of testing data is missing, where as data missingness for confirmed cases has been in average below 6.88% and missingness for deaths has been below 23.84%.
covid = covid %>% select(-contains('tests'))
Now we split our data into cumulative
and daily
## this line selects all columns from covid except those that contain
## the word new
cumulative = covid %>% select(-contains('new'))
daily = covid %>% select(-contains('total'))
Using the JHU data source, we observed negative daily cases and negative deaths (see first section)[#note-data-source]. For our peace of mind we will check if this data has such artefacts.
any(daily$new_cases < 0)
## [1] TRUE
any(daily$new_deaths < 0)
## [1] TRUE
At this point, I could freak out (LOL) but I won’t. Someone pointed at the issue of negative cases and deaths in the OWiD repository. A collaborator from the OWiD team described how this is actually due to countries overestimating their death toll and sending an update to entities such as the European Center for Disease Control (ECDC) at later dates. Not much can be done about this. At this point I decided to go ahead with using the OWiD data as it is very clean and it is regularly maintained by a professional team. Also the open issue count in the OWiD repo (8 at time of reading) is more encouraging than that of the JHU repo (1305 at the time of reading).
library(ggplot2)
daily %>%
filter(iso_code == 'ESP') %>%
ggplot(aes(x = date,y = new_cases))+
geom_point(color = 'red')+
ggtitle('Spain\'s Daily COVID cases')
Data collector do not control and often do not seem to know how nations report statistics. Spain seems to be chunk reports of daily cases every few days hence the rapidly fluctuating number of daily confirmed cases during the 2nd wave of the pandemic.
daily %>%
filter(iso_code == 'GBR') %>%
ggplot(aes(x = date,y = new_cases))+
geom_point(color = 'darkblue')+
ggtitle('UK Daily Confirmed Cases')
cumulative %>%
filter(iso_code == 'GBR') %>%
ggplot(aes(x = date,y = total_deaths))+
geom_point(color = 'darkgreen')+
ggtitle('Italy\'s Cumulative Confirmed Cases')+
ylab('Total number of confirmed cases (log10 scale)')
The OWiD data is extremely useful and little to no data cleaning was needed (in comparison to the cleaning of the JHU data).
saveRDS(daily, '../ProcessedData/daily.rds')
saveRDS(cumulative, '../ProcessedData/cumulative.rds')