Note - Data Source

In a previous effort, I used the John Hopkins University Data as my main source. Upon discovering several artefacts in the data such as COVID19 resurrections and other unexplained anomalies I decided to switch my data source to Our World in Data (OWiD).

Load useful libraries

library(dplyr)

Download data

We download the data via the raw URL of where the .csv file sits in the OWiD Github repo.

covid = read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')

## save local copy
write.csv(covid, '../owid-data.csv')

A quick look into the structure of the data shows us that the data is quite rich and many features have already been pre-computed. We won’t need many of these features for our simple dashboard:

str(covid)

## 'data.frame':    52441 obs. of  41 variables:
##  $ iso_code                       : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                      : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location                       : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                           : chr  "2019-12-31" "2020-01-01" "2020-01-02" "2020-01-03" ...
##  $ total_cases                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_cases                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed             : num  NA NA NA NA NA NA 0 0 0 0 ...
##  $ total_deaths                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed            : num  NA NA NA NA NA NA 0 0 0 0 ...
##  $ total_cases_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_cases_per_million          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million : num  NA NA NA NA NA NA 0 0 0 0 ...
##  $ total_deaths_per_million       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_per_million         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed_per_million: num  NA NA NA NA NA NA 0 0 0 0 ...
##  $ total_tests                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                    : chr  "" "" "" "" ...
##  $ stringency_index               : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ population                     : num  38928341 38928341 38928341 38928341 38928341 ...
##  $ population_density             : num  54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                     : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                  : num  2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                  : num  1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                 : num  1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate          : num  597 597 597 597 597 ...
##  $ diabetes_prevalence            : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities         : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand     : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                : num  64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index        : num  0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 ...

We see that the table consists of day-wise and country-wise features. For example, the human development index is only computed every few years and is hence a country-wise feature. Other features such as the new_deaths are day-wise features in the sense that they reflect the number of daily cases at any given day in any given country. We are interested in cumulative events (deaths, confirmed cases, recoveries, number of tests) and daily events, hence we will make one table for each use case.

First of all we will select only the column that are useful to us (country identifier such as name or ISO code are useful, especially for maps):

covid = covid %>% 
  select(iso_code,
         continent,
         location,
         date,
         total_cases,
         new_cases,
         total_deaths,
         new_deaths,
         total_tests,
         new_tests
         )

Before moving on, we will also change the date column from character type to Date type:

covid$date = as.Date(covid$date)

Note: We actually won’t use tests data as it has been very sparsely collected, see it for yourself in the chart below.

library(naniar)
vis_miss(covid %>% arrange(date) %>% select(contains(c('total','new'))))

As you can see above roughly 62% of testing data is missing, where as data missingness for confirmed cases has been in average below 6.88% and missingness for deaths has been below 23.84%.

covid = covid %>% select(-contains('tests'))

Now we split our data into cumulative and daily

## this line selects all columns from covid except those that contain
## the word new
cumulative = covid %>% select(-contains('new'))

daily = covid %>% select(-contains('total'))

Using the JHU data source, we observed negative daily cases and negative deaths (see first section)[#note-data-source]. For our peace of mind we will check if this data has such artefacts.

any(daily$new_cases < 0)

## [1] TRUE

any(daily$new_deaths < 0)

## [1] TRUE

At this point, I could freak out (LOL) but I won’t. Someone pointed at the issue of negative cases and deaths in the OWiD repository. A collaborator from the OWiD team described how this is actually due to countries overestimating their death toll and sending an update to entities such as the European Center for Disease Control (ECDC) at later dates. Not much can be done about this. At this point I decided to go ahead with using the OWiD data as it is very clean and it is regularly maintained by a professional team. Also the open issue count in the OWiD repo (8 at time of reading) is more encouraging than that of the JHU repo (1305 at the time of reading).

Plots for diagnostics

library(ggplot2)
daily %>% 
  filter(iso_code == 'ESP') %>%
  ggplot(aes(x = date,y = new_cases))+
  geom_point(color = 'red')+
  ggtitle('Spain\'s Daily COVID cases')

Data collector do not control and often do not seem to know how nations report statistics. Spain seems to be chunk reports of daily cases every few days hence the rapidly fluctuating number of daily confirmed cases during the 2nd wave of the pandemic.

daily %>% 
  filter(iso_code == 'GBR') %>%
  ggplot(aes(x = date,y = new_cases))+
  geom_point(color = 'darkblue')+
  ggtitle('UK Daily Confirmed Cases')

cumulative %>% 
  filter(iso_code == 'GBR') %>%
  ggplot(aes(x = date,y = total_deaths))+
  geom_point(color = 'darkgreen')+
  ggtitle('Italy\'s Cumulative Confirmed Cases')+
  ylab('Total number of confirmed cases (log10 scale)')

Save data

The OWiD data is extremely useful and little to no data cleaning was needed (in comparison to the cleaning of the JHU data).

saveRDS(daily, '../ProcessedData/daily.rds')
saveRDS(cumulative, '../ProcessedData/cumulative.rds')

COVID-19 Data Cleaning

Luis Chaves

2020

Note - Data Source

Load useful libraries

Download data

Plots for diagnostics

Save data