Global Open Access COVID-19 Epidemiological Data: What is it? How good the data quality is? and How can we improve it?

1. What is Open Access COVID-19 Epidemiological Data?

Coronavirus disease 2019 (COVID-19) has been spreading rapidly across the globe. Basic aggregate data such as total number of tests, new cases, and deaths have been published regularly and frequently, providing critical information to understand the magnitude, pace, and location of the epidemic. Compared to previous disease outbreaks, we see substantially more publicly accessible data (Dong 2020, WHO 2020) and use of such data by media and researchers. This rapid and transparent information has been extremely valuable for not only the public health authorities but also the general public.

In addition to the aggregate data, a research group published Open Access Epidemiological Data, a centralized global database with information for each individual case based on reports from WHO, Ministries of Health, and Chinese health authorities. This rapid action to establish and share this open access, machine-readable, regularly-updated, de-identified case-level data is beyond commendable. It can provide data to answer questions that require disaggregated information by specific background characteristics. There are 34 variables in the database to cover age, sex, sub-national location, presence of underlying chronic illness, travel history, and dates of onset of symptoms, hospitalization, and laboratory confirmation. Further methodologic details about the database) were recently published. The research team collaborated with data curators who thoroughly assessed source data from individual countries as well as WHO, and applied advanced data management techniques to create and update the large database. Anybody can view/access it at their GitHub. So, as a data enthusiast, I was thrilled to see it and do admire efforts by the research team.

2. What does the data quality look like?

As a good public health data scientist, meanwhile, I started assessing the data quality. However, the quality of the case-level data is suboptimal.

First, timeliness, one of the most important requirements for outbreak data, is poor. As of April 5, 8:33PM EDT, a total of 261558 cases from 116 countries have been included in the database. This is about 20% of the total confirmed cases available on the same date from currently available sources - like this or this. Since there is no easily accessible information on the date of the source data, we cannot assess the actual timeline between laboratory confirmation, reporting of the result in the source data, and inclusion of the case in the database. Nevertheless, considering exponential increases in new cases around the globe, I again admire perseverance of the team.

2.1 Overall completeness of reported age and sex

Second, completeness (i.e., the extent to which an actual and reasonable value is recorded, for example age 57, not 575 or missing) is one of the most important attributes of high quality data, as analyses based on less complete data can produce biased results. When completeness of age and sex information - the two most basic demographic characteristics in epidemiologic studies - is assessed among the currently available 261558 cases, only 5.1% and 5.1% have complete information for age and sex, respectively

Even among cases with complete age, there is substantial variation in units used by different countries, challenging pooled data analysis. Further, sometimes age reporting is in units that are too wide for epidemiologic studies. Among cases with age, 76% have age reported in a single year, 0% in 5-year age groups, 18% in 10-year age groups, and 6% in age groups exceeding 10 years. Open-age groups - which are typically used for older populations - are used in 0.5% of cases with age in the database, but often the groups reference the general adult population rather than older populations; (e.g., ‘20 and above’, ‘18 and above’) or the groups start at a relatively low, old age (e.g., ‘60 and above’). This is problematic, since mortality varies greatly by age even in that open range of 60 and above. See this example of age-specific mortality rates in South Korea.

2.2 Completeness of reported age and sex by country

Across countries, data completeness varies substantially. Among 45 countries that have 50 or more confirmed cases included in this database, the below figure shows completeness in age and sex reporting by country. It ranges from 0% to 100% in Singapore for both reported age and sex of patients (Figure 1). There is no clear relationship between the total number of cases and completeness (e.g., the less cases, the higher completeness) (Figure 2).

Hover over the figures to see values and more options.

Figure 1. Completeness of age and sex reporting by country (among 45 countries with 50 or more confirmed cases in the database)

Figure 2. Relationship between completeness of reporting and the number of cases (among 45 countries with 50 or more confirmed cases in the database)

2.3 Completeness of reported age and sex by country: How has it changed over time?

More interestingly, in terms of cumulative completeness for the first 50, 100, 150, and 200 confirmed cases there was also great variation between countries as they moved into crisis mode (Figures 3.A and 3.B).

  • Singapore, Philippines, South Africa, Canada, Mexico, and Australia have very high completeness for both sex and age among the first 50 as well as throughout the first 200 cases. However, even in South Africa and Canada, the overall completeness is low for all cases currently included in the database (1348 cases for South Africa and 1710 for Canada as of April 5, 2020). In Singapore, Philippines, Mexico, and Australia, 307, 512, 1389, and 544 cases are currently included in the database, respectively, and the overall completeness remains high for these countries.

  • In most countries, however, the data completeness decreased as the number of cases increased. In South Korea, for example, the completeness initially was very high, exceeding 90%, but dropped rapidly throughout the first 200 cases, potentially because the number of cases increased exponentially over a very short time period (the total number of cases increased from 30 to 346 over 6 days) (KCDC 2000).

  • The pattern in Germany is unique in that for the first 200 cases data completeness improved.

Hover over the figures to see values and more options.

Figure 3.A. Completeness of age reporting for the first 50, 100, 150, and 200 confirmed cases by country (among 34 countries with 200 or more confirmed cases in the database)

Figure 3.B. Completeness of sex reporting for the first 50, 100, 150, and 200 confirmed cases by country (among 34 countries with 200 or more confirmed cases in the database)
Note: Countries are sorted in descending order of completeness for the first 50 cases and completeness for the first 200 cases. All cases refer to all cases currently included in the database. China is excluded in this figure, since it is the first country affected by the epidemic.

3. What can we do to improve the database?

Patients care and epidemic control is a top priority in the middle of a crisis. Countries understandably can be delayed in sharing case-level data with WHO and other countries, while more detailed and complete data are available and shared within each individual countries - see examples in China, South Korea, and US. However, in rapidly unfolding pandemic outbreaks, sharing data across countries is critical. Having open access, good quality case-level data will help us understand the global pandemic and develop best strategies to combat it.

A streamlined and coordinated epidemic surveillance data system should be developed and used by countries. The database system should include a limited number of essential variables that are critical to develop rapid and specific interventions. For data collection, data entry tools should be developed to minimize human errors, while the unit of data is useful for research and programmatic responses. Importantly, the system should improve processes for reporting and sharing case-level data with WHO as well as the public to optimize timeliness. Without high quality source data from individual countries, efforts to establish a global database have only limited value. After this immediate crisis is over, this should be a top priorities for WHO, member countries, and the global health data community.

  • Last updated on: 2020-04-06
  • For typos, errors, and questions, contact me at www.isquared.global

Making Data Delicious, One Byte at a Time, in good times and bad times.