Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
The objective of this visualisation is to help the viewer understand and compare the prevalence of Covid-19 in countries in Europe, the Middle East, and North Africa. When put in the context of the associated article, the objective shifts toward helping the viewer understand how a severe lack of medical staff, equipment and testing in certain countries (e.g. Hungary) can distort our perception of the Covid-19 pandemic.
The visualisation’s target audience are people living in Europe, the Middle East, and North Africa, or who know people living in these regions, who are interested in understanding their risk of contracting Covid-19. When put in the context of the associated article, the visualisation’s target audience expands to include charities, foreign governments and people concerned about epidemiology, who wish to understand Covid-19 case numbers and what drives them, especially in eastern Europe.
The visualisation chosen had the following three main issues:
Not all circles represent the same thing. Circles appear to represent countries but, on closer inspection, some circles represent countries while others represent cities. For example, the Russian cities of Moscow and St. Petersburg are represented by different circles, but there is no circle representing Russia as a country. Because there are no annotations to highlight this inconsistency, it is possible that the viewer could mistake one of the circles representing a Russian city as representing Russia itself. This could lead the viewer to conclude that the number of cases in Russia is relatively small compared to other countries when, in fact, as of September 3rd, 2020, Russia had over one million cases of Covid-19, the most cases of any country in the visualisation. Changing what the circles represent without notifying the viewer is unethical as it misrepresents the data and biases certain countries. This bias is also dangerous as viewers of this visualisation who live in certain countries could be lulled into a false sense of security regarding their risk of contracting Covid-19.
The number of confirmed cases of Covid-19 is an inappropriate measure of the risk and spread of Covid-19 as it does not account for a country’s population and therefore cannot be used to make valid comparisons. Because the number of confirmed cases of Covid-19 does not consider a country’s population, a country with a relatively large population (e.g. Turkey) is likely to have a large number of cases simply because it has more people, but not necessarily because Covid-19 is spreading rapidly in that country. Likewise, Covid-19 may be spreading rapidly within a relatively small country (e.g. Qatar), but the country’s number of cases may appear small simply because there are fewer people in that country. Because of this, the viewer cannot draw accurate comparisons about the spread/risk of Covid-19 based on the number of cases alone. The viewer requires a relative measure, such as the number of cases per 1,000 people, in order to make valid comparisons between different countries.
The use of circles to represent numeric values makes it difficult to compare countries, particularly those with similar case numbers. For example, the viewer is likely to have trouble judging whether there are more cases of Covid-19 in Poland or Belarus because the circles representing them have similar areas. According to Cleveland and McGill, humans have difficulty judging area accurately (particularly circular area) and therefore, position and length should be used in preference to area when designing data visualisations to improve accuracy (Cleveland, 1985).
Reference
The Guardian (2020). Lack of testing raises fears of coronavirus surge in eastern Europe. Retrieved September 3, 2020, from The Guardian website: https://www.theguardian.com/world/2020/mar/29/lack-of-testing-raises-fears-of-coronavirus-surge-in-eastern-europe
Cleveland, W. S. (1985). Graphical Perception and Graphical Methods for Analyzing Scientific Data. Science, 229, 828-833.
The following code was used to fix the issues identified in the original.
# Load required packages
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
# Assign URL for original data source
git_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/09-02-2020.csv"
#
#---------------------------------------------------------------------
# Import Covid-19 data from GitHub
#---------------------------------------------------------------------
covid <- read_csv(url(git_url))
covid$Country_Region <- covid$Country_Region %>% str_replace("West Bank and Gaza", "Palestine")
covid <- covid %>% rename("Country" = Country_Region)
covid_country <- covid %>% group_by(Country) %>% summarise(Active_cases = sum(Active), Total_cases = sum(Confirmed))
#
#----------------------------------------------------------------------------
# Import Region information from excel file sourced from the World Bank
#----------------------------------------------------------------------------
# Set the working directory to the location of the World Bank data
setwd("c:/Users/james/OneDrive/Desktop/Uni Studies/RMIT/Data Visualisation - MATH2270/Assignment 2")
# read data from population data file downloaded from the World Bank
regions <- read_excel("API_SP.POP.TOTL_DS2_en_excel_v2_1307373.xls", sheet = "Metadata - Countries") %>% select(c(1, 2,3,5))
# Replace the TableName "Russian Federation" with "Russia" so it will match to the Covid-19 data
regions$TableName <- regions$TableName %>% str_replace("Russian Federation", "Russia")
regions <- regions %>% rename("Income Group" = IncomeGroup)
#
#---------------------------------------------------------------------
# Join Region information to the Covid-19 data
#---------------------------------------------------------------------
covid_regions <- left_join(covid_country, regions, by = c("Country" = "TableName"))
#
#---------------------------------------------------------------------
# Filter data to just Europe, Middle East and North Africa
#---------------------------------------------------------------------
covid_regions_filtered <- covid_regions %>%
filter(Region %in% c("Europe & Central Asia", "Middle East & North Africa"))
#
#-------------------------------------------------------------------------
# Import Population data from excel file sourced from the World Bank
#-------------------------------------------------------------------------
pop <- read_excel("API_SP.POP.TOTL_DS2_en_excel_v2_1307373.xls", sheet = "Data", skip = 3)
pop_select <- pop %>% select(c(1, ncol(pop))) %>% rename("Population" = `2019`)
# Replace the country name "Russian Federation" with "Russia" so it will match to the Covid-19 data
pop_select$`Country Name` <- pop_select$`Country Name` %>%
str_replace("Russian Federation", "Russia") %>%
str_replace("Slovak Republic", "Slovakia") %>%
str_replace("Iran, Islamic Rep.", "Iran")
#
#---------------------------------------------------------------------
# Join Population data to Covid-19 + Region data
#---------------------------------------------------------------------
covid_pop <- left_join(covid_regions_filtered, pop_select, by = c("Country" = "Country Name"))
# Add a column to display a cases per-capita measure
covid_per_cap <- covid_pop %>% mutate(cases_per_1k = (Total_cases / Population)*1000)
#
#---------------------------------------------------------------------
# Import Testing data from github - Our World in Data (OWID)
#---------------------------------------------------------------------
owid_tests <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv", col_names = FALSE)
names(owid_tests) <- owid_tests[1, ]
owid_tests <- owid_tests[-1, ]
tests <- owid_tests %>% filter(date == "2020-09-03") %>% select(c(1, "new_tests_smoothed_per_thousand"))
tests$new_tests_smoothed_per_thousand <- as.numeric(tests$new_tests_smoothed_per_thousand)
#
#---------------------------------------------------------------------
# Join Testing data to Covid-19 + Region + Population data
#---------------------------------------------------------------------
covid_tests <- left_join(covid_per_cap, tests, by = c("Country Code" = "iso_code")) %>%
select(4, 1, 5:7, 2:3, 8:9)
#
#---------------------------------------------------------------------
# Order the data in preparation for visualisation
#---------------------------------------------------------------------
covid_tests$Country <- covid_tests$Country %>% factor(levels = covid_tests$Country[order(covid_tests$cases_per_1k)])
#
#---------------------------------------------------------------------
# Prepare the data for reconstructing the visualisation
#---------------------------------------------------------------------
##
#covid_active <- covid_tests %>% mutate(active_cases_per_1k = (Active_cases / Population)*1000)
covid_gathered <- covid_tests %>% gather("Measure", "Number", 8:9)
covid_gathered$Measure <- factor(covid_gathered$Measure,
levels = c("cases_per_1k",
"new_tests_smoothed_per_thousand"),
labels = c("Cumulative total cases per 1,000 people",
"Average tests per 1,000 people per day"))
covid_gathered$`Income Group` <- factor(covid_gathered$`Income Group`,
levels = c("High income", "Upper middle income",
"Lower middle income","Low income"),
ordered = TRUE)
#
# add a ranking to the cases and testing
covid_gathered <- covid_gathered %>%
mutate(Ranks = c(rank(-covid_tests$cases_per_1k, ties.method = "min"),
rank(-covid_tests$new_tests_smoothed_per_thousand, na.last = "keep", ties.method = "min")))
#
# create a colour palette for the Income Groups
colourpal <- c("#7fc97f", "#beaed4", "#fdc086", "#ffff99")
#
#---------------------------------------------------------------------
# Reconstruct the visualisation
#---------------------------------------------------------------------
p1 <- ggplot(data = covid_gathered, aes(x = Number, y = Country, fill = `Income Group`))
p2 <- p1 + geom_bar(stat = "identity") +
geom_text(aes(label = round(Number, 1)), hjust = -0.1, size = 2.5) +
geom_text(aes(label = Ranks, x = 0), vjust = 0.25, hjust = 1, size = 2.5) +
scale_x_continuous(expand = c(.08, .08)) +
scale_fill_manual(values = colourpal) +
theme_grey() +
theme(plot.title = element_text(size = 10, face = "bold"),
plot.subtitle = element_text(size = 9),
plot.caption = element_text(size = 8),
axis.title.y = element_blank(),
axis.text.y = element_text(size = 8, colour = "black"),
axis.title.x = element_blank(),
axis.text.x = element_text(size = 7),
strip.text.x = element_text(size = 8, margin = margin(.1, 0, .1, 0, "cm")),
legend.title = element_text(size = 8),
legend.text = element_text(size = 7),
legend.key.size = unit(0.5, "cm")) +
labs(title = "Covid-19 cases and tests for countries in Europe, the Middle East and North Africa",
subtitle = "Lower-income countries tend to have lower test rates or don't report on testing, which may explain their lower case rates relative to higher-income countries \nData as at 3rd September, 2020",
caption = "Sources: JHU CSSE COVID-19 Data - https://github.com/CSSEGISandData/COVID-19,
The World Bank (2020) - https://data.worldbank.org/indicator/SP.POP.TOTL?view=map,
Our World in Data (2020) - https://github.com/owid/covid-19-data/blob/master/public/data") +
facet_grid(. ~ Measure, scales = "free")
Data Reference
The following plot fixes the main issues in the original.