Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Dorling, D. & McClure K. (2020).


Objective

Since 2020, COVID-19 virus has spread all over the world, resulting in the loss of countless lives. Hundreds of visualisations for the counting statistics of COVID-19 victims has been made. These visualisations mainly emphasized on cumulative deaths till the present date. The characteristic with cumulative statistics is that it only rises or remains steady. Using the statistic of average deaths per day, the viewer can get a better idea of COVID-19 Safety practices, for example, wearing masks and social distancing reduces the number of deaths per day. Unlike total deaths, average deaths per day for a specific date rises or falls allowing the viewer to realise the impact of their decisions. The targeted audience of this original visualisation are the citizens of each country and government officials that need to realise how their actions, like, quarantining can help reduce the spread of the virus. The government officials of the countries can also analyse which countries are more at risk to close their borders.

The visualisation chosen had the following three main issues:

  • Deceptive Methods- Due to the large number of overlapping lines and bombardment of annotations, it is hard to estimate which countries were affected the most till the end of February 2020. By convention the the time variable (i.e. independent variable) should fall in x-axis, which is not the case here.Thus, the viewer is not able to determine how the observed variable (i.e. average deaths per day) changed in response to the time variable.Furthermore, the original visualisation underestimated the average deaths per day of COVID-19 victims in China and could not identify the particular date when the virus began affecting the other countries.

  • Issues with data integrity – The original visualisation might have copyright issues since it does not cite the source of the data in the image. As a result, we cannot verify the authenticity and quality of the data that contributed to the original data visualisation. The data source needs to be checked thoroughly if the deaths were solely due to COVID-19 or other diseases. There is a high probability that the data used in the original visualisation is compromised and needs verification by creating a new visualisation using independently sourced data.

  • Perceptive/Color Bias- The original data visualisation cannot be easily understood by the color-blinded audience, since it uses red and green colors to distinguish between Germany and UK. The visualisation contains high saturation colors that potentially causes eye strain. Moreover, the visualisation uses the colors irresponsibly by using bright colored lines that do not highlight an important piece of information. The bright colored spiraling lines are very distracting ,which makes it hard to identify the trend in the data and perceive the starting point of the lines.

Reference

Code

The following code was used to fix the issues identified in the original.

#Loading the Libraries
library(ggplot2)
library(readr)
library(magrittr)
library(dplyr)
library(tidyverse)
library(xts)
library(lubridate)
library(tidyr)
library(stringr)  
library(RColorBrewer)


#Defining the function to calculate daily death for a date from the total deaths due to COVID-19 dataset
calculate_daily_deaths <- function(df) {
  country_region <- df[1]
  df_without_country <- df[-1]
  df_without_country_with_lead_one <- df_without_country[-1]
  df_daily_deaths<- df_without_country_with_lead_one - df_without_country[1:ncol(df_without_country_with_lead_one)]
  df_daily_deaths<- cbind(country_region,df_daily_deaths)
  return(df_daily_deaths)
}



#Loading the dataset 
#The dataset contains the total deaths due to COVID-19 for each day
# The dataset contains dates,country,region,latitude,longitude.
# The unit of observations is the total deaths
setwd("C:/Users/G3AR/Desktop/Data_Visualisation/assignment_02")
df <- read_csv("time_series_covid19_deaths_global.csv")




#Standardizing the variables names
colnames(df)[1] <- "province"
colnames(df)[2] <- "country_region"

#Dropping unneccessary columns and selecting the only countries mentioned in the original
#visualisation
#Not including the England and Wales since England and Wales are part of the UK
df <- subset(df, select = -c(Lat, Long,province))
df <- df %>% filter( df$country_region == "US" 
                         | df$country_region == "France"
                         | df$country_region == "Spain"
                         | df$country_region == "Italy"
                         | df$country_region == "United Kingdom"
                         | df$country_region == "Germany"
                         | df$country_region == "China")




#Finding the cumlative sum for each country, since the total deaths for each date is divided
#regions
df<- df %>% group_by(country_region) %>% summarise_each(funs(sum))


#Calculating the average deaths per day for a particular country
df_daily_deaths<-df %>% calculate_daily_deaths()


#Converting the data from wide format to long frmat
df_daily_deaths<-df_daily_deaths[-2]
df_daily_deaths<- gather(df_daily_deaths,DATE, DAILY_DEATHS,2:ncol(df_daily_deaths), factor_key=TRUE)
#df_change_in_daily_deaths<- gather(df_change_in_daily_deaths,DATE, CHANGE_IN_DAILY_DEATHS,2:ncol(df_change_in_daily_deaths), factor_key=TRUE)
df_final <- df_daily_deaths

#Processing the DATE column by separating them using "/" operator
df_final<- df_final %>% separate(DATE, into = c("Month", "Day",'Year'), sep = "/")


#Converting the day,month and year into numeric values
df_final$Day <- as.numeric(df_final$Day)
df_final$Month <- as.numeric(df_final$Month)
df_final$Year <- as.numeric(df_final$Year)

#Stamdardizingng the year variable value since it contains the last 2 digits of the year
df_final$Year <- df_final$Year +2000

#Generating the date variable by using day,month and year columns
df_final$DATE <- as.Date(with(df_final, paste(Year,Month,Day,sep="-")), "%Y-%m-%d")


#Selecting the 3 columns that we need in our dataset, country/region, date and the
# the average dates per day
df_final <- subset(df_final, select = c(country_region,DATE,DAILY_DEATHS))
df_final <-df_final[order(df_final$country_region, df_final$DATE),]
df_final<- df_final %>% mutate(country_region = as.character(country_region)) 



#Selecting only those rows that haves dates prior to April 4th,2020
#Since, the original data visualisation used data til April 3rd,2020
df_final <- filter(df_final, DATE<=ymd("2020-04-03"))

#Standardizing the country/region names for shortened forms in the legends
df_final$country_region[df_final$country_region=="United Kingdom"] <- "UK"
df_final$country_region[df_final$country_region=="United Kingdom"] <- "USA"


#Setting the color-blind friendly color pallete
cbPallete <- brewer.pal(n = 7, name = "Set2")


#Using a area chart for time series since the number of categories in the country/region varibale is 7
#which is quite large. Therfore, line chart is not suitable since the line overlap a lot. Using
#different type of lines and changing the colour did not help since the number of the lines is very large.
#Using a border in the area chart helps identify the lines of the area chart easily
p<-ggplot(data = df_final, aes( x = DATE,y = DAILY_DEATHS,fill = country_region))
p <- p + geom_area(color="black",size = .2, alpha = .5) + scale_fill_manual(values=cbPallete) + scale_colour_manual(values=cbPallete) 
p<- p+  labs(title="Daily COVID-19 Deaths by Country/Region, April 3rd, 2020",
subtitle = "Plot of Average Deaths per Day vs Time",
y ="Average Deaths per Day", x = "Time (in 2020)",fill="Country/Region",caption = 
"Source of Data: COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE)
at Johns Hopkins University. (2020).Time Series Summary of Deaths due to Covid-19 Deaths[Data file].
Retrieved from https://github.com CSSEGISandData/COVID-19")  


#Setting the theme of the plot. Modifying the grid lines grey and making the background white, so the
# eyes focus more on the visualisation and not distracted by background. Formatting text in the 
# visualisation to make to make it less distracting
p<- p + theme(
  # Remove panel border
  panel.border = element_blank(),  
  # Changing Grid Lines
  panel.grid.major.y = element_line(size = 0.5, linetype = 'dashed',
                                colour = "#f0f0f0"), 
  panel.grid.minor.y = element_line(size = 0.5, linetype = 'dashed',
                                colour = "#f0f0f0"), 
  panel.grid.major.x = element_blank(),
  panel.grid.minor.x = element_blank(),
  # Remove panel background
  panel.background = element_blank(),

  #changing the text to gray in the plots so that the data is highlighted more
  axis.line = element_line(colour = "#5c5c5c"),
  axis.title.x = element_text(color="#5c5c5c"),
  axis.title.y = element_text(color="#5c5c5c"),
  plot.subtitle=element_text(hjust=0,face="italic", color="#5c5c5c"),
  plot.caption=element_text(hjust=0,face="italic", color="#929292"),
  legend.title = element_text(color="#5c5c5c"),
  legend.text = element_text(color="#5c5c5c"),
  
  )

Data Reference

Reconstruction

The following plot fixes the main issues in the original.