Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Our World in Data.


Objective

Explain the objective of the original data visualisation and the targetted audience.

The Original data consists of world data that is 296 countries and its colored based on 6 continents which records a graph plotted against new tests done per thousand versus new confirmed cases per million for year. This information is useful for the people around the globe who need to know the daily status of covid cases in the world.The data is again represented continent wise to give color effect.

The visualization chosen had the following three main issues:

  • COMPLEX DATA VISUALISATION: (i)Since the dataset is time series data consisting of world data which is 296 countries belonging to 6 Continents, its large dataset thus displaying all the useful information in a proper way is a big challenge.And the original visualization have failed to utilize the data in efficient way. (ii)The Visualization consist of large amount of data which creates ambiguity in the message given to its audience. (iii)Instead of using Daily basis data, monthly data representation is sufficient to give a wide angle idea about the data.

  • COLOR BLINDNESS: (i)The color code used for splitting data continent wise fails to create a clear picture about the situation. (ii)The lines drawn are hard to follow for each country.If the color coding is given based on continents then there was no need to add countries. (iii)The color palette used are quite similar to neighboring thus creates a confusion for viewer. (iv)As data is recorded on daily time frame,the colored regions get mixed with the neighboring regions in orignal data which creates a different shade not mentioned in the plot palette. (v)Therefore, the data must be seperated continent wise for each month which doesnt mix the data points even though the data consist of outliers it is visible clearly.

  • POOR SCALING: (i)The scaling done is Improper and difficult to understand for common people and it makes the information vague. (ii)Though the data is recorded daily basis, the plotting should be simplified by considering yearly data month wise spread for the ease of viewer. (iii)Creating facets based on months and comparing new tests done vs new confirmed cases in 2020 and 2021 would give a clear picture about what is happening in world on time scale.

Reference

Code

The following code was used to fix the issues identified in the original.

#install.packages("tidyverse")
#install.packages("ggplot")
#install.packages("tidyr")
#install.packages("ggpubr")
library(ggplot2)
library(tidyverse)
library(tidyr)
library(ggpubr)
#Read dataset
cdata<-read.csv("D:/Rmit/Sem3/Data Visualisation/assignmnt2/covid-19-daily-tests-vs-daily-new-confirmed-cases-per-million.csv", header=TRUE)
head(cdata)
##        Entity     Code        Day new_tests_per_thousand_7day_smoothed
## 1    Abkhazia OWID_ABK 2020-01-21                                   NA
## 2 Afghanistan      AFG 2020-02-26                                   NA
## 3 Afghanistan      AFG 2020-02-27                                   NA
## 4 Afghanistan      AFG 2020-02-28                                   NA
## 5 Afghanistan      AFG 2020-02-29                                   NA
## 6 Afghanistan      AFG 2020-03-01                                   NA
##   X142753.annotations
## 1                    
## 2                    
## 3                    
## 4                    
## 5                    
## 6                    
##   Daily.new.confirmed.cases.of.COVID.19.per.million.people..rolling.7.day.average..right.aligned.
## 1                                                                                              NA
## 2                                                                                         0.00867
## 3                                                                                         0.00650
## 4                                                                                         0.00520
## 5                                                                                         0.00433
## 6                                                                                         0.00371
##   Year Continent
## 1 2015      Asia
## 2 2015      Asia
## 3 2015      Asia
## 4 2015      Asia
## 5 2015      Asia
## 6 2015      Asia
class(cdata)
## [1] "data.frame"
##Preprocessing
#remove unwanted columns
cdTs= cdata[-c(5,7)]
#Structure
str(cdTs)
## 'data.frame':    83452 obs. of  6 variables:
##  $ Entity                                                                                         : chr  "Abkhazia" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code                                                                                           : chr  "OWID_ABK" "AFG" "AFG" "AFG" ...
##  $ Day                                                                                            : chr  "2020-01-21" "2020-02-26" "2020-02-27" "2020-02-28" ...
##  $ new_tests_per_thousand_7day_smoothed                                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Daily.new.confirmed.cases.of.COVID.19.per.million.people..rolling.7.day.average..right.aligned.: num  NA 0.00867 0.0065 0.0052 0.00433 0.00371 0 0.00371 0.011 0.011 ...
##  $ Continent                                                                                      : chr  "Asia" "Asia" "Asia" "Asia" ...
colnames(cdTs)[1] <- "Country"
colnames(cdTs)[3] <- "Date"
colnames(cdTs)[4] <- "Daily_New_Tests/thousand"
colnames(cdTs)[5] <- "Daily_New_confirmed cases/million"

cdTs$Country<- as.factor(cdTs$Country)
cdTs$Code<- as.factor(cdTs$Code)
cdTs$Continent<- as.factor(cdTs$Continent)
cdTs$Date<- as.Date(cdTs$Date)

#names
names(cdTs)
## [1] "Country"                           "Code"                             
## [3] "Date"                              "Daily_New_Tests/thousand"         
## [5] "Daily_New_confirmed cases/million" "Continent"
#check for NA values
sum(is.na(cdTs))
## [1] 39771
#remove NA values
new_data<- na.omit(cdTs)
head(new_data)
##     Country Code       Date Daily_New_Tests/thousand
## 878 Albania  ALB 2020-03-11                        6
## 879 Albania  ALB 2020-03-12                       12
## 880 Albania  ALB 2020-03-13                       20
## 881 Albania  ALB 2020-03-14                       22
## 882 Albania  ALB 2020-03-15                       24
## 883 Albania  ALB 2020-03-16                       24
##     Daily_New_confirmed cases/million Continent
## 878                           1.39000    Europe
## 879                           1.99800    Europe
## 880                           2.29340    Europe
## 881                           2.20067    Europe
## 882                           2.08486    Europe
## 883                           2.43229    Europe
#Seperate day month year
str(new_data)
## 'data.frame':    43775 obs. of  6 variables:
##  $ Country                          : Factor w/ 296 levels "Ã…land Islands",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Code                             : Factor w/ 287 levels "","ABW","AFG",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Date                             : Date, format: "2020-03-11" "2020-03-12" ...
##  $ Daily_New_Tests/thousand         : num  6 12 20 22 24 24 24 25 20 14 ...
##  $ Daily_New_confirmed cases/million: num  1.39 2 2.29 2.2 2.08 ...
##  $ Continent                        : Factor w/ 8 levels "","Africa","Antarctica",..: 5 5 5 5 5 5 5 5 5 5 ...
##  - attr(*, "na.action")= 'omit' Named int [1:39677] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:39677] "1" "2" "3" "4" ...
library(tidyr)
new_data1 <- separate(new_data, "Date", c("Year", "Month", "Day"), sep = "-")
#tail(new_data1)
#library(dplyr)
#2020
df2020<-filter(new_data1, new_data1$Year=="2020")
tail(df2020)
##        Country Code Year Month Day Daily_New_Tests/thousand
## 31026 Zimbabwe  ZWE 2020    12  26                      106
## 31027 Zimbabwe  ZWE 2020    12  27                       95
## 31028 Zimbabwe  ZWE 2020    12  28                       93
## 31029 Zimbabwe  ZWE 2020    12  29                      102
## 31030 Zimbabwe  ZWE 2020    12  30                      111
## 31031 Zimbabwe  ZWE 2020    12  31                      109
##       Daily_New_confirmed cases/million Continent
## 31026                           7.80457    Africa
## 31027                           7.22786    Africa
## 31028                           6.97800    Africa
## 31029                           7.50671    Africa
## 31030                           9.31357    Africa
## 31031                          10.39000    Africa
#2021
df2021<-filter(new_data1, new_data1$Year=="2021")
tail(df2021)
##        Country Code Year Month Day Daily_New_Tests/thousand
## 12739 Zimbabwe  ZWE 2021    04  19                      138
## 12740 Zimbabwe  ZWE 2021    04  20                      143
## 12741 Zimbabwe  ZWE 2021    04  21                      139
## 12742 Zimbabwe  ZWE 2021    04  22                      131
## 12743 Zimbabwe  ZWE 2021    04  23                      125
## 12744 Zimbabwe  ZWE 2021    04  24                      118
##       Daily_New_confirmed cases/million Continent
## 12739                           5.30557    Africa
## 12740                           5.23843    Africa
## 12741                           5.87286    Africa
## 12742                           5.72871    Africa
## 12743                           4.91171    Africa
## 12744                           3.50843    Africa
##Plotting
#jitter
p1<-ggplot(df2020, aes(x = df2020$Continent, y = df2020$`Daily_New_confirmed cases/million`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2020$Month)
case20 <- p1+labs(
    title = "Monthly confirmed cases in 2020 (Continent vs cases)",
    x = "Continents wise cases", 
    y = "New confirmed cases/million"
  )
#case20

p2<-ggplot(df2021, aes(x = Continent, y = `Daily_New_confirmed cases/million`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2021$Month)
case21 <- p2+labs(
  title = "Monthly confirmed cases in 2021 (Continent vs cases)",
  x = "Continents wise cases", 
  y = "New confirmed cases/million"
)
#case21

p3<-ggplot(df2020, aes(x = df2020$Continent, y = df2020$`Daily_New_Tests/thousand`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2020$Month)
case22 <- p3+labs(
  title = "Monthly new tests done in 2020 (Continent vs cases)",
  x = "Continents wise cases", 
  y = "New Tests done/thousand"
)
#case22

p4<-ggplot(df2021, aes(x = Continent, y = df2021$`Daily_New_Tests/thousand`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2021$Month)
case23 <- p4+labs(
  title = "Monthly New tests done in 2021 (Continent vs cases)",
  x = "Continents wise cases", 
  y = "New tests done/thousand"
)
#case23

overview <- ggarrange(case20,case21,case22,case23,labels = c("p1","p2","p3","p4"),ncol=2,nrow = 2)

Data Reference

COVID-19: Daily tests vs. Daily new confirmed cases per million. Our World in Data. (2021). Retrieved 2 May 2021, from https://ourworldindata.org/grapher/covid-19-daily-tests-vs-daily-new-confirmed-cases-per-million?time=2020-02-17..2020-11-21.

Reconstruction

The following plot fixes the main issues in the original.