Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
Explain the objective of the original data visualisation and the targetted audience.
The Original data consists of world data that is 296 countries and its colored based on 6 continents which records a graph plotted against new tests done per thousand versus new confirmed cases per million for year. This information is useful for the people around the globe who need to know the daily status of covid cases in the world.The data is again represented continent wise to give color effect.
The visualization chosen had the following three main issues:
COMPLEX DATA VISUALISATION: (i)Since the dataset is time series data consisting of world data which is 296 countries belonging to 6 Continents, its large dataset thus displaying all the useful information in a proper way is a big challenge.And the original visualization have failed to utilize the data in efficient way. (ii)The Visualization consist of large amount of data which creates ambiguity in the message given to its audience. (iii)Instead of using Daily basis data, monthly data representation is sufficient to give a wide angle idea about the data.
COLOR BLINDNESS: (i)The color code used for splitting data continent wise fails to create a clear picture about the situation. (ii)The lines drawn are hard to follow for each country.If the color coding is given based on continents then there was no need to add countries. (iii)The color palette used are quite similar to neighboring thus creates a confusion for viewer. (iv)As data is recorded on daily time frame,the colored regions get mixed with the neighboring regions in orignal data which creates a different shade not mentioned in the plot palette. (v)Therefore, the data must be seperated continent wise for each month which doesnt mix the data points even though the data consist of outliers it is visible clearly.
POOR SCALING: (i)The scaling done is Improper and difficult to understand for common people and it makes the information vague. (ii)Though the data is recorded daily basis, the plotting should be simplified by considering yearly data month wise spread for the ease of viewer. (iii)Creating facets based on months and comparing new tests done vs new confirmed cases in 2020 and 2021 would give a clear picture about what is happening in world on time scale.
Reference
The following code was used to fix the issues identified in the original.
#install.packages("tidyverse")
#install.packages("ggplot")
#install.packages("tidyr")
#install.packages("ggpubr")
library(ggplot2)
library(tidyverse)
library(tidyr)
library(ggpubr)
#Read dataset
cdata<-read.csv("D:/Rmit/Sem3/Data Visualisation/assignmnt2/covid-19-daily-tests-vs-daily-new-confirmed-cases-per-million.csv", header=TRUE)
head(cdata)
## Entity Code Day new_tests_per_thousand_7day_smoothed
## 1 Abkhazia OWID_ABK 2020-01-21 NA
## 2 Afghanistan AFG 2020-02-26 NA
## 3 Afghanistan AFG 2020-02-27 NA
## 4 Afghanistan AFG 2020-02-28 NA
## 5 Afghanistan AFG 2020-02-29 NA
## 6 Afghanistan AFG 2020-03-01 NA
## X142753.annotations
## 1
## 2
## 3
## 4
## 5
## 6
## Daily.new.confirmed.cases.of.COVID.19.per.million.people..rolling.7.day.average..right.aligned.
## 1 NA
## 2 0.00867
## 3 0.00650
## 4 0.00520
## 5 0.00433
## 6 0.00371
## Year Continent
## 1 2015 Asia
## 2 2015 Asia
## 3 2015 Asia
## 4 2015 Asia
## 5 2015 Asia
## 6 2015 Asia
class(cdata)
## [1] "data.frame"
##Preprocessing
#remove unwanted columns
cdTs= cdata[-c(5,7)]
#Structure
str(cdTs)
## 'data.frame': 83452 obs. of 6 variables:
## $ Entity : chr "Abkhazia" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Code : chr "OWID_ABK" "AFG" "AFG" "AFG" ...
## $ Day : chr "2020-01-21" "2020-02-26" "2020-02-27" "2020-02-28" ...
## $ new_tests_per_thousand_7day_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ Daily.new.confirmed.cases.of.COVID.19.per.million.people..rolling.7.day.average..right.aligned.: num NA 0.00867 0.0065 0.0052 0.00433 0.00371 0 0.00371 0.011 0.011 ...
## $ Continent : chr "Asia" "Asia" "Asia" "Asia" ...
colnames(cdTs)[1] <- "Country"
colnames(cdTs)[3] <- "Date"
colnames(cdTs)[4] <- "Daily_New_Tests/thousand"
colnames(cdTs)[5] <- "Daily_New_confirmed cases/million"
cdTs$Country<- as.factor(cdTs$Country)
cdTs$Code<- as.factor(cdTs$Code)
cdTs$Continent<- as.factor(cdTs$Continent)
cdTs$Date<- as.Date(cdTs$Date)
#names
names(cdTs)
## [1] "Country" "Code"
## [3] "Date" "Daily_New_Tests/thousand"
## [5] "Daily_New_confirmed cases/million" "Continent"
#check for NA values
sum(is.na(cdTs))
## [1] 39771
#remove NA values
new_data<- na.omit(cdTs)
head(new_data)
## Country Code Date Daily_New_Tests/thousand
## 878 Albania ALB 2020-03-11 6
## 879 Albania ALB 2020-03-12 12
## 880 Albania ALB 2020-03-13 20
## 881 Albania ALB 2020-03-14 22
## 882 Albania ALB 2020-03-15 24
## 883 Albania ALB 2020-03-16 24
## Daily_New_confirmed cases/million Continent
## 878 1.39000 Europe
## 879 1.99800 Europe
## 880 2.29340 Europe
## 881 2.20067 Europe
## 882 2.08486 Europe
## 883 2.43229 Europe
#Seperate day month year
str(new_data)
## 'data.frame': 43775 obs. of 6 variables:
## $ Country : Factor w/ 296 levels "Ã…land Islands",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Code : Factor w/ 287 levels "","ABW","AFG",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Date : Date, format: "2020-03-11" "2020-03-12" ...
## $ Daily_New_Tests/thousand : num 6 12 20 22 24 24 24 25 20 14 ...
## $ Daily_New_confirmed cases/million: num 1.39 2 2.29 2.2 2.08 ...
## $ Continent : Factor w/ 8 levels "","Africa","Antarctica",..: 5 5 5 5 5 5 5 5 5 5 ...
## - attr(*, "na.action")= 'omit' Named int [1:39677] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:39677] "1" "2" "3" "4" ...
library(tidyr)
new_data1 <- separate(new_data, "Date", c("Year", "Month", "Day"), sep = "-")
#tail(new_data1)
#library(dplyr)
#2020
df2020<-filter(new_data1, new_data1$Year=="2020")
tail(df2020)
## Country Code Year Month Day Daily_New_Tests/thousand
## 31026 Zimbabwe ZWE 2020 12 26 106
## 31027 Zimbabwe ZWE 2020 12 27 95
## 31028 Zimbabwe ZWE 2020 12 28 93
## 31029 Zimbabwe ZWE 2020 12 29 102
## 31030 Zimbabwe ZWE 2020 12 30 111
## 31031 Zimbabwe ZWE 2020 12 31 109
## Daily_New_confirmed cases/million Continent
## 31026 7.80457 Africa
## 31027 7.22786 Africa
## 31028 6.97800 Africa
## 31029 7.50671 Africa
## 31030 9.31357 Africa
## 31031 10.39000 Africa
#2021
df2021<-filter(new_data1, new_data1$Year=="2021")
tail(df2021)
## Country Code Year Month Day Daily_New_Tests/thousand
## 12739 Zimbabwe ZWE 2021 04 19 138
## 12740 Zimbabwe ZWE 2021 04 20 143
## 12741 Zimbabwe ZWE 2021 04 21 139
## 12742 Zimbabwe ZWE 2021 04 22 131
## 12743 Zimbabwe ZWE 2021 04 23 125
## 12744 Zimbabwe ZWE 2021 04 24 118
## Daily_New_confirmed cases/million Continent
## 12739 5.30557 Africa
## 12740 5.23843 Africa
## 12741 5.87286 Africa
## 12742 5.72871 Africa
## 12743 4.91171 Africa
## 12744 3.50843 Africa
##Plotting
#jitter
p1<-ggplot(df2020, aes(x = df2020$Continent, y = df2020$`Daily_New_confirmed cases/million`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2020$Month)
case20 <- p1+labs(
title = "Monthly confirmed cases in 2020 (Continent vs cases)",
x = "Continents wise cases",
y = "New confirmed cases/million"
)
#case20
p2<-ggplot(df2021, aes(x = Continent, y = `Daily_New_confirmed cases/million`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2021$Month)
case21 <- p2+labs(
title = "Monthly confirmed cases in 2021 (Continent vs cases)",
x = "Continents wise cases",
y = "New confirmed cases/million"
)
#case21
p3<-ggplot(df2020, aes(x = df2020$Continent, y = df2020$`Daily_New_Tests/thousand`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2020$Month)
case22 <- p3+labs(
title = "Monthly new tests done in 2020 (Continent vs cases)",
x = "Continents wise cases",
y = "New Tests done/thousand"
)
#case22
p4<-ggplot(df2021, aes(x = Continent, y = df2021$`Daily_New_Tests/thousand`, color= Continent)) +geom_jitter(width = .5)+facet_grid(~df2021$Month)
case23 <- p4+labs(
title = "Monthly New tests done in 2021 (Continent vs cases)",
x = "Continents wise cases",
y = "New tests done/thousand"
)
#case23
overview <- ggarrange(case20,case21,case22,case23,labels = c("p1","p2","p3","p4"),ncol=2,nrow = 2)
Data Reference
COVID-19: Daily tests vs. Daily new confirmed cases per million. Our World in Data. (2021). Retrieved 2 May 2021, from https://ourworldindata.org/grapher/covid-19-daily-tests-vs-daily-new-confirmed-cases-per-million?time=2020-02-17..2020-11-21.
The following plot fixes the main issues in the original.