1. Introduction:

The project is to design a data visualisation to reveal the spread of COVID-19 pandemic in the world across various countries and continents. We will also be able to analyse which countries have been affected the most as compared to their population.

  1. Challenges to the dataset and its solutions:

Data Accuracy

The data is updated daily but on certain days there were calculation errors which were corrected the next day hence there are certain days with negative values. I tried to remove such days to avoid inconsistent problems. The same problem was also encountered with numbers reported for certain country.

Multiple Datasets

Different datasets are used to get population and continent information. Since country names are not standard, we had to manually change certain country names for them to match.

  1. Proposed DataViz (Sketch Version):

Sketch

  1. Data Characteristics & Data Wrangling:

The data consists of daily cases reported in every country and also it reports the cases that are recovered on that day and the deaths reported. It also provides the coordinates for the location. The data has multiple directories and thus they are combined to make one csv file.

Loading the libraries

library(tidyverse)
library(lubridate)
library(plotly)
library(gganimate)
library(gifski)
library(zoo)
library(echarts4r)
options(scipen = 999)

Read the csv and show header information

data <- read.csv('complete_data.csv')
head(data)
##       State        Country     Last_Update Lat Lon Confirmed Deaths Recovered
## 1     Anhui Mainland China 1/22/2020 17:00  NA  NA         1     NA        NA
## 2   Beijing Mainland China 1/22/2020 17:00  NA  NA        14     NA        NA
## 3 Chongqing Mainland China 1/22/2020 17:00  NA  NA         6     NA        NA
## 4    Fujian Mainland China 1/22/2020 17:00  NA  NA         1     NA        NA
## 5     Gansu Mainland China 1/22/2020 17:00  NA  NA        NA     NA        NA
## 6 Guangdong Mainland China 1/22/2020 17:00  NA  NA        26     NA        NA

Data is stored cumulatively, let’s separate it on daily basis

daily_data <- data %>%
  mutate(Last_Update = parse_date_time(Last_Update, c('mdYHM', 'YmdHMS', 'mdyHM')), 
         date = as.Date(Last_Update)) %>%
  group_by(date, Country) %>%
  summarise(across(Confirmed:Recovered, sum, na.rm = TRUE)) %>%
  group_by(Country) %>%
  mutate(across(Confirmed:Recovered, list(today = ~c(NA, diff(.)))))

  1. Visualization 1 - Number of active cases by continent:

continents <- read.csv('https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv')

p <- daily_data %>%
  mutate(Country = recode(Country,`Mainland China` = "China", 
                          `UK` = 'United Kingdom')) %>%
  left_join(continents %>%
              mutate(Country = recode(Country, `Korea, North` = 'North Korea', 
                                      `Korea, South` = 'South Korea', 
                                      `Russian Federation` = 'Russia')), by = 'Country') %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4) %>%
  mutate(Active_cases = Confirmed - Recovered) %>%
  group_by(date, Continent) %>%
  summarise(Active_cases  = sum(Active_cases, na.rm = TRUE)) %>%
  na.omit %>%
  group_by(Continent) %>%
  filter(Active_cases >= lag(Active_cases)) %>% 
  ggplot() + aes(date, Active_cases, color = Continent) + geom_line() + 
  facet_wrap(.~Continent) + 
  theme(legend.position="none") +
  transition_reveal(date)
animate(p, renderer = gifski_renderer(loop = FALSE))

  1. Visualization 2 - Growth: Top 10 Countries

p <- daily_data %>%
  filter(month(date) >= 4) %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
  arrange(Country, date) %>%
  group_by(Country) %>%
  mutate(Confirmed_cum = cumsum(Confirmed_today)) %>%
  group_by(date) %>%
  top_n(10, Confirmed_cum) %>%
  mutate(Country = factor(Country)) %>%
  mutate(rank = dense_rank(-Confirmed_cum)) %>%
  ungroup %>%
  ggplot() + aes(rank, group = Country, 
                  fill = Country, color = Country) +
  geom_tile(aes(y = Confirmed_cum/2,
                height = Confirmed_cum,
                width = 0.9), alpha = 0.8, color = NA) + 
  geom_text(aes(y = 0, label = paste(Country, " ")), vjust = 0.2, hjust = 1) +
  coord_flip(clip = "off", expand = FALSE) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_reverse() +
  guides(color = FALSE, fill = FALSE) +
  labs(title='{closest_state}', x = "", y = "Total cases") +
  theme(plot.title = element_text(hjust = 0, size = 22),
        axis.ticks.y = element_blank(),  
        axis.text.y  = element_blank(),
        plot.margin = margin(1,1,1,4, "cm")) +
  transition_states(date, wrap = FALSE, transition_length = 4)

animate(p,fps = 25, duration = 20, renderer = gifski_renderer(loop = FALSE))

  1. Visualization 3 - Country Wise Deaths / million on a weekly basis:

pop <- read.csv('population_by_country_2020.csv')[1:2]
names(pop) <- c('Country', 'population')

daily_data %>% 
  filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
  mutate(Country = recode(Country,`US`= "United States", `Mainland China` = "China")) %>%
  right_join(pop, by = 'Country') %>%
  mutate(pop_mil = population/1e6) %>%
  group_by(Country, week = week(date), pop_mil) %>%
  summarise(death = sum(Deaths_today, na.rm = TRUE)) %>%
  arrange(Country, week) %>%
  group_by(Country) %>%
  mutate(death_ratio = cumsum(death)/pop_mil) %>%
  na.omit %>%
  group_by(week = sprintf('%02d', week)) %>%
  e_chart(Country, timeline = TRUE) %>%
  e_map(death_ratio) %>%
  e_visual_map(min = 0, max = 600, type = 'piecewise') %>%
  e_title("Deaths per million on weekly basis", left = "center") %>%
  e_tooltip(
    trigger = "item",
    formatter = e_tooltip_choro_formatter())

  1. Visualization 4 - Confirmed Cases vs Recovery graph between US vs Rest of the world:

p <- daily_data %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4, Recovered_today < 100000) %>%
  mutate(Country = ifelse(Country != 'US', 'Rest of the world', Country)) %>%
  group_by(date, Country) %>%
  summarise(across(c(Confirmed_today, Recovered_today), sum)) %>%
  pivot_longer(cols = -c(date, Country)) %>%
  ggplot() + aes(date, value, color = Country, group = Country) + geom_line() + 
  facet_grid(.~name)  +
  transition_reveal(date)

animate(p, renderer = gifski_renderer(loop = FALSE))

  1. Insights:

Continent wise confirmed cases growth

Top 10 countries over time

Density of cases

Daily Confirmed Cases

Cases of recovery in the US are happening at a much lower levels as compared to the daily count of confirmed cases as against a similar comparison of cases in the rest of the world

  1. References:

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv

https://www.kaggle.com/tanuprabhu/population-by-country-2020