The project is to design a data visualisation to reveal the spread of COVID-19 pandemic in the world across various countries and continents. We will also be able to analyse which countries have been affected the most as compared to their population.
The data is updated daily but on certain days there were calculation errors which were corrected the next day hence there are certain days with negative values. I tried to remove such days to avoid inconsistent problems. The same problem was also encountered with numbers reported for certain country.
Different datasets are used to get population and continent information. Since country names are not standard, we had to manually change certain country names for them to match.
Sketch
The data consists of daily cases reported in every country and also it reports the cases that are recovered on that day and the deaths reported. It also provides the coordinates for the location. The data has multiple directories and thus they are combined to make one csv file.
Loading the libraries
library(tidyverse)
library(lubridate)
library(plotly)
library(gganimate)
library(gifski)
library(zoo)
library(echarts4r)
options(scipen = 999)
Read the csv and show header information
data <- read.csv('complete_data.csv')
head(data)
## State Country Last_Update Lat Lon Confirmed Deaths Recovered
## 1 Anhui Mainland China 1/22/2020 17:00 NA NA 1 NA NA
## 2 Beijing Mainland China 1/22/2020 17:00 NA NA 14 NA NA
## 3 Chongqing Mainland China 1/22/2020 17:00 NA NA 6 NA NA
## 4 Fujian Mainland China 1/22/2020 17:00 NA NA 1 NA NA
## 5 Gansu Mainland China 1/22/2020 17:00 NA NA NA NA NA
## 6 Guangdong Mainland China 1/22/2020 17:00 NA NA 26 NA NA
Data is stored cumulatively, let’s separate it on daily basis
daily_data <- data %>%
mutate(Last_Update = parse_date_time(Last_Update, c('mdYHM', 'YmdHMS', 'mdyHM')),
date = as.Date(Last_Update)) %>%
group_by(date, Country) %>%
summarise(across(Confirmed:Recovered, sum, na.rm = TRUE)) %>%
group_by(Country) %>%
mutate(across(Confirmed:Recovered, list(today = ~c(NA, diff(.)))))
continents <- read.csv('https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv')
p <- daily_data %>%
mutate(Country = recode(Country,`Mainland China` = "China",
`UK` = 'United Kingdom')) %>%
left_join(continents %>%
mutate(Country = recode(Country, `Korea, North` = 'North Korea',
`Korea, South` = 'South Korea',
`Russian Federation` = 'Russia')), by = 'Country') %>%
filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4) %>%
mutate(Active_cases = Confirmed - Recovered) %>%
group_by(date, Continent) %>%
summarise(Active_cases = sum(Active_cases, na.rm = TRUE)) %>%
na.omit %>%
group_by(Continent) %>%
filter(Active_cases >= lag(Active_cases)) %>%
ggplot() + aes(date, Active_cases, color = Continent) + geom_line() +
facet_wrap(.~Continent) +
theme(legend.position="none") +
transition_reveal(date)
animate(p, renderer = gifski_renderer(loop = FALSE))
p <- daily_data %>%
filter(month(date) >= 4) %>%
filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
arrange(Country, date) %>%
group_by(Country) %>%
mutate(Confirmed_cum = cumsum(Confirmed_today)) %>%
group_by(date) %>%
top_n(10, Confirmed_cum) %>%
mutate(Country = factor(Country)) %>%
mutate(rank = dense_rank(-Confirmed_cum)) %>%
ungroup %>%
ggplot() + aes(rank, group = Country,
fill = Country, color = Country) +
geom_tile(aes(y = Confirmed_cum/2,
height = Confirmed_cum,
width = 0.9), alpha = 0.8, color = NA) +
geom_text(aes(y = 0, label = paste(Country, " ")), vjust = 0.2, hjust = 1) +
coord_flip(clip = "off", expand = FALSE) +
scale_y_continuous(labels = scales::comma) +
scale_x_reverse() +
guides(color = FALSE, fill = FALSE) +
labs(title='{closest_state}', x = "", y = "Total cases") +
theme(plot.title = element_text(hjust = 0, size = 22),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
plot.margin = margin(1,1,1,4, "cm")) +
transition_states(date, wrap = FALSE, transition_length = 4)
animate(p,fps = 25, duration = 20, renderer = gifski_renderer(loop = FALSE))
pop <- read.csv('population_by_country_2020.csv')[1:2]
names(pop) <- c('Country', 'population')
daily_data %>%
filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
mutate(Country = recode(Country,`US`= "United States", `Mainland China` = "China")) %>%
right_join(pop, by = 'Country') %>%
mutate(pop_mil = population/1e6) %>%
group_by(Country, week = week(date), pop_mil) %>%
summarise(death = sum(Deaths_today, na.rm = TRUE)) %>%
arrange(Country, week) %>%
group_by(Country) %>%
mutate(death_ratio = cumsum(death)/pop_mil) %>%
na.omit %>%
group_by(week = sprintf('%02d', week)) %>%
e_chart(Country, timeline = TRUE) %>%
e_map(death_ratio) %>%
e_visual_map(min = 0, max = 600, type = 'piecewise') %>%
e_title("Deaths per million on weekly basis", left = "center") %>%
e_tooltip(
trigger = "item",
formatter = e_tooltip_choro_formatter())
p <- daily_data %>%
filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4, Recovered_today < 100000) %>%
mutate(Country = ifelse(Country != 'US', 'Rest of the world', Country)) %>%
group_by(date, Country) %>%
summarise(across(c(Confirmed_today, Recovered_today), sum)) %>%
pivot_longer(cols = -c(date, Country)) %>%
ggplot() + aes(date, value, color = Country, group = Country) + geom_line() +
facet_grid(.~name) +
transition_reveal(date)
animate(p, renderer = gifski_renderer(loop = FALSE))
Cases of recovery in the US are happening at a much lower levels as compared to the daily count of confirmed cases as against a similar comparison of cases in the rest of the world
https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv
https://www.kaggle.com/tanuprabhu/population-by-country-2020