Assignment 5 / Ritesh Sampat / COVID 19 - Global Spread and Proliferation

Data Accuracy

The data is updated daily but on certain days there were calculation errors which were corrected the next day hence there are certain days with negative values. I tried to remove such days to avoid inconsistent problems. The same problem was also encountered with numbers reported for certain country.

Multiple Datasets

Different datasets are used to get population and continent information. Since country names are not standard, we had to manually change certain country names for them to match.

Proposed DataViz (Sketch Version):

Sketch

Data Characteristics & Data Wrangling:

The data consists of daily cases reported in every country and also it reports the cases that are recovered on that day and the deaths reported. It also provides the coordinates for the location. The data has multiple directories and thus they are combined to make one csv file.

Loading the libraries

library(tidyverse)
library(lubridate)
library(plotly)
library(gganimate)
library(gifski)
library(zoo)
library(echarts4r)
options(scipen = 999)

Read the csv and show header information

data <- read.csv('complete_data.csv')
head(data)

##       State        Country     Last_Update Lat Lon Confirmed Deaths Recovered
## 1     Anhui Mainland China 1/22/2020 17:00  NA  NA         1     NA        NA
## 2   Beijing Mainland China 1/22/2020 17:00  NA  NA        14     NA        NA
## 3 Chongqing Mainland China 1/22/2020 17:00  NA  NA         6     NA        NA
## 4    Fujian Mainland China 1/22/2020 17:00  NA  NA         1     NA        NA
## 5     Gansu Mainland China 1/22/2020 17:00  NA  NA        NA     NA        NA
## 6 Guangdong Mainland China 1/22/2020 17:00  NA  NA        26     NA        NA

Data is stored cumulatively, let’s separate it on daily basis

daily_data <- data %>%
  mutate(Last_Update = parse_date_time(Last_Update, c('mdYHM', 'YmdHMS', 'mdyHM')), 
         date = as.Date(Last_Update)) %>%
  group_by(date, Country) %>%
  summarise(across(Confirmed:Recovered, sum, na.rm = TRUE)) %>%
  group_by(Country) %>%
  mutate(across(Confirmed:Recovered, list(today = ~c(NA, diff(.)))))

Visualization 1 - Number of active cases by continent:

continents <- read.csv('https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv')

p <- daily_data %>%
  mutate(Country = recode(Country,`Mainland China` = "China", 
                          `UK` = 'United Kingdom')) %>%
  left_join(continents %>%
              mutate(Country = recode(Country, `Korea, North` = 'North Korea', 
                                      `Korea, South` = 'South Korea', 
                                      `Russian Federation` = 'Russia')), by = 'Country') %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4) %>%
  mutate(Active_cases = Confirmed - Recovered) %>%
  group_by(date, Continent) %>%
  summarise(Active_cases  = sum(Active_cases, na.rm = TRUE)) %>%
  na.omit %>%
  group_by(Continent) %>%
  filter(Active_cases >= lag(Active_cases)) %>% 
  ggplot() + aes(date, Active_cases, color = Continent) + geom_line() + 
  facet_wrap(.~Continent) + 
  theme(legend.position="none") +
  transition_reveal(date)
animate(p, renderer = gifski_renderer(loop = FALSE))

Visualization 2 - Growth: Top 10 Countries

p <- daily_data %>%
  filter(month(date) >= 4) %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
  arrange(Country, date) %>%
  group_by(Country) %>%
  mutate(Confirmed_cum = cumsum(Confirmed_today)) %>%
  group_by(date) %>%
  top_n(10, Confirmed_cum) %>%
  mutate(Country = factor(Country)) %>%
  mutate(rank = dense_rank(-Confirmed_cum)) %>%
  ungroup %>%
  ggplot() + aes(rank, group = Country, 
                  fill = Country, color = Country) +
  geom_tile(aes(y = Confirmed_cum/2,
                height = Confirmed_cum,
                width = 0.9), alpha = 0.8, color = NA) + 
  geom_text(aes(y = 0, label = paste(Country, " ")), vjust = 0.2, hjust = 1) +
  coord_flip(clip = "off", expand = FALSE) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_reverse() +
  guides(color = FALSE, fill = FALSE) +
  labs(title='{closest_state}', x = "", y = "Total cases") +
  theme(plot.title = element_text(hjust = 0, size = 22),
        axis.ticks.y = element_blank(),  
        axis.text.y  = element_blank(),
        plot.margin = margin(1,1,1,4, "cm")) +
  transition_states(date, wrap = FALSE, transition_length = 4)

animate(p,fps = 25, duration = 20, renderer = gifski_renderer(loop = FALSE))

Visualization 3 - Country Wise Deaths / million on a weekly basis:

pop <- read.csv('population_by_country_2020.csv')[1:2]
names(pop) <- c('Country', 'population')

daily_data %>% 
  filter(Confirmed_today > 0 & Confirmed_today < 100000) %>%
  mutate(Country = recode(Country,`US`= "United States", `Mainland China` = "China")) %>%
  right_join(pop, by = 'Country') %>%
  mutate(pop_mil = population/1e6) %>%
  group_by(Country, week = week(date), pop_mil) %>%
  summarise(death = sum(Deaths_today, na.rm = TRUE)) %>%
  arrange(Country, week) %>%
  group_by(Country) %>%
  mutate(death_ratio = cumsum(death)/pop_mil) %>%
  na.omit %>%
  group_by(week = sprintf('%02d', week)) %>%
  e_chart(Country, timeline = TRUE) %>%
  e_map(death_ratio) %>%
  e_visual_map(min = 0, max = 600, type = 'piecewise') %>%
  e_title("Deaths per million on weekly basis", left = "center") %>%
  e_tooltip(
    trigger = "item",
    formatter = e_tooltip_choro_formatter())

Visualization 4 - Confirmed Cases vs Recovery graph between US vs Rest of the world:

p <- daily_data %>%
  filter(Confirmed_today > 0 & Confirmed_today < 100000, month(date) >= 4, Recovered_today < 100000) %>%
  mutate(Country = ifelse(Country != 'US', 'Rest of the world', Country)) %>%
  group_by(date, Country) %>%
  summarise(across(c(Confirmed_today, Recovered_today), sum)) %>%
  pivot_longer(cols = -c(date, Country)) %>%
  ggplot() + aes(date, value, color = Country, group = Country) + geom_line() + 
  facet_grid(.~name)  +
  transition_reveal(date)

animate(p, renderer = gifski_renderer(loop = FALSE))

Insights:

Continent wise confirmed cases growth

North America has been growing voluminously and has much faster growth rate than any other continent.
Cases in Asia have also been steadily increasing
Though case count is lower, a late surge is witnessed in Africa
Cases in Europe, after initial phase of growth, have plateaued and number of new confirmed cases have more or less stopped
Oceania which primarily consists of Australia and New Zealand hasn’t had an outbreak as compared to other continents

Top 10 countries over time

US, shortly after the China and European countries has taken lead and has been always ahead with a considerable margin.
Cases in India & Brazil rising up the rank, while China slips below after momentarily holding top spots

Density of cases

Although India is number 3 in number of cases we can see that deaths per million has always been low in India mostly because of huge population.
While Latin American countries like Peru and Chile have had the highest density of 560 and 481 cases per million

Daily Confirmed Cases

Cases of recovery in the US are happening at a much lower levels as compared to the daily count of confirmed cases as against a similar comparison of cases in the rest of the world

References:

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv

https://www.kaggle.com/tanuprabhu/population-by-country-2020