From epidemic to pandemic

In December 2019, COVID-19 coronavirus was first identified in the Wuhan region of China. By March 11, 2020, the World Health Organization (WHO) categorized the COVID-19 outbreak as a pandemic. A lot has happened in the months in between with major outbreaks in Iran, South Korea, and Italy.

We know that COVID-19 spreads through respiratory droplets, such as through coughing, sneezing, or speaking. But, how quickly did the virus spread across the globe? And, can we see any effect from country-wide policies, like shutdowns and quarantines?

Fortunately, organizations around the world have been collecting data so that governments can monitor and learn from this pandemic. Notably, the Johns Hopkins University Center for Systems Science and Engineering created a publicly available data repository to consolidate this data from sources like the WHO, the Centers for Disease Control and Prevention (CDC), and the Ministry of Health from multiple countries.

library(readr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Import datasets: confirmed_cases_worldwide.csv into confirmed_cases_worldwide

*remark: confirmed_cases_worldwide.csv from Datacamp.com

confirmed_cases_worldwide <- read_csv("confirmed_cases_worldwide.csv")
## 
## -- Column specification --------------------------
## cols(
##   date = col_date(format = ""),
##   cum_cases = col_double()
## )
confirmed_cases_worldwide
## # A tibble: 56 x 2
##    date       cum_cases
##    <date>         <dbl>
##  1 2020-01-22       555
##  2 2020-01-23       653
##  3 2020-01-24       941
##  4 2020-01-25      1434
##  5 2020-01-26      2118
##  6 2020-01-27      2927
##  7 2020-01-28      5578
##  8 2020-01-29      6166
##  9 2020-01-30      8234
## 10 2020-01-31      9927
## # ... with 46 more rows

Confirmed cases throughout the world draw a line plot from the above dataset

ggplot(data=confirmed_cases_worldwide, aes(x=date, y=cum_cases)) +
  geom_line() +
  labs(y = "Cumulative confirmed cases")

From the line plot shows the cumulative confirmed cases along with the period of time. From the beginning the confirmed cases are fairly increase until 13FEB it’s immediately jump up and grows faster and faster.

# Read in datasets/confirmed_cases_china_vs_world.csv
confirmed_cases_china_vs_world <- read_csv("confirmed_cases_china_vs_world.csv")
## 
## -- Column specification --------------------------
## cols(
##   is_china = col_character(),
##   date = col_date(format = ""),
##   cases = col_double(),
##   cum_cases = col_double()
## )
# See the result
confirmed_cases_china_vs_world
## # A tibble: 112 x 4
##    is_china date       cases cum_cases
##    <chr>    <date>     <dbl>     <dbl>
##  1 China    2020-01-22   548       548
##  2 China    2020-01-23    95       643
##  3 China    2020-01-24   277       920
##  4 China    2020-01-25   486      1406
##  5 China    2020-01-26   669      2075
##  6 China    2020-01-27   802      2877
##  7 China    2020-01-28  2632      5509
##  8 China    2020-01-29   578      6087
##  9 China    2020-01-30  2054      8141
## 10 China    2020-01-31  1661      9802
## # ... with 102 more rows
# Explore the structure of dataset
glimpse(confirmed_cases_china_vs_world)
## Rows: 112
## Columns: 4
## $ is_china  <chr> "China", "China", "China", "China", "China", "China", "Ch...
## $ date      <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-01-...
## $ cases     <dbl> 548, 95, 277, 486, 669, 802, 2632, 578, 2054, 1661, 2089,...
## $ cum_cases <dbl> 548, 643, 920, 1406, 2075, 2877, 5509, 6087, 8141, 9802, ...
# Draw a line plot of cumulative cases vs. date, grouped and colored by is_china
# Define aesthetics within the line geom
plt_cum_confirmed_cases_china_vs_world <- ggplot(data = confirmed_cases_china_vs_world) +
  geom_line(aes(x=date, y=cum_cases, group=is_china, color=is_china)) +
  ylab("Cumulative confirmed cases")

# See the plot
plt_cum_confirmed_cases_china_vs_world

As the result, there are different graph shape between China and non-China confirmed cases. For China case its growth increases greatly fast in early and to be constant after March. On the other hand, For non-China cases, the line graph is very constantly slow increase at first and likely to be jump instantly around the end of FEB then grows incredibly fast.

China compared to the rest of the world

who_events <- tribble(
  ~ date, ~ event,
  "2020-01-30", "Global health\nemergency declared",
  "2020-03-11", "Pandemic\ndeclared",
  "2020-02-13", "China reporting\nchange"
) %>%
  mutate(date = as.Date(date))

# Using who_events, add vertical dashed lines with an xintercept at date
# and text at date, labeled by event, and at 100000 on the y-axis
plt_cum_confirmed_cases_china_vs_world +
  geom_vline(data=who_events, aes(xintercept=date), linetype="dashed") +
  geom_text(data=who_events, aes(date, label=event), y = 1e5)

Adding a trend line to China

# Filter for China, from Feb 15
china_after_feb15 <- confirmed_cases_china_vs_world %>% 
    filter(is_china == "China", date >= '2020-02-15')

# Using china_after_feb15, draw a line plot cum_cases vs. date
# Add a smooth trend line using linear regression, no error bars
ggplot(data=china_after_feb15, aes(x=date, y=cum_cases)) +
  geom_line() +
  geom_smooth(method='lm', se=FALSE) +
  ylab("Cumulative confirmed cases")
## `geom_smooth()` using formula 'y ~ x'

And trend line adding for the rest of the world

# Filter confirmed_cases_china_vs_world for not China
not_china <- confirmed_cases_china_vs_world %>%
    filter(is_china == 'Not China')

# Using not_china, draw a line plot cum_cases vs. date
# Add a smooth trend line using linear regression, no error bars
plt_not_china_trend_lin <- ggplot(data=not_china, aes(x=date, y=cum_cases)) +
  geom_line() +
  geom_smooth(method='lm', se=FALSE) +
  ylab("Cumulative confirmed cases")

# See the result
plt_not_china_trend_lin 
## `geom_smooth()` using formula 'y ~ x'

The result from the rest of the world plotting shown that the trend line doesn’t even well fit to the data plot and the cumulative confirmed cases is growing up faster than the linear line. Therefore, a logarithmic scale is added to look the different result.

plt_not_china_trend_lin + 
  scale_y_log10()
## `geom_smooth()` using formula 'y ~ x'

After adding a logarithmic scale, the linear line and data plot are fit ver well. Unfortunately, from a public health point of view, that means that cases of COVID-19 in the rest of the world are growing at an exponential rate, which is terrible news.

Not all countries are being affected by COVID-19 equally, and it would be helpful to know where in the world the problems are greatest. Let’s find the countries outside of China with the most confirmed cases in our dataset.

Which countries outside of China have been hit hardest?

# Run this to get the data for each country
confirmed_cases_by_country <- read_csv("confirmed_cases_by_country.csv")
## 
## -- Column specification --------------------------
## cols(
##   country = col_character(),
##   province = col_character(),
##   date = col_date(format = ""),
##   cases = col_double(),
##   cum_cases = col_double()
## )
glimpse(confirmed_cases_by_country)
## Rows: 13,272
## Columns: 5
## $ country   <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Antigua ...
## $ province  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ date      <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ cases     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ cum_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
# Group by country, summarize to calculate total cases, find the top 7
top_countries_by_total_cases <- confirmed_cases_by_country %>%
  group_by(country) %>%
  summarize(total_cases = max(cum_cases)) %>%
  top_n(7, total_cases) %>%
  arrange(desc(total_cases))
## `summarise()` ungrouping output (override with `.groups` argument)
# See the result
top_countries_by_total_cases
## # A tibble: 7 x 2
##   country      total_cases
##   <chr>              <dbl>
## 1 Italy              31506
## 2 Iran               16169
## 3 Spain              11748
## 4 Germany             9257
## 5 Korea, South        8320
## 6 France              7699
## 7 US                  6421

Plotting hardest hit countries as of Mid-March 2020

Even though the outbreak was first identified in China, there is only one country from East Asia (South Korea) in the above table. Four of the listed countries (France, Germany, Italy, and Spain) are in Europe and share borders. To get more context, we can plot these countries’ confirmed cases over time.

# Run this to get the data for the top 7 countries
confirmed_cases_top7_outside_china <- read_csv('confirmed_cases_top7_outside_china.csv')
## 
## -- Column specification --------------------------
## cols(
##   country = col_character(),
##   date = col_date(format = ""),
##   cum_cases = col_double()
## )
# 
glimpse(confirmed_cases_top7_outside_china)
## Rows: 2,030
## Columns: 3
## $ country   <chr> "Germany", "Iran", "Italy", "Korea, South", "Spain", "US"...
## $ date      <date> 2020-02-18, 2020-02-18, 2020-02-18, 2020-02-18, 2020-02-...
## $ cum_cases <dbl> 16, 0, 3, 31, 2, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, ...
# Using confirmed_cases_top7_outside_china, draw a line plot of
# cum_cases vs. date, grouped and colored by country
ggplot(data =confirmed_cases_top7_outside_china, aes(x=date, y=cum_cases, group=country, color=country )) + 
    geom_line() + 
    ylab("Cumulative confirmed cases")