Introduction

The other day my Mom asked me “Tim, how quickly is covid actually spreading”, and I realized this is actually a difficult question to answer. People are used to thinking about geometric growth, the staight lines of \(y = mx + b\). But the behavior of a virus in the early stage of an epidemic is close to exponential growth, \(y = e^{ax}\). This demonstration uses some simple graphing and fitting to teach readers about the growth of COVID-19 in the US over the past two months. Many other commentators have done a better job of this than I have, and I recommend this great introduction by Our World in Data.

I find that from march 8th to April 5th the growth in deaths from COVID-19 in the US closely resembled an exponential function \(n \propto e^{.24t}\), and had a doubling time of 2.9 days. I then argue that this exponential growth is already rapidly slowing and unlikely to continue through April as social distancing pratices ramp up, by applying the same fit to China’s record.

This dataset for this project can be downloaded from Our World in Data. OWD has cleaned data provided daily by the European CDC on identified cases and deaths from COVID-19 by country.

#loads the suite of packages I will use in the demo
library(tidyverse)

#downloads epidemiology data from the website "Our World in Data"
url = "https://covid.ourworldindata.org/data/ecdc/full_data.csv"
df <- read_csv(url)

Comparing infection growth between the US and other countries

Is COVID spreading unusually fast in the US? This first plot displays the number of cases by day in the US as well as several other countries. One challenge to comparing different countries situations is that they became infected at different times. China has had cases for several months now, Turkey for less than a single month. To better compare between countries, I find the date when each country first had five confirmed cases, then shift the x-axis for each plot to line them up around this day.

# First, we need to mark the first day with five confirmed cases on the tibble. Because the observations of the tibble are country-days, we add a new boolean variable is_day_5. The code uses a mutate to ifelse statement to return true for all days after total_cases. For china, the dataset does not go back far enough to include the actual day_of_5, so I select by hand the first day of the dataset.

df <- df %>%
  group_by(location) %>%
  mutate(is_day_5 = ifelse
         ((lag(total_cases)<5  & total_cases >=5) | (location=='China' & date==df$date[2441]), TRUE, FALSE)) %>%
  ungroup()

# Next we need a command that calculates the time difference between each observations day and the day_of_5 for that country. Unfortunately, it is difficult in tidyverse to specify alternate rows within a mutate command, meaning to tell tidyverse "subtract from this number the value found in the day_of_5 for this specific country". To get around this problem, I make a vector which is the R-equivalent of a Python dictionary that stores the day_of_5 for each country. Then I can use that dict to "look up" the day_of_5 for each country in the mutate command without having to specify some other row in the same tibble.

#this command makes a tibble with just the days of 5.
df_true <- df %>% filter(is_day_5 == TRUE)

#This command makes a list, assigns the dates from df_true, then names those dates by their countries.
X <- vector(mode="list", length=180)
X <- c(df_true$date)
names(X) <- c(df_true$location)

#Finally this command calculates the days_since_5 for each observation using the dictionary that I just created.
df <- df %>%
  group_by(location) %>%
  mutate(days_since_5 = date - X[location]) %>%
  ungroup()

#Later, I will want to color the countries by their continent. Therefore, I join the tibble with a country-to-continent csv found online. Thanks Udacity!
url2 = "https://github.com/pdelboca/udacity-data-analysis-with-r/raw/master/data/Countries-Continents.csv"

country_cont <- read_csv(url2)

df <- left_join(df, country_cont, by = c("location" = "Country"))

# For some reason south korea gets left out of this (maybe a Republic of Korea vs. South Korea naming problem). I simply add the continent label manually for Sotuh Korea
df[df$location=='South Korea', ]$Continent <- "Asia"

# Labeling lines on a plot can be tricky because the labels may cover the data. This code makes a dictionary called "maxes" that I will use to place my labels at furthest day since 5. It uses the same dictionary-making concept that I used earlier.
df_maxes <- df %>%
  group_by(location) %>%
  mutate(maxes=max(days_since_5)) %>%
  select(location,maxes)

df_maxes <- df_maxes[!duplicated(df_maxes), ]

maxes <- vector(mode="list", length=205)

maxes <- c(df_maxes$maxes)
names(maxes) <- c(df_maxes$location)

head(df)
## # A tibble: 6 x 9
##   date       location new_cases new_deaths total_cases total_deaths is_day_5
##   <date>     <chr>        <dbl>      <dbl>       <dbl>        <dbl> <lgl>   
## 1 2019-12-31 Afghani~         0          0           0            0 FALSE   
## 2 2020-01-01 Afghani~         0          0           0            0 FALSE   
## 3 2020-01-02 Afghani~         0          0           0            0 FALSE   
## 4 2020-01-03 Afghani~         0          0           0            0 FALSE   
## 5 2020-01-04 Afghani~         0          0           0            0 FALSE   
## 6 2020-01-05 Afghani~         0          0           0            0 FALSE   
## # ... with 2 more variables: days_since_5 <drtn>, Continent <chr>

Geometric plot of total confirmed cases

Here I use the redating variable days_since_5 to make a geometric plot of total_cases.

library(ggthemes)
library(scales)
library(lattice)
library(ggrepel)

#Makes a list of countries to include in the plot
countries = c("Norway","Spain","Germany", "United States", "China", "Turkey", "Lebanon", "Oman", "Saudi Arabic", "India", "Italy", "South Korea", "Hungary", "Romania", "Phillipines", "Canada", "Brazil", "Chile", "Australia", "Japan" )

p1 <- ggplot(data=filter(df, location %in% countries), mapping=aes(x=days_since_5, y=total_cases, color = Continent, group=location)) +
  geom_path() + 
  xlim(-1,100) + 
  ylim(8,600000)

p1 +theme_solarized() + 
  labs(title = "Geometric Plot of Total Cases In Selected Countries", subtitle = "Country labels in following plot") + 
  ylab('total cases') + 
  xlab('days since 5 confirmed cases')#+ geom_label_repel(data = subset(df, days_since_5 == maxes[location] & location %in% countries), aes(label=location, color=Continent))

### Why is a logarithmic plot more readable?

This plot linear plot gives the impression that the US has had a very unusual covid trajectory. But it is actually quite misleading due to the exponential growth of viral infections. When a virus starts spreading in a population, the number of newly infected persons is proportional to the number of currently infected persons.

Let us build a simple model of how covid spreads. Suppose that each infected person interacts with \(E\) people per day and they have a \(p\) probability to infect each person. For simplicity, assume no one recovers or dies for a more sophisticated model, click here. The number of newly infected people per day is therefore

\(\Delta N_{d} = N_{d-1} * E * p\)

Where \(\Delta N_{d}\) is the daily change in \(N_d\). So the next days infections are equal to the previous days infections multiplied the value \(E * p\). To find the function of \(N_d\) we need an equation whose rate of change is itself (multiplied by some constant). The only function which is its own derivative is \(e^x\). This is a rudimentary problem of differential equations, with the solution

\(N_t = N_{i} * e^{e * p * t}\).

The slope of an exponential function is constantly changing, making it difficult to read. The infection rates of countries appear to stay low (when \(N\) is small, then take off very suddenly as \(N\) becomes large. Furthermore, countries earlier in the infection trajectory become flattened to the bottom of the plot, so the early growth of the virus is hidden. In particular, the reader cannot tell if \(E * p\) is increasing, decreasing or staying constant because it is always curving. We resolve this problem by plotting the log of total cases on the y-axis.

This works because the log of an exponential function is a linear function. In mathematical terms, we would say

\(ln(N_{t}) = t * E * p * ln(N_{i})\)

As you can see, the exponent terms have popped out. \(N_{i}\) is a constant, which depends on the start time. In practical terms, this means that time periods where the reproduction rate stays constant, the trend follows a straight line. We need only take the slope of that line to find \(E * P\).

Geometric plot of total confirmed cases

library(ggthemes)
library(scales)
library(lattice)
library(ggrepel)

countries = c("Norway","Spain","Germany", "United States", "China", "Turkey", "Lebanon", "Oman", "Saudi Arabic", "India", "Italy", "South Korea", "Hungary", "Romania", "Phillipines", "Canada", "Brazil", "Chile", "Australia", "Japan" )

p1 <- ggplot(data=filter(df, location %in% countries), mapping=aes(x=days_since_5, y=total_cases, color = Continent, group=location)) + 
  geom_path() + xlim(-1,100) + 
  ylim(8,log(max(df$total_cases))) + 
  scale_y_log10(labels = trans_format("log10", math_format(10^.x))) + 
  labs(title = "Log Plot of Total Cases In Selected Countries") + 
  ylab('total cases') + 
  xlab('days since 5 confirmed cases')
p1 +theme_solarized() + 
  geom_label_repel(data = subset(df, days_since_5 == maxes[location] & location %in% countries), aes(label=location, color=Continent))

Exponential fitting to find the growth rate for the early days of the infection in the US.

To focus on the US, let us drop all other countries from the plot. The next plot shows the log plot and linear plot side-by-side. I am also now switching from total cases to total deaths. The reason for this is that the deaths caused by COVID-19 are almost always recorded, while confirmed cases can easily be missed if there is a shortage of testing materials or poor targeting. This does introduce a lag to the measurement, as people who die usually contract the virus one or two weeks earlier.

library(gridExtra)

p3_log <- ggplot(data=filter(df, location == "United States"), mapping=aes(x=days_since_5, y=total_deaths, group=location)) + 
  geom_path() + 
  xlim(30,75) + 
  ylim(8,log(max(df$total_cases))) + 
  scale_y_log10(labels = trans_format("log10", math_format(10^.x))) + 
  theme_solarized() + 
  geom_label_repel(data = subset(df, days_since_5 == maxes[location] & location == "United States"), aes(label=location)) + 
  labs(title= 'Logarithmic Plot of Total Deaths, US Only') + 
  ylab('total deaths') + 
  xlab('days since 5 confirmed cases')

p3_geo <- ggplot(data=filter(df, location == "United States"), mapping=aes(x=days_since_5, y=total_deaths, group=location)) + 
  geom_path() + 
  xlim(30,75) + 
  ylim(8,15000) + 
  theme_solarized() + 
  geom_label_repel(data = subset(df, days_since_5 == maxes[location] & location == "United States"), aes(label=location)) + 
  labs(title= 'Geometric Plot of Total Deaths, US Only')  + 
  ylab('total deaths') + 
  xlab('days since 5 confirmed cases')


grid.arrange(p3_geo,p3_log)

As you can see, the log plot is very nearly a straight line, so \(E*p\) was almost constant until around day 70.

Fitting the exponential model

To find \(E*p\), I apply a linear fit to the log of total deaths. Because exponential functions are straight lines on log plots, fitting a line to the log is a simple way to do an exponential fit.

library(precrec)

#Filters the tibble to contain only data from the US from a certain period. I selected days 40 to 70 because they are vizually the period that best resembled an exponential function, and therefore would demonstrate the point.
df4 <- df %>%
  filter(location == "United States" & days_since_5 > 40 & days_since_5 < 70)

#Makes the exponentional model by fitting a linear model to the log of total_deaths
lm_df4 <- lm(log(df4$total_deaths) ~ df4$days_since_5)

#produces a summary describing the fit
summary(lm_df4)
## 
## Call:
## lm(formula = log(df4$total_deaths) ~ df4$days_since_5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37957 -0.11625  0.04072  0.10915  0.29061 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.186720   0.203755  -35.27   <2e-16 ***
## df4$days_since_5  0.237300   0.003662   64.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.165 on 27 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.9934 
## F-statistic:  4198 on 1 and 27 DF,  p-value: < 2.2e-16

The output of this fit (from the column labeled coefficiens: estimate) is an equation.

Because I chose the date arbitrarily, as the day with five confirmed cases, we can ignore the intercept term (7.19). Recall that the regression was applied on a log plot, so y value refers to \(log(N)\)

\(log(N) = .24t\).

Raise the natural log to both sides to find

\(N \propto e^{.24t}\)

, where the \(\propto\) symbol tells us that the equation is a proportionality (because I do not set an exact start date/initial condition). The \(E * p\) term is therefore about .24. This corresponds to a doubling time of 2.9 days. In simple english, this means that the average infected person interacted with some number E many people with some infection chance p such that the product of the two was .24. Note that because no one exits infection in my model, this strongly underestimates the \(E*p\) value. In the next graph, I plot this model against the observation. You can look for yourself and see how well the model describes the data.

Plotting the exponential model

# To display the prediction on the plot in ggplot, the easiest way is to use the model to calculate values and predict them back into the tibble.
predicted_df <- data.frame(total_deaths = exp(predict(lm_df4, df4)), days_since_5 = df4$days_since_5)

p4_geo <- p3_geo + geom_line(color='blue', linetype= 'dashed', size = 2, data = predicted_df, aes(x=days_since_5, y=total_deaths, group = NA)) + xlim(c(40,80)) + theme(legend.position = 'none') + labs(title = "Geometric Plot of Total Deaths with Exponential Model") + xlab('total deaths') + ylab('days since 5 confirmed cases')

p4_log <- p3_log +  geom_line(color='blue', linetype= 'dashed', size = 2, data = predicted_df, aes(x=days_since_5, y=total_deaths, group = NA)) + xlim(c(40,80)) + theme(legend.position = 'none')

# Lays the log and geo plot side by side
grid.arrange(p4_log,p4_geo)

Discussion / why you shouldn’t believe every model you see.

This is the exponential growth rate of coronavirus from the 8th of march to the fifth of June. That growth rate corresponds to a doubling size of roughly every three days, which is pretty fast. Extending this model in time is scary. According ot the equation, all of america would have died in ln(300,000,000/20,000,000) doubling times, or 30 days! Fortunately, this simple exponential model is only descriptive for the first period of infection, when infected/immune population is small relative to the total population and people have not adapted to the virus. Practices like social distancing will drastically change both \(E\) and \(p\). Also, the death rate among infected persons is below or near 1%. If you look closely at the above plot, you can already see the growth rate declining toward day 70. This suggests that people are getting a handle on the crisis (and remember that deaths are a lagging indicator, since those people became sick much earlier).

A much better model of the future of coronavirus has been made by the Institute for Health Monitoring and Evaluation, and can be accessed here here from the Institute for Health Metrics and Evaluation. Their much more believable projections find that toal deaths will level off by mid may, most likely at 60,0000 deaths. The outer bounds are 150,000 and 20,000. The reason their model performs better than mine is they fit a more sophisticated function which accounts for social hygiene practices and immunities. So let that be a lesson to you, that not all models are created equal and you should not believe every graph on the internet!

Appendix - why this model does not predict the future in graph form

An easy way to demonstrate why this simple exponential model does not predict the infection rates in the US for the next few months is to reproduce the fit on the data from China. Because China is at a much later stage of its epidemic, we can see the later behavior and check if the exponential model still applies.

library(precrec)

#Filters the tibble to contain only data from China from days 20 to 70. I started at 20 to have great continuity, and closed at 100 because 1000 new deaths are added to the roles shorly after day 100.
df5 <- df %>%
  filter(location == "China" & days_since_5 > 20 & days_since_5 < 100)

#Creates the model using the same method
lm_df5 <- lm(log(df5$total_deaths) ~ df5$days_since_5)

summary(lm_df5)
## 
## Call:
## lm(formula = log(df5$total_deaths) ~ df5$days_since_5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2869 -0.5466  0.1872  0.7327  0.9951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.968003   0.285907   13.88   <2e-16 ***
## df5$days_since_5 0.052890   0.004454   11.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9028 on 77 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6422 
## F-statistic:   141 on 1 and 77 DF,  p-value: < 2.2e-16

You can already see from the summary values here that the R-squared values are moch lower (from ~.99 to ~.65), which tells us that the fit of this model is poor. Now let us look at the model graphically.

# To display the prediction on the plot in ggplot, the easiest way is to use the model to calculate values and predict them back into the tibble.
predicted_df5 <- data.frame(total_deaths = exp(predict(lm_df5, df5)), days_since_5 = df5$days_since_5)

p5_log <- ggplot(data=filter(df5, location == "China"), mapping=aes(x=days_since_5, y=total_deaths, group=location)) + 
  geom_path() + 
  xlim(20,100) + 
  ylim(8,log(max(5000))) + 
  scale_y_log10(labels = trans_format("log10", math_format(10^.x))) + 
  theme_solarized() +  
  geom_line(color='blue', linetype= 'dashed', size = 2, data = predicted_df5, aes(x=days_since_5, y=total_deaths, group = NA)) + 
  xlim(c(40,80)) + 
  labs(title = "Geometric Plot of Total Deaths in China with Exponential Model") +
  theme(legend.position = 'none')
  ylab('total deaths')
## $y
## [1] "total deaths"
## 
## attr(,"class")
## [1] "labels"
  xlab('days since 5 confirmed cases')
## $x
## [1] "days since 5 confirmed cases"
## 
## attr(,"class")
## [1] "labels"
p5_geo <- ggplot(data=filter(df5, location == "China"), mapping=aes(x=days_since_5, y=total_deaths, group=location)) +
  geom_path() +
  xlim(10,100) +
  ylim(8,5000) +
  theme_solarized() +
  geom_line(color='blue', linetype= 'dashed', size = 2, data = predicted_df5, aes(x=days_since_5, y=total_deaths, group = NA)) +
  xlim(c(40,80)) +
  theme(legend.position = 'none') +
  labs(title = "Geometric Plot of Total Deaths in China with Exponential Model") +
  ylab('total deaths') +
  xlab('days since 5 confirmed cases')


# Lays the log and geo plot side by side
grid.arrange(p5_log,p5_geo)

As the R-squared values suggested, we see major deviation from the data. This strongly suggests that the simple exponential model captures very little about what was happening in China as early as the start of February. This confirms that the simple exponential model is only descriptive in a limited set of cases, particularly when societies have not adapted to the epidemic.

Thanks for reading!