Gompertz Covid Forecast

Hugh Whelan


Updated on: 2020-08-10

Background

Good background on Gompertz curves can be found at https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0178691&type=printable. A quick Youtube introduction is at https://youtu.be/0ifT-7K68sk. R code used for an example application of tumor growth modelling is at https://rpubs.com/mathetal/gompertz.

A “Gompertz” equation can be written in two forms that are equivalent from a data fitting perspective. I look at two forms. Form 1: \[f(t) = a * exp(-b * exp(-c*t))\], the three parameters to be estimated are a, b and c. Term t represents time. Parameter a is the upper asymptote. In the context of Covid cumulative death data, one way parameter a can be conceptualized is as an estimate of the population with inherent characteristics that will result in their death from Covid.

Form 2: \[f(t) = a*exp(-exp(-b*(d-Ti)))\] In this form parameter a is the same (as we will see below), , d is the current day number, and parameter b is…

“…a growth-rate coefficient (which affects the slope), and Ti represents time at inflection. The Ti parameter shifts the growth curve horizontally without changing its shape and is therefore what is often termed a location parameter (whereas A and [b] are shape parameters)….” - Tjørve KMC, Tjørve E (2017) The use of Gompertz models in growth analyses, and new Gompertz-model approach: An addition to the Unified-Richards family. PLoS ONE 12(6): e0178691.

Form 2 is easier to interpret, because Ti is interpreted in a straightforward way as the date at which the growth rate begins to slow.

Code used here

My R code (included some additional analyses not shown here) is freely avaialble at Github

Obtain data from https://covidtracking.com/api

Note: Total deaths from covidtracking.com tends to lag those reported by J. Hopkins website.

Aggregate to US level and Fit Equation Forms 1 and 2:

u1<-stLevel[,lapply(.SD, sum,na.rm=TRUE),by=Date,.SDcols=c("death","positive","total","positiveIncrease","totalTestResultsIncrease","deathIncrease","hospitalized","hospitalizedCurrently")] 

u1<-u1[order(Date),]
# Omit fragmentary early data when cum deaths < 50
u11<-u1[death>50,]

data1<- data.frame(seq(1:nrow(u11)),u11$death)
colnames(data1)<-c("d","cd")

library(easynls)
#fit data to the Gompertz model form 1
mod10<-nlsfit(data1, model=10, start = c(a=150000,b=11,c=0.1))
mod10
## $Model
## [1] "y~a*exp(-b*exp(-c*x)"
## 
## $Parameters
##                                cd
## coefficient a        1.456911e+05
## coefficient b        4.810877e+00
## coefficient c        3.184994e-02
## p-value t.test for a 0.000000e+00
## p-value t.test for b 0.000000e+00
## p-value t.test for c 0.000000e+00
## r-squared            9.893000e-01
## adjusted r-squared   9.891000e-01
## AIC                  3.036147e+03
## BIC                  3.048242e+03
#fit data to Gompertz model form 2
library(minpack.lm)
modA<-nlsLM(cd~a*exp(-exp(-b*(d-Ti))),data=data1,start=list(a=150000,b=4,Ti=20)) 
summary(modA)
## 
## Formula: cd ~ a * exp(-exp(-b * (d - Ti)))
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## a  1.457e+05  1.497e+03   97.29   <2e-16 ***
## b  3.185e-02  9.294e-04   34.27   <2e-16 ***
## Ti 4.932e+01  5.815e-01   84.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5177 on 149 degrees of freedom
## 
## Number of iterations to convergence: 19 
## Achieved convergence tolerance: 1.49e-08
data2<- data.frame(cdc[,wkNum],cdc[,totalCovid])
colnames(data2)<-c("wkNum","totalCovid")
#fit CDC date of death weekly data
#omit last 2 weeks of data because of significant lags in reporting to the CDC
modCDC<-nlsLM(totalCovid~a*exp(-exp(-b*(wkNum-Ti))),data=data2[1:(nrow(data2)-2),],start=list(a=150000,b=0.1,Ti=8))

## Lowdon section
# # modCDC<-nlsLM(totalCovid~a*exp(-exp(-b*(wkNum-Ti))),data=data2[1:(nrow(data2)),],start=list(a=100000,b=0.1,Ti=8))
# summary(modCDC)

cdcMod<-unlist(summary(modCDC))

Plot Covid Tracking Cum Covid Death Data vs. Fitted Curve

gomp<-nlsplot(data=data1, model = 10, start = c(a = 150000, b = 11, c = .1), 
         xlab = "Days since Deaths>50, Gompertz Eq. " , ylab = "Cum Covid Deaths", position = 1)

Plot CDC Date of Death Cum Covid Death vs. Fitted Curve

CDC date of death data Gompertz parameters indicates a similar peak between 4/11/2020 and 4/18/2020, and an estimated total of 138,007 Covid deaths. The issue with CDC date of death data is the significant lag of reporting which gets revised upward over the next 5 - 7 weeks.

Plot Daily Deaths vs. Implied By Fitted Curve

Another parameterized equation form that some have proposed to model infections is a “Graph Equation” or “Network Equation” (see https://twitter.com/JasonC88766481/status/1269991092507652098). These equations are used to model something propagated through a network of cluster nodes. One proposed Graph Equation fit to daily deaths is \[K*(t)^q * exp(-(t)/z)\] where t is time, and K, q and z are estimated parameters. This equation is fit to daily deaths.

This can be compared to the first derivative of a Gompertz curve fit to cumulative deaths vs. time. The first derivative of Form 2 of the Gompertz equation is \[a*b*exp(-exp(-b*(t-To))-b*(t-To))\], where again t is time and parameters a, b, and To are estimated.

While these equations are different, they are both 3 parameter equations with declining growth over time. As shown on the graph above, they produce very similar fits to daily deaths. The total forecasted cumulative number of Covid deaths from the Gompertz equation is 145,691, while the forecasted cumulative from the Graph equation is 159,975. Since I am using covidtracking.com for fitting, these can be compared to their currently reported total deaths of 154,947 (this is lower than numbers reported by J. Hopkins).

How Stable Has The Cumulative Death Estimate Been?

#Look at estimates of parameter a over the last 30 days at 5 day intervals
priorDates<-data.frame(seq((nrow(data1)-40),nrow(data1),5))
priorDates$CumEstimate<-0
priorDates$AdjRSquared<-0
priorDates$ParamB<-0
priorDates$ParamC<-0
for (i in 1:nrow(priorDates)) {
  mod10<-nlsfit(data1[1:priorDates[i,1],], model=10, start = c(a=90000,b=8,c=0.05))
  priorDates$CumEstimate[i]<-unlist(mod10)[2]
  priorDates$AdjRSquared[i]<-unlist(mod10)[8]
  priorDates$ParamB[i]<-unlist(mod10)[3]
  priorDates$ParamC[i]<-unlist(mod10)[4]  
  
}
priorDates[,1]<-priorDates[,1]-max(priorDates[,1])
priorDates[,2]<-comma(round(as.numeric(priorDates[,2]),0),digits = 0)
priorDates[,3]<-percent(priorDates[,3])
priorDates[,4]<-comma(priorDates[,4],2)
priorDates[,5]<-comma(priorDates[,5],4)
colnames(priorDates) <- c("Days Ago","Cum Covid Death Est.","Adj. R-Squared","Parameter b", "Parameter c")
priorDates$pctChangeCumCovid<-percent(max(priorDates$`Cum Covid Death Est.`)/priorDates$`Cum Covid Death Est.`-1,digits = 1)
priorDates
##   Days Ago Cum Covid Death Est. Adj. R-Squared Parameter b Parameter c
## 1      -40              123,519         99.88%        6.90      0.0444
## 2      -35              124,965         99.86%        6.72      0.0434
## 3      -30              126,652         99.83%        6.52      0.0423
## 4      -25              128,580         99.78%        6.29      0.0411
## 5      -20              130,880         99.71%        6.03      0.0397
## 6      -15              133,693         99.58%        5.74      0.0380
## 7      -10              137,136         99.40%        5.43      0.0360
## 8       -5              141,125         99.18%        5.12      0.0340
## 9        0              145,691         98.93%        4.81      0.0319
##   pctChangeCumCovid
## 1             18.0%
## 2             16.6%
## 3             15.0%
## 4             13.3%
## 5             11.3%
## 6              9.0%
## 7              6.2%
## 8              3.2%
## 9              0.0%

We can see that cumulative Covid death estimate using this technique has changed by 15.0% over the last 30 days (since 2020-07-11) despite all fits having high adjusted R-squared values. The estimate does appear to be converging however with smaller changes as time passes. This seems consistent with the virus spreading more widely throughout the US, ultimately providing a more stable national estimate.

For comparison the widely cited IHME Univ. of Washington model (https://ihmecovid19storage.blob.core.windows.net/archive/2020-04-29/ihme-covid19.zip) forecasted 72,000 Covid deaths on April 29, and then revised their forecast to 134,000 on May 4th.

mda<-data.frame(seq((nrow(data1)-40),nrow(data1),5))
mda$CumEstimate<-0
mda$ParamB<-0
mda$ParamTo<-0
for (i in 1:nrow(mda)) {
  moda<-summary(nlsLM(cd~a*exp(-exp(-b*(d-To))),data=data1[1:mda[i,1],],start=list(a=90000,b=0.05,To=30)))
  mda$CumEstimate[i]<-moda$parameters[1]
  mda$ParamB[i]<-moda$parameters[2]
  mda$ParamTo[i]<-moda$parameters[3]  
}
mda[,1]<-mda[,1]-max(mda[,1])
mda[,2]<-comma(round(as.numeric(mda[,2]),0),digits = 0)
mda[,3]<-percent(mda[,3])
mda[,4]<-comma(mda[,4],1)

colnames(mda) <- c("Days Ago","Cum Covid Death Est.","Parameter b", "Parameter Ti")
mda$pctChangeCumCovid<-percent(max(mda$`Cum Covid Death Est.`)/mda$`Cum Covid Death Est.`-1,digits = 1)
mda
##   Days Ago Cum Covid Death Est. Parameter b Parameter Ti pctChangeCumCovid
## 1      -40              123,519       4.44%         43.5             18.0%
## 2      -35              124,965       4.34%         43.9             16.6%
## 3      -30              126,652       4.23%         44.3             15.0%
## 4      -25              128,580       4.11%         44.7             13.3%
## 5      -20              130,880       3.97%         45.3             11.3%
## 6      -15              133,693       3.80%         46.0              9.0%
## 7      -10              137,136       3.60%         46.9              6.2%
## 8       -5              141,125       3.40%         48.0              3.2%
## 9        0              145,691       3.18%         49.3              0.0%

Progression of Estimates For Final Total US Covid Deaths

plot(mda$`Days Ago`,mda$`Cum Covid Death Est.`,type='b',ylab="Final Total Covid Death Est.",xlab="Est. Made X Days Ago",col='red',ylim = c(50000,150000))

Comparison of Estimated Cumulative Covid Death Curves: Now vs. Est. 40 Days Ago.

a1<-nlsLM(cd~a*exp(-exp(-b*(d-To))),data=data1[1:(nrow(data1)-40),],start=list(a=90000,b=0.05,To=30))
a2<-nlsLM(cd~a*exp(-exp(-b*(d-To))),data=data1,start=list(a=90000,b=0.05,To=30))

plot(predict(a2,newdata = data1),type='l',col='blue',xlab = "Days Since Deaths>50",ylab="Cum. Covid Deaths", main="Blue line = Curr. Gompertz Est., Red Line = Est. Made 40 days ago")
lines(predict(a1,newdata=data1),col='red')

Interpretation of Change in Form 2 Equation Parameters Over Time

Form 2 model parameters have changed in the following ways:

  • Parameter a – the cumulative Covid death forecast has increased over time.
  • The death growth rate estimate, parameter b has declined.
  • Ti - The date of the growth rate inflection point was extended from 2020-04-24 to 2020-04-30.

It is difficult to relate these changes to increasing (and now decreasing) degrees of lockdown/social distancing. While the slowing growth rate estimate and the extension of the inflection date could be seen as compatible with increasing lockdown implementations, the growing estimate of cumulative deaths doesn’t fit a lockdown hypothesis.

I believe a better hypothesis is that the initial death estimates were dominated by the large number of deaths coming from dense, highly populated cities (e.g. in NY and NJ). The initial Gompertz estimates would have reflected the growth and “carrying capacity” in that sub-system. Since then Covid has spread to interior states/more rural & suburban areas. For example, on April 8th 49% of the US population aged 55 and older lived in counties with 5 or fewer reported Covid deaths. On June 2nd, that percentage had dropped to approximately 15%.

Thus from Covid’s perspective, it found a “larger container” as it expanded beyond larger cities. That larger container was inherently harder to spread in. Finding the larger container meant cumulative death estimates would rise, growth rates would decline, and the date of growth rate inflection was extended.

More generally, each state’s experience (as shown below) has been unique in terms of:

  1. The timing of infection
  2. Population density cluster(s) and how connected they are
  3. Proximity to adjacent states that may have high/low infection rates
  4. How well connected the state is to international and other domestic cities
  5. They may also be different along socio-economic/general health dimensions as well
## Warning: Removed 607 row(s) containing missing values (geom_path).

We can fit Gompertz curves to each state individually. The chart below shows the resulting dispersion of coefficients for Form 2 of the Gompertz equation. Term b, the growth coefficient estimates and Ti (days to growth inflection point) are the x-y coordinates. Term a, the asymptote, is used to calculate the “projDeathRate” which is term a divided by that state’s population aged 55 and older.

## `geom_smooth()` using formula 'y ~ x'

As suspected given what we know about the timing and distribution of Covid deaths in the US, curves for New York and New Jersey have a higher growth parameter than one estimated on all other states combined. Similarly the growth inflection date for New York and New Jersey is sooner than ones estimated for states with lower population densities

Finally, the total Covid death estimate is higher by combining all individual state Gompertz curves vs. the estimate from the single All-US model estimate (104,002,029 vs. 145,691).

I suspect the total forecast using 50 individual state models is less stable, but more accurate (as it captures slower growth curves in some states) than a single US model. Ten days ago the total Covid death estimate using individual state models was 17,213,572 (-83% different).

Appendix A: Sum of individual state models “forecast” of Daily Deaths

aa<-head(sumTab[Date>=Sys.Date(),.(Date,round(dailyDeath,0))],30)
colnames(aa)<-c("Date","Daily Covid Deaths")
kable(aa, table.attr = "style='width:50%;'")
Date Daily Covid Deaths
2020-08-10 686
2020-08-11 691
2020-08-12 697
2020-08-13 702
2020-08-14 708
2020-08-15 715
2020-08-16 721
2020-08-17 727
2020-08-18 734
2020-08-19 741
2020-08-20 748
2020-08-21 755
2020-08-22 763
2020-08-23 771
2020-08-24 778
2020-08-25 786
2020-08-26 795
2020-08-27 803
2020-08-28 812
2020-08-29 820
2020-08-30 829
2020-08-31 838
2020-09-01 848
2020-09-02 857
2020-09-03 867
2020-09-04 877
2020-09-05 887
2020-09-06 897
2020-09-07 908
2020-09-08 918