World population data

Population data derived from the World Population Clock website (World Population Clock, 2020).

Dates and times are in Melbourne time, which up until 4th April 2020, is Australian Eastern Daylight Savings Time (AEDT).

knitr::opts_chunk$set(echo = TRUE)
library(lubridate) # for time functions
library(tidyverse)
library(chemCal) # for 'inverse.predict' function
library(formattable) # provide 'comma' function
d <- data.frame(Time = ymd_hms("2020-03-05 04:04:00", tz = "Australia/Melbourne"), Population = 7768625956)
d <- d %>% add_row(Time = ymd_hms("2020-03-06 21:40:00", tz = "Australia/Melbourne"), Population = 7769011917)
d <- d %>% add_row(Time = ymd_hms("2020-03-10 02:21:00", tz = "Australia/Melbourne"), Population = 7769723385)
d <- d %>% add_row(Time = ymd_hms("2020-03-11 04:23:00", tz = "Australia/Melbourne"), Population = 7769964915)
d <- d %>% add_row(Time = ymd_hms("2020-03-19 00:00:00", tz = "Australia/Melbourne"), Population = 7771705611)
d <- d %>% add_row(Time = ymd_hms("2020-03-20 06:51:00", tz = "Australia/Melbourne"), Population = 7771991830)
d <- d %>% add_row(Time = ymd_hms("2020-03-21 02:32:00", tz = "Australia/Melbourne"), Population = 7772174460)

d2 <- data.frame(Time = ymd_hms("2020-03-23 17:30:00", tz = "Australia/Melbourne"), Population = 7772758654)
d2 <- d2 %>% add_row(Time = ymd_hms("2020-03-25 07:52:00", tz = "Australia/Melbourne"), Population = 7773114617)
# observations not used in model creation, but will be used for model testing

Population vs Time Plot

Population growth appears close to linear, as might be expected over a relatively short time period.

ggplot(d, aes (x = Time, y = Population)) +
  geom_point() + # the data used for modeling
  geom_point(data = d2, aes (colour = "Test")) + labs(colour = "Dataset") 

Normalizing and modifying the data

To help with model generation, a new variable nTime is calculated, where time ‘0’ is defined as the time of the first observation (2020-03-05 04:04:00 AEDT). The unit of nTime is seconds.

Population growth is often modeled as exponential. logPopulation is defined as the natural logarithm of the population.

d <- d %>%
  mutate(logPopulation = log(Population), # the log of the Population
         nTime = as.numeric(as.duration(Time - d$Time[[1]]))) 
# the first observation is time 'zero'

d2 <- d2 %>% # do the same for the test data
  mutate(logPopulation = log(Population), # the log of the Population
         nTime = as.numeric(as.duration(Time - d$Time[[1]]))) 
# the first observation is time 'zero'

Models

Model 1 - A simple linear model.

Although population growth is often modeled as exponential, when using data from a relatively short time period, the growth might be approximated as linear.

\[ \begin{align} Population &= \beta_0 + \beta_1 \times t\\ \end{align} \]

where \(t\) is the normalized time nTime.

model1 <- lm(Population ~ nTime, data = d)
summary(model1)
## 
## Call:
## lm(formula = Population ~ nTime, data = d)
## 
## Residuals:
##        1        2        3        4        5        6        7 
## -0.40316 -1.33054  3.99801 -1.47098  0.05119 -5.11852  4.27400 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 7.769e+09  2.270e+00 3.423e+09   <2e-16 ***
## nTime       2.577e+00  2.560e-06 1.007e+06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.593 on 5 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.014e+12 on 1 and 5 DF,  p-value: < 2.2e-16

Model 2 - A simple exponential growth model.

\[ \begin{aligned} Population & = \beta_0\times e^{\beta_1 \times t}\\ \implies log_e Population & = log_e \beta_0 + \beta_1 \times t \end{aligned} \]

model2 <- lm(logPopulation ~ nTime, data = d)
summary(model2)
## 
## Call:
## lm(formula = logPopulation ~ nTime, data = d)
## 
## Residuals:
##          1          2          3          4          5          6 
## -1.345e-08 -2.860e-09  1.109e-08  1.299e-08  3.334e-09 -3.754e-09 
##          7 
## -7.353e-09 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 2.277e+01  6.685e-09 3.406e+09   <2e-16 ***
## nTime       3.317e-10  7.540e-15 4.399e+04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.058e-08 on 5 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.935e+09 on 1 and 5 DF,  p-value: < 2.2e-16

Model 3 - A varying exponential growth model

Perhaps population growth is exponential, but the rate of exponential growth is changing with time e.g. declining due to demographic transition (de Vries, 2013). If, within the timeframe of the data, the rate of exponential growth change is locally linear, this can be modeled:

\[ \begin{aligned} Population & = \beta_0\times e^{(\beta_1 + \beta_2 t ) t}\\ \implies log_e Population & = log_e \beta_0 + (\beta_1 + \beta_2 t) t\\ \implies log_e Population & = log_e \beta_0 + \beta_1t + \beta_2 t^2 \end{aligned} \]

model3 <- lm(logPopulation ~ nTime + I(nTime^2), data = d)
summary(model3)
## 
## Call:
## lm(formula = logPopulation ~ nTime + I(nTime^2), data = d)
## 
## Residuals:
##          1          2          3          4          5          6 
## -8.545e-12 -1.620e-10  4.803e-10 -2.325e-10 -5.416e-12 -6.485e-10 
##          7 
##  5.767e-10 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  2.277e+01  4.383e-10  5.196e+10  < 2e-16 ***
## nTime        3.317e-10  1.783e-15  1.860e+05  < 2e-16 ***
## I(nTime^2)  -5.518e-20  1.204e-21 -4.584e+01 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.158e-10 on 4 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.073e+11 on 2 and 4 DF,  p-value: < 2.2e-16

Model performance

d2 # the test data
##                  Time Population logPopulation   nTime
## 1 2020-03-23 17:30:00 7772758654      22.77389 1603560
## 2 2020-03-25 07:52:00 7773114617      22.77394 1741680

Are the predictions of the three models for (2020-03-23 17:30:00 and 2020-03-25 07:52:00) close to the actual observed population of (7,772,758,654 and 7,773,114,617)?

# simple linear model
predict(model1, newdata = d2)
##          1          2 
## 7772758656 7773114619
# simple exponential growth model
exp(predict(model2, newdata = d2)) # need to raise e^logPopulation
##          1          2 
## 7772758866 7773114941
# varying exponential growth model
exp(predict(model3, newdata = d2)) # need to raise e^logPopulation
##          1          2 
## 7772758655 7773114618

The predictions of the simple linear model (model1) and varying exponential growth model (model3) are both superior to the simple exponential growth model (model3), and almost identical.

But the simple linear model (model1) is the simpler model, requiring fewer variables, and so could be considered superior.

Prediction for population 7,777,777,777

target <- 7777777777

Prediction plot

Population ‘target’ of 7,777,777,777 indicated with the horizontal dashed line.

time_range <- ymd_hms("2020-03-05 04:04:00", tz = "Australia/Melbourne"):
  ymd_hms("2020-04-30 00:00:00", tz = "Australia/Melbourne")
time_range <- time_range[seq(1, length(time_range), 60*60*24)] # only keep one 'x' value per day
time_range <- as_datetime(time_range) # convert back to times

pred1 <- data.frame(Time = time_range,
                    Population = predict(model1,
                                         data.frame(nTime = as.numeric(time_range - time_range[1]))))
pred2 <- data.frame(Time = time_range,
                    Population = exp(predict(model2,
                                             data.frame(nTime = as.numeric(time_range - time_range[1])))))
pred3 <- data.frame(Time = time_range,
                    Population = exp(predict(model3,
                                             data.frame(nTime = as.numeric(time_range - time_range[1])))))


plotdata <- bind_rows(d %>% mutate(Observation = "Sample", Model = NA),
                      d2 %>% mutate(Observation = "Test", Model = NA),
                      pred1 %>% mutate(Model = "Model1", Observation = NA),
                      pred2 %>% mutate(Model = "Model2", Observation = NA),
                      pred3 %>% mutate(Model = "Model3", Observation = NA))

ggplot(plotdata, aes(x = Time, y = Population)) +
  geom_point(aes(shape = Observation)) + # the data used for modeling and testing
  scale_shape_manual(values = c(25,24), breaks = c("Sample", "Test")) +
  geom_line(aes(group = Model, color = Model, linetype = Model)) +
  scale_color_manual(values = c("red", "green", "blue"),
                     breaks = c("Model1", "Model2", "Model3"),
                     guide = guide_legend(override.aes = list(shape = rep(NA, 3)))) +
  scale_linetype_manual(values = c("twodash", "dotted", "dashed"),
                        breaks = c("Model1", "Model2", "Model3")) +
  theme_light() +
  geom_hline(yintercept = target, color = "gold1", linetype = "dashed", size = 1) +
  expand_limits(x = ymd_hms("2020-04-30 00:00:00", tz = "Australia/Melbourne"))

All the model predictions for the time of world population achieving 7,777,777,777 are in mid-April. In early April, Melbourne time changes from Daylight Savings (AEDT) to Australian Eastern Standard Time (AEST).

Simple linear model ‘model1

x1 <- inverse.predict(model1, target)
prediction_time1 <- d$Time[1] + dseconds(x1$Prediction)
# dseconds 'duration' includes the change from Daylight Savings to Standard Time

x1
## $Prediction
## [1] 3551067
## 
## $`Standard Error`
## [1] 3.190958
## 
## $Confidence
## [1] 8.202617
## 
## $`Confidence Limits`
## [1] 3551059 3551075
prediction_time1
## [1] "2020-04-15 05:28:26 AEST"
with_tz(prediction_time1, tz = "GMT")
## [1] "2020-04-14 19:28:26 GMT"

Simple exponential growth model ‘model2

x2 <- inverse.predict(model2, log(target)) # need to 'log' the population
prediction_time2 <- d$Time[1] + dseconds(x2$Prediction)
# dseconds 'duration' includes the change from Daylight Savings to Standard Time

x2
## $Prediction
## [1] 3549790
## 
## $`Standard Error`
## [1] 73.00997
## 
## $Confidence
## [1] 187.6781
## 
## $`Confidence Limits`
## [1] 3549602 3549977
prediction_time2
## [1] "2020-04-15 05:07:09 AEST"
with_tz(prediction_time2, tz = "GMT")
## [1] "2020-04-14 19:07:09 GMT"

Varying exponential growth model ‘model3

# Can't use *inverse.predict*, because *inverse.predict* expects only one variable/predictor.

# Guess that the time, in seconds, is the same as the predicted by model1

x3 <- x1$Prediction

repeat {
  p <- round(exp(predict(model3, newdata = data.frame(nTime = x3))))
  if (p == target) {break} # reached the target!
  if (p < target) {x3 <- x3 + 1} # increment one second
  if (p > target) {x3 <- x3 - 1} # decrement one second
}

prediction_time3 <- d$Time[1] + dseconds(x3)
x3
## [1] 3551072
prediction_time3
## [1] "2020-04-15 05:28:31 AEST"
with_tz(prediction_time3, tz = "GMT")
## [1] "2020-04-14 19:28:31 GMT"

Results

The simple linear model prediction (model1 : GMT 2020-04-14 19:28:26) and varying exponential model prediction (model3 : GMT 2020-04-14 19:28:31) are similar.

The ‘actual’ result

15th April 2020, 05:28:32 AM

Commentary

Population growth is often modeled as exponential growth, but the current human population in the world is very heterogeneous with regard to population growth. Some areas have a very youthful age structure, others are much older. Some areas have population growth which could be described as exponential growth, others have much more static, or even declining, population. If there are balancing influences on population growth, then population growth could be logistic (de Vries, 2013), and currently in a linear growth phase.

Although the ‘simple linear (model 1)’ model performed just as well as the ‘varying exponential (model 3)’ on the test data, the varying exponential model intuitively has a somewhat better theoretical basis. As it turned out, the varying exponential model performed better on the target prediction.

Model limitations

All three growth models base current growth on the entirety of the world population. In reality, only some age strata contribute to population growth. The size of these strata, and the reproductive rate of these strata, vary widely across the societies of the world.

Over the next few weeks www.worldometers.info might change their model to deal with unexpected influences on population, e.g. COVID-19 (Roser et al., 2020)!

References

de Vries, B. (2013). Sustainability science (1st ed). Cambridge University Press.

Roser, M., Ritchie, H., & Ortiz-Ospina, E. (2020). Coronavirus Disease (COVID-19) Statistics and Research. Our World in Data.

World Population Clock: 7.8 Billion People (2020) - Worldometer. (2020). https://www.worldometers.info/world-population/.