In this notebook, we will be estimating Zipf’s law using a variety of open, ready-to-use data sets and illustrate the striking empirical regularity. We will be focusing on metropolitan areas, which are larger than cities, and made of multiple counties. The file MSA-population-2015.xls is a spreadsheet with metropolitan areas ranked by population. You’ll recognize some of the most well-known American metropolitan areas, e.g. the New York tristate metropolitan area, the Los Angeles metro area.
Zipf’s law states that when regressing the log(rank) on the log(size) (where size is population) of a metro area, the coefficient is approximately -1.
\[ \log(rank) = \alpha - \zeta \cdot \log(size) + \varepsilon \]
Let’s estimate this relationship with U.S. data.
rm(list=ls())
library(readxl)
us.dta <- readxl::read_excel(path = 'MSA-population-2015.xlsx')
us.dta <- transform(us.dta,
log.rank = log(1+Rank),
log.population.2015 = log(`2015 Estimate`))
plot(log.rank ~ log.population.2015, data = us.dta, main = "A test of Zipf's law for cities",
xlab = '2015 Metro area population', ylab = 'log(Rank)')
There is a remarkable almost linear relationship between the log(Rank) and the log(population). Let’s add the OLS prediction on this graph, and let’s report the regression result.
zipfreg <- lm(log.rank ~ log.population.2015, data = us.dta)
summary(zipfreg)
Call:
lm(formula = log.rank ~ log.population.2015, data = us.dta)
Residuals:
Min 1Q Median 3Q Max
-0.70848 -0.02489 0.01436 0.08204 0.18299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.873248 0.074590 212.8 <2e-16 ***
log.population.2015 -0.860365 0.005865 -146.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1238 on 380 degrees of freedom
Multiple R-squared: 0.9826, Adjusted R-squared: 0.9826
F-statistic: 2.152e+04 on 1 and 380 DF, p-value: < 2.2e-16
plot(log.rank ~ log.population.2015, data = us.dta, main = "A test of Zipf's law for cities",
xlab = '2015 Metro area population', ylab = 'log(Rank)')
abline(zipfreg, lwd = 3, col='red')
Interestingly, while Gabaix (1999) and Krugman (1996) report a coefficient of 1, the coefficient we obtain here is about 0.9, close to 1 but not quite 1. What is striking however is how well the relationship predicts city size. A R Squared of 98% is hard to find in economics.
While the relationship performs well for the middle of the distribution (metro areas such as New Orleans and Salt Lake City), it performs less well for the largest metros (New York), but pretty badly at the bottom of the distribution (metro areas such as Rome, GA, or Fairbanks, AK). Gabaix (1999) provides an explanation based on the greater variance of industry-level shocks in those smaller metros.
Gabaix (1999) considered only 135 metros. How well does Zipf law perform with such a sample? Let’s try.
zipfreg.largest <- lm(log.rank ~ log.population.2015, data = us.dta[1:135,])
summary(zipfreg.largest)
Call:
lm(formula = log.rank ~ log.population.2015, data = us.dta[1:135,
])
Residuals:
Min 1Q Median 3Q Max
-0.42975 -0.04315 -0.00596 0.03829 0.25084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.99413 0.13929 129.2 <2e-16 ***
log.population.2015 -1.01134 0.01002 -100.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1004 on 133 degrees of freedom
Multiple R-squared: 0.9871, Adjusted R-squared: 0.987
F-statistic: 1.018e+04 on 1 and 133 DF, p-value: < 2.2e-16
We get almost the same result as Gabaix (1999) ! Good to see that results can be replicated. The constant is significantly higher than in his paper though (10.53 in the paper vs. 17.99 here) but the slope is not statistically different from 1 (-1.01 with a standard error of 0.01).
plot(log.rank ~ log.population.2015, data = us.dta[1:135,], main = "A test of Zipf's law for U.S. Metro Areas",
xlab = '2015 Metro area population', ylab = 'log(Rank)')
abline(zipfreg.largest, lwd = 3, col='red')
We’re going to look at the type of growth processes that can explain Zipf’s law. For that, we’re going to take the equations of Gabaix and simulate city sizes at different time horizons (a few years, a few decades) to see: (i) what growth processes (\(f(\gamma)\)) can explain the c.d.f. of city sizes and (ii) how many periods \(t\) are required to get a distribution that is not statistically different from Zipf.
N <- 200 # we start with 200 cities
T <- 1000 # and 100 years
get.next.distribution <- function(dist, mean, sd) {
for (k in 1:length(dist)) {
dist[k] <- max(dist[k] * rnorm(1, mean, sd), 1000)
}
dist
}
normalize.distribution <- function(dist) {
dist <- dist / sum(dist)
}
make.zipf.dta <- function(dist) {
dta <- data.frame(population = dist[order(dist, decreasing=TRUE)],
rank = 1:length(dist))
dta$log.pop <- log(dta$population)
dta$log.rank <- log(dta$rank)
dta
}
distributions <- c()
distributions[[1]] <- rnorm(N,1,10)*10000 + 10000
for (k in 2:T) {
distributions[[k]] <- get.next.distribution(distributions[[k-1]], 1, 0.1)
}
zipf.dta <- make.zipf.dta(distributions[[T]])
plot(log.rank ~ log.pop, data = zipf.dta, main = 'Monte Carlo Simulation -- Gibrat\'s law and Zipf\'s law',
xlab = 'log(Population)', ylab = 'log(Rank)')
zipfreg <- lm(log.rank ~ log.pop, data = zipf.dta)
abline(zipfreg, lwd = 3, col = 'red')
summary(zipfreg)
Call:
lm(formula = log.rank ~ log.pop, data = zipf.dta)
Residuals:
Min 1Q Median 3Q Max
-0.51958 -0.05271 -0.01006 0.06197 0.37649
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.056752 0.053610 206.2 <2e-16 ***
log.pop -0.849949 0.006694 -127.0 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1054 on 198 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9878
F-statistic: 1.612e+04 on 1 and 198 DF, p-value: < 2.2e-16
europedta <- readxl::read_excel('European-metropolitan-areas.xlsx')
europedta <- transform(europedta,
log.rank = log(as.numeric(`Rank`)),
log.pop = log(as.numeric(population)))
plot(log.rank ~ log.pop, data = europedta)
europereg <- lm(log.rank ~ log.pop, data = europedta)
abline(europereg, lwd=3, col='red')
europereg <- lm(log.rank ~ log.pop, data = europedta)
summary(europereg)
Call:
lm(formula = log.rank ~ log.pop, data = europedta)
Residuals:
Min 1Q Median 3Q Max
-0.26718 -0.10875 -0.00365 0.09330 0.43020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.57302 0.10141 45.10 <2e-16 ***
log.pop -1.73948 0.07273 -23.92 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1634 on 22 degrees of freedom
Multiple R-squared: 0.963, Adjusted R-squared: 0.9613
F-statistic: 572 on 1 and 22 DF, p-value: < 2.2e-16