Estimating Zipf with Clauset et al. method

This markdown estimates the Zipf parameter and min x using the method described in Clauset, Shalizi, and Newman (2009) and implemented in poweRlaw package.

Let’s start by reproducing the example from the paper and package documentation. The following example looks at Zipf’s law in Moby Dick.

The data are a vector of word counts sorted from highest to lowest frequency.

# load data
data("moby")
head(moby)

## [1] 14086  6414  6260  4573  4484  4040

Fit a discrete power law

moby_power_law <- displ$new(moby)

Estimate xmin

xmin_est <- estimate_xmin(moby_power_law)
xmin_est$xmin

## [1] 7

In this case, xmin_est is 7.

Update power law with estimate xmin_value.

moby_power_law$setXmin(xmin_est)

Plot the data - the power law sure looks like a good fit to the data.

plot(moby_power_law)
lines(moby_power_law, col=2)

Let’s do inference - is this actually a power law?

bootstrapped_p_value <- bootstrap_p(moby_power_law)
bootstrapped_p_value$p

## [1] 0.71

Large p-values indicate that the data are likely to have come from the model distribution. In this case, the p-value is 0.71 suggesting the power law is a good fit for the data.

Our data

Let’s do the same as above except with a community from teh reddit data.

LOCAL_PATH <-  "/Volumes/wilbur_the_great/LANGSCALES_subreddit_sample/misc/all_word_counts.csv"

all_counts <- read_csv(LOCAL_PATH) %>%
  select(subreddit, total_counts)

reddit_test <- all_counts %>%
  filter(subreddit == "morbidquestions") %>%
  arrange(-total_counts) %>%
  pull(total_counts)

head(reddit_test)

## [1] 133295 108239  91544  89928  70850  66749

Again the data are a vector of counts sorted by frequency.

Fit a discrete power law

reddit_power_law <- displ$new(reddit_test)

Estimate xmin

xmin_est_reddit <- estimate_xmin(reddit_power_law)
xmin_est_reddit$xmin

## [1] 446

In this case, xmin_est is 446.

Update power law with estimate xmin_value.

reddit_power_law$setXmin(xmin_est_reddit)

Plot the data - the power law sure looks like a good fit to the data.

plot(reddit_power_law)
lines(reddit_power_law, col=2)

bootstrapped_p_value_reddit <- bootstrap_p(reddit_power_law, threads = 5, xmax = 1000000)
bootstrapped_p_value_reddit$p

## [1] 0.37

Estimating Zipf with Clauset et al. method

Molly Lewis

2020-02-07

Our data