This markdown estimates the Zipf parameter and min x using the method described in Clauset, Shalizi, and Newman (2009) and implemented in poweRlaw package.
Let’s start by reproducing the example from the paper and package documentation. The following example looks at Zipf’s law in Moby Dick.
The data are a vector of word counts sorted from highest to lowest frequency.
# load data
data("moby")
head(moby)
## [1] 14086 6414 6260 4573 4484 4040
Fit a discrete power law
moby_power_law <- displ$new(moby)
Estimate xmin
xmin_est <- estimate_xmin(moby_power_law)
xmin_est$xmin
## [1] 7
In this case, xmin_est is 7.
Update power law with estimate xmin_value.
moby_power_law$setXmin(xmin_est)
Plot the data - the power law sure looks like a good fit to the data.
plot(moby_power_law)
lines(moby_power_law, col=2)
Let’s do inference - is this actually a power law?
bootstrapped_p_value <- bootstrap_p(moby_power_law)
bootstrapped_p_value$p
## [1] 0.71
Large p-values indicate that the data are likely to have come from the model distribution. In this case, the p-value is 0.71 suggesting the power law is a good fit for the data.
Let’s do the same as above except with a community from teh reddit data.
LOCAL_PATH <- "/Volumes/wilbur_the_great/LANGSCALES_subreddit_sample/misc/all_word_counts.csv"
all_counts <- read_csv(LOCAL_PATH) %>%
select(subreddit, total_counts)
reddit_test <- all_counts %>%
filter(subreddit == "morbidquestions") %>%
arrange(-total_counts) %>%
pull(total_counts)
head(reddit_test)
## [1] 133295 108239 91544 89928 70850 66749
Again the data are a vector of counts sorted by frequency.
Fit a discrete power law
reddit_power_law <- displ$new(reddit_test)
Estimate xmin
xmin_est_reddit <- estimate_xmin(reddit_power_law)
xmin_est_reddit$xmin
## [1] 446
In this case, xmin_est is 446.
Update power law with estimate xmin_value.
reddit_power_law$setXmin(xmin_est_reddit)
Plot the data - the power law sure looks like a good fit to the data.
plot(reddit_power_law)
lines(reddit_power_law, col=2)
bootstrapped_p_value_reddit <- bootstrap_p(reddit_power_law, threads = 5, xmax = 1000000)
bootstrapped_p_value_reddit$p
## [1] 0.37