library(dplyr, warn.conflicts = FALSE)
library(actuar, warn.conflicts = FALSE)
library(randomNames)
library(ggplot2)
We are going to make some assumptions. In particular we are going to assume that the number of votes each candidate gets follows a power-law (Pareto) distribution. That is, a few candidates will get a lot of votes, more candidates will get a middling number of votes, and many candidates will get few votes. But we will also assume that the average ranking each leader gets is a normal distribution; that is, that any given vote is evenly distributed around the person’s true merit and worth.
n_candidates <- 100
set.seed(12204)
With those assumptions, lets create a fake group of leaders and voting data results.
leaders <- data_frame(
name = randomNames(n_candidates, which = "first"),
n_votes = as.integer(rpareto(n_candidates, 500, 1e6)),
avg_vote = runif(n_candidates, 0.5, 5.0)) %>%
mutate(prop_vote = round(n_votes / sum(n_votes), 3),
stars = round(avg_vote / 0.5 , 0) * 0.5)
head(leaders)
## Source: local data frame [6 x 5]
##
## name n_votes avg_vote prop_vote stars
## 1 Sunny 5122 1.540 0.024 1.5
## 2 You-One 3728 3.877 0.017 4.0
## 3 Alexandra 983 3.327 0.005 3.5
## 4 Drew 2118 1.938 0.010 2.0
## 5 John 8098 1.453 0.037 1.5
## 6 Cheng 3020 1.088 0.014 1.0
We could rank the leaders by the number of votes that they received.
leaders %>%
mutate(rank = min_rank(-n_votes)) %>%
arrange(rank) %>%
head(10)
## Source: local data frame [10 x 6]
##
## name n_votes avg_vote prop_vote stars rank
## 1 Sachelle 9288 4.826 0.043 5.0 1
## 2 Anthony 9273 3.258 0.043 3.5 2
## 3 John 8098 1.453 0.037 1.5 3
## 4 Desiree 7144 1.588 0.033 1.5 4
## 5 Joshua 6964 4.405 0.032 4.5 5
## 6 Chenell 6284 3.902 0.029 4.0 6
## 7 Reyel 5142 1.626 0.024 1.5 7
## 8 Sunny 5122 1.540 0.024 1.5 8
## 9 Pacomio 5013 3.123 0.023 3.0 9
## 10 Jessica 4973 2.807 0.023 3.0 10
This solution is insufficient because some of the people with the most votes are by all accounts terrible people with low average votes.
We could instead rank people by their average votes.
leaders %>%
mutate(rank = min_rank(-avg_vote)) %>%
arrange(rank) %>%
head(10)
## Source: local data frame [10 x 6]
##
## name n_votes avg_vote prop_vote stars rank
## 1 Yichen 3125 4.849 0.014 5.0 1
## 2 Sachelle 9288 4.826 0.043 5.0 2
## 3 Jada 267 4.799 0.001 5.0 3
## 4 Sang Hyun 4791 4.773 0.022 5.0 4
## 5 Michael 936 4.769 0.004 5.0 5
## 6 Kayla 1146 4.695 0.005 4.5 6
## 7 Chandler 1549 4.690 0.007 4.5 7
## 8 Scout 76 4.559 0.000 4.5 8
## 9 Jesse 3849 4.521 0.018 4.5 9
## 10 Noah 8 4.506 0.000 4.5 10
This is inadequate because most of the top ranked people have a low percentage of the vote. (The person with the most votes got 0.043 of the vote.) This problem shows up even though the data was randomly generated and there is no relationship between the number of votes and popularity. We might guess that in the actual voting people vote more often for the people they like better.
In Web Reputation Systems and the Real World, Randy Farmer suggests a liquidity ranked mean. This kind of mean takes into account not just the average number of votes, but the number of votes. Below a certain minimum threshold (the floor) votes are penalized, between the floor and the maximum threshold (the ceiling) votes get a boost for each additional person that votes for them, and above for each vote about the ceiling no additional boost is given. This can be expressed as a function.
liquidity_mean <- function(avg_vote, n_votes,
f = 100, c = 1000,
adjustment = 0.2) {
weight <- min(max((n_votes - f) / c, 0), 1) * 2
return(avg_vote - adjustment + weight * adjustment)
}
We can see what this function will do for us. Assume that you have an average ranking of 2.5. What will the weighted ranking be for all votes from 1 to 10,000?
data_frame(n_votes = 1:1e4,
avg_vote = rep(2.5, 1e4)) %>%
rowwise() %>%
mutate(liquid = liquidity_mean(avg_vote, n_votes)) %>%
ungroup() %>%
ggplot(aes(x = n_votes, y = liquid)) + geom_line() + ylim(0, 5)
Now we can calculate the liquidity mean for each person and rank them.
leaders %>%
rowwise() %>%
mutate(liquid = liquidity_mean(avg_vote, n_votes)) %>%
ungroup() %>%
mutate(rank = min_rank(-liquid),
bump = liquid - avg_vote) %>%
arrange(rank) %>%
head(10)
## name n_votes avg_vote prop_vote stars liquid rank bump
## 1 Yichen 3125 4.849 0.014 5.0 5.049 1 0.2000
## 2 Sachelle 9288 4.826 0.043 5.0 5.026 2 0.2000
## 3 Sang Hyun 4791 4.773 0.022 5.0 4.973 3 0.2000
## 4 Michael 936 4.769 0.004 5.0 4.903 4 0.1344
## 5 Kayla 1146 4.695 0.005 4.5 4.895 5 0.2000
## 6 Chandler 1549 4.690 0.007 4.5 4.890 6 0.2000
## 7 Jesse 3849 4.521 0.018 4.5 4.721 7 0.2000
## 8 Eric 3309 4.490 0.015 4.5 4.690 8 0.2000
## 9 Jada 267 4.799 0.001 5.0 4.666 9 -0.1332
## 10 Joshua 6964 4.405 0.032 4.5 4.605 10 0.2000
Notice that with our default parameters for the liquidity_mean() function, Yichen still bumps out Sachelle for the lead because they both have gotten the maximum bump for having more than a thousand votes. But if we change the ceiling (after which you don’t get an additional bump for having more votes) to 10,000 we get different results.
leaders %>%
rowwise() %>%
mutate(liquid = liquidity_mean(avg_vote, n_votes, c = 1e4)) %>%
ungroup() %>%
mutate(rank = min_rank(-liquid),
bump = liquid - avg_vote) %>%
arrange(rank) %>%
head(10)
## name n_votes avg_vote prop_vote stars liquid rank bump
## 1 Sachelle 9288 4.826 0.043 5.0 4.994 1 0.16752
## 2 Yichen 3125 4.849 0.014 5.0 4.770 2 -0.07900
## 3 Sang Hyun 4791 4.773 0.022 5.0 4.760 3 -0.01236
## 4 Jada 267 4.799 0.001 5.0 4.606 4 -0.19332
## 5 Michael 936 4.769 0.004 5.0 4.602 5 -0.16656
## 6 Chandler 1549 4.690 0.007 4.5 4.548 6 -0.14204
## 7 Kayla 1146 4.695 0.005 4.5 4.537 7 -0.15816
## 8 Joshua 6964 4.405 0.032 4.5 4.479 8 0.07456
## 9 Jesse 3849 4.521 0.018 4.5 4.471 9 -0.05004
## 10 Eric 3309 4.490 0.015 4.5 4.418 10 -0.07164
Now Sachelle beats Yichen. Notice that the stars column contains the average vote rounded to the nearest 0.5. Users will not noticed any difference between the top five people, who all have a 5.0 ranking, but now the people with the most votes win.