Crowd Ranking

library(dplyr, warn.conflicts = FALSE)
library(actuar, warn.conflicts = FALSE)
library(randomNames)
library(ggplot2)

We are going to make some assumptions. In particular we are going to assume that the number of votes each candidate gets follows a power-law (Pareto) distribution. That is, a few candidates will get a lot of votes, more candidates will get a middling number of votes, and many candidates will get few votes. But we will also assume that the average ranking each leader gets is a normal distribution; that is, that any given vote is evenly distributed around the person’s true merit and worth.

n_candidates <- 100
set.seed(12204)

With those assumptions, lets create a fake group of leaders and voting data results.

leaders <- data_frame(
  name = randomNames(n_candidates, which = "first"),
  n_votes = as.integer(rpareto(n_candidates, 500, 1e6)),
  avg_vote = runif(n_candidates, 0.5, 5.0)) %>%
  mutate(prop_vote = round(n_votes / sum(n_votes), 3),
         stars = round(avg_vote / 0.5 , 0) * 0.5)
head(leaders)

## Source: local data frame [6 x 5]
## 
##        name n_votes avg_vote prop_vote stars
## 1     Sunny    5122    1.540     0.024   1.5
## 2   You-One    3728    3.877     0.017   4.0
## 3 Alexandra     983    3.327     0.005   3.5
## 4      Drew    2118    1.938     0.010   2.0
## 5      John    8098    1.453     0.037   1.5
## 6     Cheng    3020    1.088     0.014   1.0

We could rank the leaders by the number of votes that they received.

leaders %>%
  mutate(rank = min_rank(-n_votes)) %>%
  arrange(rank) %>%
  head(10)

## Source: local data frame [10 x 6]
## 
##        name n_votes avg_vote prop_vote stars rank
## 1  Sachelle    9288    4.826     0.043   5.0    1
## 2   Anthony    9273    3.258     0.043   3.5    2
## 3      John    8098    1.453     0.037   1.5    3
## 4   Desiree    7144    1.588     0.033   1.5    4
## 5    Joshua    6964    4.405     0.032   4.5    5
## 6   Chenell    6284    3.902     0.029   4.0    6
## 7     Reyel    5142    1.626     0.024   1.5    7
## 8     Sunny    5122    1.540     0.024   1.5    8
## 9   Pacomio    5013    3.123     0.023   3.0    9
## 10  Jessica    4973    2.807     0.023   3.0   10

This solution is insufficient because some of the people with the most votes are by all accounts terrible people with low average votes.

We could instead rank people by their average votes.

leaders %>%
  mutate(rank = min_rank(-avg_vote)) %>%
  arrange(rank) %>%
  head(10)

## Source: local data frame [10 x 6]
## 
##         name n_votes avg_vote prop_vote stars rank
## 1     Yichen    3125    4.849     0.014   5.0    1
## 2   Sachelle    9288    4.826     0.043   5.0    2
## 3       Jada     267    4.799     0.001   5.0    3
## 4  Sang Hyun    4791    4.773     0.022   5.0    4
## 5    Michael     936    4.769     0.004   5.0    5
## 6      Kayla    1146    4.695     0.005   4.5    6
## 7   Chandler    1549    4.690     0.007   4.5    7
## 8      Scout      76    4.559     0.000   4.5    8
## 9      Jesse    3849    4.521     0.018   4.5    9
## 10      Noah       8    4.506     0.000   4.5   10

This is inadequate because most of the top ranked people have a low percentage of the vote. (The person with the most votes got 0.043 of the vote.) This problem shows up even though the data was randomly generated and there is no relationship between the number of votes and popularity. We might guess that in the actual voting people vote more often for the people they like better.

In Web Reputation Systems and the Real World, Randy Farmer suggests a liquidity ranked mean. This kind of mean takes into account not just the average number of votes, but the number of votes. Below a certain minimum threshold (the floor) votes are penalized, between the floor and the maximum threshold (the ceiling) votes get a boost for each additional person that votes for them, and above for each vote about the ceiling no additional boost is given. This can be expressed as a function.

liquidity_mean <- function(avg_vote, n_votes, 
                           f = 100, c = 1000,
                           adjustment = 0.2) {
  weight <- min(max((n_votes - f) / c, 0), 1) * 2
  return(avg_vote - adjustment + weight * adjustment)
  }

We can see what this function will do for us. Assume that you have an average ranking of 2.5. What will the weighted ranking be for all votes from 1 to 10,000?

data_frame(n_votes = 1:1e4,
           avg_vote = rep(2.5, 1e4)) %>%
  rowwise() %>%
  mutate(liquid = liquidity_mean(avg_vote, n_votes)) %>%
  ungroup() %>%
  ggplot(aes(x = n_votes, y = liquid)) + geom_line() + ylim(0, 5)

plot of chunk unnamed-chunk-7

Now we can calculate the liquidity mean for each person and rank them.

leaders %>%
  rowwise() %>%
  mutate(liquid = liquidity_mean(avg_vote, n_votes)) %>%
  ungroup() %>%
  mutate(rank = min_rank(-liquid),
         bump = liquid - avg_vote) %>%
  arrange(rank) %>%
  head(10)

##         name n_votes avg_vote prop_vote stars liquid rank    bump
## 1     Yichen    3125    4.849     0.014   5.0  5.049    1  0.2000
## 2   Sachelle    9288    4.826     0.043   5.0  5.026    2  0.2000
## 3  Sang Hyun    4791    4.773     0.022   5.0  4.973    3  0.2000
## 4    Michael     936    4.769     0.004   5.0  4.903    4  0.1344
## 5      Kayla    1146    4.695     0.005   4.5  4.895    5  0.2000
## 6   Chandler    1549    4.690     0.007   4.5  4.890    6  0.2000
## 7      Jesse    3849    4.521     0.018   4.5  4.721    7  0.2000
## 8       Eric    3309    4.490     0.015   4.5  4.690    8  0.2000
## 9       Jada     267    4.799     0.001   5.0  4.666    9 -0.1332
## 10    Joshua    6964    4.405     0.032   4.5  4.605   10  0.2000

Notice that with our default parameters for the liquidity_mean() function, Yichen still bumps out Sachelle for the lead because they both have gotten the maximum bump for having more than a thousand votes. But if we change the ceiling (after which you don’t get an additional bump for having more votes) to 10,000 we get different results.

leaders %>%
  rowwise() %>%
  mutate(liquid = liquidity_mean(avg_vote, n_votes, c = 1e4)) %>%
  ungroup() %>%
  mutate(rank = min_rank(-liquid),
         bump = liquid - avg_vote) %>%
  arrange(rank) %>%
  head(10)

##         name n_votes avg_vote prop_vote stars liquid rank     bump
## 1   Sachelle    9288    4.826     0.043   5.0  4.994    1  0.16752
## 2     Yichen    3125    4.849     0.014   5.0  4.770    2 -0.07900
## 3  Sang Hyun    4791    4.773     0.022   5.0  4.760    3 -0.01236
## 4       Jada     267    4.799     0.001   5.0  4.606    4 -0.19332
## 5    Michael     936    4.769     0.004   5.0  4.602    5 -0.16656
## 6   Chandler    1549    4.690     0.007   4.5  4.548    6 -0.14204
## 7      Kayla    1146    4.695     0.005   4.5  4.537    7 -0.15816
## 8     Joshua    6964    4.405     0.032   4.5  4.479    8  0.07456
## 9      Jesse    3849    4.521     0.018   4.5  4.471    9 -0.05004
## 10      Eric    3309    4.490     0.015   4.5  4.418   10 -0.07164

Now Sachelle beats Yichen. Notice that the stars column contains the average vote rounded to the nearest 0.5. Users will not noticed any difference between the top five people, who all have a 5.0 ranking, but now the people with the most votes win.