Five-Star Ratings

A friend sent me this question and data:

So we have a rating of 4.5 out of \(n = 363\) reviews.

n <- 363

Further, we have the following ratings distribution represented using the probability vector \(p\):

p <- c(0.06, 0.01, 0.02, 0.12, 0.79) # percent scaled to [0,1]
names(p) <- 1:5
p

##    1    2    3    4    5 
## 0.06 0.01 0.02 0.12 0.79

sum(p)  # should sum to 1

## [1] 1

Furthermore, assume that the average of 4.5 is unweighted.

We can get the counts using \(n\) and \(p\) to get

counts <- round(n * p)
counts

##   1   2   3   4   5 
##  22   4   7  44 287

Then ‘uncount’ to get a vector of ratings of length 363 (approx)

ratings <- rep(1:5, counts)
str(ratings)

##  int [1:364] 1 1 1 1 1 1 1 1 1 1 ...

table(ratings)

## ratings
##   1   2   3   4   5 
##  22   4   7  44 287

with sum 1662 and mean 4.5659341, which rounded to 1 decimal place, gives 4.6. Given that the image shows a rounded rating of 4.5, the ratings vector we’ve generated doesn’t seem correct.

Call this sum of ratings \(y\).

y <- sum(ratings)

Also, n differs from the length of the ratings vector by 1

## [1] 363

length(ratings)

## [1] 364

so we can reassign n for the moment

n <- length(ratings)
n

## [1] 364

We want to know how many five-star ratings need to be added to bring the average rating to 4.6.

Brute force solution

We can brute force this easily:

iter <- 0
current_mean <- mean(ratings)

# keep adding 5 to the ratings vector until the mean is >= 4.6
while (current_mean < 4.6) {
  iter <- iter + 1
  current_mean <- mean(c(ratings, rep(5, iter)))
}

iter

## [1] 31

current_mean

## [1] 4.6

So we need about 31 additional five-star ratings.

Algebraic solution

Assume that only five-star ratings can be added.

Solve for \(n_1\) in the following equation:

\[\frac{y + (n_1 \times 5)}{n + n_1} = 4.6\]

which can be rewritten as

\[5n_1 = 4.6 \times (n + n_1) - y\] and simplified to yield

\[n_1 = \frac{(4.6 \times n) - y}{0.4}\]

Plugging in values of \(y\) and \(n\), we get

n1 <- ((4.6 * n) - y) / 0.4
n1

## [1] 31

which is the same as the brute force solution above.

Problem

The ratings vector isn’t really the correct one as it has length 364 instead of 363 and a mean of 4.5659341 which is above \(\gt 4.5\) but actually should be \(\leq 4.5\).

Resolution

We can work around this issue in two steps:

Generate samples with replacement from the “incorrect” ratings vector above.
- A generated sample is retained if it matches all the statistics we see in the image, i.e.,
  - the mean of the sample rounded to 1 decimal place equals 4.5
  - the category percentages are equal to the ones in the image rounded to 2 decimal places
  - all 1-5 categories are present in the sample
For each of the retained samples, we can plug-in the sum of the sample into the formula for \(n_1\) derived above to get the distribution of the number of five-star ratings needed.

# set up multicore cluster
# code taken from: https://www.blasbenito.com/post/02_parallelizing_loops_with_r/
library(foreach)

## Warning: package 'foreach' was built under R version 4.0.5

ncores <- parallel::detectCores() - 1

my_cluster <- parallel::makeCluster(ncores, type = "PSOCK")

my_cluster

## socket cluster with 15 nodes on host 'localhost'

doParallel::registerDoParallel(cl = my_cluster)

foreach::getDoParRegistered()

## [1] TRUE

foreach::getDoParWorkers()

## [1] 15

n_samples <- 1.5e6 # running 100k on 15 cores

size <- 363

samples <- foreach (i = 1:n_samples) %dopar% {
  shuffle <- sample(ratings, size = size, replace = TRUE)
  p_i <- round(table(shuffle) / size, 2)
  if ((round(mean(shuffle), 1) == 4.5) && (length(p_i) == 5) && (all(p_i == p))) {
    #print(i)
    #samples[[iter]] <- shuffle
    #iter <- iter + 1
    shuffle
  }
}

# remove all the NULL elements
samples <- purrr::compact(samples)

parallel::stopCluster(cl = my_cluster)

This gives us a total of 530 samples that are consistent with the statistics in the image.

Now for each of these samples, we can get the distribution of the sums of these random vectors \(y_i\) and use the formula derived above to calculate the number of five-star ratings \(n_1\).

n1_dist <- unlist(lapply(samples, function(x) {return(ceiling(((4.6 * size) - sum(x)) / 0.4))}))
table(n1_dist)

## n1_dist
##  47  50 
## 364 166

So between 47 and 50 additional 5 star ratings are needed.

Based on values of the n1_dist, there seem to be only two vectors that are compatible with the statistics.

First one:

possibility_1 <- samples[[which.min(unlist(lapply(samples, sum)))]]
table(possibility_1)

## possibility_1
##   1   2   3   4   5 
##  23   5   8  42 285

((4.6 * size) - sum(possibility_1)) / 0.4

## [1] 49.5

Second one:

possibility_2 <- samples[[which.max(unlist(lapply(samples, sum)))]]
table(possibility_2)

## possibility_2
##   1   2   3   4   5 
##  23   5   7  43 285

((4.6 * size) - sum(possibility_2)) / 0.4

## [1] 47

So based on these simulations, the chance that it came from possibility 2 is:

100 * round(unname((table(n1_dist) / sum(table(n1_dist)))[1]), 2)

## [1] 69

Five-Star Ratings

Akshat Dwivedi

2021-11-14

Brute force solution

Algebraic solution

Problem

Resolution