A friend sent me this question and data:
So we have a rating of 4.5 out of \(n = 363\) reviews.
n <- 363
Further, we have the following ratings distribution represented using the probability vector \(p\):
p <- c(0.06, 0.01, 0.02, 0.12, 0.79) # percent scaled to [0,1]
names(p) <- 1:5
p
## 1 2 3 4 5
## 0.06 0.01 0.02 0.12 0.79
sum(p) # should sum to 1
## [1] 1
Furthermore, assume that the average of 4.5 is unweighted.
We can get the counts using \(n\) and \(p\) to get
counts <- round(n * p)
counts
## 1 2 3 4 5
## 22 4 7 44 287
Then ‘uncount’ to get a vector of ratings of length 363 (approx)
ratings <- rep(1:5, counts)
str(ratings)
## int [1:364] 1 1 1 1 1 1 1 1 1 1 ...
table(ratings)
## ratings
## 1 2 3 4 5
## 22 4 7 44 287
with sum 1662 and mean 4.5659341, which rounded to 1 decimal place, gives 4.6. Given that the image shows a rounded rating of 4.5, the ratings vector we’ve generated doesn’t seem correct.
Call this sum of ratings \(y\).
y <- sum(ratings)
Also, n differs from the length of the ratings vector by 1
n
## [1] 363
length(ratings)
## [1] 364
so we can reassign n for the moment
n <- length(ratings)
n
## [1] 364
We want to know how many five-star ratings need to be added to bring the average rating to 4.6.
We can brute force this easily:
iter <- 0
current_mean <- mean(ratings)
# keep adding 5 to the ratings vector until the mean is >= 4.6
while (current_mean < 4.6) {
iter <- iter + 1
current_mean <- mean(c(ratings, rep(5, iter)))
}
iter
## [1] 31
current_mean
## [1] 4.6
So we need about 31 additional five-star ratings.
Assume that only five-star ratings can be added.
Solve for \(n_1\) in the following equation:
\[\frac{y + (n_1 \times 5)}{n + n_1} = 4.6\]
which can be rewritten as
\[5n_1 = 4.6 \times (n + n_1) - y\] and simplified to yield
\[n_1 = \frac{(4.6 \times n) - y}{0.4}\]
Plugging in values of \(y\) and \(n\), we get
n1 <- ((4.6 * n) - y) / 0.4
n1
## [1] 31
which is the same as the brute force solution above.
The ratings vector isn’t really the correct one as it has length 364 instead of 363 and a mean of 4.5659341 which is above \(\gt 4.5\) but actually should be \(\leq 4.5\).
We can work around this issue in two steps:
# set up multicore cluster
# code taken from: https://www.blasbenito.com/post/02_parallelizing_loops_with_r/
library(foreach)
## Warning: package 'foreach' was built under R version 4.0.5
ncores <- parallel::detectCores() - 1
my_cluster <- parallel::makeCluster(ncores, type = "PSOCK")
my_cluster
## socket cluster with 15 nodes on host 'localhost'
doParallel::registerDoParallel(cl = my_cluster)
foreach::getDoParRegistered()
## [1] TRUE
foreach::getDoParWorkers()
## [1] 15
n_samples <- 1.5e6 # running 100k on 15 cores
size <- 363
samples <- foreach (i = 1:n_samples) %dopar% {
shuffle <- sample(ratings, size = size, replace = TRUE)
p_i <- round(table(shuffle) / size, 2)
if ((round(mean(shuffle), 1) == 4.5) && (length(p_i) == 5) && (all(p_i == p))) {
#print(i)
#samples[[iter]] <- shuffle
#iter <- iter + 1
shuffle
}
}
# remove all the NULL elements
samples <- purrr::compact(samples)
parallel::stopCluster(cl = my_cluster)
This gives us a total of 530 samples that are consistent with the statistics in the image.
Now for each of these samples, we can get the distribution of the sums of these random vectors \(y_i\) and use the formula derived above to calculate the number of five-star ratings \(n_1\).
n1_dist <- unlist(lapply(samples, function(x) {return(ceiling(((4.6 * size) - sum(x)) / 0.4))}))
table(n1_dist)
## n1_dist
## 47 50
## 364 166
So between 47 and 50 additional 5 star ratings are needed.
Based on values of the n1_dist, there seem to be only two vectors that are compatible with the statistics.
First one:
possibility_1 <- samples[[which.min(unlist(lapply(samples, sum)))]]
table(possibility_1)
## possibility_1
## 1 2 3 4 5
## 23 5 8 42 285
((4.6 * size) - sum(possibility_1)) / 0.4
## [1] 49.5
Second one:
possibility_2 <- samples[[which.max(unlist(lapply(samples, sum)))]]
table(possibility_2)
## possibility_2
## 1 2 3 4 5
## 23 5 7 43 285
((4.6 * size) - sum(possibility_2)) / 0.4
## [1] 47
So based on these simulations, the chance that it came from possibility 2 is:
100 * round(unname((table(n1_dist) / sum(table(n1_dist)))[1]), 2)
## [1] 69