Arknights’ Non-6★ Headhunting Streak EDA

Introduction

This EDA (Exploratory Data Analysis) is in collaboration with @SurfChu85, @satominyaa, @Eyenine_i9 and yours truly, @KaidenFrizu. I would like to thank them by making this small project happen. In this file, R language was used for the simulation. For Python Codes, check it out through this link.

Originally, this came from a probability question that I posted on our Discord server. This problem is related to Arknights’ Pity System at 50th non-6★ pull.

A biased coin has a probability of getting a head on the first attempt is 2%, if its outcome is a tail, the probability of landing a head increases by 2% on the next attempt. The experiment continues until the outcome is a head. What is the expected number of attempts in getting a head?

Eventually, we tried implementing a simulation on Arknights’ gacha system.

Although the simulation and analysis is referenced from this thread made by u/1234gary, there will be some additional info regarding the 6★ Headhunting rate in Arknights. In addition, we will present some graphs to further visualize on what is going on.

Based from the said thread, the simulation draws a 6★ operator every 34.5 draws on average. It is important to point out that the 34.5 does not mean that you will get the operator within its range. It means that when you count each pull streak before you got your 6★, the average of these counts would be close to 34.5. To describe further, take this one scenario below.

I pulled my first five 6★ operators on their following pull streaks: 20, 39, 42, 17, and 65. This means that I pulled my first 6★ at 20th pull, my second 6★ at 39th pull, and so on. Its average is statistically close to 34.5.

Note: Your pull streak resets once you got your 6★ operator.

This means that it is still likely that you will experience saltiness where your streak is 60 or higher within your first few 6★ pull streak, so don’t get the wrong idea about the average.

Methodology

At first, I was expecting that there is a theoretical method in finding the number of trials. The closest thing that we think of is the binomial distribution however, one of the variables required is constantly changing, thus being unable to mathematically solve the problem We thought that crunching Math would be super hard, therefore we just went for a simulation. If there are people who can explain the theoretical part of the gacha rate, let us know in the comments.

Anyways, here are the corresponding versions when this file was last rendered. Also, ggplot2 library was used for this analysis.

print(R.version$version.string)

## [1] "R version 4.0.2 (2020-06-22)"

library(ggplot2);
packageVersion("ggplot2")

## [1] '3.3.2'

To simulate Arknights’ 6★ pull streak count, the following function below was used. This function is patterned from the Math problem I showed earlier, but it is similar to the code that u/1234gary showed in his/her code. Unlike his/her code, its output is a matrix given the number of repetitions (number of 6★ pull streaks) and number of iterations (number of simulated accounts). Each item contains the number of pulls (described in attempt) before getting a 6★ operator (described in result).

Furthermore, this function also considers Arknights’ Pity System where the 6★ rate increases on 51st pull.

Keep in mind that the first index in an array/vector in R starts not with 0, but with 1

expected_attempt <- function(reps, iteration, seed = NULL) {
    if(!is.null(seed)) set.seed(seed)
    dist_matrix <- matrix(nrow = reps, ncol = iteration)
    for (j in 1:iteration) {
        for(i in 1:reps) {
            attempt <- 0
            head_rate <- 0.02
            hit <- FALSE
            while(hit != TRUE) {
                attempt = attempt + 1
                head_rate <- 0.02 * max(1, attempt-49)
                result <- sample(c(0,1)
                                 ,size = 1
                                 ,prob = c(1-head_rate, head_rate)
                                 ,replace = TRUE)
                if(result == 1) hit <- TRUE
            }
            dist_matrix[i,j] <- attempt
        }
    }
    return(dist_matrix)
}

I’m not sure about the attempt-49 part because on the 50th pull, the 6★ rate is still at 2% but it will turn into 4% on the 51st pull. In u/1234gary’s code, it’s attempt-50 which means the rate at 51st attempt is still 2%. Here are the Headhunting details for reference.

From Arknights Fandom Wiki:

If a 6★ Operator does not appear after 50 pulls, each subsequent pull will increase the 6★ Operators’ rate by 2%, up to 100%.

From Arknights Headhunting Details:

If you have not obtained any 6★ operator after 50 rolls, the odds of getting a 6★ Operators will be increased from 2% to 4% in the next roll; And if you still have not obtained any 6★ Operator in the 51st roll, the odds will be increased from 4% to 6% in the next roll. The odds will keep rising by 2% each time until it reaches 100% which means you will definitely get a 6★ Operator… If you have obtained any one 6★ Operator in (Standard) Headhunting, the odds of getting 6★ operators in (Standard) Headhunting will be reduced to 2%.

However, @satominyaa said that the RNG algorithms used in a programming language could be a significant factor in the results. Languages like Python and R uses Mersenne Twister RNG unlike Arknights’ RNG algorithm (presumably Unity/C#) which is Subtractive PRNG.

@satominyaa wrote:

…you’re using the Mersenne Twister RNG, which is not the same as Arknight’s RNG algorithm, which I assume is Unity which in turn uses Mono C# C# in the specifications implements the Subtractive PRNG.

Therefore, this analysis contain two experiments. The first experiment uses Mersenne Twister RNG through the R function set.seed() while the second experiment uses Subtractive PRNG where the generator function is made by @SurfChu85 which is shown below. This code is based from the Rosettacode website.

subtractive_gen <- function(seed = NULL) {
    if(!is.null(seed)) {
        temp <- c()
        reorder_temp <- c()
        
        temp[1] <- seed
        temp[2] <- 1
        
        for(i in 3:55) temp[i] <- (temp[i-2] - temp[i-1]) %% (10^9)
        for(i in 1:55) reorder_temp[i] <- temp[((34*(i)) %% 55)+1]
        for(i in 56:220) (reorder_temp[i] <- (reorder_temp[i-55] - reorder_temp[i-24]) %% (10^9))
        
        temp55 <- tail(reorder_temp, n = 55)
    }
    
    temp55 <- append(temp55, (temp55[56-55] - temp55[56-24]) %% (10^9))
    temp55 <<- tail(temp55, n = 55)
    return(temp55[56])
}

I copied a part of my simulation code to implement the RNG function and assigned it to a new function shown below.

s_expected_attempt <- function(reps, iteration, seed = NULL) {
    if(is.null(seed)) subtractive_gen(abs(mean(rnorm(1, sd = 1000))))
    else subtractive_gen(seed)
    dist_matrix <- matrix(nrow = reps, ncol = iteration)
    for (j in 1:iteration) {
        for(i in 1:reps) {
            attempt <- 0
            head_rate <- 0.02
            hit <- FALSE
            result <- 1
            while(result > head_rate) {
                attempt = attempt + 1
                head_rate <- 0.02 * max(1, attempt-49)
                result <- subtractive_gen() / (10^9)
            }
            dist_matrix[i,j] <- attempt
        }
    }
    return(dist_matrix)
}

Now that the functions are ready, it’s time to run the two simulations. This would be the scenario:

There are 50000 accounts that contain a record of their pull streaks from 5 of their 6★ Operators. The pull streaks follow the theoretical 6★ Headhunting rate (including the Pity System) but their outcome varies by their RNG. Overall, there are 250000 pull streaks.

experiment1 <- expected_attempt(5, 50000, 26492)
experiment2 <- s_expected_attempt(5, 50000, 26492)

The seed number can be any number from \(1\) to \(10^9-1\) (from the Subtractive PRNG restrictions). I just chose 26492 because it deciphers to the word AMIYA based from the old 12-button keypad.

Results

Some of you might have imagined that the probability distribution function is a bell-shaped curve centered at 34.5. This is partially true, but we will explain it how is it different from the actual distribution.

First, we would plot the sampling distribution and get the mean of it. First, we get the average pulls per account and plot them in the histogram. Next, we find the average of all average pulls from all accounts (mean of means).

sampling1 <- ggplot(mapping = aes(x = colMeans(experiment1))) +
    geom_histogram(binwidth = 1) +
    ggtitle("Sampling Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Mersenne Twister RNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Frequency")

sampling2 <- ggplot(mapping = aes(x = colMeans(experiment2))) +
    geom_histogram(binwidth = 1) +
    ggtitle("Sampling Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Subtractive PRNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Frequency")

print(sampling1)

print(sampling2)

mean(colMeans(experiment1))

## [1] 34.54388

mean(colMeans(experiment2))

## [1] 34.50551

To interpret these graphs, the mean of the accounts’ pull average is 34.544 for Mersenne Twister RNG, and 34.506 for Subtractive PRNG. The results do not differ much from the 34.5 as calculated before. The following graph is a histogram of the average pull in the pull streak.

Question: Is there any difference in their mean between finding the mean of means, and finding the mean of the whole matrix?

When the number of samples given is large, there is no significant difference in their means between those two methods. This is the fundamental definition of the Central Limit Theorem (CLT).

Some might think that: What if we plot the actual pulls in each pull streak? This is possible and once you plot those in the histogram, it would be called sample distribution. It is a distribution of the samples, not the means of the samples as previously illustrated. The following code were used for sample distribution.

sample1 <- ggplot(mapping = aes(x = experiment1)) +
    geom_histogram(binwidth = 1) +
    ggtitle("Sample Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Mersenne Twister RNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Frequency")

sample2 <- ggplot(mapping = aes(x = experiment2)) +
    geom_histogram(binwidth = 1) +
    ggtitle("Sample Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Subtractive PRNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Frequency")

print(sample1)

print(sample2)

mean(experiment1)

## [1] 34.54388

mean(experiment2)

## [1] 34.50551

The distribution is beautiful, isn’t it? That is the sample distribution and this is close to the probability distribution function (PDF) for the gacha rate in Arknights. Although, we can’t use this for predicting the rates, but it is notable that the distribution spikes starting at 51. However, these simulations behave closely to the theoretical PDF especially at large samples so for now, it is the best possible distribution.

If there’s someone who can find the theoretical PDF by crunching Math, let us know in the comments. Theoretical PDF is the most ideal way to predict the number of pulls in a pull streak for 6★ Headhunting rate.

To give you more details, here is the tally for the number of pulls before getting a 6★ Operator from each experiment.

experiment1_t <- table(experiment1)
print(experiment1_t)

## experiment1
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 4875 4982 4657 4757 4576 4439 4451 4409 4302 4216 4120 4061 3925 3897 3895 3736 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 3626 3550 3520 3478 3272 3286 3261 3165 3091 3014 2960 3003 2833 2751 2706 2649 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2650 2521 2499 2394 2432 2355 2286 2258 2192 2233 2237 2067 1973 2050 1986 1890 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1932 1834 3632 5270 6449 7588 8069 8368 8211 7730 7064 6284 5340 4376 3410 2679 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
## 1975 1427 1036  702  450  304  184   88   51   33   15    7    2    3    1

experiment2_t <- table(experiment2)
print(experiment2_t)

## experiment2
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 5020 4922 4812 4797 4630 4569 4504 4371 4309 4235 3987 4068 4035 3982 3763 3716 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 3570 3545 3394 3406 3374 3224 3227 3183 3127 2988 2929 2895 2848 2746 2760 2614 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2557 2584 2597 2498 2480 2384 2276 2140 2295 2197 2054 2122 2079 2030 1926 1936 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1970 1784 3685 5154 6492 7418 7540 8238 8258 8026 7011 6369 5293 4365 3541 2673 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   81 
## 2014 1565 1046  689  470  288  186  108   55   24   16   10    4    1    2

And here are their tally ratio from each experiment.

10/12 Update: It is no longer in % anymore, so you might have to multiply it by 100 so you would get the percentage

experiment1_p <- as.data.frame(experiment1_t / length(experiment1))
names(experiment1_p) <- c("Number.of.Pulls","Prob")
print(experiment1_t / length(experiment1))

## experiment1
##        1        2        3        4        5        6        7        8 
## 0.019500 0.019928 0.018628 0.019028 0.018304 0.017756 0.017804 0.017636 
##        9       10       11       12       13       14       15       16 
## 0.017208 0.016864 0.016480 0.016244 0.015700 0.015588 0.015580 0.014944 
##       17       18       19       20       21       22       23       24 
## 0.014504 0.014200 0.014080 0.013912 0.013088 0.013144 0.013044 0.012660 
##       25       26       27       28       29       30       31       32 
## 0.012364 0.012056 0.011840 0.012012 0.011332 0.011004 0.010824 0.010596 
##       33       34       35       36       37       38       39       40 
## 0.010600 0.010084 0.009996 0.009576 0.009728 0.009420 0.009144 0.009032 
##       41       42       43       44       45       46       47       48 
## 0.008768 0.008932 0.008948 0.008268 0.007892 0.008200 0.007944 0.007560 
##       49       50       51       52       53       54       55       56 
## 0.007728 0.007336 0.014528 0.021080 0.025796 0.030352 0.032276 0.033472 
##       57       58       59       60       61       62       63       64 
## 0.032844 0.030920 0.028256 0.025136 0.021360 0.017504 0.013640 0.010716 
##       65       66       67       68       69       70       71       72 
## 0.007900 0.005708 0.004144 0.002808 0.001800 0.001216 0.000736 0.000352 
##       73       74       75       76       77       78       79 
## 0.000204 0.000132 0.000060 0.000028 0.000008 0.000012 0.000004

experiment2_p <- as.data.frame(experiment2_t / length(experiment2))
names(experiment2_p) <- c("Number.of.Pulls","Prob")
print(experiment2_t / length(experiment2))

## experiment2
##        1        2        3        4        5        6        7        8 
## 0.020080 0.019688 0.019248 0.019188 0.018520 0.018276 0.018016 0.017484 
##        9       10       11       12       13       14       15       16 
## 0.017236 0.016940 0.015948 0.016272 0.016140 0.015928 0.015052 0.014864 
##       17       18       19       20       21       22       23       24 
## 0.014280 0.014180 0.013576 0.013624 0.013496 0.012896 0.012908 0.012732 
##       25       26       27       28       29       30       31       32 
## 0.012508 0.011952 0.011716 0.011580 0.011392 0.010984 0.011040 0.010456 
##       33       34       35       36       37       38       39       40 
## 0.010228 0.010336 0.010388 0.009992 0.009920 0.009536 0.009104 0.008560 
##       41       42       43       44       45       46       47       48 
## 0.009180 0.008788 0.008216 0.008488 0.008316 0.008120 0.007704 0.007744 
##       49       50       51       52       53       54       55       56 
## 0.007880 0.007136 0.014740 0.020616 0.025968 0.029672 0.030160 0.032952 
##       57       58       59       60       61       62       63       64 
## 0.033032 0.032104 0.028044 0.025476 0.021172 0.017460 0.014164 0.010692 
##       65       66       67       68       69       70       71       72 
## 0.008056 0.006260 0.004184 0.002756 0.001880 0.001152 0.000744 0.000432 
##       73       74       75       76       77       78       81 
## 0.000220 0.000096 0.000064 0.000040 0.000016 0.000004 0.000008

This means that 1.95% of the pull streaks obtained a 6★ Operator in the 1st attempt (using Mersenne Twister RNG) while 1.9928% on the second attempt.

If these values is the values from the theoretical PDF, then these percentages would be the predictive percentages in getting a 6★ Operator in Arknights. However, @SurfChu85 mentioned that it is not allowed to use these percentages as a baseline for predictive probabilities. There are certain percentages that differs from the true percentage as described in the theoretical PDF. The theoretical PDF is the best way to predict these probabilities within pull streaks. No matter how large the sample size is, it would never result to the theoretical PDF, but it could be an estimate of it.

The problem is, we were unable to find the theoretical PDF due to the insane Math needed to solve. When we tried solving it, we found out that the binomial distribution at the start of the Pity System varies as pull streak increases, thus adding more complexity. We also don’t know if there’s a distribution that is applied to the Pity System, not to mention the constant rate up to pull 50. Our best assumption to this theoretical PDF is that it’s a piecewise function that changes at 50. If there is someone who can explain or describe the theoretical PDF of the gacha rate, leave a comment below.

With the given limitations, the best estimate of these probabilities (for now) would be the tally ratio presented above. Keep in mind that these probabilities are not the true percentages as you would expect and there will be some inaccuracies.

To interpret the tally ratio: At the start of your pull streak, there is a 3.2952% chance that you will pull your 6★ Operator in the 56th pull using Subtractive PRNG, while 1.694% on the 10th pull. This goes to the rest of the values and in the Mersenne Twister RNG tally ratio.

The probability distribution can be seen using the code below. The shape is similar to the sample distribution, but the values are different. From the start of your pull streak, here are the probabilities that you will get a 6★ from the number of pulls.

sample1_p <- ggplot(data = experiment1_p
                    ,mapping = aes(x = Number.of.Pulls, y = Prob, width = 0.75)) +
    geom_bar(stat = "identity") +
    scale_x_discrete(breaks = seq(5, length(experiment1_p$Prob), 5)) +
    ggtitle("Probability Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Mersenne Twister RNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Probability")

sample2_p <- ggplot(data = experiment2_p
                    ,mapping = aes(x = Number.of.Pulls, y = Prob, width = 0.75)) +
    geom_bar(stat = "identity") +
    scale_x_discrete(breaks = seq(5, length(experiment2_p$Prob), 5)) +
    ggtitle("Probability Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Subtractive PRNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Probability")

print(sample1_p)

print(sample2_p)

It is notable that these probabilities shrink at pull 70 because the probability of proceeding to another pull in your pull streak decreases as your chances of having a 6★ Operator increases. Therefore, it is nearly impossible to get a 99 pull streak because it only has a 2% chance. Its probability decreases exponentially when considering the pulls before that happens.

It is quite reasonable that the results stops around pull 80 because by that time, you have a 60% chance of getting a 6★ Operator.

Considering those type of scenarios, we would illustrate the actual probabilities through cumulative distribution. To further interpret the probabilities, we would turn this simulated probability distribution to a cumulative distribution. The results would mean the probability of getting a 6★ in the nth pull or sooner.

It is important to highlight that the following distribution is as inaccurate as the simulated probability distribution. Theoretically, the PDF is the derivative of the CDF (which means that the CDF is the integral of the PDF) but we still have no idea or information about that. To retrieve the cumulative values, we add the probabilities from the sum of the previous probabilities, until the last row has a cumulative value of \(1\). This is my pseudocode in computing the cumulative value.

cdf <- function(column) {
    result <- c()
    for(i in 1:length(column)) {
        if(i == 1) {
            result <- c(result, column[1])
            next
        }
        result <- c(result, column[i]+result[i-1])
    }
    return(result)
}

And here’s how I implemented the function.

experiment1_p$CuProb <- cdf(experiment1_p$Prob)
tail(experiment1_p, n = 10)

##    Number.of.Pulls     Prob   CuProb
## 70              70 0.001216 0.998464
## 71              71 0.000736 0.999200
## 72              72 0.000352 0.999552
## 73              73 0.000204 0.999756
## 74              74 0.000132 0.999888
## 75              75 0.000060 0.999948
## 76              76 0.000028 0.999976
## 77              77 0.000008 0.999984
## 78              78 0.000012 0.999996
## 79              79 0.000004 1.000000

experiment2_p$CuProb <- cdf(experiment2_p$Prob)
tail(experiment2_p, n = 10)

##    Number.of.Pulls     Prob   CuProb
## 70              70 0.001152 0.998376
## 71              71 0.000744 0.999120
## 72              72 0.000432 0.999552
## 73              73 0.000220 0.999772
## 74              74 0.000096 0.999868
## 75              75 0.000064 0.999932
## 76              76 0.000040 0.999972
## 77              77 0.000016 0.999988
## 78              78 0.000004 0.999992
## 79              81 0.000008 1.000000

sample1_c <- ggplot(data = experiment1_p
                    ,mapping = aes(x = Number.of.Pulls, y = CuProb, width = 0.75)) +
    geom_bar(stat = "identity") +
    scale_x_discrete(breaks = seq(5, length(experiment1_p$Prob), 5)) +
    ggtitle("Cumulative Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Mersenne Twister RNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Probability")

sample2_c <- ggplot(data = experiment2_p
                    ,mapping = aes(x = Number.of.Pulls, y = CuProb, width = 0.75)) +
    geom_bar(stat = "identity") +
    scale_x_discrete(breaks = seq(5, length(experiment2_p$Prob), 5)) +
    ggtitle("Cumulative Distribution of Arknights' 6-star Headhunting Rate"
            ,subtitle = "Subtractive PRNG") +
    xlab("Number of Pulls in a Streak") +
    ylab("Probability")

print(sample1_c)

print(sample2_c)

You might have noticed that the cumulative probability of 0.5 reaches between 34 and 35 from both of the RNGs. This result corresponds to the mean computed earlier which falls into that range.

Anyways, to interpret the given graph: At the start of your streak, there is a 0.508532% chance that you will pull a 6★ Operator at pull 35 or earlier. Of course, the probability is almost certain in about 80 pulls within your pull streak.

These probabilities cannot be used if you’re in the nth pull streak because that is for conditional probability. I might update this file in the future for that.

Other Stuffs

One thing to add is that: While Subtractive PRNG is used to simulate the RNG of Arknights, it is not known if this RNG implementation were done on the server side. In fact, they might have used another language for it. The assumption is that the developers used the C#/Unity RNG for the Headhunting rates.

In addition, @satominyaa pointed out that there would be a unique seed RNG for each account, but I did not use that in this analysis for the sake of simplicity and reproducibility. In his work, he implemented a random seed for each account created. For more details about his work, visit this article.

For statistical purposes, we would test if there is a significant difference between the results from different RNGs. Though, it is evident by observation that there is no differences between them.

t.test(experiment1, experiment2)

## 
##  Welch Two Sample t-test
## 
## data:  experiment1 and experiment2
## t = 0.64769, df = 5e+05, p-value = 0.5172
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07773725  0.15447325
## sample estimates:
## mean of x mean of y 
##  34.54388  34.50551

For any questions or concerns, leave a comment in the subreddit or in our Twitter.

Analysis Updates

October 12, 2020

Tally percentage is no longer in % anymore (it now becomes Tally Ratio), so you might have to multiply it by 100 so you would get the percentage.
Updated formatting changes.
Added Cumulative Distribution with description