1 Storing data frames from Reconciliation Analysis I

We will first restore necessary data frames from Reconciliation Analysis I [1] to continue the study. Invoked packages are hide.

PDF_daily_impute_char_cumulated <- 
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_daily_impute_char_cumulated.csv")

PDF_indiv_reconciliation_non <-
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_indiv_reconciliation_non.csv")

PDF_indiv_reconciliation_non_dynamic <-
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_indiv_reconciliation_non_dynamic.csv")

df_redemptions <-
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_redemptions.csv")

df_revenue_view <-
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_revenue_view.csv")

df_users_thru_jan23 <- 
  read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_users_thru_jan23.csv")

2 Local peaks of the discrepancy amount

In Reconciliation Analysis I [1], we have displayed the distribution of the discrepancy amount in histograms. We find a lot of data observations are clustered within less discrepancy amount, which can be reasonably considered as being offered a “suspicious” promotion from the project or the application. To further help us identify these local peaks, we will generate a list of view of the density plots on discrep_summary_non below, in full discrepancy amount scale, 5K scale, 1K scale, and 0.3K scale, respectively.

densplot1 <- 
  ggplot(PDF_indiv_reconciliation_non, aes(x = discrep_summary_non)) +
  geom_density(fill = "#FF6666", alpha = 0.5) 
  labs(x = "discrepancy amount ($)", y = "density") +
  theme_minimal()
## NULL
densplot2 <- densplot1 + xlim(c(0, 5000))

densplot3 <- densplot1 + xlim(c(0, 1000))

densplot4 <- densplot1 + xlim(c(0, 300))

plot_grid(densplot1, densplot2, densplot3, densplot4, labels = "AUTO")
## Warning: Removed 18 rows containing non-finite values (`stat_density()`).
## Warning: Removed 300 rows containing non-finite values (`stat_density()`).
## Warning: Removed 2049 rows containing non-finite values (`stat_density()`).

It appears that figure \(D\) above has several outstanding peaks in density. We will implement the calculation of the second derivative to see if there is a change in sign, as this determines the local peaks in our model. This calculus concept can be realized by diff(sign(diff())) function in R. We will annotate the discrepancy amount, which is rounded to 2 decimals, right above each local peak on the diagram below.

PDF_indiv_reconciliation_non_seg1 <- 
  PDF_indiv_reconciliation_non[PDF_indiv_reconciliation_non$discrep_summary_non <= 300, ]

density_non_seg1 <- density(PDF_indiv_reconciliation_non_seg1$discrep_summary_non)

peaks_x_seg1 <- density_non_seg1$x[which(diff(sign(diff(density_non_seg1$y))) < 0)]

peaks_y_seg1 <- density_non_seg1$y[which(diff(sign(diff(density_non_seg1$y))) < 0)]

peaks_pdf_seg1 <- data.frame(x = peaks_x_seg1, y = peaks_y_seg1)

ggplot(PDF_indiv_reconciliation_non_seg1, aes(x = discrep_summary_non)) +
  geom_density(fill = "#FF6666", alpha = 0.5) +
  geom_text(data = peaks_pdf_seg1, aes(x = x, y = y, label = round(x, 2)), 
            vjust = -1, color = "black", size = 3) +
  xlim(c(0, 300)) +
  labs(x = "Discrepancy Amount ($)", y = "Density",
       title = "Density Plot of Discrepancy Amount (<= $300)") +
  theme_minimal() 

The result is nearly aligned with our conjecture before. We notice that these peaks are striking based on some counting patterns. All of them above are slightly less than their nearest multiples of 5 (i.e. 24.32 < 25, 48.89 < 50, 74.1 < 75, etc.). This is probably due to two reasons. First, the firm may input inaccurate numbers by removing decimal points. Second, which I think it is more likely, the customers may pay extra taxes when they redeem, causing their discrepancy amount to shrink a little bit (i.e. suppose you pay 200 dollars to purchase get a credit of 250 dollars, the discrepancy should be 50 ideally, but the additional taxing information may drag down the value a bit to, say, 48.89). Regardless of whatever reason causing the problem, we can indeed label these peaks’ nearest integer neighbors (multiples of 25) as “suspicious” promoted accounts under certain projects.

3 Categorization of the “suspicious” promotion variable

Instead of simply labeling a binary “yes” or “no” label on this potential observation (per user_id per project_id), the degree of peak density may also serve a standard for us to determine the extent of being promoted. What we will do here is to manually create the “confidence intervals” around each of the rounded decimal labeled in the plot above. We do not strictly follow the definition of the confidence interval here, though. It is simply because the open cover is too narrow (we may change the \(z\) score accordingly to allow for a larger cover, but for simplicity here, we directly set such width to 3, for example). We tend to cover any peaks indicated above and also the multiples of 25. Hence, we implement our algorithms below to compute the mid-distance between two vectors. With the mid-point distance vector, we will easily find the upper and lower “CI” to cover both the listed peaks and multiples of 25.

peak_list <- c(24.32, 48.89, 74.1, 99.3, 123.25, 149.08, 
               174.28, 198.86, 222.8, 249.27, 273.21, 298.41)

multiples_of_25 <- c(25, 50, 75, 100, 125, 150, 
                     175, 200, 225, 250, 275, 300)

midpoints <- (peak_list + multiples_of_25) / 2

min_interval_width <- 3 # may vary based on personal judgment

interval_list <- list()

for(i in 1:length(midpoints)) {
  
  dist_to_peak <- abs(midpoints[i] - peak_list[i])
  
  dist_to_multiple <- abs(midpoints[i] - multiples_of_25[i])
  
  half_width <- max(dist_to_peak, dist_to_multiple, min_interval_width / 2)
  
  lower_bound <- midpoints[i] - half_width
  
  upper_bound <- midpoints[i] + half_width
  
  interval_list[[i]] <- 
    c("mid" = midpoints[i], "lowerCI" = lower_bound, "upperCI" = upper_bound)
  
}

interval_df <- do.call(rbind, interval_list)

interval_df <- as.data.frame(interval_df)

interval_df$peak_list <- peak_list

interval_df$multiples_of_25 <- multiples_of_25

interval_df <- interval_df[, c("peak_list", "multiples_of_25", 
                               "mid", "lowerCI", "upperCI")]
interval_df
##    peak_list multiples_of_25     mid lowerCI upperCI
## 1      24.32              25  24.660  23.160  26.160
## 2      48.89              50  49.445  47.945  50.945
## 3      74.10              75  74.550  73.050  76.050
## 4      99.30             100  99.650  98.150 101.150
## 5     123.25             125 124.125 122.625 125.625
## 6     149.08             150 149.540 148.040 151.040
## 7     174.28             175 174.640 173.140 176.140
## 8     198.86             200 199.430 197.930 200.930
## 9     222.80             225 223.900 222.400 225.400
## 10    249.27             250 249.635 248.135 251.135
## 11    273.21             275 274.105 272.605 275.605
## 12    298.41             300 299.205 297.705 300.705

Then, we will subjectively value the likelihood of being offered promotion under each interval of the discrepancy amount. The potential factors we will include in the mutated column suspicious_promo are 0 (very unlikely of being offered promotions), 1 (unlikely of being offered promotions), 2 (likely of being offered promotions), and 3 (very likely of being offered promotions). The judgement here is not objective but based on our intuitive classification. Since the peak around 48.89 is so striking, I would classify it as 3. The peaks around 24.32, 74.1, and 99.3 are also relatively outstanding, so I would classify them as the group of 2. The rest peaks on the plot above are less obvious, though they still reveal some patterns we discovered above. Under this consideration, it is reasonable for us to label them as 1. For any other amounts not within these ranges, we may label them a 0.

density_estimate <- 
  density(PDF_indiv_reconciliation_non_seg1$discrep_summary_non)

peak_density <- 
  approx(x = density_estimate$x, y = density_estimate$y, xout = peak_list)$y

interval_df$peak_density <- peak_density

compute_density_values <- function(lower, upper) {
  
  x_values <- seq(lower, upper, length.out = 100)
  
  y_values <- approx(x = density_estimate$x, y = density_estimate$y, xout = x_values)$y
  
  data.frame(x = x_values, y = y_values)
  
}

interval_densities <- lapply(1:nrow(interval_df), function(i) {
  
  compute_density_values(interval_df$lowerCI[i], interval_df$upperCI[i])
  
  })

interval_densities_df <- do.call(rbind, interval_densities)

interval_densities_df$group <- rep(1:nrow(interval_df), each = 100)

ggplot(PDF_indiv_reconciliation_non_seg1, aes(x = discrep_summary_non)) +
  geom_density(fill = "#FF6666", alpha = 0.5) +
  geom_text(data = peaks_pdf_seg1, aes(x = x, y = y, label = round(x, 2)), 
            vjust = -1, color = "black", size = 3) +
  geom_segment(data = interval_df, 
               aes(x = lowerCI, xend = lowerCI, y = 0, yend = peak_density), 
               color = "black", linetype = "dashed", size = 0.7) +
  geom_segment(data = interval_df, 
               aes(x = upperCI, xend = upperCI, y = 0, yend = peak_density), 
               color = "black", linetype = "dashed", size = 0.7) +
  geom_ribbon(data = interval_densities_df, aes(x = x, ymin = 0, ymax = y, fill = factor(group)), 
              alpha = 0.9) +
  xlim(c(0, 305)) +
  labs(x = "Discrepancy Amount ($)", y = "Density",
       title = "Density Plot of Discrepancy Amount (<= $305)",
       fill = "Group") +
  scale_fill_manual(values = c("1" = "orchid", "2" = "darkorchid4", "3" = "orchid", "4" = "orchid",
                               "5" = "plum", "6" = "plum", "7" = "plum", "8" = "plum",
                               "9" = "plum", "10" = "plum", "11" = "plum", "12" = "plum"),
                    labels = c("1" = "[23.160, 26.160]", "2" = "[47.945, 50.945]", "3" = "[73.050, 76.050]", 
                               "4" = "[98.150, 101.150]", "5" = "[122.625, 125.625]", "6" = "[148.040, 151.040]", 
                               "7" = "[173.140, 176.140]", "8" = "[197.930, 200.930]", "9" = "[222.400, 225.400]", 
                               "10" = "[248.135, 251.135]", "11" = "[272.605, 275.605]", "12" = "[297.705, 300.705]")) +   theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The above visualization strictly applies this categorization, where 3 has the color of darkorchid4, 2 has the color of orchid, 1 has the color of plum, and 0 has the color of #FF6666, which is the background filling color in the density plot above. For memorization, the darker the color, the higher probability of being offered promotions.

4 Mutation of the “suspicious” promotion variable

With the theoretical foundations built above, we can implement the following two tasks.

4.1 Mutation of the “suspicious” promotion variable in interval_df

Add a new column to interval_df called suspicious_promo and assign it with the factor 3 for the second row, 2 for the 1st, 3rd, and 4th rows, and 1 for rest of the rows.

interval_df <- interval_df |> 
  mutate(suspicious_promo = case_when(row_number() == 2 ~ 3, 
                                      row_number() %in% c(1, 3:4) ~ 2, 
                                      TRUE ~ 1))

interval_df
##    peak_list multiples_of_25     mid lowerCI upperCI peak_density
## 1      24.32              25  24.660  23.160  26.160 0.0204172075
## 2      48.89              50  49.445  47.945  50.945 0.0294787060
## 3      74.10              75  74.550  73.050  76.050 0.0125878868
## 4      99.30             100  99.650  98.150 101.150 0.0061677817
## 5     123.25             125 124.125 122.625 125.625 0.0018325601
## 6     149.08             150 149.540 148.040 151.040 0.0013604500
## 7     174.28             175 174.640 173.140 176.140 0.0006942789
## 8     198.86             200 199.430 197.930 200.930 0.0010070916
## 9     222.80             225 223.900 222.400 225.400 0.0004112768
## 10    249.27             250 249.635 248.135 251.135 0.0006718587
## 11    273.21             275 274.105 272.605 275.605 0.0003076106
## 12    298.41             300 299.205 297.705 300.705 0.0004244653
##    suspicious_promo
## 1                 2
## 2                 3
## 3                 2
## 4                 2
## 5                 1
## 6                 1
## 7                 1
## 8                 1
## 9                 1
## 10                1
## 11                1
## 12                1

4.2 Mutation of the “suspicious” promotion variable in PDF_indiv_reconciliation_non

Add a new column to PDF_indiv_reconciliation_non that is also called suspicious_promo. For each row, if discrep_summary_non falls within any of the intervals defined by lowerCI and upperCI in interval_df, the suspicious_promo should be labeled as 1, 2, or 3 accordingly. If discrep_summary_non does not fall within any interval, it should be labeled as 0. Notice that we define a function called find_interval that returns the suspicious_promo value of the interval a given value (1, 2, or 3) falls in, or 0 if it doesn’t fall in any interval. We then apply this function to each discrep_summary_non value to create the suspicious_promo column in PDF_indiv_reconciliation_non.

find_interval <- function(x, intervals) {
  for(i in 1:nrow(intervals)) {
    if(x >= intervals$lowerCI[i] & x <= intervals$upperCI[i]) {
      return(intervals$suspicious_promo[i])
    }
  }
  return(0)
}

PDF_indiv_reconciliation_non <- 
  PDF_indiv_reconciliation_non |> 
  mutate(suspicious_promo = 
           sapply(discrep_summary_non, find_interval, intervals = interval_df))

head(PDF_indiv_reconciliation_non[, c(1:3, 14:15)], 10)
##    user_id project_id created_at discrep_summary_non suspicious_promo
## 1   100002        676 2022-04-23              125.00                1
## 2   100005        692 2023-01-09              100.00                2
## 3   100012        702 2022-04-23              120.00                0
## 4   100017        713 2022-11-03              148.17                1
## 5   100019        698 2022-07-30             1660.21                0
## 6   100027        676 2022-04-29               75.00                2
## 7   100035        676 2022-04-23               25.00                2
## 8   100043        692 2022-04-23               30.00                0
## 9   100047        676 2022-05-01               50.00                3
## 10  100059        676 2023-01-05                0.68                0

5 Refinement on cumulative discrepancy amount data frame

Our previous analysis in Reconciliation Analysis I [1] has loaded discrep_df under adjustments on PDF_indiv_reconciliation_non_tol. However, the tolerance adjustment does not seem to be quite effective as it only removes 49 observations. Hence, the new version of discrep_df has been presented below with the use of PDF_indiv_reconciliation_non and we will refine such a data frame with a closer look. Notice that frequency and cumulative_frequency columns are renamed to observation and cumulative_observation to avoid misunderstanding.

new_break_points <- c(-Inf, 50, 100, 200, 300, 500, 1000, 10000, 50000, Inf)

new_labels <- c("<= $50", "<= $100", "<= $200", "<= $300", "<= $500", 
            "<= $1000", "<= $10000", "<= $50000", "> $50000")

PDF_indiv_reconciliation_non$discrep_category <- 
  cut(PDF_indiv_reconciliation_non$discrep_summary_non, 
      breaks = new_break_points, labels = new_labels, include.lowest = TRUE)

new_discrep_counts <- table(PDF_indiv_reconciliation_non$discrep_category)

new_discrep_cumulative_counts <- cumsum(new_discrep_counts)

new_discrep_counts_df <- as.data.frame(new_discrep_counts)

new_discrep_cumulative_counts_df <- as.data.frame(new_discrep_cumulative_counts)

colnames(new_discrep_counts_df) <- 
  c("discrepancy_amount", "frequency")

new_discrep_cumulative_counts_df <- 
  cbind(discrepancy_amount = rownames(new_discrep_cumulative_counts_df), 
        new_discrep_cumulative_counts_df)

rownames(new_discrep_cumulative_counts_df) <- NULL

colnames(new_discrep_cumulative_counts_df) <- 
  c("discrepancy_amount", "cumulative_frequency") 

new_discrep_df <- full_join(new_discrep_counts_df, new_discrep_cumulative_counts_df, 
                    by = "discrepancy_amount")

names(new_discrep_df) <- c("discrepancy_amount", 
                           "observation_count", "cumulative_observation_count")

new_discrep_df
##   discrepancy_amount observation_count cumulative_observation_count
## 1             <= $50             26718                        26718
## 2            <= $100             11491                        38209
## 3            <= $200              4830                        43039
## 4            <= $300              1612                        44651
## 5            <= $500              1103                        45754
## 6           <= $1000               646                        46400
## 7          <= $10000               294                        46694
## 8          <= $50000                 6                        46700
## 9           > $50000                 0                        46700

5.1 Mutation of new variables to reflect unique possession on customers and projects

In contrast to simply counting the observations of each user_id under unique project_id, we can also count the unique number of user_ids and project_ids under each discrepancy_amount range. Their cumulative counts are also supplied below.

user_counts <- sapply(split(PDF_indiv_reconciliation_non, 
                            PDF_indiv_reconciliation_non$discrep_category), 
                      function(x) length(unique(x$user_id)))

cumulative_user_counts <- cumsum(user_counts)

user_counts_df <- as.data.frame(user_counts)

user_counts_df <- 
  cbind(discrepancy_amount = rownames(user_counts_df), 
        user_counts_df)

rownames(user_counts_df) <- NULL

colnames(user_counts_df) <- 
  c("discrepancy_amount", "user_count") 

cumulative_user_counts_df <- as.data.frame(cumulative_user_counts)

cumulative_user_counts_df <- 
  cbind(discrepancy_amount = rownames(cumulative_user_counts_df), 
        cumulative_user_counts_df)

rownames(cumulative_user_counts_df) <- NULL

colnames(cumulative_user_counts_df) <- 
  c("discrepancy_amount", "cumulative_user_count") 

user_df <- full_join(user_counts_df, cumulative_user_counts_df, by = "discrepancy_amount")

project_counts <- sapply(split(PDF_indiv_reconciliation_non, 
                               PDF_indiv_reconciliation_non$discrep_category), 
                         function(x) length(unique(x$project_id)))

cumulative_project_counts <- cumsum(project_counts)

project_counts_df <- as.data.frame(project_counts)

project_counts_df <- 
  cbind(discrepancy_amount = rownames(project_counts_df), 
        project_counts_df)

rownames(project_counts_df) <- NULL

colnames(project_counts_df) <- 
  c("discrepancy_amount", "project_count") 

cumulative_project_counts_df <- as.data.frame(cumulative_project_counts)

cumulative_project_counts_df <- 
  cbind(discrepancy_amount = rownames(cumulative_project_counts_df), 
        cumulative_project_counts_df)

rownames(cumulative_project_counts_df) <- NULL

colnames(cumulative_project_counts_df) <- 
  c("discrepancy_amount", "cumulative_project_count") 

project_df <- full_join(project_counts_df, cumulative_project_counts_df, by = "discrepancy_amount")

new_discrep_df <- full_join(new_discrep_df, user_df, by = "discrepancy_amount")

new_discrep_df <- full_join(new_discrep_df, project_df, by = "discrepancy_amount")

new_discrep_df
##   discrepancy_amount observation_count cumulative_observation_count user_count
## 1             <= $50             26718                        26718      25404
## 2            <= $100             11491                        38209      11243
## 3            <= $200              4830                        43039       4680
## 4            <= $300              1612                        44651       1570
## 5            <= $500              1103                        45754       1084
## 6           <= $1000               646                        46400        627
## 7          <= $10000               294                        46694        279
## 8          <= $50000                 6                        46700          5
## 9           > $50000                 0                        46700          0
##   cumulative_user_count project_count cumulative_project_count
## 1                 25404           143                      143
## 2                 36647           138                      281
## 3                 41327           135                      416
## 4                 42897           104                      520
## 5                 43981            89                      609
## 6                 44608            74                      683
## 7                 44887            37                      720
## 8                 44892             4                      724
## 9                 44892             0                      724

5.2 Visualizing the trend (non-elaborated)

We can also visualize the table above into the following trend plot. Notice that our number of discrepancy_amount intervals are relatively small, so the trend plot is “non-elaborated”. We could elaborate such a plot with the use of PDF_indiv_reconciliation_non in future studies.

new_discrep_df$discrepancy_amount <- 
  factor(new_discrep_df$discrepancy_amount, levels = new_discrep_df$discrepancy_amount)

new_discrep_df_long <- 
  reshape2::melt(new_discrep_df, id.vars = "discrepancy_amount",
                 measure.vars = c("cumulative_observation_count", 
                                  "cumulative_user_count", 
                                  "cumulative_project_count"), 
                 variable.name = "type", value.name = "cumulative_count")

new_discrep_df_long
##    discrepancy_amount                         type cumulative_count
## 1              <= $50 cumulative_observation_count            26718
## 2             <= $100 cumulative_observation_count            38209
## 3             <= $200 cumulative_observation_count            43039
## 4             <= $300 cumulative_observation_count            44651
## 5             <= $500 cumulative_observation_count            45754
## 6            <= $1000 cumulative_observation_count            46400
## 7           <= $10000 cumulative_observation_count            46694
## 8           <= $50000 cumulative_observation_count            46700
## 9            > $50000 cumulative_observation_count            46700
## 10             <= $50        cumulative_user_count            25404
## 11            <= $100        cumulative_user_count            36647
## 12            <= $200        cumulative_user_count            41327
## 13            <= $300        cumulative_user_count            42897
## 14            <= $500        cumulative_user_count            43981
## 15           <= $1000        cumulative_user_count            44608
## 16          <= $10000        cumulative_user_count            44887
## 17          <= $50000        cumulative_user_count            44892
## 18           > $50000        cumulative_user_count            44892
## 19             <= $50     cumulative_project_count              143
## 20            <= $100     cumulative_project_count              281
## 21            <= $200     cumulative_project_count              416
## 22            <= $300     cumulative_project_count              520
## 23            <= $500     cumulative_project_count              609
## 24           <= $1000     cumulative_project_count              683
## 25          <= $10000     cumulative_project_count              720
## 26          <= $50000     cumulative_project_count              724
## 27           > $50000     cumulative_project_count              724
ggplot(new_discrep_df_long, 
       aes(x = discrepancy_amount, y = cumulative_count, color = type)) +
  geom_line(aes(group = type), size = 1.2) + 
  labs(x = "discrepancy_amount", y = "cumulative_count", color = "type") +
  theme_minimal()

We can roughly see from the plot that the gap between cumulative_observation_count and cumulative_user_count is approximately same across discrepancy_amount > $200. This indicates that customers whose discrepancy_amount > $200 typically do not purchase at more than one project. This may suggest that those high discrepancy_amount is due to some sporadic customers in some sporadic projects.

5.3 Storing the list of user_ids with a large discrepancy amount

We are especially interested in knowing the user_ids of those customers whose discrepancy_amount > $1000. There are, unfortunately, more than what we would expect - 279 and 5 user_ids are detected to follow such restriction, according to the table above. We store them in the following vector, user_id_large_discrep.

user_id_large_discrep <- 
  unique(PDF_indiv_reconciliation_non$user_id
         [PDF_indiv_reconciliation_non$discrep_summary_non > 1000])

length(user_id_large_discrep)
## [1] 284

6 Emblematic of the non-reconcilable dynamics

Our previous study in Reconciliation Analysis I [1] has turned out that the 8 emblematic of the non-reconcilable dynamics are not what we would like to extract for. Instead, we become interested in knowing customers whose continuous_period, defined before as the consecutive period during which a single customer has made actions (either purchasing or redeeming), is between 3 and 8. Before we proceed the analysis, these examples are undoubtedly quite a lot. We will stochastically pick a few representations among each number from 3 to 8. The following code chunk is adapted from Reconciliation Analysis I [1].

user_counts <- table(PDF_indiv_reconciliation_non_dynamic$user_id)

users_to_keep <- names(user_counts)

PDF_indiv_reconciliation_non_dynamic_fil <- 
  PDF_indiv_reconciliation_non_dynamic |> filter(user_id %in% users_to_keep)

count_tbl <- 
  PDF_indiv_reconciliation_non_dynamic_fil |>
  group_by(user_id, project_id) |> summarise(continuous_period = n())
## `summarise()` has grouped output by 'user_id'. You can override using the
## `.groups` argument.
head(count_tbl, 10)
## # A tibble: 10 × 3
## # Groups:   user_id [10]
##    user_id project_id continuous_period
##      <int>      <int>             <int>
##  1   10048        200                 3
##  2   10050        200                 1
##  3   10074        185                18
##  4   10084        184                10
##  5   10085        184                 2
##  6   10105        184                10
##  7   10108        184                 1
##  8   10117        200                 1
##  9   10119        208                 1
## 10   10137        184                11

6.1 Overview of length of observation_count and user_count under each condition

Now let’s count how many observations and how many unique user_ids are there under each condition ranging from 3 to 8 continuous_period. Both columns are stored as L_observation and L_user_id.

CP3 <- filter(count_tbl, continuous_period == 3) 

CP4 <- filter(count_tbl, continuous_period == 4) 

CP5 <- filter(count_tbl, continuous_period == 5) 

CP6 <- filter(count_tbl, continuous_period == 6) 

CP7 <- filter(count_tbl, continuous_period == 7) 

CP8 <- filter(count_tbl, continuous_period == 8) 

L_observation <- c()

CP_list <- list(CP3, CP4, CP5, CP6, CP7, CP8)

for (i in CP_list) {
  
  L_observation <- c(L_observation, nrow(i))
  
}

L_user_id <- c()

for (i in CP_list) {
  
  L_user_id <- c(L_user_id, length(unique(i$user_id)))
  
}

continuous_period <- c(3, 4, 5, 6, 7, 8)

count3to8 <- data.frame(continuous_period, L_observation, L_user_id)

count3to8
##   continuous_period L_observation L_user_id
## 1                 3          2999      2963
## 2                 4          1787      1772
## 3                 5           893       887
## 4                 6           565       560
## 5                 7           408       405
## 6                 8           317       315

6.2 Counting the duplicated user_ids under each condition

We’ve noticed that for any continuous_period, L_observation \(\neq\) L_user_id, so this indicates that must be some duplicated user_ids among each continuous_period. Let’s count them and mutate several columns on our count3to8 data frame above. This will serve us a standard for us to “pseudo-stochastically” pick the representations.

result_list <- list()

for(i in seq_along(CP_list)) {
  
  duplicate_count <- CP_list[[i]] |> count(user_id) |> group_by(n) |>
    summarise(count = n(), .groups = "drop") |>
    spread(key = n, value = count, fill = 0)
  
  names(duplicate_count) <- paste0("duplicate_", names(duplicate_count))
  
  result_list[[i]] <- duplicate_count
  
}

result_list
## [[1]]
## # A tibble: 1 × 6
##   duplicate_1 duplicate_2 duplicate_3 duplicate_4 duplicate_5 duplicate_8
##         <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
## 1        2940          18           2           1           1           1
## 
## [[2]]
## # A tibble: 1 × 4
##   duplicate_1 duplicate_2 duplicate_4 duplicate_6
##         <dbl>       <dbl>       <dbl>       <dbl>
## 1        1763           7           1           1
## 
## [[3]]
## # A tibble: 1 × 3
##   duplicate_1 duplicate_2 duplicate_3
##         <dbl>       <dbl>       <dbl>
## 1         882           4           1
## 
## [[4]]
## # A tibble: 1 × 2
##   duplicate_1 duplicate_2
##         <dbl>       <dbl>
## 1         555           5
## 
## [[5]]
## # A tibble: 1 × 3
##   duplicate_1 duplicate_2 duplicate_3
##         <dbl>       <dbl>       <dbl>
## 1         403           1           1
## 
## [[6]]
## # A tibble: 1 × 2
##   duplicate_1 duplicate_2
##         <dbl>       <dbl>
## 1         313           2
duplicated_cols <- c("duplicate_1", "duplicate_2", "duplicate_3", "duplicate_4", 
                     "duplicate_5", "duplicate_6", "duplicate_8")

for (name in duplicated_cols) {
  
  count3to8[[name]] <- 0
  
}

for (i in 1:nrow(count3to8)) {

  result <- result_list[[i]]
  
  for (name in duplicated_cols) {

    if (name %in% names(result)) {

      count3to8[i, name] <- result[1, name]
      
    }
    
  }
  
}

count3to8
##   continuous_period L_observation L_user_id duplicate_1 duplicate_2 duplicate_3
## 1                 3          2999      2963        2940          18           2
## 2                 4          1787      1772        1763           7           0
## 3                 5           893       887         882           4           1
## 4                 6           565       560         555           5           0
## 5                 7           408       405         403           1           1
## 6                 8           317       315         313           2           0
##   duplicate_4 duplicate_5 duplicate_6 duplicate_8
## 1           1           1           0           1
## 2           1           0           1           0
## 3           0           0           0           0
## 4           0           0           0           0
## 5           0           0           0           0
## 6           0           0           0           0

7 Pseudo-stochastic selection of user_ids under each condition

We can then stochastically select a user_id from the table above. For example, among continuous_period == 3, which is the first row, we see duplicated user_ids among duplicate_1, duplicate_2, duplicate_3, duplicate_4, duplicate_5, and duplicate_8, so we will randomly choose one user_id and compose a list. We will do this 6 times for each row. The eventual list, stoch_list_user_id, is a nested list of length 6.

stoch_list_user_id <- list()

for (name in grep("duplicate_", names(count3to8), value = TRUE)) {

  for (i in 1:nrow(count3to8)) {

    if (count3to8[i, name] > 0) {
  
      stoch_list_user_id[[paste0("CP", i + 2, "_user_id")]] <- 
        c(stoch_list_user_id[[paste0("CP", i + 2, "_user_id")]], 
          sample(CP_list[[i]]$user_id, 1))
    }
    
  }
  
}

stoch_list_user_id
## $CP3_user_id
## [1] 144393 180636 184746  60935  35763 134973
## 
## $CP4_user_id
## [1] 131511 193410 119306 134294
## 
## $CP5_user_id
## [1] 110945  84443 119249
## 
## $CP6_user_id
## [1] 53879 53879
## 
## $CP7_user_id
## [1]  60162  58012 140802
## 
## $CP8_user_id
## [1] 70285 34775

8 Reference

[1] Z. Jiang, “RPubs - Reconciliation Analysis I - Modeling and Extracting Emblematic Non-reconcilable Dynamics,” rpubs.com, Jul. 13, 2023. https://rpubs.com/jiangzm/1063534 (accessed Jul. 20, 2023).