We will first restore necessary data frames from Reconciliation Analysis I [1] to continue the study. Invoked packages are hide.
PDF_daily_impute_char_cumulated <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_daily_impute_char_cumulated.csv")
PDF_indiv_reconciliation_non <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_indiv_reconciliation_non.csv")
PDF_indiv_reconciliation_non_dynamic <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Reconciliation\ Analysis\ I\ Data/PDF_indiv_reconciliation_non_dynamic.csv")
df_redemptions <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_redemptions.csv")
df_revenue_view <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_revenue_view.csv")
df_users_thru_jan23 <-
read.csv("/Users/apple/Quantitative\ Marketing\ Research/Reconciliation\ Analysis\ I/Converted\ Raw\ Data/df_users_thru_jan23.csv")
In Reconciliation Analysis I [1], we have displayed the distribution
of the discrepancy amount in histograms. We find a lot
of data observations are clustered within less discrepancy amount, which
can be reasonably considered as being offered a “suspicious” promotion
from the project or the application. To further help us identify these
local peaks, we will generate a list of view of the density plots on
discrep_summary_non below, in full discrepancy amount
scale, 5K scale, 1K scale, and 0.3K scale, respectively.
densplot1 <-
ggplot(PDF_indiv_reconciliation_non, aes(x = discrep_summary_non)) +
geom_density(fill = "#FF6666", alpha = 0.5)
labs(x = "discrepancy amount ($)", y = "density") +
theme_minimal()
## NULL
densplot2 <- densplot1 + xlim(c(0, 5000))
densplot3 <- densplot1 + xlim(c(0, 1000))
densplot4 <- densplot1 + xlim(c(0, 300))
plot_grid(densplot1, densplot2, densplot3, densplot4, labels = "AUTO")
## Warning: Removed 18 rows containing non-finite values (`stat_density()`).
## Warning: Removed 300 rows containing non-finite values (`stat_density()`).
## Warning: Removed 2049 rows containing non-finite values (`stat_density()`).
It appears that figure \(D\) above
has several outstanding peaks in density. We will implement the
calculation of the second derivative to see if there is
a change in sign, as this determines the local peaks in our model. This
calculus concept can be realized by diff(sign(diff()))
function in R. We will annotate the discrepancy amount, which is rounded
to 2 decimals, right above each local peak on the diagram below.
PDF_indiv_reconciliation_non_seg1 <-
PDF_indiv_reconciliation_non[PDF_indiv_reconciliation_non$discrep_summary_non <= 300, ]
density_non_seg1 <- density(PDF_indiv_reconciliation_non_seg1$discrep_summary_non)
peaks_x_seg1 <- density_non_seg1$x[which(diff(sign(diff(density_non_seg1$y))) < 0)]
peaks_y_seg1 <- density_non_seg1$y[which(diff(sign(diff(density_non_seg1$y))) < 0)]
peaks_pdf_seg1 <- data.frame(x = peaks_x_seg1, y = peaks_y_seg1)
ggplot(PDF_indiv_reconciliation_non_seg1, aes(x = discrep_summary_non)) +
geom_density(fill = "#FF6666", alpha = 0.5) +
geom_text(data = peaks_pdf_seg1, aes(x = x, y = y, label = round(x, 2)),
vjust = -1, color = "black", size = 3) +
xlim(c(0, 300)) +
labs(x = "Discrepancy Amount ($)", y = "Density",
title = "Density Plot of Discrepancy Amount (<= $300)") +
theme_minimal()
The result is nearly aligned with our conjecture before. We notice that these peaks are striking based on some counting patterns. All of them above are slightly less than their nearest multiples of 5 (i.e. 24.32 < 25, 48.89 < 50, 74.1 < 75, etc.). This is probably due to two reasons. First, the firm may input inaccurate numbers by removing decimal points. Second, which I think it is more likely, the customers may pay extra taxes when they redeem, causing their discrepancy amount to shrink a little bit (i.e. suppose you pay 200 dollars to purchase get a credit of 250 dollars, the discrepancy should be 50 ideally, but the additional taxing information may drag down the value a bit to, say, 48.89). Regardless of whatever reason causing the problem, we can indeed label these peaks’ nearest integer neighbors (multiples of 25) as “suspicious” promoted accounts under certain projects.
Instead of simply labeling a binary “yes” or “no” label on this
potential observation (per user_id per
project_id), the degree of peak density
may also serve a standard for us to determine the extent of being
promoted. What we will do here is to manually create the “confidence
intervals” around each of the rounded decimal labeled in the plot above.
We do not strictly follow the definition of the confidence interval
here, though. It is simply because the open cover is too narrow (we may
change the \(z\) score accordingly to
allow for a larger cover, but for simplicity here, we directly set such
width to 3, for example). We tend to cover any peaks indicated above and
also the multiples of 25. Hence, we implement our algorithms below to
compute the mid-distance between two vectors. With the mid-point
distance vector, we will easily find the upper and lower “CI” to cover
both the listed peaks and multiples of 25.
peak_list <- c(24.32, 48.89, 74.1, 99.3, 123.25, 149.08,
174.28, 198.86, 222.8, 249.27, 273.21, 298.41)
multiples_of_25 <- c(25, 50, 75, 100, 125, 150,
175, 200, 225, 250, 275, 300)
midpoints <- (peak_list + multiples_of_25) / 2
min_interval_width <- 3 # may vary based on personal judgment
interval_list <- list()
for(i in 1:length(midpoints)) {
dist_to_peak <- abs(midpoints[i] - peak_list[i])
dist_to_multiple <- abs(midpoints[i] - multiples_of_25[i])
half_width <- max(dist_to_peak, dist_to_multiple, min_interval_width / 2)
lower_bound <- midpoints[i] - half_width
upper_bound <- midpoints[i] + half_width
interval_list[[i]] <-
c("mid" = midpoints[i], "lowerCI" = lower_bound, "upperCI" = upper_bound)
}
interval_df <- do.call(rbind, interval_list)
interval_df <- as.data.frame(interval_df)
interval_df$peak_list <- peak_list
interval_df$multiples_of_25 <- multiples_of_25
interval_df <- interval_df[, c("peak_list", "multiples_of_25",
"mid", "lowerCI", "upperCI")]
interval_df
## peak_list multiples_of_25 mid lowerCI upperCI
## 1 24.32 25 24.660 23.160 26.160
## 2 48.89 50 49.445 47.945 50.945
## 3 74.10 75 74.550 73.050 76.050
## 4 99.30 100 99.650 98.150 101.150
## 5 123.25 125 124.125 122.625 125.625
## 6 149.08 150 149.540 148.040 151.040
## 7 174.28 175 174.640 173.140 176.140
## 8 198.86 200 199.430 197.930 200.930
## 9 222.80 225 223.900 222.400 225.400
## 10 249.27 250 249.635 248.135 251.135
## 11 273.21 275 274.105 272.605 275.605
## 12 298.41 300 299.205 297.705 300.705
Then, we will subjectively value the likelihood of being offered
promotion under each interval of the discrepancy amount. The potential
factors we will include in the mutated column
suspicious_promo are 0 (very unlikely
of being offered promotions), 1 (unlikely
of being offered promotions), 2 (likely of
being offered promotions), and 3 (very
likely of being offered promotions). The judgement here is not
objective but based on our intuitive classification. Since the peak
around 48.89 is so striking, I would classify it as 3. The
peaks around 24.32, 74.1, and 99.3 are also relatively outstanding, so I
would classify them as the group of 2. The rest peaks on
the plot above are less obvious, though they still reveal some patterns
we discovered above. Under this consideration, it is reasonable for us
to label them as 1. For any other amounts not within these
ranges, we may label them a 0.
density_estimate <-
density(PDF_indiv_reconciliation_non_seg1$discrep_summary_non)
peak_density <-
approx(x = density_estimate$x, y = density_estimate$y, xout = peak_list)$y
interval_df$peak_density <- peak_density
compute_density_values <- function(lower, upper) {
x_values <- seq(lower, upper, length.out = 100)
y_values <- approx(x = density_estimate$x, y = density_estimate$y, xout = x_values)$y
data.frame(x = x_values, y = y_values)
}
interval_densities <- lapply(1:nrow(interval_df), function(i) {
compute_density_values(interval_df$lowerCI[i], interval_df$upperCI[i])
})
interval_densities_df <- do.call(rbind, interval_densities)
interval_densities_df$group <- rep(1:nrow(interval_df), each = 100)
ggplot(PDF_indiv_reconciliation_non_seg1, aes(x = discrep_summary_non)) +
geom_density(fill = "#FF6666", alpha = 0.5) +
geom_text(data = peaks_pdf_seg1, aes(x = x, y = y, label = round(x, 2)),
vjust = -1, color = "black", size = 3) +
geom_segment(data = interval_df,
aes(x = lowerCI, xend = lowerCI, y = 0, yend = peak_density),
color = "black", linetype = "dashed", size = 0.7) +
geom_segment(data = interval_df,
aes(x = upperCI, xend = upperCI, y = 0, yend = peak_density),
color = "black", linetype = "dashed", size = 0.7) +
geom_ribbon(data = interval_densities_df, aes(x = x, ymin = 0, ymax = y, fill = factor(group)),
alpha = 0.9) +
xlim(c(0, 305)) +
labs(x = "Discrepancy Amount ($)", y = "Density",
title = "Density Plot of Discrepancy Amount (<= $305)",
fill = "Group") +
scale_fill_manual(values = c("1" = "orchid", "2" = "darkorchid4", "3" = "orchid", "4" = "orchid",
"5" = "plum", "6" = "plum", "7" = "plum", "8" = "plum",
"9" = "plum", "10" = "plum", "11" = "plum", "12" = "plum"),
labels = c("1" = "[23.160, 26.160]", "2" = "[47.945, 50.945]", "3" = "[73.050, 76.050]",
"4" = "[98.150, 101.150]", "5" = "[122.625, 125.625]", "6" = "[148.040, 151.040]",
"7" = "[173.140, 176.140]", "8" = "[197.930, 200.930]", "9" = "[222.400, 225.400]",
"10" = "[248.135, 251.135]", "11" = "[272.605, 275.605]", "12" = "[297.705, 300.705]")) + theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The above visualization strictly applies this categorization, where
3 has the color of darkorchid4, 2
has the color of orchid, 1 has the color of
plum, and 0 has the color of
#FF6666, which is the background filling color in the
density plot above. For memorization, the darker the color, the higher
probability of being offered promotions.
With the theoretical foundations built above, we can implement the following two tasks.
interval_dfAdd a new column to interval_df called
suspicious_promo and assign it with the factor
3 for the second row, 2 for the 1st, 3rd, and
4th rows, and 1 for rest of the rows.
interval_df <- interval_df |>
mutate(suspicious_promo = case_when(row_number() == 2 ~ 3,
row_number() %in% c(1, 3:4) ~ 2,
TRUE ~ 1))
interval_df
## peak_list multiples_of_25 mid lowerCI upperCI peak_density
## 1 24.32 25 24.660 23.160 26.160 0.0204172075
## 2 48.89 50 49.445 47.945 50.945 0.0294787060
## 3 74.10 75 74.550 73.050 76.050 0.0125878868
## 4 99.30 100 99.650 98.150 101.150 0.0061677817
## 5 123.25 125 124.125 122.625 125.625 0.0018325601
## 6 149.08 150 149.540 148.040 151.040 0.0013604500
## 7 174.28 175 174.640 173.140 176.140 0.0006942789
## 8 198.86 200 199.430 197.930 200.930 0.0010070916
## 9 222.80 225 223.900 222.400 225.400 0.0004112768
## 10 249.27 250 249.635 248.135 251.135 0.0006718587
## 11 273.21 275 274.105 272.605 275.605 0.0003076106
## 12 298.41 300 299.205 297.705 300.705 0.0004244653
## suspicious_promo
## 1 2
## 2 3
## 3 2
## 4 2
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 1
## 12 1
PDF_indiv_reconciliation_nonAdd a new column to PDF_indiv_reconciliation_non that is
also called suspicious_promo. For each row, if
discrep_summary_non falls within any of the intervals
defined by lowerCI and upperCI in
interval_df, the suspicious_promo should be
labeled as 1, 2, or 3
accordingly. If discrep_summary_non does not fall within
any interval, it should be labeled as 0. Notice that we
define a function called find_interval that returns the
suspicious_promo value of the interval a given value
(1, 2, or 3) falls in, or
0 if it doesn’t fall in any interval. We then apply this
function to each discrep_summary_non value to create the
suspicious_promo column in
PDF_indiv_reconciliation_non.
find_interval <- function(x, intervals) {
for(i in 1:nrow(intervals)) {
if(x >= intervals$lowerCI[i] & x <= intervals$upperCI[i]) {
return(intervals$suspicious_promo[i])
}
}
return(0)
}
PDF_indiv_reconciliation_non <-
PDF_indiv_reconciliation_non |>
mutate(suspicious_promo =
sapply(discrep_summary_non, find_interval, intervals = interval_df))
head(PDF_indiv_reconciliation_non[, c(1:3, 14:15)], 10)
## user_id project_id created_at discrep_summary_non suspicious_promo
## 1 100002 676 2022-04-23 125.00 1
## 2 100005 692 2023-01-09 100.00 2
## 3 100012 702 2022-04-23 120.00 0
## 4 100017 713 2022-11-03 148.17 1
## 5 100019 698 2022-07-30 1660.21 0
## 6 100027 676 2022-04-29 75.00 2
## 7 100035 676 2022-04-23 25.00 2
## 8 100043 692 2022-04-23 30.00 0
## 9 100047 676 2022-05-01 50.00 3
## 10 100059 676 2023-01-05 0.68 0
Our previous analysis in Reconciliation Analysis I [1] has loaded
discrep_df under adjustments on
PDF_indiv_reconciliation_non_tol. However, the tolerance
adjustment does not seem to be quite effective as it only removes 49
observations. Hence, the new version of discrep_df has been
presented below with the use of
PDF_indiv_reconciliation_non and we will refine such a data
frame with a closer look. Notice that frequency and
cumulative_frequency columns are renamed to
observation and cumulative_observation to
avoid misunderstanding.
new_break_points <- c(-Inf, 50, 100, 200, 300, 500, 1000, 10000, 50000, Inf)
new_labels <- c("<= $50", "<= $100", "<= $200", "<= $300", "<= $500",
"<= $1000", "<= $10000", "<= $50000", "> $50000")
PDF_indiv_reconciliation_non$discrep_category <-
cut(PDF_indiv_reconciliation_non$discrep_summary_non,
breaks = new_break_points, labels = new_labels, include.lowest = TRUE)
new_discrep_counts <- table(PDF_indiv_reconciliation_non$discrep_category)
new_discrep_cumulative_counts <- cumsum(new_discrep_counts)
new_discrep_counts_df <- as.data.frame(new_discrep_counts)
new_discrep_cumulative_counts_df <- as.data.frame(new_discrep_cumulative_counts)
colnames(new_discrep_counts_df) <-
c("discrepancy_amount", "frequency")
new_discrep_cumulative_counts_df <-
cbind(discrepancy_amount = rownames(new_discrep_cumulative_counts_df),
new_discrep_cumulative_counts_df)
rownames(new_discrep_cumulative_counts_df) <- NULL
colnames(new_discrep_cumulative_counts_df) <-
c("discrepancy_amount", "cumulative_frequency")
new_discrep_df <- full_join(new_discrep_counts_df, new_discrep_cumulative_counts_df,
by = "discrepancy_amount")
names(new_discrep_df) <- c("discrepancy_amount",
"observation_count", "cumulative_observation_count")
new_discrep_df
## discrepancy_amount observation_count cumulative_observation_count
## 1 <= $50 26718 26718
## 2 <= $100 11491 38209
## 3 <= $200 4830 43039
## 4 <= $300 1612 44651
## 5 <= $500 1103 45754
## 6 <= $1000 646 46400
## 7 <= $10000 294 46694
## 8 <= $50000 6 46700
## 9 > $50000 0 46700
In contrast to simply counting the observations of each
user_id under unique project_id, we can also
count the unique number of user_ids and
project_ids under each discrepancy_amount
range. Their cumulative counts are also supplied below.
user_counts <- sapply(split(PDF_indiv_reconciliation_non,
PDF_indiv_reconciliation_non$discrep_category),
function(x) length(unique(x$user_id)))
cumulative_user_counts <- cumsum(user_counts)
user_counts_df <- as.data.frame(user_counts)
user_counts_df <-
cbind(discrepancy_amount = rownames(user_counts_df),
user_counts_df)
rownames(user_counts_df) <- NULL
colnames(user_counts_df) <-
c("discrepancy_amount", "user_count")
cumulative_user_counts_df <- as.data.frame(cumulative_user_counts)
cumulative_user_counts_df <-
cbind(discrepancy_amount = rownames(cumulative_user_counts_df),
cumulative_user_counts_df)
rownames(cumulative_user_counts_df) <- NULL
colnames(cumulative_user_counts_df) <-
c("discrepancy_amount", "cumulative_user_count")
user_df <- full_join(user_counts_df, cumulative_user_counts_df, by = "discrepancy_amount")
project_counts <- sapply(split(PDF_indiv_reconciliation_non,
PDF_indiv_reconciliation_non$discrep_category),
function(x) length(unique(x$project_id)))
cumulative_project_counts <- cumsum(project_counts)
project_counts_df <- as.data.frame(project_counts)
project_counts_df <-
cbind(discrepancy_amount = rownames(project_counts_df),
project_counts_df)
rownames(project_counts_df) <- NULL
colnames(project_counts_df) <-
c("discrepancy_amount", "project_count")
cumulative_project_counts_df <- as.data.frame(cumulative_project_counts)
cumulative_project_counts_df <-
cbind(discrepancy_amount = rownames(cumulative_project_counts_df),
cumulative_project_counts_df)
rownames(cumulative_project_counts_df) <- NULL
colnames(cumulative_project_counts_df) <-
c("discrepancy_amount", "cumulative_project_count")
project_df <- full_join(project_counts_df, cumulative_project_counts_df, by = "discrepancy_amount")
new_discrep_df <- full_join(new_discrep_df, user_df, by = "discrepancy_amount")
new_discrep_df <- full_join(new_discrep_df, project_df, by = "discrepancy_amount")
new_discrep_df
## discrepancy_amount observation_count cumulative_observation_count user_count
## 1 <= $50 26718 26718 25404
## 2 <= $100 11491 38209 11243
## 3 <= $200 4830 43039 4680
## 4 <= $300 1612 44651 1570
## 5 <= $500 1103 45754 1084
## 6 <= $1000 646 46400 627
## 7 <= $10000 294 46694 279
## 8 <= $50000 6 46700 5
## 9 > $50000 0 46700 0
## cumulative_user_count project_count cumulative_project_count
## 1 25404 143 143
## 2 36647 138 281
## 3 41327 135 416
## 4 42897 104 520
## 5 43981 89 609
## 6 44608 74 683
## 7 44887 37 720
## 8 44892 4 724
## 9 44892 0 724
We can also visualize the table above into the following trend plot.
Notice that our number of discrepancy_amount intervals are
relatively small, so the trend plot is “non-elaborated”. We could
elaborate such a plot with the use of
PDF_indiv_reconciliation_non in future studies.
new_discrep_df$discrepancy_amount <-
factor(new_discrep_df$discrepancy_amount, levels = new_discrep_df$discrepancy_amount)
new_discrep_df_long <-
reshape2::melt(new_discrep_df, id.vars = "discrepancy_amount",
measure.vars = c("cumulative_observation_count",
"cumulative_user_count",
"cumulative_project_count"),
variable.name = "type", value.name = "cumulative_count")
new_discrep_df_long
## discrepancy_amount type cumulative_count
## 1 <= $50 cumulative_observation_count 26718
## 2 <= $100 cumulative_observation_count 38209
## 3 <= $200 cumulative_observation_count 43039
## 4 <= $300 cumulative_observation_count 44651
## 5 <= $500 cumulative_observation_count 45754
## 6 <= $1000 cumulative_observation_count 46400
## 7 <= $10000 cumulative_observation_count 46694
## 8 <= $50000 cumulative_observation_count 46700
## 9 > $50000 cumulative_observation_count 46700
## 10 <= $50 cumulative_user_count 25404
## 11 <= $100 cumulative_user_count 36647
## 12 <= $200 cumulative_user_count 41327
## 13 <= $300 cumulative_user_count 42897
## 14 <= $500 cumulative_user_count 43981
## 15 <= $1000 cumulative_user_count 44608
## 16 <= $10000 cumulative_user_count 44887
## 17 <= $50000 cumulative_user_count 44892
## 18 > $50000 cumulative_user_count 44892
## 19 <= $50 cumulative_project_count 143
## 20 <= $100 cumulative_project_count 281
## 21 <= $200 cumulative_project_count 416
## 22 <= $300 cumulative_project_count 520
## 23 <= $500 cumulative_project_count 609
## 24 <= $1000 cumulative_project_count 683
## 25 <= $10000 cumulative_project_count 720
## 26 <= $50000 cumulative_project_count 724
## 27 > $50000 cumulative_project_count 724
ggplot(new_discrep_df_long,
aes(x = discrepancy_amount, y = cumulative_count, color = type)) +
geom_line(aes(group = type), size = 1.2) +
labs(x = "discrepancy_amount", y = "cumulative_count", color = "type") +
theme_minimal()
We can roughly see from the plot that the gap between
cumulative_observation_count and
cumulative_user_count is approximately same across
discrepancy_amount > $200. This indicates that customers
whose discrepancy_amount > $200 typically do not
purchase at more than one project. This may suggest that those high
discrepancy_amount is due to some sporadic customers in
some sporadic projects.
user_ids with a large discrepancy amountWe are especially interested in knowing the user_ids of
those customers whose discrepancy_amount > $1000. There
are, unfortunately, more than what we would expect - 279 and 5
user_ids are detected to follow such restriction, according
to the table above. We store them in the following vector,
user_id_large_discrep.
user_id_large_discrep <-
unique(PDF_indiv_reconciliation_non$user_id
[PDF_indiv_reconciliation_non$discrep_summary_non > 1000])
length(user_id_large_discrep)
## [1] 284
Our previous study in Reconciliation Analysis I [1] has turned out
that the 8 emblematic of the non-reconcilable dynamics are not what we
would like to extract for. Instead, we become interested in knowing
customers whose continuous_period, defined before as
the consecutive period during which a single customer has made
actions (either purchasing or redeeming), is between 3 and 8.
Before we proceed the analysis, these examples are undoubtedly quite a
lot. We will stochastically pick a few representations among each number
from 3 to 8. The following code chunk is adapted from Reconciliation
Analysis I [1].
user_counts <- table(PDF_indiv_reconciliation_non_dynamic$user_id)
users_to_keep <- names(user_counts)
PDF_indiv_reconciliation_non_dynamic_fil <-
PDF_indiv_reconciliation_non_dynamic |> filter(user_id %in% users_to_keep)
count_tbl <-
PDF_indiv_reconciliation_non_dynamic_fil |>
group_by(user_id, project_id) |> summarise(continuous_period = n())
## `summarise()` has grouped output by 'user_id'. You can override using the
## `.groups` argument.
head(count_tbl, 10)
## # A tibble: 10 × 3
## # Groups: user_id [10]
## user_id project_id continuous_period
## <int> <int> <int>
## 1 10048 200 3
## 2 10050 200 1
## 3 10074 185 18
## 4 10084 184 10
## 5 10085 184 2
## 6 10105 184 10
## 7 10108 184 1
## 8 10117 200 1
## 9 10119 208 1
## 10 10137 184 11
observation_count and user_count under each
conditionNow let’s count how many observations and how many unique
user_ids are there under each condition ranging from 3 to 8
continuous_period. Both columns are stored as
L_observation and L_user_id.
CP3 <- filter(count_tbl, continuous_period == 3)
CP4 <- filter(count_tbl, continuous_period == 4)
CP5 <- filter(count_tbl, continuous_period == 5)
CP6 <- filter(count_tbl, continuous_period == 6)
CP7 <- filter(count_tbl, continuous_period == 7)
CP8 <- filter(count_tbl, continuous_period == 8)
L_observation <- c()
CP_list <- list(CP3, CP4, CP5, CP6, CP7, CP8)
for (i in CP_list) {
L_observation <- c(L_observation, nrow(i))
}
L_user_id <- c()
for (i in CP_list) {
L_user_id <- c(L_user_id, length(unique(i$user_id)))
}
continuous_period <- c(3, 4, 5, 6, 7, 8)
count3to8 <- data.frame(continuous_period, L_observation, L_user_id)
count3to8
## continuous_period L_observation L_user_id
## 1 3 2999 2963
## 2 4 1787 1772
## 3 5 893 887
## 4 6 565 560
## 5 7 408 405
## 6 8 317 315
user_ids under each conditionWe’ve noticed that for any continuous_period,
L_observation \(\neq\)
L_user_id, so this indicates that must be some duplicated
user_ids among each continuous_period. Let’s
count them and mutate several columns on our count3to8 data
frame above. This will serve us a standard for us to
“pseudo-stochastically” pick the representations.
result_list <- list()
for(i in seq_along(CP_list)) {
duplicate_count <- CP_list[[i]] |> count(user_id) |> group_by(n) |>
summarise(count = n(), .groups = "drop") |>
spread(key = n, value = count, fill = 0)
names(duplicate_count) <- paste0("duplicate_", names(duplicate_count))
result_list[[i]] <- duplicate_count
}
result_list
## [[1]]
## # A tibble: 1 × 6
## duplicate_1 duplicate_2 duplicate_3 duplicate_4 duplicate_5 duplicate_8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2940 18 2 1 1 1
##
## [[2]]
## # A tibble: 1 × 4
## duplicate_1 duplicate_2 duplicate_4 duplicate_6
## <dbl> <dbl> <dbl> <dbl>
## 1 1763 7 1 1
##
## [[3]]
## # A tibble: 1 × 3
## duplicate_1 duplicate_2 duplicate_3
## <dbl> <dbl> <dbl>
## 1 882 4 1
##
## [[4]]
## # A tibble: 1 × 2
## duplicate_1 duplicate_2
## <dbl> <dbl>
## 1 555 5
##
## [[5]]
## # A tibble: 1 × 3
## duplicate_1 duplicate_2 duplicate_3
## <dbl> <dbl> <dbl>
## 1 403 1 1
##
## [[6]]
## # A tibble: 1 × 2
## duplicate_1 duplicate_2
## <dbl> <dbl>
## 1 313 2
duplicated_cols <- c("duplicate_1", "duplicate_2", "duplicate_3", "duplicate_4",
"duplicate_5", "duplicate_6", "duplicate_8")
for (name in duplicated_cols) {
count3to8[[name]] <- 0
}
for (i in 1:nrow(count3to8)) {
result <- result_list[[i]]
for (name in duplicated_cols) {
if (name %in% names(result)) {
count3to8[i, name] <- result[1, name]
}
}
}
count3to8
## continuous_period L_observation L_user_id duplicate_1 duplicate_2 duplicate_3
## 1 3 2999 2963 2940 18 2
## 2 4 1787 1772 1763 7 0
## 3 5 893 887 882 4 1
## 4 6 565 560 555 5 0
## 5 7 408 405 403 1 1
## 6 8 317 315 313 2 0
## duplicate_4 duplicate_5 duplicate_6 duplicate_8
## 1 1 1 0 1
## 2 1 0 1 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
user_ids under each conditionWe can then stochastically select a user_id from the
table above. For example, among continuous_period == 3,
which is the first row, we see duplicated user_ids among
duplicate_1, duplicate_2,
duplicate_3, duplicate_4,
duplicate_5, and duplicate_8, so we will
randomly choose one user_id and compose a list. We will do
this 6 times for each row. The eventual list,
stoch_list_user_id, is a nested list of length 6.
stoch_list_user_id <- list()
for (name in grep("duplicate_", names(count3to8), value = TRUE)) {
for (i in 1:nrow(count3to8)) {
if (count3to8[i, name] > 0) {
stoch_list_user_id[[paste0("CP", i + 2, "_user_id")]] <-
c(stoch_list_user_id[[paste0("CP", i + 2, "_user_id")]],
sample(CP_list[[i]]$user_id, 1))
}
}
}
stoch_list_user_id
## $CP3_user_id
## [1] 144393 180636 184746 60935 35763 134973
##
## $CP4_user_id
## [1] 131511 193410 119306 134294
##
## $CP5_user_id
## [1] 110945 84443 119249
##
## $CP6_user_id
## [1] 53879 53879
##
## $CP7_user_id
## [1] 60162 58012 140802
##
## $CP8_user_id
## [1] 70285 34775
[1] Z. Jiang, “RPubs - Reconciliation Analysis I - Modeling and Extracting Emblematic Non-reconcilable Dynamics,” rpubs.com, Jul. 13, 2023. https://rpubs.com/jiangzm/1063534 (accessed Jul. 20, 2023).