Exploratory Data Analysis II with focus on customers

Prompts / Suggestions from May 9’s meeting:

hist. x-axis on number of transc; y-axis on users’ transc times (freq table; use group_by) avg number of spend !!; individual spend -> # transc ~ 20/21 repeated obs??? -> impact the model (HMM)

given a user, what is dist. of unique project id accross customer? -> summary stats 1, 2, or 3…? -> more interested in conditional on more than 1 purchase (1 purchase is only from 1 project)

separate the above process for UID with g (restaurant) and i(universal)

individual transc -reorganize-> customer-around

given gift cards, to predict what happen to each customer

visualization on freq. and last time (not on each customer)? RecencyFM

Codes from EDA I

revenue_view_2021 <-
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2021.json")

revenue_view_2022 <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2022.json")

revenue_view_2023_Jan <- 
  fromJSON(file = "/Users/apple/Desktop/2023\ Feb\ Transfer/revenue_view_2023_Jan.json")

revenue_view_2021 <- lapply(revenue_view_2021, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2021 <- as.data.frame(do.call("cbind", revenue_view_2021)) |> t()

revenue_view_2022 <- lapply(revenue_view_2022, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2022 <- as.data.frame(do.call("cbind", revenue_view_2022)) |> t()

revenue_view_2023_Jan <- lapply(revenue_view_2023_Jan, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

df_revenue_view_2023_Jan <- as.data.frame(do.call("cbind", revenue_view_2023_Jan)) |> t()

df_revenue_view <- 
  rbind(df_revenue_view_2021, df_revenue_view_2022, df_revenue_view_2023_Jan)

df_revenue_view <- data.frame(df_revenue_view)

user_purchase <- group_by(df_revenue_view, user_id) |> count()
user_purchase

number_purchases <- unique(user_purchase$n) |> sort()
number_purchases

##  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
## [20]  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  40  44  45
## [39]  48  53  54  57  63  64  68  71  77  78  87  96 129 139 169 722

We have a total of 95,418 customers, as shown by their distinct user_id. These users can be further split into multiple groups based on their number of purchases on projects. Primarily, we are interested in splitting those one-time purchasing customers versus those multiple-time purchasing customers, where \(n \geq 2\). It is interesting to note that we have some customers that are very loyal in using the app inKind to make purchases. Some (or just one) customer(s) even make(s) 722 purchases! In the following codes, we will try to create a histogram (based on frequency of certain number of purchases) with the x-axis on the number of purchases versus the y-axis on users’ transaction times. Later, we will also study on those multiple-time purchasers - among which projects do they most likely spend their money on?

freq_vec <- c()

for (i in number_purchases) {
  
  element <- filter(user_purchase, n == i) |> nrow()
  freq_vec <- append(freq_vec, element)
  
}

freq_vec

##  [1] 67294 19942  4439  1798   813   389   217   152    84    60    43    26
## [13]    26    21    16     9     3    10     5     7     4     4     8     1
## [25]     1     7     3     2     3     1     2     1     1     1     1     1
## [37]     5     2     1     1     1     1     1     1     1     1     1     1
## [49]     1     1     1     1     1     1

freq_table <- data.frame(number_purchases, freq_vec)
freq_table

ggplot(freq_table) + geom_col(aes(x = freq_table$number_purchases, 
  y = freq_table$freq_vec), color = "black", fill = "#FFCC99") + 
  xlab("number of transactions") + ylab("Frequency")

## Warning: Use of `freq_table$number_purchases` is discouraged.
## ℹ Use `number_purchases` instead.

## Warning: Use of `freq_table$freq_vec` is discouraged.
## ℹ Use `freq_vec` instead.

The plots above display the distribution of the number of transactions / purchases versus their respective counts. We observe that when the number of transactions / purchases is just 1, the count is extremely high. This is expected since we know in advance that there are a lot of one-time purchasing customers. We will exclude this column to generate the distribution of only multiple-time purchasing customers.

freq_mult_table <- filter(freq_table, number_purchases != 1)

ggplot(freq_mult_table) + geom_col(aes(x = freq_mult_table$number_purchases, 
  y = freq_mult_table$freq_vec), color = "black", fill = "#FFCC99") + 
  xlab("number of transactions") + ylab("Frequency")

## Warning: Use of `freq_mult_table$number_purchases` is discouraged.
## ℹ Use `number_purchases` instead.

## Warning: Use of `freq_mult_table$freq_vec` is discouraged.
## ℹ Use `freq_vec` instead.

The distribution is still not very pretty and displays the shape of Poisson distribution when \(\lambda = 1\). Unfortunately, what we have generated above are bar plots. We will convert them into histograms since histograms provide more useful information by their continuity (binning consecutive data on number of transactions, instead of discreteness).

freq_table_dist <- as.vector(rep(freq_table$number_purchases, freq_table$freq_vec))

hist(freq_table_dist, breaks = 300, col = 'skyblue3', 
  xlab = "Number of transactions", main = "Histogram of Number of Transactions")

freq_mult_table_dist <- as.vector(rep(freq_mult_table$number_purchases, freq_mult_table$freq_vec))

hist(freq_mult_table_dist, breaks = 250, col = 'skyblue3', 
     xlab = "Number of transactions", 
     main = "Histogram of Number of Transactions excluding one-time purchasers")

By excluding those one-time transaction-making customers, we can still observe that there are a large number of customers making transactions less than 10 (or more precisely, 5). Let’s take a closer look below by including one-time purchasers and those multiple-time purchasers but make transactions less than or equal to 10. Their frequency value is directly displayed above each bin.

h <- hist(freq_table_dist, breaks = 1000, col = 'skyblue3', 
          xlab = "Number of transactions", xlim = c(1, 10), ylim = c(0, 90000), 
          main = "Histogram of Number of Transactions between 1 and 10")
text(h$mids,h$counts,labels=h$counts, adj=c(0.5, -0.5))

freq_table

The histogram above displays the number of transactions between 1 and 10 with respective to their counts. Notice that the first bin is in fact a combination of one-time purchasers and two-time purchasers since 67,294 + 199,42 = 87,236. The proportion of number of transactions made by customers is created as an additional column to freq_table below.

freq_table_prop <- mutate(freq_table, proportion = 
        paste0(round(100*freq_table$freq_vec/sum(freq_table$freq_vec),2), "%"))
freq_table_prop |> arrange(desc(freq_vec))

Hence, among all transactions made by customers, approximately 70% of them are made by one-time purchasers, following by approximately 20% made by two-time purchasers, whereas the rest of multiple-time purchasers are in total less than 10%.

We are also interested in the average number of transactions made by customers. We will explore the mean values of transactions of all customers and customers excluding one-time purchasers.

freq_table_dist |> mean()

## [1] 1.507116

freq_mult_table_dist |> mean()

## [1] 2.720523

We find out that the average number of transactions made by all customers is approximately 1.51, whereas the average number of transactions made by multiple-time purchasers is approximately 2.72. That implies that for those customers that make more than 1 purchase, they typically won’t re-purchase more than 2 times. This somehow makes our analysis hard since a majority of customers do not have lasting purchasing behaviors. Fortunately, given that we have a relatively large sample of customers (more than 90,000), we still can access multiple number of customers that make fairly frequent transactions through inKind (by naive definition so far, we can find some patterns of transactions above 10 purchases).

Now we have another problem of interest. What is the average amount of dollars charged for all customers? What about for those one-time purchasers? What about for those multiple-time purchasers? We will try to answer these questions by the following codes.

user_one_purchase <- filter(user_purchase, n == 1)

one_purchase_list <- user_one_purchase$user_id

df_revenue_view_one_tp <- filter(df_revenue_view, 
                                 df_revenue_view$user_id %in% one_purchase_list)

df_revenue_view_one_tp

nrow(df_revenue_view_one_tp)

## [1] 67294

When the number of transactions is solely 1, we have 67,294 customers. Their average amount spent in USD is fairly easy to determine - we can simply use mean() (or summary statistics in general) to compute these 67,294 average amount of dollars spent on each transaction.

summary(df_revenue_view_one_tp$amount_charged_in_usd)

##    Length     Class      Mode 
##     67294 character character

We find out that the distribution of amount of dollars charged for only one-time purchasers have the distribution above. The minimum dollars spent by a customer is only 0.01 dollars (could due to a verification purpose?), whereas the maximum dollars spent by a customer is 100,000.00 dollars (kind of not normal, will investigate these seemingly outliers further). A boxplot is made below for one-time purchasers.

ggplot(df_revenue_view_one_tp) + geom_boxplot(aes(as.numeric(amount_charged_in_usd)), 
                outlier.color="red", outlier.shape=8, outlier.size=4)

outliers_one_tp <- 
  boxplot.stats(as.numeric(df_revenue_view_one_tp$amount_charged_in_usd))$out |> c() 

length(outliers_one_tp)

## [1] 5749

paste0(100*round(length(outliers_one_tp)/nrow(df_revenue_view_one_tp), 2), "%")

## [1] "9%"

We find that there are 5,749 outliers! That is already 9% of all one-time purchasers. The boxplot has a heavy tail to the right. We want to see a clear distribution of one-time purchasers, so we will just ignore these 5,749 outliers for a second (sure, we need to implement robustness check on diagnostics further to make sure that we eliminate the “correct” outliers).

df_revenue_view_one_tp_no_outlier <- filter(df_revenue_view_one_tp, 
                                  ! amount_charged_in_usd %in% outliers_one_tp)

ggplot(df_revenue_view_one_tp, aes(as.numeric(amount_charged_in_usd))) + 
  geom_boxplot(outlier.shape = NA) + xlim(0, 255) + geom_point(aes(x =
  max(as.numeric(df_revenue_view_one_tp_no_outlier$amount_charged_in_usd)), y = 0))

## Warning: Removed 5742 rows containing non-finite values (`stat_boxplot()`).

The boxplot excluding outliers is shown above. It appears that the median is around 100 dollars. Still, we will treat our original summary statistics as the criterion for one-time purchasers - we will still treat its mean as 58.74 and its median as 144.49.

Continue on our problems of interest on average dollars spent for each number of transactions ranging from 1 to 722. When the number of transactions is greater than 1, things may get a bit tricky. Let’s create a function and then run iterations to help simplify our jobs.

df_revenue_view_mult_tp <- filter(df_revenue_view, 
                              ! df_revenue_view$user_id %in% one_purchase_list)

number_transaction <- function(i) {
  
  i_purchase_list <- filter(user_purchase, n == i)$user_id
  
  df_revenue_view_i_tp <- filter(df_revenue_view, 
                                 df_revenue_view$user_id %in% i_purchase_list)
  
  return(df_revenue_view_i_tp)
  
}

avg_transaction <- function(i) {
  
  return (mean(as.numeric(number_transaction(i)$amount_charged_in_usd)))
  
}

vec_avg_transaction <- c()

for (i in number_purchases) {
  
  vec_avg_transaction <- append(vec_avg_transaction, avg_transaction(i))
  
}

df_avg_transaction <- data.frame(number_purchases, vec_avg_transaction)

df_avg_transaction |> arrange(desc(vec_avg_transaction))

ggplot(df_avg_transaction, aes(x = number_purchases, y = vec_avg_transaction)) + 
  geom_point(color = "black", fill = "#FFCC99") + 
  stat_smooth(method = "lm", formula = y ~ x, geom = "smooth", se = FALSE) + 
  xlab("Number of transactions") + ylab("Average amount of of dollars charged")

From the table outputs, we can see that when the number of purchases / transactions is 169, the average dollars charged is the highest, approximately 2,415.82 dollars. When the number of purchases / transactions is 30, the average dollars charged is the lowest, approximately 42.85 dollars. The distribution from the plot does not seem to follow any particular pattern since it is fanning out (spreading out as the number of transactions increases). This implies that OLS model is not favored here. We should try to fit a logit-GLM model later. We can also find a rougher average amount of dollars charged for one-time purchasers and multiple-time purchasers below.

mean(as.numeric(df_revenue_view_one_tp$amount_charged_in_usd))

## [1] 144.49

mean(as.numeric(df_revenue_view_mult_tp$amount_charged_in_usd))

## [1] 211.8939

We are sad to conclude that there is not sufficient information for us to identify a positive correlation between the number of transactions and the average amount of dollars charged. HOWEVER, if we split this data further to individual-level customers with different number of transactions, we MAY see a personalized pattern there. Based on Bayesian synthetic control with a general idea on randomized experiments, we can infer these purchasing patterns for another group of customers with similar features from customers that we study for.

Now our next problem of interest is that given a certain user / customer, what is the distribution of unique project ids for him/her? Well, we may study for individual-level customers here, and we may also study for conditional customers based on the number of transactions (i.e. 1 through 722). Let us focus on the latter case first since the total size is smaller than individual-level study. Given the large size here, it is inefficient to create bar plots to display the distribution of project ids. Instead, we will present their summary statistics here. Our motivation here is to perform a simple example when the number of transactions is 1, and then we write a generic function that by simply inputting a number (i.e. 2 through 722), we can access its distribution of project ids.

group_by(df_revenue_view_one_tp, project_id) |> count() |> arrange(desc(n))

From the table outputs, there are 147 project ids for one-time purchasers. Their count of purchased times is listed in the second column in descending order. There is a project id named -999 that has the most frequent visited times. Let us explore what this project is. We will also check for proejct id named 692.

filter(df_revenue_view, project_id == "-999")

filter(df_revenue_view, project_id == "692")

Project id -999 is actually inKind Pass, not any physical restaurants. This makes sense to us because it may be confused when we see a minus sign from the project id. Project id 692 is a normal physical restaurant that is super popular.

proj_dist <- function(i) {
  
  return (group_by(number_transaction(i), project_id) |> count() |> 
            arrange(desc(n)))
  
}

proj_dist(2)

proj_dist(10)

Due to computational complexity in returning multiple data frames, we will avoid running a for loop (lapply is much faster iterative approach that avoids duplicated computations). Instead, we can input the number of transactions i inside our pre-defined function proj_dist(). Still, we may ponder which project id is the most popular one among each count of transactions i. Therefore, we define another function that simply returns the most frequent project id under each number of transactions i.

proj_list <- lapply(number_purchases, 
                    function(i) {head(proj_dist(i), 1)$project_id})
proj_vec <- unlist(proj_list)

project_dist <- data.frame(number_purchases, proj_vec)
project_dist

filter(project_dist, proj_vec != "-999") |> group_by(proj_vec) |> 
  count() |> arrange(desc(n))

We find out that among all numbers of transactions, by excluding the most popular project id -999 named inKind Pass, the next most popular project id is 277, which appears 6 times for 6 different numbers of transactions.

filter(df_revenue_view, project_id == "277")

Hence, the second most popular project id in terms of different numbers of transactions is named The Market District.

Now we will focus our attention on each individual customer. What is each of their project id distribution? The methodologies follow as above. However, since we have more than 90,000 individual customers, the iterations can take significantly more time. Let us first group each customer /user together. Notice that a dictionary or key here is favored. Hence, we will convert the data frame into a giant dictionary to store each customer’s transaction(s).

library(dictionaRy)

## Loading required package: jsonlite

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:shiny':
## 
##     validate

## The following object is masked from 'package:purrr':
## 
##     flatten

## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON

library(datadictionary)

df_revenue_view_by_user <- group_by(df_revenue_view, user_id) |> 
  summarize(UID, name, project_id, gcp_id, ict_id, created_at, 
            amount_charged_in_usd, utm_campaign, utm_medium, utm_content, 
            utm_source, is_app_purchase, user_id, transaction_type, credit_type, 
            credit_given_in_usd, is_excess, stripe_brand, option_id)

## `summarise()` has grouped output by 'user_id'. You can override using the
## `.groups` argument.

df_revenue_view_by_user <- transform(df_revenue_view_by_user, 
                             is_app_purchase = as.factor(is_app_purchase),
                             transaction_type = as.factor(transaction_type),
                             credit_type = as.factor(credit_type),
                             is_excess = as.factor(is_excess),
                             stripe_brand = as.factor(stripe_brand))

df_revenue_view_by_user

user_dict <- create_dictionary(df_revenue_view_by_user, id_var = "user_id")
user_dict

We find out that group each user by user_id only returns 152 unique responses for project_id, the detailed project ids are given below, corresponding to each of its project name.

unique(df_revenue_view_by_user$project_id)

##   [1] "695"  "676"  "185"  "692"  "720"  "702"  "697"  "706"  "715"  "713" 
##  [11] "-999" "708"  "732"  "698"  "302"  "699"  "726"  "277"  "681"  "727" 
##  [21] "716"  "224"  "686"  "255"  "700"  "712"  "687"  "784"  "767"  "722" 
##  [31] "710"  "711"  "723"  "748"  "317"  "707"  "184"  "811"  "693"  "709" 
##  [41] "790"  "684"  "311"  "718"  "703"  "719"  "696"  "310"  "729"  "749" 
##  [51] "741"  "735"  "208"  "694"  "773"  "734"  "754"  "756"  "793"  "249" 
##  [61] "701"  "11"   "682"  "740"  "721"  "737"  "736"  "747"  "750"  "745" 
##  [71] "786"  "315"  "753"  "731"  "739"  "746"  "752"  "234"  "738"  "200" 
##  [81] "771"  "264"  "766"  "705"  "751"  "763"  "758"  "730"  "744"  "757" 
##  [91] "788"  "42"   "776"  "772"  "781"  "742"  "813"  "768"  "312"  "764" 
## [101] "760"  "770"  "724"  "775"  "180"  "765"  "774"  "762"  "769"  "789" 
## [111] "796"  "777"  "787"  "252"  "780"  "269"  "778"  "809"  "306"  "785" 
## [121] "759"  "795"  "803"  "761"  "791"  "807"  "265"  "170"  "801"  "805" 
## [131] "195"  "683"  "802"  "299"  "800"  "798"  "797"  "318"  "794"  "307" 
## [141] "804"  "319"  "314"  "371"  "690"  "326"  "658"  "105"  "352"  "226" 
## [151] "717"  "218"

unique(df_revenue_view_by_user$name)

##   [1] "Matthew Kenney"                          
##   [2] "50 Eggs Hospitality Group"               
##   [3] "The Ravenous Pig"                        
##   [4] "Thunderdome Restaurant Group"            
##   [5] "The Electric Jane"                       
##   [6] "Apero"                                   
##   [7] "Parker Hospitality"                      
##   [8] "Parson's"                                
##   [9] "The Bungalow"                            
##  [10] "Silk Road Hospitality"                   
##  [11] "inKind Pass"                             
##  [12] "Jose Andres Restaurants"                 
##  [13] "Street Guys Hospitality"                 
##  [14] "Mina Group"                              
##  [15] "Maketto"                                 
##  [16] "Bluestone Lane"                          
##  [17] "City Winery"                             
##  [18] "The Market District"                     
##  [19] "Salvation Pizza"                         
##  [20] "JINYA (DMV)"                             
##  [21] "Union Square Hospitality Group"          
##  [22] "Juliet Italian Kitchen"                  
##  [23] "Elcielo"                                 
##  [24] "Paraiso"                                 
##  [25] "Stout's"                                 
##  [26] "Milkshake Concepts"                      
##  [27] "Two Hands"                               
##  [28] "Delicious Hospitality Group"             
##  [29] "Cha Cha Matcha"                          
##  [30] "Citizens Manhattan West"                 
##  [31] "AMPD Group"                              
##  [32] "Higher Ground"                           
##  [33] "Disruptive Group"                        
##  [34] "Nighthawk Brewery & Poppyseed Rye"       
##  [35] "40 North"                                
##  [36] "Philotimo"                               
##  [37] "Penny Ann's Cafe"                        
##  [38] "Rosebud Restaurants"                     
##  [39] "Destination Unknown Restaurants"         
##  [40] "Zandra's"                                
##  [41] "Horn Barbecue"                           
##  [42] "Superette"                               
##  [43] "Tiger, Grateful Bread & Solomon's"       
##  [44] "Eighteen36"                              
##  [45] "Goosecup"                                
##  [46] "Catalogue"                               
##  [47] "Black Rock Social House"                 
##  [48] "Zola"                                    
##  [49] "Mirame"                                  
##  [50] "Omakase Restaurant Group"                
##  [51] "Black Seed Bagels + Pebble Bar"          
##  [52] "Service Bar + Causa"                     
##  [53] "St Arnold's + Tyber Bierhaus"            
##  [54] "138° by Matt Meyer"                      
##  [55] "Musket Room"                             
##  [56] "Fiorella"                                
##  [57] "Veggie Grill"                            
##  [58] "Valentina's Tex Mex BBQ"                 
##  [59] "Quarter Acre"                            
##  [60] "Sheesh"                                  
##  [61] "High Road Cycling"                       
##  [62] "Pluma by Bluebird Bakery"                
##  [63] "Ozumo"                                   
##  [64] "Little Fatty"                            
##  [65] "The Concourse Project"                   
##  [66] "Filé Gumbo Bar"                          
##  [67] "Le Pigeon"                               
##  [68] "Gregory Gourdet Restaurants"             
##  [69] "In Hospitality LLC"                      
##  [70] "Parched Hospitality Group"               
##  [71] "The Butcher's Daughter"                  
##  [72] "Georgetown Butcher"                      
##  [73] "Back of the House"                       
##  [74] "Strangebird Hospitality"                 
##  [75] "Tacolicious"                             
##  [76] "Livanos Restaurant Group"                
##  [77] "Momoya"                                  
##  [78] "Nomad Donuts"                            
##  [79] "Ariete Hospitality"                      
##  [80] "TLC"                                     
##  [81] "HopMonk"                                 
##  [82] "Thamee"                                  
##  [83] "Ethan Stowell"                           
##  [84] "CALA"                                    
##  [85] "Cassava SF"                              
##  [86] "1799 Prime"                              
##  [87] "L'antica Pizzeria da Michele"            
##  [88] "Figaro Cafe"                             
##  [89] "Juniper Cafe"                            
##  [90] "Hoffbrau Steak & Grill House"            
##  [91] "Grand Fir Brewing"                       
##  [92] "Prequel DC"                              
##  [93] "Garrett Hospitality"                     
##  [94] "Gotham"                                  
##  [95] "Tacos 1986"                              
##  [96] "Negroni"                                 
##  [97] "Rosa Mexicano"                           
##  [98] "Todos Cantina + Cocina"                  
##  [99] "Rosen's Bagels"                          
## [100] "BarTucci / Gino and Marty's"             
## [101] "Justin Queso's"                          
## [102] "Kitchen + Kocktails by Kevin Kelley"     
## [103] "The Monkey King "                        
## [104] "Whitmans Restaurant Group"               
## [105] "Salt & Time"                             
## [106] "Le Bon Nosh"                             
## [107] "Hook & Master"                           
## [108] "Les Trois Chevaux"                       
## [109] "VHCLE"                                   
## [110] "Underdogs Cantina"                       
## [111] "Great White"                             
## [112] "Reem's"                                  
## [113] "American Cut Steakhouse"                 
## [114] "Urge + The Barrel Room + Mason Ale Works"
## [115] "Boujis Group"                            
## [116] "Bar Spero"                               
## [117] "Mia Market"                              
## [118] "Avli"                                    
## [119] "Reina + Hey ReyRey"                      
## [120] "Major Food Group"                        
## [121] "Jrk!"                                    
## [122] "Wexler's Deli"                           
## [123] "Broad Street Oyster Co."                 
## [124] "Tulum Tacos & Tequila"                   
## [125] "101 Hospitality"                         
## [126] "Santé"                                   
## [127] "Gozu"                                    
## [128] "Bliss Cafe"                              
## [129] "JoJo's Shake Bar"                        
## [130] "RedFarm"                                 
## [131] "Hamrock's Restaurant"                    
## [132] "The Farmers Butcher"                     
## [133] "Winston House / The Waterfront"          
## [134] "Magna"                                   
## [135] "Blackfoot Hospitality"                   
## [136] "Saraghina Group"                         
## [137] "Lamia's Fish Market"                     
## [138] "Capitol Cider House"                     
## [139] "Paperboy"                                
## [140] "The Farmer’s Cow Calfé & Creamery"       
## [141] "Bronze"                                  
## [142] "Reunion 19"                              
## [143] "Bhuna Restaurant"                        
## [144] "Perle"                                   
## [145] "Austin's Best Restaurants"               
## [146] "Happiest Hour"                           
## [147] "Crack Shack Little Italy"                
## [148] "The Cave"                                
## [149] "Flight"                                  
## [150] "Mandala Kitchen & Bar"                   
## [151] "Chuck Lager"                             
## [152] "JINYA Ramen Bar"

Now one may be interested if we split the UID columns into 2 parts - ending with g implying a restaurant project type and starting with i implying a universal project type. To achieve this, we will first mutate the table with a new column by extracting the last and first character from UID.

df_revenue_view_by_uid <- mutate(df_revenue_view_by_user, last_char = 
                                 substr(UID, nchar(UID), nchar(UID)), 
                                 first_char = substr(UID, 1, 1))

df_revenue_view_uid_g <- filter(df_revenue_view_by_uid, last_char == "g")
df_revenue_view_uid_i <- filter(df_revenue_view_by_uid, first_char == "i")

user_dict_uid_g <- create_dictionary(df_revenue_view_uid_g, id_var = "user_id")

## Warning in character_summary(dataset, column): last_char has fewer than 10
## unique values, did you want a factor?

## Warning in character_summary(dataset, column): first_char has fewer than 10
## unique values, did you want a factor?

user_dict_uid_g <- transform(df_revenue_view_uid_g, 
                             last_char = as.factor(last_char), 
                             first_char = as.factor(first_char))
user_dict_uid_g

user_dict_uid_i <- create_dictionary(df_revenue_view_uid_i, id_var = "user_id")

## Warning in character_summary(dataset, column): name has fewer than 10 unique
## values, did you want a factor?

## Warning in character_summary(dataset, column): project_id has fewer than 10
## unique values, did you want a factor?

## Warning in character_summary(dataset, column): utm_medium has fewer than 10
## unique values, did you want a factor?

## Warning in character_summary(dataset, column): first_char has fewer than 10
## unique values, did you want a factor?

user_dict_uid_i <- transform(df_revenue_view_uid_i, 
                             name = as.factor(name), 
                             project_id = as.factor(project_id), 
                             utm_medium = as.factor(utm_medium), 
                             first_char = as.factor(first_char))
user_dict_uid_i

unique(user_dict_uid_g$name) |> length()

## [1] 151

unique(user_dict_uid_i$name)

## [1] inKind Pass
## Levels: inKind Pass

By splitting the data into two categories with g and i, we find out that i category includes all inKind Pass projects, and they are unique in project_id. The rest of 151 projects is in the category of UID ending with g.

Exploratory Data Analysis II with focus on customers

Zhongming Jiang

2023-05-10