Often times, we have to deal with a lot of unclean, missing categorical data, and our goal is to extract key insights, features from various attributes to come up with some sort of customer profile. For example, imagine you have a data set that has only five variables, i.e. var_a is subject Id, var_b is gender, var_c is education, var_d is habit, and var_e is customer_yn. What we want is to extract key features from each attribute (gender, education, habit) that are important to our customers (customer_yn == ‘Y’). For example, women who have a bachelor degree or above, who exercise weekly are more likely to be our customers.
It’s noteworthy to point out that not all data is filled by our customers. Customer A is missing gender or education, whereas customer B has an extensive list of habits (30 different habits filled in the var_d field).
In this case, we are going to do a simple and quick solution using sql and odds ratio to extract key features (or values) from various attributes. For this blog post, we are using a toy data set that has 13 variables, in particular, we are interested in the “vip” column (0 = “No”, 1 = “Yes”) and we want to figure out what attrbibutes, values that distinguish our VIP customers. These are all categorical data and many data are missing.
library(tidyverse, quietly = TRUE)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
library(sqldf, quietly = TRUE)
## Warning: package 'sqldf' was built under R version 3.6.2
## Warning: package 'gsubfn' was built under R version 3.6.2
## Warning: package 'proto' was built under R version 3.6.2
## Warning: package 'RSQLite' was built under R version 3.6.2
library(kableExtra, quietly = TRUE)
## Warning: package 'kableExtra' was built under R version 3.6.2
library(Hmisc, quietly = TRUE)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Warning: package 'survival' was built under R version 3.6.2
df <- read.csv("demoDf.csv")
dim(df)
## [1] 224985 13
head(df, 20) %>% kable()
| id | vip | country | product | credit_card_brand | credit_card_funding | promotion_type | industry | signed_up_enabled | registered_previously | price | status | loyalty_program |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IZb9bS00100101 | 0 | United States | Product K | Visa | credit | Type 2 | Healthcare/Pharma | FALSE | TRUE | full_price | New | FALSE |
| IZb9bS00100104 | 0 | United States | Product K | Visa | credit | Type 2 | Technology/Telecommunications | FALSE | FALSE | full_price | New | FALSE |
| IZb9bS00100105 | 0 | United States | Product K | MasterCard | credit | Type 2 | Other | FALSE | TRUE | full_price | New | FALSE |
| IZb9bS00100106 | 0 | United States | Product Z | Type 3 | Financial Services | FALSE | FALSE | full_price | New | FALSE | ||
| IZb9bS00100108 | 0 | Singapore | Product K | Visa | credit | Type 2 | Financial Services | FALSE | FALSE | full_price | New | FALSE |
| IZb9bS00100112 | 0 | Hong Kong | Product K | Type 2 | Financial Services | FALSE | FALSE | full_price | New | FALSE | ||
| IZb9bS00100115 | 0 | Saudi Arabia | Product K | Type 2 | Financial Services | FALSE | FALSE | full_price | New | FALSE | ||
| IZb9bS00100117 | 1 | Hong Kong | Product Z | Type 2 | Manufacturing/Engineering | TRUE | TRUE | full_price | New | FALSE | ||
| IZb9bS00100118 | 0 | Hong Kong | Product Z | Type 2 | FALSE | TRUE | full_price | New | FALSE | |||
| IZb9bS00100119 | 0 | United States | Product Z | Visa | debit | Type 3 | FALSE | FALSE | full_price | New | FALSE | |
| IZb9bS00100121 | 0 | Russian Federation | Product K | Type 2 | FALSE | TRUE | full_price | New | FALSE | |||
| IZb9bS00100122 | 0 | Germany | Product Z | Type 2 | FALSE | TRUE | full_price | NA | ||||
| IZb9bS00100123 | 0 | Russian Federation | Product K | Visa | debit | Type 2 | Financial Services | FALSE | TRUE | full_price | New | FALSE |
| IZb9bS00100124 | 0 | Germany | Product Z | Type 2 | FALSE | FALSE | full_price | New | FALSE | |||
| IZb9bS00100126 | 0 | United States | Product Z | American Express | credit | Type 2 | FALSE | FALSE | full_price | New | FALSE | |
| IZb9bS00100128 | 0 | Japan | Product Z | Visa | credit | Type 2 | Real Estate | FALSE | FALSE | full_price | New | FALSE |
| IZb9bS00100129 | 0 | United Kingdom | Product Z | Visa | debit | Type 2 | Financial Services | TRUE | FALSE | full_price | NA | |
| IZb9bS00100130 | 0 | Netherlands | Product Z | MasterCard | credit | Type 2 | Technology/Telecommunications | FALSE | TRUE | full_price | New | FALSE |
| IZb9bS00100132 | 0 | United States | Product Z | American Express | credit | Type 3 | Financial Services | FALSE | FALSE | full_price | New | FALSE |
| IZb9bS00100135 | 0 | Italy | Product K | Type 2 | Financial Services | FALSE | FALSE | full_price | New | FALSE |
# unique values
dfUV <- lapply(df, function(x) length(unique(x))) %>%
unlist %>%
as.data.frame() %>%
rownames_to_column() %>%
dplyr::select(columns = rowname, unique_values = ".")
# missing
dfMissing <- colSums(is.na(df)) %>%
as.data.frame() %>%
rownames_to_column() %>%
dplyr::select(columns = rowname, missing = ".")
# data type
dfClass <- lapply(df, function(x) class(x)) %>%
unlist %>%
as.data.frame() %>%
rownames_to_column() %>%
dplyr::select(columns = rowname, class = ".")
# metadata
dfMeta <- dplyr::inner_join(dfUV, dfMissing,
by = c("columns", "columns")) %>%
dplyr::inner_join(dfClass, by = c("columns", "columns"))
dfMeta %>% kable()
| columns | unique_values | missing | class |
|---|---|---|---|
| id | 224543 | 0 | factor |
| vip | 2 | 0 | integer |
| country | 210 | 0 | factor |
| product | 3 | 0 | factor |
| credit_card_brand | 8 | 0 | factor |
| credit_card_funding | 5 | 0 | factor |
| promotion_type | 7 | 0 | factor |
| industry | 74 | 0 | factor |
| signed_up_enabled | 3 | 1436 | logical |
| registered_previously | 3 | 1436 | logical |
| price | 2 | 0 | factor |
| status | 4 | 0 | factor |
| loyalty_program | 3 | 22220 | logical |
# replace NA with "UNKNOWN" (optional)
dfClean <- lapply(1:length(names(df)), function(x) df %>%
dplyr::select(names(df)[x]) %>%
unlist %>%
as.vector %>%
replace_na("UNKNOWN")) %>%
bind_cols()
names(dfClean) <- names(df)
# gather data into appropriate structure
dfOR <- dfClean %>%
dplyr::select(-id) %>%
tidyr::gather(attribute, value, -vip)
head(dfOR, 20) %>% kable()
| vip | attribute | value |
|---|---|---|
| 0 | country | United States |
| 0 | country | United States |
| 0 | country | United States |
| 0 | country | United States |
| 0 | country | Singapore |
| 0 | country | Hong Kong |
| 0 | country | Saudi Arabia |
| 1 | country | Hong Kong |
| 0 | country | Hong Kong |
| 0 | country | United States |
| 0 | country | Russian Federation |
| 0 | country | Germany |
| 0 | country | Russian Federation |
| 0 | country | Germany |
| 0 | country | United States |
| 0 | country | Japan |
| 0 | country | United Kingdom |
| 0 | country | Netherlands |
| 0 | country | United States |
| 0 | country | Italy |
# a quick hack in sql
OR <- sqldf::sqldf("
select attribute
, value
, interest_group_yes
, control_group_yes
, interest_group_no
, control_group_no
from (
select x.attribute
, x.value
, x.interest_group_yes
, x.control_group_yes
, y.interest_group_total - x.interest_group_yes as interest_group_no
, y.control_group_total - x.control_group_yes as control_group_no
from (
select attribute
, value
, sum(case when vip = 1 then 1 else 0 end) as interest_group_yes
, sum(case when vip = 0 then 1 else 0 end) as control_group_yes
from dfOR
group by 1, 2
) x
join (
select attribute
, sum(case when vip = 1 then 1 else 0 end) as interest_group_total
, sum(case when vip = 0 then 1 else 0 end) as control_group_total
from dfOR
group by 1
) y on x.attribute = y.attribute
) z
where interest_group_yes >= 25
and interest_group_no >= 25
and control_group_yes >= 25
and control_group_no >= 25
"
)
Here’s a short list of key features that we extract from the categorical variables (with a lot of missing). We compare the VIP with non-VIP and we distill the following differences. For example, VIP customers are 3 times more likely to associate with the “discount” value from the “price” attribute (with 95% CI between 2.99 and 3.11 times as likely). They are also 2.7 times more likely to associate with the “Type 5” promotion, in addition 2.7 times more likely to be associated with “Product Z”.
Looking down the list, we continue to see more interesting side of our VIP customers, such as they are less likely to have registered previously (so that they are more likely to be new customers to us); they are more likely to have signed up with our loyalty program; they are more likely come from Eastern European countries, and etc. Interesting fact, industry data (or their profession) does not seem to have any difference in VIP status.
Above is a quick sql hack to get key insights from missing, messy data. Above code can be rewritten completely within R tidyverse, but I am just more customed to do it in sql (and it’s easier to see and debug everything in one sql query instead of multiple steps in R).
# Odds Ratio
ORplusCI <- OR %>%
dplyr::mutate(OR = round((interest_group_yes / interest_group_no) / (control_group_yes / control_group_no), 4))
# Upper 95% CI
Upper_CI = with(ORplusCI,
exp(1) ^(log(OR) +
(1.96 * sqrt(1/interest_group_yes +
1/interest_group_no +
1/control_group_yes +
1/control_group_no))))
# Lower 95% CI
Lower_CI = with(ORplusCI,
exp(1) ^(log(OR) -
(1.96 * sqrt(1/interest_group_yes +
1/interest_group_no +
1/control_group_yes +
1/control_group_no))))
# combined
ORplusCI$Upper_CI <- round(Upper_CI, 4)
ORplusCI$Lower_CI <- round(Lower_CI, 4)
# display only significant results and sort by OR
ORplusCI <- ORplusCI %>%
dplyr::filter(Upper_CI >1 & Lower_CI >1) %>%
dplyr::mutate(OR = round(OR, 2),
Upper_CI = round(Upper_CI, 2),
Lower_CI = round(Lower_CI, 2)) %>%
arrange(desc(OR))
ORplusCI %>% kable()
| attribute | value | interest_group_yes | control_group_yes | interest_group_no | control_group_no | OR | Upper_CI | Lower_CI |
|---|---|---|---|---|---|---|---|---|
| price | discount | 31363 | 40527 | 30961 | 122134 | 3.05 | 3.11 | 2.99 |
| promotion_type | Type 5 | 26 | 25 | 62298 | 162636 | 2.71 | 4.70 | 1.57 |
| product | Product Z | 52240 | 107057 | 10084 | 55604 | 2.69 | 2.76 | 2.63 |
| promotion_type | Type 4 | 51 | 61 | 62273 | 162600 | 2.18 | 3.17 | 1.50 |
| loyalty_program | TRUE | 136 | 179 | 62188 | 162482 | 1.99 | 2.48 | 1.59 |
| product | Product L | 116 | 160 | 62208 | 162501 | 1.89 | 2.41 | 1.49 |
| promotion_type | Type 7 | 241 | 337 | 62083 | 162324 | 1.87 | 2.21 | 1.58 |
| registered_previously | FALSE | 41838 | 93579 | 20486 | 69082 | 1.51 | 1.54 | 1.48 |
| country | Lithuania | 97 | 175 | 62227 | 162486 | 1.45 | 1.86 | 1.13 |
| country | Czech Republic | 131 | 243 | 62193 | 162418 | 1.41 | 1.74 | 1.14 |
| promotion_type | Type 3 | 4226 | 8387 | 58098 | 154274 | 1.34 | 1.39 | 1.29 |
| country | Bulgaria | 87 | 171 | 62237 | 162490 | 1.33 | 1.72 | 1.03 |
| country | Poland | 265 | 520 | 62059 | 162141 | 1.33 | 1.54 | 1.15 |
| country | Ukraine | 140 | 277 | 62184 | 162384 | 1.32 | 1.62 | 1.08 |
| country | Austria | 161 | 320 | 62163 | 162341 | 1.31 | 1.59 | 1.09 |
| country | Finland | 216 | 431 | 62108 | 162230 | 1.31 | 1.54 | 1.11 |
| country | Hungary | 109 | 224 | 62215 | 162437 | 1.27 | 1.60 | 1.01 |
| country | Nigeria | 117 | 243 | 62207 | 162418 | 1.26 | 1.57 | 1.01 |
| signed_up_enabled | TRUE | 9005 | 19191 | 53319 | 143470 | 1.26 | 1.30 | 1.23 |
| country | Italy | 536 | 1125 | 61788 | 161536 | 1.25 | 1.38 | 1.12 |
| country | Norway | 258 | 541 | 62066 | 162120 | 1.25 | 1.45 | 1.07 |
| country | Viet Nam | 212 | 442 | 62112 | 162219 | 1.25 | 1.48 | 1.06 |
| country | Germany | 920 | 2005 | 61404 | 160656 | 1.20 | 1.30 | 1.11 |
| country | Sweden | 308 | 677 | 62016 | 161984 | 1.19 | 1.36 | 1.04 |
| promotion_type | Type 1 | 222 | 487 | 62102 | 162174 | 1.19 | 1.40 | 1.02 |
| country | India | 725 | 1679 | 61599 | 160982 | 1.13 | 1.23 | 1.03 |
| loyalty_program | UNKNOWN | 6558 | 15662 | 55766 | 146999 | 1.10 | 1.14 | 1.07 |
| country | Singapore | 2208 | 5412 | 60116 | 157249 | 1.07 | 1.12 | 1.01 |
| country | United Kingdom | 4392 | 10888 | 57932 | 151773 | 1.06 | 1.10 | 1.02 |