Odds Ratio

Often times, we have to deal with a lot of unclean, missing categorical data, and our goal is to extract key insights, features from various attributes to come up with some sort of customer profile. For example, imagine you have a data set that has only five variables, i.e. var_a is subject Id, var_b is gender, var_c is education, var_d is habit, and var_e is customer_yn. What we want is to extract key features from each attribute (gender, education, habit) that are important to our customers (customer_yn == ‘Y’). For example, women who have a bachelor degree or above, who exercise weekly are more likely to be our customers.

It’s noteworthy to point out that not all data is filled by our customers. Customer A is missing gender or education, whereas customer B has an extensive list of habits (30 different habits filled in the var_d field).

In this case, we are going to do a simple and quick solution using sql and odds ratio to extract key features (or values) from various attributes. For this blog post, we are using a toy data set that has 13 variables, in particular, we are interested in the “vip” column (0 = “No”, 1 = “Yes”) and we want to figure out what attrbibutes, values that distinguish our VIP customers. These are all categorical data and many data are missing.

library(tidyverse, quietly = TRUE)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
library(sqldf, quietly = TRUE)
## Warning: package 'sqldf' was built under R version 3.6.2
## Warning: package 'gsubfn' was built under R version 3.6.2
## Warning: package 'proto' was built under R version 3.6.2
## Warning: package 'RSQLite' was built under R version 3.6.2
library(kableExtra, quietly = TRUE)
## Warning: package 'kableExtra' was built under R version 3.6.2
library(Hmisc, quietly = TRUE)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Warning: package 'survival' was built under R version 3.6.2
df <- read.csv("demoDf.csv")
dim(df)
## [1] 224985     13
head(df, 20) %>% kable()        
id vip country product credit_card_brand credit_card_funding promotion_type industry signed_up_enabled registered_previously price status loyalty_program
IZb9bS00100101 0 United States Product K Visa credit Type 2 Healthcare/Pharma FALSE TRUE full_price New FALSE
IZb9bS00100104 0 United States Product K Visa credit Type 2 Technology/Telecommunications FALSE FALSE full_price New FALSE
IZb9bS00100105 0 United States Product K MasterCard credit Type 2 Other FALSE TRUE full_price New FALSE
IZb9bS00100106 0 United States Product Z Type 3 Financial Services FALSE FALSE full_price New FALSE
IZb9bS00100108 0 Singapore Product K Visa credit Type 2 Financial Services FALSE FALSE full_price New FALSE
IZb9bS00100112 0 Hong Kong Product K Type 2 Financial Services FALSE FALSE full_price New FALSE
IZb9bS00100115 0 Saudi Arabia Product K Type 2 Financial Services FALSE FALSE full_price New FALSE
IZb9bS00100117 1 Hong Kong Product Z Type 2 Manufacturing/Engineering TRUE TRUE full_price New FALSE
IZb9bS00100118 0 Hong Kong Product Z Type 2 FALSE TRUE full_price New FALSE
IZb9bS00100119 0 United States Product Z Visa debit Type 3 FALSE FALSE full_price New FALSE
IZb9bS00100121 0 Russian Federation Product K Type 2 FALSE TRUE full_price New FALSE
IZb9bS00100122 0 Germany Product Z Type 2 FALSE TRUE full_price NA
IZb9bS00100123 0 Russian Federation Product K Visa debit Type 2 Financial Services FALSE TRUE full_price New FALSE
IZb9bS00100124 0 Germany Product Z Type 2 FALSE FALSE full_price New FALSE
IZb9bS00100126 0 United States Product Z American Express credit Type 2 FALSE FALSE full_price New FALSE
IZb9bS00100128 0 Japan Product Z Visa credit Type 2 Real Estate FALSE FALSE full_price New FALSE
IZb9bS00100129 0 United Kingdom Product Z Visa debit Type 2 Financial Services TRUE FALSE full_price NA
IZb9bS00100130 0 Netherlands Product Z MasterCard credit Type 2 Technology/Telecommunications FALSE TRUE full_price New FALSE
IZb9bS00100132 0 United States Product Z American Express credit Type 3 Financial Services FALSE FALSE full_price New FALSE
IZb9bS00100135 0 Italy Product K Type 2 Financial Services FALSE FALSE full_price New FALSE
# unique values
dfUV <- lapply(df, function(x) length(unique(x))) %>% 
        unlist %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, unique_values = ".")

# missing
dfMissing <- colSums(is.na(df)) %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, missing = ".")

# data type
dfClass <- lapply(df, function(x) class(x)) %>%
        unlist %>%
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, class = ".")

# metadata
dfMeta <- dplyr::inner_join(dfUV, dfMissing, 
                            by = c("columns", "columns")) %>%
        dplyr::inner_join(dfClass, by = c("columns", "columns"))

dfMeta %>% kable()
columns unique_values missing class
id 224543 0 factor
vip 2 0 integer
country 210 0 factor
product 3 0 factor
credit_card_brand 8 0 factor
credit_card_funding 5 0 factor
promotion_type 7 0 factor
industry 74 0 factor
signed_up_enabled 3 1436 logical
registered_previously 3 1436 logical
price 2 0 factor
status 4 0 factor
loyalty_program 3 22220 logical
# replace NA with "UNKNOWN" (optional)
dfClean <- lapply(1:length(names(df)), function(x) df %>% 
                              dplyr::select(names(df)[x]) %>%
                              unlist %>%
                              as.vector %>%
                              replace_na("UNKNOWN")) %>% 
        bind_cols()

names(dfClean) <- names(df)

# gather data into appropriate structure
dfOR <- dfClean %>%
        dplyr::select(-id) %>%
        tidyr::gather(attribute, value, -vip) 

head(dfOR, 20) %>% kable()
vip attribute value
0 country United States
0 country United States
0 country United States
0 country United States
0 country Singapore
0 country Hong Kong
0 country Saudi Arabia
1 country Hong Kong
0 country Hong Kong
0 country United States
0 country Russian Federation
0 country Germany
0 country Russian Federation
0 country Germany
0 country United States
0 country Japan
0 country United Kingdom
0 country Netherlands
0 country United States
0 country Italy
# a quick hack in sql
OR <- sqldf::sqldf("
        select attribute 
        , value 
        , interest_group_yes
        , control_group_yes
        , interest_group_no
        , control_group_no
        from (
                
                select x.attribute 
                , x.value
                , x.interest_group_yes
                , x.control_group_yes
                , y.interest_group_total - x.interest_group_yes as interest_group_no
                , y.control_group_total - x.control_group_yes as control_group_no
                from (
                        select attribute
                        , value
                        , sum(case when vip = 1 then 1 else 0 end) as interest_group_yes
                        , sum(case when vip = 0 then 1 else 0 end) as control_group_yes
                        from dfOR
                        group by 1, 2
                ) x
                join (
                        select attribute
                        , sum(case when vip = 1 then 1 else 0 end) as interest_group_total
                        , sum(case when vip = 0 then 1 else 0 end) as control_group_total
                        from dfOR
                        group by 1
                ) y on x.attribute = y.attribute
                
        ) z
        where interest_group_yes >= 25
        and interest_group_no >= 25
        and control_group_yes >= 25
        and control_group_no >= 25
        "
)

Odds Ratio, Confidence Interval

Here’s a short list of key features that we extract from the categorical variables (with a lot of missing). We compare the VIP with non-VIP and we distill the following differences. For example, VIP customers are 3 times more likely to associate with the “discount” value from the “price” attribute (with 95% CI between 2.99 and 3.11 times as likely). They are also 2.7 times more likely to associate with the “Type 5” promotion, in addition 2.7 times more likely to be associated with “Product Z”.

Looking down the list, we continue to see more interesting side of our VIP customers, such as they are less likely to have registered previously (so that they are more likely to be new customers to us); they are more likely to have signed up with our loyalty program; they are more likely come from Eastern European countries, and etc. Interesting fact, industry data (or their profession) does not seem to have any difference in VIP status.

Above is a quick sql hack to get key insights from missing, messy data. Above code can be rewritten completely within R tidyverse, but I am just more customed to do it in sql (and it’s easier to see and debug everything in one sql query instead of multiple steps in R).

# Odds Ratio
ORplusCI <- OR %>%
        dplyr::mutate(OR = round((interest_group_yes / interest_group_no) / (control_group_yes / control_group_no), 4))

# Upper 95% CI
Upper_CI = with(ORplusCI, 
                exp(1) ^(log(OR) + 
                                 (1.96 * sqrt(1/interest_group_yes + 
                                                        1/interest_group_no + 
                                                        1/control_group_yes + 
                                                        1/control_group_no))))

# Lower 95% CI
Lower_CI = with(ORplusCI, 
                exp(1) ^(log(OR) - 
                                 (1.96 * sqrt(1/interest_group_yes + 
                                                      1/interest_group_no + 
                                                      1/control_group_yes + 
                                                      1/control_group_no))))

# combined
ORplusCI$Upper_CI <- round(Upper_CI, 4)
ORplusCI$Lower_CI <- round(Lower_CI, 4)
# display only significant results and sort by OR
ORplusCI <- ORplusCI %>%
        dplyr::filter(Upper_CI >1 & Lower_CI >1) %>%
        dplyr::mutate(OR = round(OR, 2),
                      Upper_CI = round(Upper_CI, 2),
                      Lower_CI = round(Lower_CI, 2)) %>%
        arrange(desc(OR))

ORplusCI %>% kable()
attribute value interest_group_yes control_group_yes interest_group_no control_group_no OR Upper_CI Lower_CI
price discount 31363 40527 30961 122134 3.05 3.11 2.99
promotion_type Type 5 26 25 62298 162636 2.71 4.70 1.57
product Product Z 52240 107057 10084 55604 2.69 2.76 2.63
promotion_type Type 4 51 61 62273 162600 2.18 3.17 1.50
loyalty_program TRUE 136 179 62188 162482 1.99 2.48 1.59
product Product L 116 160 62208 162501 1.89 2.41 1.49
promotion_type Type 7 241 337 62083 162324 1.87 2.21 1.58
registered_previously FALSE 41838 93579 20486 69082 1.51 1.54 1.48
country Lithuania 97 175 62227 162486 1.45 1.86 1.13
country Czech Republic 131 243 62193 162418 1.41 1.74 1.14
promotion_type Type 3 4226 8387 58098 154274 1.34 1.39 1.29
country Bulgaria 87 171 62237 162490 1.33 1.72 1.03
country Poland 265 520 62059 162141 1.33 1.54 1.15
country Ukraine 140 277 62184 162384 1.32 1.62 1.08
country Austria 161 320 62163 162341 1.31 1.59 1.09
country Finland 216 431 62108 162230 1.31 1.54 1.11
country Hungary 109 224 62215 162437 1.27 1.60 1.01
country Nigeria 117 243 62207 162418 1.26 1.57 1.01
signed_up_enabled TRUE 9005 19191 53319 143470 1.26 1.30 1.23
country Italy 536 1125 61788 161536 1.25 1.38 1.12
country Norway 258 541 62066 162120 1.25 1.45 1.07
country Viet Nam 212 442 62112 162219 1.25 1.48 1.06
country Germany 920 2005 61404 160656 1.20 1.30 1.11
country Sweden 308 677 62016 161984 1.19 1.36 1.04
promotion_type Type 1 222 487 62102 162174 1.19 1.40 1.02
country India 725 1679 61599 160982 1.13 1.23 1.03
loyalty_program UNKNOWN 6558 15662 55766 146999 1.10 1.14 1.07
country Singapore 2208 5412 60116 157249 1.07 1.12 1.01
country United Kingdom 4392 10888 57932 151773 1.06 1.10 1.02