data_621_blog

Odds Ratio

Often times, we have to deal with a lot of unclean, missing categorical data, and our goal is to extract key insights, features from various attributes to come up with some sort of customer profile. For example, imagine you have a data set that has only five variables, i.e. var_a is subject Id, var_b is gender, var_c is education, var_d is habit, and var_e is customer_yn. What we want is to extract key features from each attribute (gender, education, habit) that are important to our customers (customer_yn == ‘Y’). For example, women who have a bachelor degree or above, who exercise weekly are more likely to be our customers.

It’s noteworthy to point out that not all data is filled by our customers. Customer A is missing gender or education, whereas customer B has an extensive list of habits (30 different habits filled in the var_d field).

In this case, we are going to do a simple and quick solution using sql and odds ratio to extract key features (or values) from various attributes. For this blog post, we are using a toy data set that has 13 variables, in particular, we are interested in the “vip” column (0 = “No”, 1 = “Yes”) and we want to figure out what attrbibutes, values that distinguish our VIP customers. These are all categorical data and many data are missing.

library(tidyverse, quietly = TRUE)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
library(sqldf, quietly = TRUE)
## Warning: package 'sqldf' was built under R version 3.6.2
## Warning: package 'gsubfn' was built under R version 3.6.2
## Warning: package 'proto' was built under R version 3.6.2
## Warning: package 'RSQLite' was built under R version 3.6.2
library(kableExtra, quietly = TRUE)
## Warning: package 'kableExtra' was built under R version 3.6.2
library(Hmisc, quietly = TRUE)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Warning: package 'survival' was built under R version 3.6.2

df <- read.csv("demoDf.csv")
dim(df)
## [1] 224985     13
head(df, 20) %>% kable()

id	vip	country	product	credit_card_brand	credit_card_funding	promotion_type	industry	signed_up_enabled	registered_previously	price	status	loyalty_program
IZb9bS00100101	0	United States	Product K	Visa	credit	Type 2	Healthcare/Pharma	FALSE	TRUE	full_price	New	FALSE
IZb9bS00100104	0	United States	Product K	Visa	credit	Type 2	Technology/Telecommunications	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100105	0	United States	Product K	MasterCard	credit	Type 2	Other	FALSE	TRUE	full_price	New	FALSE
IZb9bS00100106	0	United States	Product Z			Type 3	Financial Services	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100108	0	Singapore	Product K	Visa	credit	Type 2	Financial Services	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100112	0	Hong Kong	Product K			Type 2	Financial Services	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100115	0	Saudi Arabia	Product K			Type 2	Financial Services	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100117	1	Hong Kong	Product Z			Type 2	Manufacturing/Engineering	TRUE	TRUE	full_price	New	FALSE
IZb9bS00100118	0	Hong Kong	Product Z			Type 2		FALSE	TRUE	full_price	New	FALSE
IZb9bS00100119	0	United States	Product Z	Visa	debit	Type 3		FALSE	FALSE	full_price	New	FALSE
IZb9bS00100121	0	Russian Federation	Product K			Type 2		FALSE	TRUE	full_price	New	FALSE
IZb9bS00100122	0	Germany	Product Z			Type 2		FALSE	TRUE	full_price		NA
IZb9bS00100123	0	Russian Federation	Product K	Visa	debit	Type 2	Financial Services	FALSE	TRUE	full_price	New	FALSE
IZb9bS00100124	0	Germany	Product Z			Type 2		FALSE	FALSE	full_price	New	FALSE
IZb9bS00100126	0	United States	Product Z	American Express	credit	Type 2		FALSE	FALSE	full_price	New	FALSE
IZb9bS00100128	0	Japan	Product Z	Visa	credit	Type 2	Real Estate	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100129	0	United Kingdom	Product Z	Visa	debit	Type 2	Financial Services	TRUE	FALSE	full_price		NA
IZb9bS00100130	0	Netherlands	Product Z	MasterCard	credit	Type 2	Technology/Telecommunications	FALSE	TRUE	full_price	New	FALSE
IZb9bS00100132	0	United States	Product Z	American Express	credit	Type 3	Financial Services	FALSE	FALSE	full_price	New	FALSE
IZb9bS00100135	0	Italy	Product K			Type 2	Financial Services	FALSE	FALSE	full_price	New	FALSE

# unique values
dfUV <- lapply(df, function(x) length(unique(x))) %>% 
        unlist %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, unique_values = ".")

# missing
dfMissing <- colSums(is.na(df)) %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, missing = ".")

# data type
dfClass <- lapply(df, function(x) class(x)) %>%
        unlist %>%
        as.data.frame() %>%
        rownames_to_column() %>%
        dplyr::select(columns = rowname, class = ".")

# metadata
dfMeta <- dplyr::inner_join(dfUV, dfMissing, 
                            by = c("columns", "columns")) %>%
        dplyr::inner_join(dfClass, by = c("columns", "columns"))

dfMeta %>% kable()

columns	unique_values	missing	class
id	224543	0	factor
vip	2	0	integer
country	210	0	factor
product	3	0	factor
credit_card_brand	8	0	factor
credit_card_funding	5	0	factor
promotion_type	7	0	factor
industry	74	0	factor
signed_up_enabled	3	1436	logical
registered_previously	3	1436	logical
price	2	0	factor
status	4	0	factor
loyalty_program	3	22220	logical

# replace NA with "UNKNOWN" (optional)
dfClean <- lapply(1:length(names(df)), function(x) df %>% 
                              dplyr::select(names(df)[x]) %>%
                              unlist %>%
                              as.vector %>%
                              replace_na("UNKNOWN")) %>% 
        bind_cols()

names(dfClean) <- names(df)

# gather data into appropriate structure
dfOR <- dfClean %>%
        dplyr::select(-id) %>%
        tidyr::gather(attribute, value, -vip) 

head(dfOR, 20) %>% kable()

vip	attribute	value
0	country	United States
0	country	United States
0	country	United States
0	country	United States
0	country	Singapore
0	country	Hong Kong
0	country	Saudi Arabia
1	country	Hong Kong
0	country	Hong Kong
0	country	United States
0	country	Russian Federation
0	country	Germany
0	country	Russian Federation
0	country	Germany
0	country	United States
0	country	Japan
0	country	United Kingdom
0	country	Netherlands
0	country	United States
0	country	Italy

# a quick hack in sql
OR <- sqldf::sqldf("
        select attribute 
        , value 
        , interest_group_yes
        , control_group_yes
        , interest_group_no
        , control_group_no
        from (
                
                select x.attribute 
                , x.value
                , x.interest_group_yes
                , x.control_group_yes
                , y.interest_group_total - x.interest_group_yes as interest_group_no
                , y.control_group_total - x.control_group_yes as control_group_no
                from (
                        select attribute
                        , value
                        , sum(case when vip = 1 then 1 else 0 end) as interest_group_yes
                        , sum(case when vip = 0 then 1 else 0 end) as control_group_yes
                        from dfOR
                        group by 1, 2
                ) x
                join (
                        select attribute
                        , sum(case when vip = 1 then 1 else 0 end) as interest_group_total
                        , sum(case when vip = 0 then 1 else 0 end) as control_group_total
                        from dfOR
                        group by 1
                ) y on x.attribute = y.attribute
                
        ) z
        where interest_group_yes >= 25
        and interest_group_no >= 25
        and control_group_yes >= 25
        and control_group_no >= 25
        "
)

Odds Ratio, Confidence Interval

Here’s a short list of key features that we extract from the categorical variables (with a lot of missing). We compare the VIP with non-VIP and we distill the following differences. For example, VIP customers are 3 times more likely to associate with the “discount” value from the “price” attribute (with 95% CI between 2.99 and 3.11 times as likely). They are also 2.7 times more likely to associate with the “Type 5” promotion, in addition 2.7 times more likely to be associated with “Product Z”.

Looking down the list, we continue to see more interesting side of our VIP customers, such as they are less likely to have registered previously (so that they are more likely to be new customers to us); they are more likely to have signed up with our loyalty program; they are more likely come from Eastern European countries, and etc. Interesting fact, industry data (or their profession) does not seem to have any difference in VIP status.

Above is a quick sql hack to get key insights from missing, messy data. Above code can be rewritten completely within R tidyverse, but I am just more customed to do it in sql (and it’s easier to see and debug everything in one sql query instead of multiple steps in R).

# Odds Ratio
ORplusCI <- OR %>%
        dplyr::mutate(OR = round((interest_group_yes / interest_group_no) / (control_group_yes / control_group_no), 4))

# Upper 95% CI
Upper_CI = with(ORplusCI, 
                exp(1) ^(log(OR) + 
                                 (1.96 * sqrt(1/interest_group_yes + 
                                                        1/interest_group_no + 
                                                        1/control_group_yes + 
                                                        1/control_group_no))))

# Lower 95% CI
Lower_CI = with(ORplusCI, 
                exp(1) ^(log(OR) - 
                                 (1.96 * sqrt(1/interest_group_yes + 
                                                      1/interest_group_no + 
                                                      1/control_group_yes + 
                                                      1/control_group_no))))

# combined
ORplusCI$Upper_CI <- round(Upper_CI, 4)
ORplusCI$Lower_CI <- round(Lower_CI, 4)

# display only significant results and sort by OR
ORplusCI <- ORplusCI %>%
        dplyr::filter(Upper_CI >1 & Lower_CI >1) %>%
        dplyr::mutate(OR = round(OR, 2),
                      Upper_CI = round(Upper_CI, 2),
                      Lower_CI = round(Lower_CI, 2)) %>%
        arrange(desc(OR))

ORplusCI %>% kable()

attribute	value	interest_group_yes	control_group_yes	interest_group_no	control_group_no	OR	Upper_CI	Lower_CI
price	discount	31363	40527	30961	122134	3.05	3.11	2.99
promotion_type	Type 5	26	25	62298	162636	2.71	4.70	1.57
product	Product Z	52240	107057	10084	55604	2.69	2.76	2.63
promotion_type	Type 4	51	61	62273	162600	2.18	3.17	1.50
loyalty_program	TRUE	136	179	62188	162482	1.99	2.48	1.59
product	Product L	116	160	62208	162501	1.89	2.41	1.49
promotion_type	Type 7	241	337	62083	162324	1.87	2.21	1.58
registered_previously	FALSE	41838	93579	20486	69082	1.51	1.54	1.48
country	Lithuania	97	175	62227	162486	1.45	1.86	1.13
country	Czech Republic	131	243	62193	162418	1.41	1.74	1.14
promotion_type	Type 3	4226	8387	58098	154274	1.34	1.39	1.29
country	Bulgaria	87	171	62237	162490	1.33	1.72	1.03
country	Poland	265	520	62059	162141	1.33	1.54	1.15
country	Ukraine	140	277	62184	162384	1.32	1.62	1.08
country	Austria	161	320	62163	162341	1.31	1.59	1.09
country	Finland	216	431	62108	162230	1.31	1.54	1.11
country	Hungary	109	224	62215	162437	1.27	1.60	1.01
country	Nigeria	117	243	62207	162418	1.26	1.57	1.01
signed_up_enabled	TRUE	9005	19191	53319	143470	1.26	1.30	1.23
country	Italy	536	1125	61788	161536	1.25	1.38	1.12
country	Norway	258	541	62066	162120	1.25	1.45	1.07
country	Viet Nam	212	442	62112	162219	1.25	1.48	1.06
country	Germany	920	2005	61404	160656	1.20	1.30	1.11
country	Sweden	308	677	62016	161984	1.19	1.36	1.04
promotion_type	Type 1	222	487	62102	162174	1.19	1.40	1.02
country	India	725	1679	61599	160982	1.13	1.23	1.03
loyalty_program	UNKNOWN	6558	15662	55766	146999	1.10	1.14	1.07
country	Singapore	2208	5412	60116	157249	1.07	1.12	1.01
country	United Kingdom	4392	10888	57932	151773	1.06	1.10	1.02

data_621_blog_3

Jimmy Ng

4/27/2020

Odds Ratio

Odds Ratio, Confidence Interval