We were provided with retailer’s data to analyze and uncover any trends or answer business objectives. The dataset used in this project contains household level transactions over two years from a group of 5000 households who are shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories.
Objective
We want to be able to answer the following research questions:
Approach
We analyze purchase behaviour and patterns w.r.t demographics, products and loyalty program, in terms of frequency of purchases, total products purchased and their types and amount spent.
As a part of our final submission we have divided our report into the following section:
Exploratory Data Analysis has been further divided into 3 sub-sections:
Analysis’ use
Our analysis will help in understanding who are our customer and what are the top products. It might also be helpful in converting non-loyalty customers into loyalty ones by identifying the purchase behaviour and the gaps between two purchase journeys.
Following packages need to be loaded :
library(tidyverse)
## Warning: package 'dplyr' was built under R version 3.5.2
library(stringr)
library(knitr)
library(rmarkdown)
library(lubridate)
library(DT)
library(ggpubr)
library(gridExtra)
Data Source 8451° Link to Original dataset
The data was provided by the retailer which they had collected by recording the purchases made by the customer.
transaction <- read_csv("data/5000_transactions.csv")
household <- read_csv("data/5000_households.csv")
product <- read_csv("data/5000_products.csv")
# renaming columns to more meaningful names
colnames(transaction) <- c("basket_id","household_id","date","product_id","spend_amt",
"total_units","store_region","week_num","year")
colnames(product) <- c("product_id","department","commodity","brand_type","is_natural")
colnames(household) <- c ('household_id', 'loyalty_card', 'age_range','marital_status',
'income_range','homeowner','household_composition','household_size','children')
5000 households’ purchases were tracked in the data, inluding the transactions made by the them, the product purchased. Resultantly, the dataset consists of three tables (products, transactions and households)
Description of each table:
Variables from Products Table:
Variables from Transaction Table:
Variables from Household Table:
str(transaction)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1048575 obs. of 9 variables:
## $ basket_id : int 24 24 34 60 60 168 199 252 355 366 ...
## $ household_id: int 1809 1809 1253 1595 1595 3393 4558 3491 4733 4369 ...
## $ date : chr "3-Jan-16" "3-Jan-16" "3-Jan-16" "3-Jan-16" ...
## $ product_id : int 5817389 5829886 539501 5260099 4535660 5602916 5683532 980450 735993 138157 ...
## $ spend_amt : num -1.5 -1.5 2.19 0.99 2.5 4.5 3.49 2.79 1 9.98 ...
## $ total_units : int -1 -1 1 1 2 1 1 1 1 2 ...
## $ store_region: chr "SOUTH" "SOUTH" "EAST" "WEST" ...
## $ week_num : int 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 9
## .. ..$ BASKET_NUM : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ HSHD_NUM : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PURCHASE_ : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ PRODUCT_NUM: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ SPEND : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ UNITS : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ STORE_R : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ WEEK_NUM : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ YEAR : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
str(product)
## Classes 'tbl_df', 'tbl' and 'data.frame': 151141 obs. of 5 variables:
## $ product_id: int 92993 93924 94272 94299 94594 94606 94613 95625 96152 96153 ...
## $ department: chr "NON-FOOD" "NON-FOOD" "NON-FOOD" "NON-FOOD" ...
## $ commodity : chr "PET" "PET" "PET" "PET" ...
## $ brand_type: chr "PRIVATE" "PRIVATE" "PRIVATE" "PRIVATE" ...
## $ is_natural: chr "N" "N" "N" "N" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ PRODUCT_NUM: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ DEPARTMENT : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ COMMODITY : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BRAND_TY : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ X5 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
str(household)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 9 variables:
## $ household_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ loyalty_card : chr "Y" "Y" "Y" "Y" ...
## $ age_range : chr "55-64" "45-54" "55-64" "35-44" ...
## $ marital_status : chr "Married" "Married" "Married" "Single" ...
## $ income_range : chr "150K+" "150K+" "50-74K" "100-150K" ...
## $ homeowner : chr "Homeowner" "Homeowner" "Homeowner" "Homeowner" ...
## $ household_composition: chr "2 Adults and Kids" "2 Adults and Kids" "2 Adults" "1 Adult and Kids" ...
## $ household_size : chr "3" "5+" "2" "3" ...
## $ children : chr "1" "3+" "NOT AVAILABLE" "2" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 9
## .. ..$ HSHD_NUM : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ L : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ AGE_RANGE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MARITAL : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ INCOME_RANGE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HOMEOWNER : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HSHD_COMPOSITION: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HH_SIZE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ CHILDREN : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Data cleaning was performed in order to remove any entry/disrepancy which might hinder our analysis. Each table was checked for the following:
Each table was cleaned as a whole before moving to the next one.
1. TRANSACTION TABLE:
Cleaning the transaction table to ensure the above mentioned points:
transaction$date <- as.character(transaction$date)
transaction$date <- dmy(transaction$date)
transaction$store_region <- as.factor(transaction$store_region)
# checking duplicate observations and removing them: no duplicates found
unique(transaction)
# completeness check for each variable for each table: transaction does not have NA
colSums(is.na(transaction))/nrow(transaction)
## basket_id household_id date product_id spend_amt
## 0 0 0 0 0
## total_units store_region week_num year
## 0 0 0 0
print(paste(sum(complete.cases(transaction)),"- total number of complete cases"))
## [1] "1048575 - total number of complete cases"
The cleaned table consists of all variables having correct datatypes, and no duplicate values in the data. Also, there are no missing values or incomplete entries in the table now.
2. HOUSEHOLD TABLE The column names in the household tables need to be fixed as the variable names are padded with empty spaces on with sides. Also, there are string values in place of NA which can be source of issue in our analysis in the future. Lastly, since read_csv does not create factor values by default we will convert our respective variables to factors. Therefore, actions performed are:
# removing spaces around values: using string trim
str_trim(household$household_size)
str_trim(household$income_range)
str_trim(household$age_range)
str_trim(household$household_composition)
# detecting "null" "Unavailable" any other such values and setting them to NA
household$age_range[household$age_range=="NOT AVAILABLE"|household$age_range=="null"] <- NA
household$marital_status[household$marital_status=="Unknown"|household$marital_status=="null"]<- NA
household$income_range[household$income_range=="null"] <- NA
household$homeowner[household$homeowner=="Unknown"|household$homeowner=="null"] <- NA
household$household_size[household$household_size=="null"] <- NA
household$children[household$children=="null"|household$children=="NOT AVAILABLE"] <- NA
# converting to appropriate datatype
#converting characters to factors for categorical variables
household$loyalty_card <- as.factor(household$loyalty_card)
household$age_range <- as.factor(household$age_range)
household$marital_status <- as.factor(household$marital_status)
household$income_range <- as.factor(household$income_range)
household$homeowner <- as.factor(household$homeowner)
household$household_size <- as.factor(household$household_size)
household$children <- as.factor(household$children)
household$household_composition <- as.factor(household$household_composition)
# checking duplicate observations and removing them
unique(household)
# checking proportion of missing values in each column
colSums(is.na(household))/nrow(household)
household <- na.omit(household)
# cleaned data has 1684 observations
dim(household)
# completeness check for table:
print(paste(sum(complete.cases(household)),"- total number of complete cases"))
The final cleaned data has 9 variables with 1684 observations.
3. PRODUCT TABLE
Analyzing the Product table we found issue of missing values and incorrect datatypes. Post fixing both the issues we have a cleaned table
# converting to appropriate datatype
#converting characters to factors for categorical variables
product$department <- as.factor(product$department)
product$commodity <- as.factor(product$commodity)
product$brand_type <- as.factor(product$brand_type)
product$is_natural <- as.factor(product$is_natural)
# checking duplicate observations and removing them
unique(product)
# checking proportion of missing values in each column
colSums(is.na(product))/nrow(product)
# completeness check for each variable for each table: transaction does not have NA
print(paste(sum(complete.cases(product)),"- total number of complete cases"))
Finally, we join all he cleaned tables into a single table namely df.
## combining all tables to a cleaned base table to hereafter work upon: cleaned_data
joined1 = inner_join(transaction,household)
df <- inner_join(joined1,product)
The final cleaned data consists of 368,106 observations and 20 variables. The following things were observed in Data Cleaning:
datatable(head(df,10))
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 368106 obs. of 21 variables:
## $ basket_id : int 34 199 252 581 728 834 1094 1275 1494 1503 ...
## $ household_id : int 1253 4558 3491 4664 38 907 1279 2801 1944 3843 ...
## $ date : Date, format: "2016-01-03" "2016-01-03" ...
## $ product_id : int 539501 5683532 980450 5945674 5189959 5423151 5031239 5802735 4667776 2259171 ...
## $ spend_amt : num 2.19 3.49 2.79 1 3.99 1.88 0.49 2.49 1.99 5.5 ...
## $ total_units : int 1 1 1 1 1 1 1 1 1 1 ...
## $ store_region : Factor w/ 4 levels "CENTRAL","EAST",..: 2 3 3 3 2 1 2 4 1 2 ...
## $ week_num : int 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ loyalty_card : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
## $ age_range : Factor w/ 7 levels "19-24","25-34",..: 4 3 5 7 3 4 3 5 3 4 ...
## $ marital_status : Factor w/ 2 levels "Married","Single": 1 1 1 2 1 2 1 1 1 1 ...
## $ income_range : Factor w/ 6 levels "100-150K","150K+",..: 3 3 5 6 1 1 4 2 1 5 ...
## $ homeowner : Factor w/ 2 levels "Homeowner","Renter": 1 1 1 2 1 1 1 1 1 1 ...
## $ household_composition: Factor w/ 7 levels "1 Adult","1 Adult and Kids",..: 4 4 4 2 4 2 4 4 4 4 ...
## $ household_size : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 2 5 5 5 4 3 5 ...
## $ children : Factor w/ 3 levels "1","2","3+": 2 3 2 1 3 3 3 2 1 3 ...
## $ department : Factor w/ 3 levels "FOOD","NON-FOOD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ commodity : Factor w/ 42 levels "ACTIVITY","ALCOHOL",..: 9 13 5 12 6 19 19 19 5 40 ...
## $ brand_type : Factor w/ 2 levels "NATIONAL","PRIVATE": 1 1 1 1 1 1 1 1 1 1 ...
## $ is_natural : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
## basket_id household_id date product_id
## Min. : 8 Min. : 1 Min. :2016-01-03 Min. : 7
## 1st Qu.:218783 1st Qu.:1188 1st Qu.:2016-07-01 1st Qu.: 102176
## Median :478534 Median :2207 Median :2016-12-30 Median : 926872
## Mean :483520 Mean :2281 Mean :2016-12-30 Mean :2267127
## 3rd Qu.:739662 3rd Qu.:3406 3rd Qu.:2017-07-01 3rd Qu.:4914910
## Max. :999985 Max. :4979 Max. :2017-12-30 Max. :6791142
##
## spend_amt total_units store_region week_num
## Min. :-14.990 Min. :-10.000 CENTRAL: 84277 Min. : 1.0
## 1st Qu.: 1.670 1st Qu.: 1.000 EAST :135550 1st Qu.: 26.0
## Median : 2.590 Median : 1.000 SOUTH : 72144 Median : 52.0
## Mean : 3.609 Mean : 1.291 WEST : 76135 Mean : 52.4
## 3rd Qu.: 4.040 3rd Qu.: 1.000 3rd Qu.: 78.0
## Max. :196.350 Max. :246.000 Max. :104.0
##
## year loyalty_card age_range marital_status
## Min. :2016 N: 27345 19-24: 2097 Married:292092
## 1st Qu.:2016 Y:340761 25-34: 21088 Single : 76014
## Median :2016 35-44: 85445
## Mean :2016 45-54:129946
## 3rd Qu.:2017 55-64: 85336
## Max. :2017 65-74: 32943
## 75+ : 11251
## income_range homeowner household_composition
## 100-150K :62129 Homeowner:331520 1 Adult : 0
## 150K+ :40385 Renter : 36586 1 Adult and Kids : 76014
## 35-49K :75569 2 Adults : 0
## 50-74K :78874 2 Adults and Kids:291736
## 75-99K :58402 null : 356
## UNDER 35K:52747 Single Female : 0
## Single Male : 0
## household_size children department
## 1 : 0 1 :184645 FOOD :321835
## 2 : 42640 2 : 73540 NON-FOOD: 43183
## 3 :160142 3+:109921 PHARMA : 3088
## 4 : 63434
## 5+:101890
##
##
## commodity brand_type is_natural
## GROCERY STAPLE : 96881 NATIONAL:255731 N:353982
## PRODUCE : 48462 PRIVATE :112375 Y: 14124
## DAIRY : 32576
## FROZEN FOOD : 25337
## BEVERAGE - NON WATER: 20044
## BAKERY : 18987
## (Other) :125819
We perform analysis based on demographics to answer our research question : Do Demographic factors (e.g. household size, presence of children, income) affect customer spend? Engagement with certain categories?
1684 shortlisted households have spent 1.32MM in 2016 & 2017. Here, we look at the number of items purchased by each household in these 2 years. We see that most households have purchased around 250 items in the 2 years.
df %>%
group_by(household_id) %>%
summarise(number_of_items=sum(total_units)) %>%
ggplot(aes(x=number_of_items))+
geom_histogram(fill="darkorchid3",bins=200)+
labs(title="Total items purchased by each household in 2 years",
x="Number of items purchased", y="Count of households")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
Al look at household size and total children shows that most households shopping at kroger have 3 members and 1 children (this trend was found by comparing both the plots). Also proportionally, the households with 3 members make up for major share of the sales both the year. Also food department holds the major share in each type household.
g1 <- ggplot(df,aes(x = df$household_size))+
geom_bar(fill = "indianred4")+
labs(title="Who are our customers?
(household size))",
x="Household size", y="Transaction count of households")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
g2 <- ggplot(df,aes(x=df$children))+
geom_bar(fill = "indianred4")+
labs(title="Who are our customers?
(total children))",
x="Number of children", y="Transaction count by number of children")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
ggarrange(g1, g2,
ncol = 2,
nrow = 1)
df %>%
group_by(household_size,year) %>%
ggplot(aes(household_size,spend_amt))+
geom_bar(stat="summary", fun.y="sum",fill = "indianred4")+
facet_wrap(~year)+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by Household Size: a comparison of the two years",
x="Household Size", y="Total amount spent")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
df %>%
group_by(household_size,year) %>%
ggplot(aes(household_size,spend_amt,fill = department))+
geom_bar(stat="summary", fun.y="sum")+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by Household Size : top performing departments",
x="Household Size", y="Total amount spent")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
A look at income range of our customers and property own status shows
g3 <- ggplot(df,aes(x = df$homeowner))+
geom_bar(fill = "green4")+
labs(title="Who are our customers?
(homeowners)",
x="Homeowner Type", y="Transaction count by homeowner")+
theme(text = element_text(size=9),
axis.text.x = element_text(angle=45, hjust=1))
g4 <- ggplot(df,aes(x = df$income_range))+
geom_bar(fill = "green4")+
labs(title="Who are our customers?
(income range)")+
labs(title="Who are our customers?
(Income range)",
x="Income Range", y="Transaction count by income range")+
theme(text = element_text(size=9),
axis.text.x = element_text(angle=45, hjust=1))
ggarrange(g3, g4,
ncol = 2,
nrow = 1)
g5 <- df %>%
group_by(homeowner,year) %>%
ggplot(aes(homeowner,spend_amt))+
geom_bar(stat="summary", fun.y="sum",fill = "green4")+
facet_wrap(~year)+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by Homeowner
(homeowner/rental)",
x="Homeowner Type", y="Total amount spent")+
theme(text = element_text(size=9),
axis.text.x = element_text(angle=45, hjust=1))
g6 <- df %>%
group_by(income_range,year) %>%
ggplot(aes(income_range,spend_amt))+
geom_bar(stat="summary", fun.y="sum",fill = "green4")+
facet_wrap(~year)+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by Income Range ",x="Income Range")+
theme(text = element_text(size=9),
axis.text.x = element_text(angle=45, hjust=1))
ggarrange(g5, g6,
ncol = 2,
nrow = 1)
df %>%
group_by(income_range,year) %>%
ggplot(aes(income_range,spend_amt, fill=department))+
geom_bar(stat="summary", fun.y="sum")+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by Income Range ",x="Income Range")+
theme(text = element_text(size=9),
axis.text.x = element_text(angle=45, hjust=1))
As seen below, sales by loyalty customers made up for around 92.5% of the total sales. This trend was consistent in both the years.
g7 <- ggplot(df,aes(x = df$loyalty_card,fill=loyalty_card))+
geom_bar(position = "stack")+
labs(title="Who are our customers?
(loyalty program))",
x="Loyalty card", y="Transaction countby Loyalty Card")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
g8 <- df %>%
group_by(household_size,year) %>%
ggplot(aes(year,spend_amt, fill=loyalty_card))+
geom_bar(stat="summary", fun.y="sum")+
scale_y_continuous(NULL, labels = scales::dollar)+
labs(title="Spend by loyalty card:
a comparison of the two years",
x="Loyalty card", y="Total amount spent")+
theme(text = element_text(size=9),
axis.text.x = element_text(hjust=1))
ggarrange(g7, g8,
ncol = 2,
nrow = 1)
Here, we are analysing the spend and the purchase pattern of the loyalty and the non-loyalty customers over two years of data to answer our research question Are customer in both the groups (loyalty and non-loyalty) different in terms of spend & purchase behaviour?
Customers enrolled in loyalty program are loyalty customers.
To analyze this we have formed two groups based on whether the customer is enrolled in loyalty program or not.Each group of customer is grouped by month of purchase, and the total spend for each month is analyzed over the year.
We have used the following color coding: purple = loyalty , light purple/black = non-loyalty
# Loyalty_spend2016 mothwise
L_2016 <- df %>%
filter(year==2016 & loyalty_card=='Y') %>%
group_by(household_id, purchase_month) %>%
summarize(Total_spend=sum(spend_amt))
# Non-Loyalty_spend2016 monthwise
NL_2016 <- df %>%
filter(year==2016 & loyalty_card=='N') %>%
group_by(household_id, purchase_month) %>%
summarize(Total_spend=sum(spend_amt))
# plot for Monthwise Spend by Loyalty & Non-Loyalty customers in 2016
grid.arrange(L_2016 %>%
ggplot(aes(purchase_month, Total_spend))+
geom_boxplot(fill="purple")+
labs(title="Monthwise Spend by Loyalty customers in 2016",
x="Month", y="Total spend ")+
scale_y_continuous(breaks = seq(0, 300, by=100), limits=c(0,300),labels = scales::dollar), NL_2016 %>%
ggplot(aes(purchase_month, Total_spend))+
geom_boxplot(fill="thistle")+
labs(title="Monthwise Spend by Non-Loyalty customers in 2016",
x="Month", y="Total spend ")+
scale_y_continuous(breaks = seq(0, 300, by=100), limits=c(0,300),labels = scales::dollar)
, nrow =2)
# plot for Monthwise Spend by Loyalty & Non-Loyalty customers in 2017
Here,each group of customer is grouped by month of purchase and the total units of each product type that they purchased for that particular month.This is analyzed for each year.
############ Loyalty_purchase2016 monthwise ##########
LP_2016 <- df %>%
filter(year==2016 & loyalty_card=='Y') %>%
group_by(product_id,household_id, purchase_month) %>%
summarize(Total_units=sum(total_units))
############ Non-Loyalty_purchase2016 monthwise ##########
NLP_2016 <- df %>%
filter(year==2016 & loyalty_card=='N') %>%
group_by(product_id,household_id, purchase_month) %>%
summarize(Total_units=sum(total_units))
# summary(LP_2016)
# summary(NLP_2016)
# plot for Month-wise product units purchased by Loyalty & Non-Loyalty customers in 2017
We had started our project with some research questions in mind. Upon our analysis we were able to find the answers to them. We also uncovered some trends and insights during this process.
Which categories of Kroger have highest share? Is it somehow related to loyalty?
We found that Food department has the major share due to categories such as Grocery Staple, Produce, Dairy and Frozen food which have highest sales. We also found that two Non-food product categories (Household and Personal Care) are in top 10 categories too.
We found that loyalty customers share improved slightly in 2017.
Do Demographic factors (e.g. household size, presence of children, income) affect customer spend?
Household with 3 members (and 1 child) have the biggest share followed by households with 5+members. Also, customers with income range 34-74K make up for biggest share at the retailer.
Are customer in both the groups (loyalty and non-loyalty) different in terms of purchase behaviour?
Yes,customer in both the groups (loyalty and non-loyalty) are different in terms of purchase behaviour.The count of units purchased monthly by loyalty customers is higher than that purchased by the non-loyalty ones. The count of units of a products purchased by the loyalty group has risen in 2017, this can be researched further.
How many customers are spending more over time? Less over time? Understand these customers. Of those customers who are spending more over time, which categories are growing at a faster rate? Calculate some metrics.
The next thing that can be researched is that why non-loyalty customers are not willing to enroll in the loyalty program? The retailer can design some surveys around this and encourage people to participate in these surveys, so that the retailer can understand the reasons of non-loyalty customers not enrolling in the loyalty program.