Data Wrangling Project: Understanding our customers

Introduction

We were provided with retailer’s data to analyze and uncover any trends or answer business objectives. The dataset used in this project contains household level transactions over two years from a group of 5000 households who are shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories.

Objective

We want to be able to answer the following research questions:

Which categories of Kroger have highest? Is it somehow related to loyalty?
Do Demographic factors (e.g. household size, presence of children, income) affect customer spend?
Are customer in both the groups (loyalty and non-loyalty) different in terms of purchase behaviour? Do Demographic factors (e.g. household size, presence of children, income) affect being enrolled in the loyalty program?
How many customers are spending more over time? Less over time? Understand these customers. Of those customers who are spending more over time, which categories are growing at a faster rate? Calculate some metrics.

Approach

We analyze purchase behaviour and patterns w.r.t demographics, products and loyalty program, in terms of frequency of purchases, total products purchased and their types and amount spent.

As a part of our final submission we have divided our report into the following section:

Introduction
Packages Required
Data Preparation
Exploratory Data Analysis
Summary

Exploratory Data Analysis has been further divided into 3 sub-sections:

Analysis by demographics
Analysis by products
Analysis by loyalty program

Analysis’ use

Our analysis will help in understanding who are our customer and what are the top products. It might also be helpful in converting non-loyalty customers into loyalty ones by identifying the purchase behaviour and the gaps between two purchase journeys.

Packages Required

Following packages need to be loaded :

library(tidyverse) : to install packages readr,dplyr,tidyr,ggplot2,stringr
library(knitr) : to perform dynamic report generation
library(rmarkdown) : to create R Markdown documents
library(lubridate) : to manipulate dates related data
library(DT) : to display tables
library(ggpubr) : to manipulate plots
library(gridExtra) : to manipulate plots

library(tidyverse)

## Warning: package 'dplyr' was built under R version 3.5.2

library(stringr)
library(knitr)
library(rmarkdown)
library(lubridate)
library(DT)
library(ggpubr)
library(gridExtra)

Data Preparation

1. Data Import

Data Source 8451° Link to Original dataset

The data was provided by the retailer which they had collected by recording the purchases made by the customer.

transaction <- read_csv("data/5000_transactions.csv")
household <- read_csv("data/5000_households.csv")
product <- read_csv("data/5000_products.csv")

# renaming columns to more meaningful names
colnames(transaction) <- c("basket_id","household_id","date","product_id","spend_amt",
                           "total_units","store_region","week_num","year")
colnames(product) <- c("product_id","department","commodity","brand_type","is_natural")
colnames(household) <- c ('household_id', 'loyalty_card', 'age_range','marital_status', 
                          'income_range','homeowner','household_composition','household_size','children')

2. Data Description

5000 households’ purchases were tracked in the data, inluding the transactions made by the them, the product purchased. Resultantly, the dataset consists of three tables (products, transactions and households)

Description of each table:

Variables from Products Table:

product_id : Unique identifier for each product
department : Department to which the product belongs
commodity : Commodity type of the product
brand_type : Whether the brand is national or private
is_natural : Whether the product is natural or not

Variables from Transaction Table:

basket_id : Unique identifier for each basket (a basket contains all purchases made in 1 shopping journey)
household_id : Unique identifier for each household
date : Date of purchase
product_id : Unique identifier for each product
spend_amt : Amount (in dollars) spent for a particular purchase
total_units : Total units of items purchased
store_region : There are four regions (central, east, west south) where the stores are located
week_num : Week number for when the purchase was made (both years combined)
Year : Year when the purchase was made

Variables from Household Table:

household_id : Unique identifier for each household
loyalty_card : Whether the shopper has a store loyalty card or not
age_range : There are 7 levels of age range of shoppers (19-24, 24-53 and so on)
marital_status : Whether the shopper is married, single or the status is unknown
income_range : Income range of the shopper
homeowner: Whether the shopper is a homeowner or a renter
household_composition : Describes the composition of household; adults and kids are in the house
household_size : Total number of people in the household
children : Total number of children in the household

str(transaction)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1048575 obs. of  9 variables:
##  $ basket_id   : int  24 24 34 60 60 168 199 252 355 366 ...
##  $ household_id: int  1809 1809 1253 1595 1595 3393 4558 3491 4733 4369 ...
##  $ date        : chr  "3-Jan-16" "3-Jan-16" "3-Jan-16" "3-Jan-16" ...
##  $ product_id  : int  5817389 5829886 539501 5260099 4535660 5602916 5683532 980450 735993 138157 ...
##  $ spend_amt   : num  -1.5 -1.5 2.19 0.99 2.5 4.5 3.49 2.79 1 9.98 ...
##  $ total_units : int  -1 -1 1 1 2 1 1 1 1 2 ...
##  $ store_region: chr  "SOUTH" "SOUTH" "EAST" "WEST" ...
##  $ week_num    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 9
##   .. ..$ BASKET_NUM : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ HSHD_NUM   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PURCHASE_  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ PRODUCT_NUM: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ SPEND      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ UNITS      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ STORE_R    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ WEEK_NUM   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ YEAR       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

str(product)

## Classes 'tbl_df', 'tbl' and 'data.frame':    151141 obs. of  5 variables:
##  $ product_id: int  92993 93924 94272 94299 94594 94606 94613 95625 96152 96153 ...
##  $ department: chr  "NON-FOOD" "NON-FOOD" "NON-FOOD" "NON-FOOD" ...
##  $ commodity : chr  "PET" "PET" "PET" "PET" ...
##  $ brand_type: chr  "PRIVATE" "PRIVATE" "PRIVATE" "PRIVATE" ...
##  $ is_natural: chr  "N" "N" "N" "N" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ PRODUCT_NUM: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ DEPARTMENT : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ COMMODITY  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BRAND_TY   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ X5         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

str(household)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5000 obs. of  9 variables:
##  $ household_id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ loyalty_card         : chr  "Y" "Y" "Y" "Y" ...
##  $ age_range            : chr  "55-64" "45-54" "55-64" "35-44" ...
##  $ marital_status       : chr  "Married" "Married" "Married" "Single" ...
##  $ income_range         : chr  "150K+" "150K+" "50-74K" "100-150K" ...
##  $ homeowner            : chr  "Homeowner" "Homeowner" "Homeowner" "Homeowner" ...
##  $ household_composition: chr  "2 Adults and Kids" "2 Adults and Kids" "2 Adults" "1 Adult and Kids" ...
##  $ household_size       : chr  "3" "5+" "2" "3" ...
##  $ children             : chr  "1" "3+" "NOT AVAILABLE" "2" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 9
##   .. ..$ HSHD_NUM        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ L               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ AGE_RANGE       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MARITAL         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ INCOME_RANGE    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HOMEOWNER       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HSHD_COMPOSITION: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ HH_SIZE         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CHILDREN        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

3. Data Cleaning

Data cleaning was performed in order to remove any entry/disrepancy which might hinder our analysis. Each table was checked for the following:

Each variable has correct data type
No Duplicate values are present
No Missing values are present - completeness check
No White spaces around variables or values
If any column needs to be spread or gathered, or separated or united
Drop any redundant data

Each table was cleaned as a whole before moving to the next one.

1. TRANSACTION TABLE:

Cleaning the transaction table to ensure the above mentioned points:

Converting to appropriate datatype: ‘date’ to date format
Checking duplicate observations and removing them: no duplicates found

transaction$date <- as.character(transaction$date)
transaction$date <- dmy(transaction$date)
transaction$store_region <- as.factor(transaction$store_region)

# checking duplicate observations and removing them: no duplicates found
unique(transaction)

# completeness check for each variable for each table: transaction does not have NA
colSums(is.na(transaction))/nrow(transaction)

##    basket_id household_id         date   product_id    spend_amt 
##            0            0            0            0            0 
##  total_units store_region     week_num         year 
##            0            0            0            0

print(paste(sum(complete.cases(transaction)),"- total number of complete cases"))

## [1] "1048575 - total number of complete cases"

The cleaned table consists of all variables having correct datatypes, and no duplicate values in the data. Also, there are no missing values or incomplete entries in the table now.

2. HOUSEHOLD TABLE The column names in the household tables need to be fixed as the variable names are padded with empty spaces on with sides. Also, there are string values in place of NA which can be source of issue in our analysis in the future. Lastly, since read_csv does not create factor values by default we will convert our respective variables to factors. Therefore, actions performed are:

Removing spaces around values: using string trim
Detecting “null” “Unavailable” any other such values and setting them to NA
Converting to appropriate datatype
checking duplicate observations and removing them
completeness check for each variable for each table: there were missing values in the table

# removing spaces around values: using string trim
str_trim(household$household_size)
str_trim(household$income_range)
str_trim(household$age_range)
str_trim(household$household_composition)

# detecting "null" "Unavailable" any other such values and setting them to NA
household$age_range[household$age_range=="NOT AVAILABLE"|household$age_range=="null"] <- NA
household$marital_status[household$marital_status=="Unknown"|household$marital_status=="null"]<- NA
household$income_range[household$income_range=="null"] <- NA
household$homeowner[household$homeowner=="Unknown"|household$homeowner=="null"] <- NA
household$household_size[household$household_size=="null"] <- NA
household$children[household$children=="null"|household$children=="NOT AVAILABLE"] <- NA

# converting to appropriate datatype
#converting characters to factors for categorical variables
household$loyalty_card  <- as.factor(household$loyalty_card)
household$age_range  <- as.factor(household$age_range)
household$marital_status  <- as.factor(household$marital_status)
household$income_range  <- as.factor(household$income_range)
household$homeowner  <- as.factor(household$homeowner)
household$household_size  <- as.factor(household$household_size)
household$children  <- as.factor(household$children)
household$household_composition  <- as.factor(household$household_composition)

# checking duplicate observations and removing them
unique(household)

# checking proportion of missing values in each column
colSums(is.na(household))/nrow(household)
household <- na.omit(household)

# cleaned data has 1684 observations
dim(household)

# completeness check for table: 
print(paste(sum(complete.cases(household)),"- total number of complete cases"))

The final cleaned data has 9 variables with 1684 observations.

3. PRODUCT TABLE

Analyzing the Product table we found issue of missing values and incorrect datatypes. Post fixing both the issues we have a cleaned table

# converting to appropriate datatype
#converting characters to factors for categorical variables
product$department <- as.factor(product$department)
product$commodity <- as.factor(product$commodity)
product$brand_type <- as.factor(product$brand_type)
product$is_natural <- as.factor(product$is_natural)

# checking duplicate observations and removing them
unique(product)

# checking proportion of missing values in each column
colSums(is.na(product))/nrow(product)

# completeness check for each variable for each table: transaction does not have NA
print(paste(sum(complete.cases(product)),"- total number of complete cases"))

Finally, we join all he cleaned tables into a single table namely df.

## combining all tables to a cleaned base table to hereafter work upon: cleaned_data
joined1 = inner_join(transaction,household)
df <- inner_join(joined1,product)

4. Data Preview

The final cleaned data consists of 368,106 observations and 20 variables. The following things were observed in Data Cleaning:

The product and transactions table did not have any null, unknown values. But the household table had a significant amount of null, unknown & not available values. Out of the 5000 observations in the household table, after data cleaning we had 1684 observations.
There were no outliers observed as most of the variables are categorical.
However, the ‘spend’ variable had negative values and we need to investigate further on what it implies. (Our analysis says that it could imply a return or shopping made solely from loyalty points, but this has to be confirmed further)
We have to decide on how many observations we will use for our exploratory data analysis. We could do a random sampling that is a good representation of the population.

datatable(head(df,10))

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    368106 obs. of  21 variables:
##  $ basket_id            : int  34 199 252 581 728 834 1094 1275 1494 1503 ...
##  $ household_id         : int  1253 4558 3491 4664 38 907 1279 2801 1944 3843 ...
##  $ date                 : Date, format: "2016-01-03" "2016-01-03" ...
##  $ product_id           : int  539501 5683532 980450 5945674 5189959 5423151 5031239 5802735 4667776 2259171 ...
##  $ spend_amt            : num  2.19 3.49 2.79 1 3.99 1.88 0.49 2.49 1.99 5.5 ...
##  $ total_units          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ store_region         : Factor w/ 4 levels "CENTRAL","EAST",..: 2 3 3 3 2 1 2 4 1 2 ...
##  $ week_num             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ year                 : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ loyalty_card         : Factor w/ 2 levels "N","Y": 2 2 1 2 2 2 2 2 2 2 ...
##  $ age_range            : Factor w/ 7 levels "19-24","25-34",..: 4 3 5 7 3 4 3 5 3 4 ...
##  $ marital_status       : Factor w/ 2 levels "Married","Single": 1 1 1 2 1 2 1 1 1 1 ...
##  $ income_range         : Factor w/ 6 levels "100-150K","150K+",..: 3 3 5 6 1 1 4 2 1 5 ...
##  $ homeowner            : Factor w/ 2 levels "Homeowner","Renter": 1 1 1 2 1 1 1 1 1 1 ...
##  $ household_composition: Factor w/ 7 levels "1 Adult","1 Adult and Kids",..: 4 4 4 2 4 2 4 4 4 4 ...
##  $ household_size       : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 2 5 5 5 4 3 5 ...
##  $ children             : Factor w/ 3 levels "1","2","3+": 2 3 2 1 3 3 3 2 1 3 ...
##  $ department           : Factor w/ 3 levels "FOOD","NON-FOOD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ commodity            : Factor w/ 42 levels "ACTIVITY","ALCOHOL",..: 9 13 5 12 6 19 19 19 5 40 ...
##  $ brand_type           : Factor w/ 2 levels "NATIONAL","PRIVATE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_natural           : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

summary(df)

##    basket_id       household_id       date              product_id     
##  Min.   :     8   Min.   :   1   Min.   :2016-01-03   Min.   :      7  
##  1st Qu.:218783   1st Qu.:1188   1st Qu.:2016-07-01   1st Qu.: 102176  
##  Median :478534   Median :2207   Median :2016-12-30   Median : 926872  
##  Mean   :483520   Mean   :2281   Mean   :2016-12-30   Mean   :2267127  
##  3rd Qu.:739662   3rd Qu.:3406   3rd Qu.:2017-07-01   3rd Qu.:4914910  
##  Max.   :999985   Max.   :4979   Max.   :2017-12-30   Max.   :6791142  
##                                                                        
##    spend_amt        total_units       store_region       week_num    
##  Min.   :-14.990   Min.   :-10.000   CENTRAL: 84277   Min.   :  1.0  
##  1st Qu.:  1.670   1st Qu.:  1.000   EAST   :135550   1st Qu.: 26.0  
##  Median :  2.590   Median :  1.000   SOUTH  : 72144   Median : 52.0  
##  Mean   :  3.609   Mean   :  1.291   WEST   : 76135   Mean   : 52.4  
##  3rd Qu.:  4.040   3rd Qu.:  1.000                    3rd Qu.: 78.0  
##  Max.   :196.350   Max.   :246.000                    Max.   :104.0  
##                                                                      
##       year      loyalty_card age_range      marital_status  
##  Min.   :2016   N: 27345     19-24:  2097   Married:292092  
##  1st Qu.:2016   Y:340761     25-34: 21088   Single : 76014  
##  Median :2016                35-44: 85445                   
##  Mean   :2016                45-54:129946                   
##  3rd Qu.:2017                55-64: 85336                   
##  Max.   :2017                65-74: 32943                   
##                              75+  : 11251                   
##     income_range       homeowner            household_composition
##  100-150K :62129   Homeowner:331520   1 Adult          :     0   
##  150K+    :40385   Renter   : 36586   1 Adult and Kids : 76014   
##  35-49K   :75569                      2 Adults         :     0   
##  50-74K   :78874                      2 Adults and Kids:291736   
##  75-99K   :58402                      null             :   356   
##  UNDER 35K:52747                      Single Female    :     0   
##                                       Single Male      :     0   
##  household_size children       department    
##  1 :     0      1 :184645   FOOD    :321835  
##  2 : 42640      2 : 73540   NON-FOOD: 43183  
##  3 :160142      3+:109921   PHARMA  :  3088  
##  4 : 63434                                   
##  5+:101890                                   
##                                              
##                                              
##                 commodity         brand_type     is_natural
##  GROCERY STAPLE      : 96881   NATIONAL:255731   N:353982  
##  PRODUCE             : 48462   PRIVATE :112375   Y: 14124  
##  DAIRY               : 32576                               
##  FROZEN FOOD         : 25337                               
##  BEVERAGE - NON WATER: 20044                               
##  BAKERY              : 18987                               
##  (Other)             :125819

Exploratory Data Analysis

Analysis by Demographics

We perform analysis based on demographics to answer our research question : Do Demographic factors (e.g. household size, presence of children, income) affect customer spend? Engagement with certain categories?

1684 shortlisted households have spent 1.32MM in 2016 & 2017. Here, we look at the number of items purchased by each household in these 2 years. We see that most households have purchased around 250 items in the 2 years.

df %>%
  group_by(household_id) %>%
  summarise(number_of_items=sum(total_units)) %>%
  ggplot(aes(x=number_of_items))+
  geom_histogram(fill="darkorchid3",bins=200)+
  labs(title="Total items purchased by each household in 2 years",
       x="Number of items purchased", y="Count of households")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

Al look at household size and total children shows that most households shopping at kroger have 3 members and 1 children (this trend was found by comparing both the plots). Also proportionally, the households with 3 members make up for major share of the sales both the year. Also food department holds the major share in each type household.

g1 <- ggplot(df,aes(x = df$household_size))+
  geom_bar(fill = "indianred4")+
  labs(title="Who are our customers?
       (household size))",
       x="Household size", y="Transaction count of households")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

g2 <- ggplot(df,aes(x=df$children))+
  geom_bar(fill = "indianred4")+
  labs(title="Who are our customers? 
       (total children))",
       x="Number of children", y="Transaction count by number of children")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

ggarrange(g1, g2,
          ncol = 2, 
          nrow = 1)

df %>% 
  group_by(household_size,year) %>% 
  ggplot(aes(household_size,spend_amt))+
  geom_bar(stat="summary", fun.y="sum",fill = "indianred4")+
  facet_wrap(~year)+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by Household Size: a comparison of the two years",
       x="Household Size", y="Total amount spent")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

df %>% 
  group_by(household_size,year) %>% 
  ggplot(aes(household_size,spend_amt,fill = department))+
  geom_bar(stat="summary", fun.y="sum")+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by Household Size : top performing departments",
       x="Household Size", y="Total amount spent")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

A look at income range of our customers and property own status shows

g3 <- ggplot(df,aes(x = df$homeowner))+
  geom_bar(fill = "green4")+
  labs(title="Who are our customers?
       (homeowners)",
       x="Homeowner Type", y="Transaction count by homeowner")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(angle=45, hjust=1))


g4 <- ggplot(df,aes(x = df$income_range))+
  geom_bar(fill = "green4")+
  labs(title="Who are our customers?
       (income range)")+
    labs(title="Who are our customers?
       (Income range)",
       x="Income Range", y="Transaction count by income range")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(angle=45, hjust=1))


ggarrange(g3, g4,
          ncol = 2, 
          nrow = 1)

g5 <- df %>% 
  group_by(homeowner,year) %>% 
  ggplot(aes(homeowner,spend_amt))+
  geom_bar(stat="summary", fun.y="sum",fill = "green4")+
  facet_wrap(~year)+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by Homeowner 
       (homeowner/rental)",
       x="Homeowner Type", y="Total amount spent")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(angle=45, hjust=1))

g6 <- df %>% 
  group_by(income_range,year) %>% 
  ggplot(aes(income_range,spend_amt))+
  geom_bar(stat="summary", fun.y="sum",fill = "green4")+
  facet_wrap(~year)+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by Income Range ",x="Income Range")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(angle=45, hjust=1))

ggarrange(g5, g6,
          ncol = 2, 
          nrow = 1)

df %>% 
  group_by(income_range,year) %>% 
  ggplot(aes(income_range,spend_amt, fill=department))+
  geom_bar(stat="summary", fun.y="sum")+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by Income Range ",x="Income Range")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(angle=45, hjust=1))

As seen below, sales by loyalty customers made up for around 92.5% of the total sales. This trend was consistent in both the years.

g7 <- ggplot(df,aes(x = df$loyalty_card,fill=loyalty_card))+
  geom_bar(position = "stack")+
  labs(title="Who are our customers?
       (loyalty program))",
       x="Loyalty card", y="Transaction countby Loyalty Card")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))


g8 <- df %>% 
  group_by(household_size,year) %>% 
  ggplot(aes(year,spend_amt, fill=loyalty_card))+
  geom_bar(stat="summary", fun.y="sum")+
  scale_y_continuous(NULL, labels = scales::dollar)+
  labs(title="Spend by loyalty card: 
       a comparison of the two years",
       x="Loyalty card", y="Total amount spent")+
  theme(text = element_text(size=9),
        axis.text.x = element_text(hjust=1))

ggarrange(g7, g8,
          ncol = 2, 
          nrow = 1)

Analysis by products

Purpose: analysing the key products and departments of sales for kroger.

Insights from the plot:

As we can from the plot above, Food department is the clear winner at kroger, making up for for almost 82% of all sale.
Within Food department, the top products are Grocery Staple, Produce, Dairy and Frozen food.
2 categories from the Non-food department also fall in the top 10, they are Household items and Personal Care items.
Share of loyalty sales improved in 2017 across all departments.

Analysis by Loyalty program

Purpose: Loyalty versus non-loyalty spend analysis

Here, we are analysing the spend and the purchase pattern of the loyalty and the non-loyalty customers over two years of data to answer our research question Are customer in both the groups (loyalty and non-loyalty) different in terms of spend & purchase behaviour?

Customers enrolled in loyalty program are loyalty customers.

To analyze this we have formed two groups based on whether the customer is enrolled in loyalty program or not.Each group of customer is grouped by month of purchase, and the total spend for each month is analyzed over the year.

We have used the following color coding: purple = loyalty , light purple/black = non-loyalty

# Loyalty_spend2016 mothwise 

L_2016 <- df %>%
  filter(year==2016 & loyalty_card=='Y') %>%
  group_by(household_id, purchase_month) %>%
  summarize(Total_spend=sum(spend_amt))

# Non-Loyalty_spend2016 monthwise 
NL_2016 <- df %>%
  filter(year==2016 & loyalty_card=='N') %>%
  group_by(household_id, purchase_month) %>%
  summarize(Total_spend=sum(spend_amt))

# plot for Monthwise Spend by Loyalty & Non-Loyalty customers in 2016
grid.arrange(L_2016 %>% 
 ggplot(aes(purchase_month, Total_spend))+ 
 geom_boxplot(fill="purple")+
 labs(title="Monthwise Spend by Loyalty customers in 2016",
 x="Month", y="Total spend ")+
 scale_y_continuous(breaks = seq(0, 300, by=100), limits=c(0,300),labels = scales::dollar), NL_2016   %>% 
  ggplot(aes(purchase_month, Total_spend))+ 
  geom_boxplot(fill="thistle")+
  labs(title="Monthwise Spend by Non-Loyalty customers in 2016",
       x="Month", y="Total spend ")+
  scale_y_continuous(breaks = seq(0, 300, by=100), limits=c(0,300),labels = scales::dollar)
, nrow =2)

# plot for Monthwise Spend by Loyalty & Non-Loyalty customers in 2017

Insights from the plot:

The median monthly spend of customers in the Loyalty program is almost 60% times more than the ones who are not enrolled in the loyalty program (this is evident from the box plot & is supported by the summary statistics).
There are much more customers in the loyalty program who spend way more than the usual spend. However, this spend behaviour is observed relatively less in non-loyalty customers.(Measure : Evident from the outliers in the plot).
It is very evident that the monthly spend among loyalty customers is way more than that in the other group(non-loyalty).
The median monthly spend among both the groups is almost consistent.
There is an increase in the maximum monthly spent in both the groups from 2016 to 2017 (good news for the retailer)(source: summary statistics).

Purpose: Loyalty versus non-loyalty purchase analysis:

Here,each group of customer is grouped by month of purchase and the total units of each product type that they purchased for that particular month.This is analyzed for each year.

############ Loyalty_purchase2016 monthwise ##########

LP_2016 <- df %>%
  filter(year==2016 & loyalty_card=='Y') %>%
  group_by(product_id,household_id, purchase_month) %>%
  summarize(Total_units=sum(total_units))

############ Non-Loyalty_purchase2016 monthwise ##########

NLP_2016 <- df %>%
  filter(year==2016 & loyalty_card=='N') %>%
  group_by(product_id,household_id, purchase_month) %>%
  summarize(Total_units=sum(total_units))

# summary(LP_2016)
# summary(NLP_2016)

# plot for Month-wise product units purchased by Loyalty & Non-Loyalty customers in 2017

Insights from the plot:

The count of units purchased by loyalty customers is higher than that purchased by the other group (evident from summary statistics and plot).
It is observed that the count of units of a products purchased by the loyalty group has risen in 2017 (clear from the plot).

Summary

We had started our project with some research questions in mind. Upon our analysis we were able to find the answers to them. We also uncovered some trends and insights during this process.

Which categories of Kroger have highest share? Is it somehow related to loyalty?

We found that Food department has the major share due to categories such as Grocery Staple, Produce, Dairy and Frozen food which have highest sales. We also found that two Non-food product categories (Household and Personal Care) are in top 10 categories too.
We found that loyalty customers share improved slightly in 2017.

Do Demographic factors (e.g. household size, presence of children, income) affect customer spend?

Household with 3 members (and 1 child) have the biggest share followed by households with 5+members. Also, customers with income range 34-74K make up for biggest share at the retailer.

Are customer in both the groups (loyalty and non-loyalty) different in terms of purchase behaviour?

Yes,customer in both the groups (loyalty and non-loyalty) are different in terms of purchase behaviour.The count of units purchased monthly by loyalty customers is higher than that purchased by the non-loyalty ones. The count of units of a products purchased by the loyalty group has risen in 2017, this can be researched further.

How many customers are spending more over time? Less over time? Understand these customers. Of those customers who are spending more over time, which categories are growing at a faster rate? Calculate some metrics.

The median monthly spend of customers in the Loyalty program is higher than the ones who are not enrolled in the loyalty program.
There are much more customers in the loyalty program who spend way more than the usual spend.
There is an increase in the maximum monthly spent in both the groups from 2016 to 2017, this can be researched further.

The next thing that can be researched is that why non-loyalty customers are not willing to enroll in the loyalty program? The retailer can design some surveys around this and encourage people to participate in these surveys, so that the retailer can understand the reasons of non-loyalty customers not enrolling in the loyalty program.

Data Wrangling Project: Understanding our customers

Shagun Narwal and Apoorva Rautela

December 09, 2018

Introduction

Packages Required

Data Preparation

1. Data Import

2. Data Description

3. Data Cleaning

4. Data Preview

Exploratory Data Analysis

Analysis by Demographics

Analysis by products

Purpose: analysing the key products and departments of sales for kroger.

Insights from the plot:

Analysis by Loyalty program

Purpose: Loyalty versus non-loyalty spend analysis

Insights from the plot:

Purpose: Loyalty versus non-loyalty purchase analysis:

Insights from the plot:

Summary