\(\color{#29A887}{\text{Introduction}}\)

The core of a business is to learn about its customers.

A company can never grow larger nor further if the management levels know nothing about its customer or competitor. Especially when their competitor stay close to the trend of society and always keep an eye on the customers. A company needs to learn from the behavior of the current customers, which might be the solid and easiest information they can collect, in order to keep the current customers around, attract more new customers and improve itself. So does a retailer.

To help the retailer have better understanding of its customer, this project mainly provided an analysis on the spending trends of the customer. Households with different levels of incoming will have their distinct needs. Meanwhile, they have no competition against each other. The retailer is supposed to react with groups with different ranges of income respectively. Based on the range of income of the 2500 households, we dvided them into three customer groups:

  • High-end Customer Group: income range above 100K;
  • Middle-end Customer Group: income range between 50K and 99K;
  • Low-end Customer Group: income range below 49K.

Each question been asked and explained in this project will be explicitly spread in these three customer groups.

\(\color{#5097A4}{\text{Problem Statement}}\)

To obtain the specific details which will help the retailer learn more about its customers and improve themselves to perform better in the future, several questions have been raised up with and will be explained and solved in this project are listed as follows:

  • According to the spending trends of customers:
    • How many customers are spending more over time?
    • How many customers are spending less over time?
  • What trend can be observed in the following factors:
    • Loyalty
    • Brand Type Selection
    • Store Selection
      • Customer Feature in store selection_Household Composition
      • Customer Feature in store selection_Loyalty
  • What kind of suggestion or inspiration we may generated from this analysis and provided to our client as well as audiences:
    • What are the trends we observed and what are the meaning behind these
    • Any interesting findings in this analysis
    • Future aciton or promotion on its way

\(\color{#5097A4}{\text{Data Source and Methodology}}\)

Data Source

This dataset was released by 84.51º for academic research and case study, and was provided by our professor Bradley Boehmke for the Data Wrangling with R course at University of Cincinnati. The data source can be find at course page. The whole dataset contains a household group of 2500 households and thier household level transactions over two years. All the households are frequent shoppers at a retailer. This dataset also contains the demographic information of the 2500 households in order to help the retailer learn more about its customers. This data provided an opportunity to explore 84.51°/Kroger data and an opportunity to apply skills learned to investigate different business questions.

Methodology

  • Dividend has been used to divide the whole large dataset into three groups of customers based on their income ranges.

  • Time Series Analysis has been applied in this project to generate the spending trend of customers.

      a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
      b <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
      
      trend <- function(x){trend.test(x)}
      trend(a)
## 
##  Spearman's rank correlation rho
## 
## data:  x and time(x)
## S = 3.6637e-14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
##   1
      trend(a)$estimate
## rho 
##   1
      trend(b)
## 
##  Spearman's rank correlation rho
## 
## data:  x and time(x)
## S = 330, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
##  -1
      trend(b)$estimate
## rho 
##  -1

Test.trend: Test if the series has an increasing or decreasing trend, using a non-parametric Spearman test between the observations and time.

Many thanks to our reference

  • Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.

\(\color{#5097A4}{\text{Mission}}\)

The mission of this analysis project is to help the retailer have a better understanding of its customers. Understand the data they collected from their customers can help it understand the market and have win-win relationship with their respectful customers. There is no doubt that the final goal of each business is to earn profit.

The spending trends of the customers in two years (2016 and 2017) will provide the retailer with a detailed picture of its performance by pointing out how many customers spending less as well as how many customers tend to spending more. The distinct factors such as store selection, brand selection, and product selection by customers will help the retailer improving itself in order to attract more customers. Other demographic factors, for example, age range and household composition of customers will help the retailer look into more specific details so it can provide its customers with more proper promotion, which will help the retailer gain profits in return.

The whole project can not only help the retailer with its performance, but also can help each reader with their understanding of customer behaviors, customer spending trends and other aspects of business. Will provide you with new ideas about business while you are looking through the whole project we did for sure!

\(\color{#29A887}{\text{Packages Required}}\)

Packages being used in this project. If you want to run the code individually, please install these packages first.

library(tidyr)      # used for tidying up data
library(dplyr)      # used for data manipulation
library(lubridate)  # used for transforming date
library(knitr)      # used for viewing data
library(printr)     # used for viewing data and tables
library(leaflet)    # used for viewing data
library(readr)      # used for importing and reading .csv files
library(readxl)     # used for reading .xlsx files
library(ggplot2)    # used for data visualization
library(pastecs)    # used for analysis of space-time ecological series

\(\color{#29A887}{\text{Data Preparation}}\)

In this section, detailed description of data, the main procedures of importing and cleaning data, and a brief final data preview will be provided.

\(\color{#5097A4}{\text{Data Description}}\)

Original Data Variables

  • Households
    • HSHD_NUM: each household(customer) has been assigned a unique ID
    • L: whether the customer is a loyalty member or not
    • AGE_RANGE: age range of the customer
    • MARITAL: marital status of the customer
    • INCOME_RANGE: income range of the customer(what we used to divide our investigated customer groups in this project)
    • HOMEOWNER: whether the customer is a homeowner or not
    • HSHD_COMPOSITION: the composition of the customer’s family
    • HH_SIZE: the family size of the customer
    • CHILDREN: the number of child the customer has
  • Products
    • PRODUCT_NUM: each product has been assigned a unique ID
    • DEPARTMENT: the department each product belongs to
    • COMMODITY: product classification such as pet, meditation or cosmetics
    • BRAND_TY: the brand type of each product - private or national
    • X5: natural or organic flag
  • Transactions
    • Basket_NUM: each transaction has been assigned a unique ID
    • PURCHASE_: the exact purchase date of the transaction
    • WEEK_NUM: the week number when the transaction happened
    • YEAR: which year the transaction happened
    • HSHD_NUM: foreign key from Households dataset
    • PRODUCT_NUM: foreign key from Products dataset
    • SPEND: the money the customer spend(get refund) in the transaction
    • UNITS: the number of products the customer purchased(returned) in the transaction
    • STORE_R: the region of the store the transaction happened

Variables Created in this project

  • trans_clean: clean dataset without duplicates
  • prod_trans: the two datasets(products and transactions) merged together
  • data: the final clean dataset
  • high_end: customer group with income range above 100K
  • middle-end: customer Group with income range between 50K and 99K
  • low-end: customer group with income range below 49K

\(\color{#5097A4}{\text{Import and Clean Data}}\)

  • Import Data
    • Imported Household data using for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 5000 observations and 9 variables. Named the dataset “households.”
    • Imported Products data using for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 151,141 observations and 5 variables. Named the dataset “products.”
    • Imported Transactions data for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 10,625,553 observations and 9 variables. Named the dataset “transactions.”
files <- c("5000_households", "5000_products", "5000_transactions")
names <- c("households","products","transactions")

for (i in seq_along(files)) {
  #create file name
    full_path <- paste0("data/", files[i], ".csv")
  
  #import data
  if (file.exists(full_path)) {
    df <- read_csv(full_path)
    assign(names[i],df)
    rm(df)
  } else {
    print("No such file exists")
  }
}
  • Missing Values of Data Sources

There are 68 missing values in Households dataset, and they are located at Household Composition variable. There is 0 missing value in Products dataset. There is 0 missing value in Transactions dataset.

The missing values just indicated that some of the household composition information were missing from the customer. The analysis will be influenced if we simply remove all the missing values from the original data source.

  • Clean Data Duplicates will damage our analysis and give us wrong results, so we need to

  • Check the duplicates data in each dataframe, and

names <- c("households","products","transactions")

for (i in seq_along(names)) {
  #create the function to check the number of duplicates
    duplicates <- function(x){sum(duplicates(x))}
  
  #import data
  if (duplicates(x) != 0) {
    summary(duplicates(x))
  } else {
    print("No duplicates exists")
  }
}
  • Clean all the duplicates and get the clean data set
trans_clean <- transactions %>% distinct()
  • Merge Data
    • To obtain a more direct observation of the household level of transactions, the next step is to merge the three separate dataset into one combined dataset, which was named as “data.”
prod_trans <- merge(trans_clean, products)
data <- merge(prod_trans, households)

\(\color{#5097A4}{\text{Final Data Preview}}\)

head() function and summary() function were used to generate a brief preview of the combined clean dataset.

head(data, 6)
HSHD_NUM PRODUCT_NUM BASKET_NUM PURCHASE_ SPEND UNITS STORE_R WEEK_NUM YEAR DEPARTMENT COMMODITY BRAND_TY X5 L AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION HH_SIZE CHILDREN
0001 1345638 559252 29-JAN-17 4.01 1 EAST 57 2017 FOOD MEAT - BEEF NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
0001 0324896 713487 18-MAY-17 3.79 1 EAST 72 2017 FOOD INTERNATIONAL FOOD NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
0001 1029355 203002 25-MAY-16 1.63 1 EAST 21 2016 FOOD PRODUCE NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
0001 0654440 741608 07-JUN-17 1.99 1 EAST 75 2017 FOOD INTERNATIONAL FOOD NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
0001 4174205 588333 19-FEB-17 2.50 2 EAST 60 2017 FOOD PRODUCE NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
0001 2933903 699493 09-MAY-17 3.99 1 EAST 71 2017 FOOD GROCERY STAPLE NATIONAL N Y 55-64 Married 150K+ Homeowner 2 Adults and Kids 3 1
summary(data)
HSHD_NUM PRODUCT_NUM BASKET_NUM PURCHASE_ SPEND UNITS STORE_R WEEK_NUM YEAR DEPARTMENT COMMODITY BRAND_TY X5 L AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION HH_SIZE CHILDREN
Length:10625131 Length:10625131 Length:10625131 Length:10625131 Min. :-14.990 Min. :-26.000 Length:10625131 Min. : 1.00 Min. :2016 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131 Length:10625131
Class :character Class :character Class :character Class :character 1st Qu.: 1.670 1st Qu.: 1.000 Class :character 1st Qu.: 27.00 1st Qu.:2016 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Median : 2.580 Median : 1.000 Mode :character Median : 52.00 Median :2016 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
NA NA NA NA Mean : 3.598 Mean : 1.296 NA Mean : 52.48 Mean :2016 NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA 3rd Qu.: 4.000 3rd Qu.: 1.000 NA 3rd Qu.: 78.00 3rd Qu.:2017 NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA Max. :299.990 Max. :246.000 NA Max. :104.00 Max. :2017 NA NA NA NA NA NA NA NA NA NA NA NA

\(\color{#29A887}{\text{Proposed Exploratory Data Analysis}}\)

Time Series Analysis has been applied in this project to generate the spending trend of customers. Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.

\(\color{#5097A4}{\text{High-End Customer Group}}\)

High-end Customer Group: income range above 100K

high_end <- filter(data, INCOME_RANGE == "150K+" | INCOME_RANGE == "100-150K")

To generate the spending trend of the high-end customer group

#finding spending trends in high-end

high_end$PURCHASE_ <- strptime(high_end$PURCHASE_,format = "%d-%b-%Y")

high.end <- high_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
  unite(ID, HSHD_NUM, year, month, date, sep = "-")

spend.id.high <-  data.frame(high.end$ID, high.end$SPEND) 

spend.day.high <- aggregate(high.end.SPEND ~ high.end.ID, data = spend.id.high, FUN = sum)

spend.high <- spend.day.high %>% 
  separate(high.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
  unite(date, year, month, date, sep = "-") %>%
  group_by_at(vars(-high.end.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = high.end.SPEND) %>%    
  select(-row_id) 

value.high <- spend.high[, colSums(is.na(spend.high)) < 720]

In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.

dim(value.high)
## [1] 726 669

There were 668 customers in total who have been taken into consideration when generating the spending trend in high-end market.

#obtain trends in high-end market

high.test <- value.high[,c(2:669)]
for (x in high.test) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}

trend.high <- read_excel("data/trends_high_end.xlsx") %>% 
  separate(trend, c("n", "trend"), sep = " ")
unique(trend.high$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in high_end market

high.test <- value.high[,c(2:669)]

high.increase <- 0
high.decrease <- 0
for (x in high.test) {
  if (trend.test(x)$estimate > 0) {
    high.increase = high.increase + 1 
  }
  else{
    high.decrease = high.decrease + 1 
  }
  
}
high.increase
## [1] 340
high.decrease
## [1] 328

There are 340 customers have increasing spending trend among past two years, and 328 customers have decreasing spending trend among past two years.

#add the tend column to high_end table based on HSHD_ID, call new table high

high_ <- value.high %>% gather(ID, spend, 2:669)
high.ID <-  as.vector(unique(high_$ID))
high.trend <- as.vector(trend.high$trend)
high.comb <- cbind.data.frame(high.ID, high.trend)

high.inc <- high.comb[high.comb$high.trend == "\"inc\"", 1]
high.dec <- high.comb[high.comb$high.trend == "\"dec\"", 1]

high.part1 <- high_end[which(high_end$HSHD_NUM == high.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION","X5")]
high.part2 <- high_end[which(high_end$HSHD_NUM == high.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "X5")]

high.part1["trends"] <- rep("inc")
high.part2["trends"] <- rep("dec")
high <- rbind.data.frame(high.part1, high.part2)

\(\color{#003151}{\text{Loyalty}}\)

high.trend.L <- ggplot(high, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
high.trend.L + geom_histogram(aes(fill = factor(L)),
                            stat = "count",
                                  bins = 4,
                                  binwidth = 2,
                                  col = "white", 
                                  size = 0.1) +   
  labs(title = "increasing and decreasing trends shows in high-end customer loyalty")

Observation:

  • Most high-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.

  • Among the small portion of the high-end customers who were not loyalty members of the retailer, more customers decreased their spend over time.

  • Loyalty customers tend to spend more over time. The retailer needs to take action to reward their loyalty members and keep them in basket. Meanwhile, they could hand out some coupons or other promotions to invite more customers join the loyalty membership and increase their spend in the long run.

\(\color{#003151}{\text{Brand Type Selection}}\)

high.trend.stores <- ggplot(high, aes(trends)) + scale_fill_brewer(palette = 
                                                                     "Pastel2")
high.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of high end markect shows in 
       brand type selection")

Observation:

  • Majority of high-end customers turned to national brands. More customers increased their spend over time.

  • The very low portion of high-end customers especially compared to national-brand-customer turned to private brands. Among the private-brand-customer, slightly more than half of them increased their spend over time.

  • Even though private-brand-buyer was only a small part of the high-end customer group, many of them tended to buy more over time. The retailer can promote more on the specific rand and product which gained the best sell in the past and use the product to keep customers.

\(\color{#003151}{\text{Store Selection}}\)

high.trend.stores <- ggplot(high, aes(trends)) + scale_fill_brewer(palette = 
                                                                     "Pastel2")
high.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of high-end market shows in four 
       stores")

Observation: * Eastern store had the majority of high-end customers in all four stores, and over half of its customers increased their spend. More customers increased their spend than the number of customers who decreased their spend over time.

  • Central store had the minority of high-end customers in all four stores, and slightly more customers decreased their spend over time.

  • Southern store and western store were similar at the number of high-end customers, however, more than half of customers of southern store decreased their spending over time, while more than half of customers of western store increased their spend over time.

  • The other three stores need to learn from eastern store, and not only try to attract more customers in the future but also try to promote the customers to buy more.

\(\color{#003151}{\text{Customer Features_Household Composition}}\)
high.store.trend <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette = 
                                                                     "Pastel2")
high.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
                           stat = "count",
                           bins = 4,
                           binwidth = 2,
                           col = "white", 
                           size = 0.1) +   
  labs(title = "store selection by high-end customers with different 
       household composition") 

Observation:

  • Overall speaking, the largest group of customers had family with 2 adults and kids.

  • Eastern store had the majority of high-end customers in all four stores. The number of customers who had family with 1 adult and kids was larger than the other three stores.

  • Central store had the minority of high-end customers in all four stores. The number of customers who were single adult at home was the lowest among all stores.

  • Southern store and western store were similar at the number of high-end customers, and both the two stores were higher in the number of customers who were single female or single male by themselves.

\(\color{#003151}{\text{Customer Features_loyalty}}\)
high.store.age <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette = 
                                                                   "Pastel2")
high.store.age + geom_histogram(aes(fill = factor(L)),
                            stat = "count",
                            bins = 4,
                            binwidth = 2,
                            col = "white", 
                            size = 0.1) +   
  labs(title = "store selection by high-end customers with customer loyalty")  

Observation:

  • Overall speaking, majority of the customers of the four stores were loyalty members.

  • Eastern store had the majority of high-end customers in all four stores. The number of customers who were not loyalty members tended to be the lowest among all four stores.

  • Oppositely, central store had the minority of high-end customers in all four stores. The number of customers who were not loyalty members tended to be the highest among all four stores.

\(\color{#5097A4}{\text{Middle-End Customer Group}}\)

Middle-end Customer Group: income range between 50K and 99K

middle_end <- filter(data, INCOME_RANGE == "75-99K" | INCOME_RANGE == "50-74K")

To generate the spending trend of the middle-end customer group

#finding spending trends in middle-end

middle_end$PURCHASE_ <- strptime(middle_end$PURCHASE_,format = "%d-%b-%Y")

mid.end <- middle_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
  unite(ID, HSHD_NUM, year, month, date, sep = "-")

spend.id.mid <-  data.frame(mid.end$ID, mid.end$SPEND) 

spend.day.mid <- aggregate(mid.end.SPEND ~ mid.end.ID, data = spend.id.mid, FUN = sum)

spend.mid <- spend.day.mid %>% 
  separate(mid.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
  unite(date, year, month, date, sep = "-") %>%
  group_by_at(vars(-mid.end.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = mid.end.SPEND) %>%    
  select(-row_id) 

value.mid <- spend.mid[, colSums(is.na(spend.mid)) < 720]

In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.

dim(value.mid)
## [1]  726 1234

There were totally 1233 customers who have been taken into consideration when generating the spending trend in middle-end market.

#obtain trends in middle-end market

mid.test <- value.mid[,c(2:1234)]
for (x in mid.test) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}

trend.mid <- read_excel("data/trends_middle_end.xlsx") %>% 
  separate(trend, c("n", "trend"), sep = " ")
unique(trend.mid$trend)
## [1] "\"dec\"" "\"inc\""
#number of increasing and decreasing in middle-end market

mid.increase <- 0
mid.decrease <- 0

for (x in mid.test) {
  if (trend.test(x)$estimate > 0) {
    mid.increase = mid.increase + 1 
  }
  else{
    mid.decrease = mid.decrease + 1 
  }
  
}
mid.increase
## [1] 645
mid.decrease
## [1] 588

There are 645 customers have increasing spending trend among past two years, and 588 customers have decreasing spending trend among past two years.

#add the trends column to high_end table based on HSHD_ID, call new table middle

mid_ <- value.mid %>% gather(ID, spend, 2:1234)
mid.ID <-  as.vector(unique(mid_$ID))
mid.trend <- as.vector(trend.mid$trend)

mid.comb <- cbind.data.frame(mid.ID, mid.trend)

mid.inc <- mid.comb[mid.comb$mid.trend == "\"inc\"", 1]
mid.dec <- mid.comb[mid.comb$mid.trend == "\"dec\"", 1]

mid.part1 <- middle_end[which(middle_end$HSHD_NUM == mid.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION","X5")]
mid.part2 <- middle_end[which(middle_end$HSHD_NUM == mid.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "X5")]

mid.part1["trends"] <- rep("inc")
mid.part2["trends"] <- rep("dec")
mid <- rbind.data.frame(mid.part1, mid.part2)

\(\color{#003151}{\text{Loyalty}}\)

mid.trend.L <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.L + geom_histogram(aes(fill = factor(L)),
                              stat = "count",
                              bins = 4,
                              binwidth = 2,
                              col = "white", 
                              size = 0.1) +   
  labs(title = "increasing and decreasing trends shows in middle-end customer loyalty")

Observation:

  • Most middle-end customers were the loyalty members of the retailer. More loyalty customers increased their spend over time.

  • Among the small portion of the middle-end customers who were not loyalty members of the retailer, more customers tended to decrease their spend over time.

\(\color{#003151}{\text{Brand Type}}\)

mid.trend.stores <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of middle end markect shows in brand type selection")

Observation:

  • Majority of middle-end customers turned to national brands. There were more customers increased their spend over time while the rest decreased their spend over time.

  • The smaller portion of middle-end customers turned to private brands. Among the private-brand-customer, more than half of their increased their spend over time.

  • There was no big division between midd-end customers who increased spend over time or decreased their spend over time on either national brand or private brand.

\(\color{#003151}{\text{Store Collection}}\)

mid.trend.stores <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of middle-end market shows in four stores")

Observation:

  • Eastern store had the majority of middle-end customers in all four stores, however, over half of its customers decreased their spend over time.

  • Central store had the lowest portion of middle-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.

  • Southern store and western store were similar at the number of middle-end customers, however, more than half of customers of southern store increased their spending over time, while more than half of customers of western store increased their spend over time as well.

  • The central store need to take action to persuasive their customers to spend more, and the other stores, even though didn’t have as many as customers as the central store, did a fair job on promote their customers spend more over time.

\(\color{#003151}{\text{Customer Features_Household Composition}}\)
mid.store.trend <- ggplot(mid, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
mid.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
                                  stat = "count",
                                  bins = 4,
                                  binwidth = 2,
                                  col = "white", 
                                  size = 0.1) +   
  labs(title = "store selection by mid-end customers with different household composition") 

Observation:

  • Overall speaking, majority of the middle-end customers had a family with two adults and kids.

  • Eastern store had the majority of middle-end customers in all four stores. The number of customers who were single females was lower than the other three stores.

  • Central store had the minority of middle-end customers in all four stores. The number of customers who were single adult at home was the lowest among all stores.

  • Southern store and western store were similar at the number of high-end customers, and both the two stores had more customers who were single female or single male by themselves.

\(\color{#003151}{\text{Customer Features_loyalty}}\)
mid.store.age <- ggplot(mid, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
mid.store.age + geom_histogram(aes(fill = factor(L)),
                                stat = "count",
                                bins = 4,
                                binwidth = 2,
                                col = "white", 
                                size = 0.1) +   
  labs(title = "store selection by mid-end customers with customer loyalty")  

Observation:

  • Overall speaking, majority of the customers of the four stores were loyalty members.

  • Eastern store had the majority of mid-end customers in all four stores. The number of customers who were not loyalty members tended to be the lowest among all four stores.

  • Oppositely, central store had the minority of high-end customers in all four stores. The number of non-loyal customers tended to be the highest among all stores.

\(\color{#5097A4}{\text{Low-End Customer Group}}\)

Low-end Customer Group: income range below 49K.

low_end <- filter(data, INCOME_RANGE == "35-49K"  | INCOME_RANGE == "UNDER 35K")

To generate the spending trend of the low-end customer group

#finding spending trends in low-end 

low_end$PURCHASE_ <- strptime(low_end$PURCHASE_,format = "%d-%b-%Y")

low.end <- low_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
  unite(ID, HSHD_NUM, year, month, date, sep = "-")

spend.id.low <-  data.frame(low.end$ID, low.end$SPEND) 

spend.day.low <- aggregate(low.end.SPEND ~ low.end.ID, data = spend.id.low, FUN = sum)
  spend.low <- spend.day.low %>% 
  separate(low.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
  unite(date, year, month, date, sep = "-") %>%
  group_by_at(vars(-low.end.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = low.end.SPEND) %>%    
  select(-row_id) 

  
value.low <- spend.low[, colSums(is.na(spend.low)) < 720]

In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.

dim(value.low)
## [1]  726 1415

There were 1414 customers in total be taken into consideration for generating the spending trend in low-end market.

##obtain trends in low-end market

low.test <- value.low[,c(2:1415)]

for (x in low.test) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}


trend.low <- read_excel("data/trends_low_end.xlsx") %>% 
  separate(trend, c("n", "trend"), sep = " ")
unique(trend.low$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in low-end market

low.test <- value.low[,c(2:1415)]

low.increase <- 0
low.decrease <- 0
for (x in low.test) {
  if (trend.test(x)$estimate > 0) {
    low.increase = low.increase + 1 
  }
  else{
    low.decrease = low.decrease + 1 
  }
  
}
low.increase
## [1] 748
low.decrease
## [1] 666

There are 748 customers have increasing spending trend among past two years, and 666 customers have decreasing spending trend among past two years.

#add the trends column to low_end table based on HSHD_ID, call new table low

low_ <- value.low %>% gather(ID, spend, 2:1415)
low.ID <-  as.vector(unique(low_$ID))
low.trend <- as.vector(trend.low$trend)
low.comb <- cbind.data.frame(low.ID, low.trend)

low.inc <- low.comb[low.comb$low.trend == "\"inc\"", 1]
low.dec <- low.comb[low.comb$low.trend == "\"dec\"", 1]

low.part1 <- low_end[which(low_end$HSHD_NUM == low.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER","X5")]
low.part2 <- low_end[which(low_end$HSHD_NUM == low.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER", "X5")]

low.part1["trends"] <- rep("inc")
low.part2["trends"] <- rep("dec")
low <- rbind.data.frame(low.part1, low.part2)

\(\color{#003151}{\text{Loyalty}}\)

low.trend.L <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.L + geom_histogram(aes(fill = factor(L)),
                              stat = "count",
                              bins = 4,
                              binwidth = 2,
                              col = "white", 
                              size = 0.1) +   
  labs(title = "increasing and decreasing trends shows in low-end customer loyalty")

Observation:

  • Most low-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.

  • Among the small portion of the low-end customers who were not loyalty members of the retailer, and the portion of customers who either increased or decreased their spend over time was almost 50-50.

\(\color{#003151}{\text{Brand Type}}\)

low.trend.stores <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of low end markect shows in brand 
       type selection")

Observation:

  • Majority of low-end customers turned to national brands. More customers increased their spend over time while the rest decreased their spend over time.

  • A lower portion of low-end customers turned to private brands. The number of customers who either decreased their spend or increased their spend seemed to be even numbers.

  • For low-end market, customers didn’t have a strong preference of brand type compared with middle-end market and high-end market. The retailer did a fair job to push its customers to increase their spend over time.

\(\color{#003151}{\text{Store Collection}}\)

low.trend.stores <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
                                   stat = "count",
                                   bins = 4,
                                   binwidth = 2,
                                   col = "white", 
                                   size = 0.1) +   
  labs(title = "increasing and decreasing trends of low-end market shows in four stores")

Observation:

  • Eastern store had the majority of low-end customers in all four stores, and over half of its customers increased their spend over time.

  • Southern store had the lowest portion of low-end customers in all four stores, and the portion of customers who either increased or decreased their spend over time was almost 50-50.

  • Central store and western store were similar at the number of low-end customers. Though they had lower portion of low-end customers compared with the other two stores, more customers increased their spend over time.

  • The four store did a fair job of promoting their customers to spend more.

\(\color{#003151}{\text{Customer Features_Household Composition}}\)
low.store.trend <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
low.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
                                  stat = "count",
                                  bins = 4,
                                  binwidth = 2,
                                  col = "white", 
                                  size = 0.1) +   
  labs(title = "store selection by low-end customers with different household composition") 

Observation:

  • Majority of the low-end customers had a family with two adults. The second largest portion of the low-end customers was customers with a family with two adults and kids.

  • Eastern store had the majority of low-end customers in all four stores. The number of customers who were single males was higher than the other three stores.

  • Southern store had the minority of low-end customers in all four stores. The number of customers who were single adult whether male or female seemed to be even.

\(\color{#003151}{\text{Customer Features_loyalty}}\)
low.store.age <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
low.store.age + geom_histogram(aes(fill = factor(L)),
                                stat = "count",
                                bins = 4,
                                binwidth = 2,
                                col = "white", 
                                size = 0.1) +   
  labs(title = "store selection by low-end customers with customer loyalty")   

Observation:

  • Overall speaking, majority of the low-end customers of the four stores were loyalty members.

  • Eastern store had the majority of low-end customers in all four stores. The number of customers who were not loyalty members tended to be lower than the others.

\(\color{#29A887}{\text{Summary}}\)

** Summary on our problem statement and how we addressed them **

The main goal of this analysis is to find out the spending trends of customers on whether they increased their spend over time or they decreased their spend according to our records. Other than that, we also look into details in the trends showed in loyalty, brand type of product purchased, store selections and customer features of loyalty and household composition in store selection. Data of the transactions taken place in two years provided by our professor has been used in this analysis. Our methods being used in this project include but not limited to dividend, time series analysis and plots. Finally we generate inspirations from the analysis and show in this report.

** Some interesting finds we generating from our analysis: **

  • There was not a huge difference betwwen the number of customers who whether increased or decreased their spend over time

  • Eastern store tended to have the best performance overall.

  • Loyalty members would like to revisit stores from time to time and increase their spend in the long run.

  • Customers were much more interested in national brands compared with private brands.

  • The compositions of customers in each store selection were quite similar from our analysis and the plots we generated. Family with kids tended to be the major term.

** We might have some ideas and suggestions for the retailer: **

  • Look carefully at Eastern store and learn more about the secret of the success in this particular store. Whether it is the location, the assistants in the store, or just the right promotions. Maybe reorganize other stores a little bit based on further detailed analysis.

  • Making sure your customers are loyal is critical. Based on our analysis, loyal customers tended to buy more over time. Non-loyalty members might decrease their spend or even quit. Keep you loyalty members while invite new customers to join your loyalty family!

  • Some stores such as Central Store tended to have more single customers besides the major group of customers who have adults-with-kids. These stores may have more products target on the single person while remain the abundance of househols products.

  • Even though customers purchased mainly and hugely on national products, there was an increasing trend in private brands ass well. It will be helpful to find out several particular brands customers crave for and increase sells potentially.

  • Customers are definitely in the first place for a retailer. There will not be an end in learning more about customers!

** Further analysis **

  • The trend analysis is not perfect for the reason that further hypothesis tests were not generated to prove the existence of linear regression we obtained in the previous steps. To be exact, some of the trends of customer-spending may not have linear correlation. We would kindly and sincerely suggest to add hypothesis test to develop model and provide more evidence in further trend analysis in order to improve the accurancy of the result.

  • Further analysis is able to perform if more detailed information on customers are provided. Then more detailed results and suggestions can be made to provide to the retailer on what kind of promotions or improvement they may have and how to take those in action.