The core of a business is to learn about its customers.
A company can never grow larger nor further if the management levels know nothing about its customer or competitor. Especially when their competitor stay close to the trend of society and always keep an eye on the customers. A company needs to learn from the behavior of the current customers, which might be the solid and easiest information they can collect, in order to keep the current customers around, attract more new customers and improve itself. So does a retailer.
To help the retailer have better understanding of its customer, this project mainly provided an analysis on the spending trends of the customer. Households with different levels of incoming will have their distinct needs. Meanwhile, they have no competition against each other. The retailer is supposed to react with groups with different ranges of income respectively. Based on the range of income of the 2500 households, we dvided them into three customer groups:
Each question been asked and explained in this project will be explicitly spread in these three customer groups.
To obtain the specific details which will help the retailer learn more about its customers and improve themselves to perform better in the future, several questions have been raised up with and will be explained and solved in this project are listed as follows:
Data Source
This dataset was released by 84.51º for academic research and case study, and was provided by our professor Bradley Boehmke for the Data Wrangling with R course at University of Cincinnati. The data source can be find at course page. The whole dataset contains a household group of 2500 households and thier household level transactions over two years. All the households are frequent shoppers at a retailer. This dataset also contains the demographic information of the 2500 households in order to help the retailer learn more about its customers. This data provided an opportunity to explore 84.51°/Kroger data and an opportunity to apply skills learned to investigate different business questions.
Methodology
Dividend has been used to divide the whole large dataset into three groups of customers based on their income ranges.
Time Series Analysis has been applied in this project to generate the spending trend of customers.
a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
b <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
trend <- function(x){trend.test(x)}
trend(a)
##
## Spearman's rank correlation rho
##
## data: x and time(x)
## S = 3.6637e-14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1
trend(a)$estimate
## rho
## 1
trend(b)
##
## Spearman's rank correlation rho
##
## data: x and time(x)
## S = 330, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -1
trend(b)$estimate
## rho
## -1
Test.trend: Test if the series has an increasing or decreasing trend, using a non-parametric Spearman test between the observations and time.
Many thanks to our reference
The mission of this analysis project is to help the retailer have a better understanding of its customers. Understand the data they collected from their customers can help it understand the market and have win-win relationship with their respectful customers. There is no doubt that the final goal of each business is to earn profit.
The spending trends of the customers in two years (2016 and 2017) will provide the retailer with a detailed picture of its performance by pointing out how many customers spending less as well as how many customers tend to spending more. The distinct factors such as store selection, brand selection, and product selection by customers will help the retailer improving itself in order to attract more customers. Other demographic factors, for example, age range and household composition of customers will help the retailer look into more specific details so it can provide its customers with more proper promotion, which will help the retailer gain profits in return.
The whole project can not only help the retailer with its performance, but also can help each reader with their understanding of customer behaviors, customer spending trends and other aspects of business. Will provide you with new ideas about business while you are looking through the whole project we did for sure!
Packages being used in this project. If you want to run the code individually, please install these packages first.
library(tidyr) # used for tidying up data
library(dplyr) # used for data manipulation
library(lubridate) # used for transforming date
library(knitr) # used for viewing data
library(printr) # used for viewing data and tables
library(leaflet) # used for viewing data
library(readr) # used for importing and reading .csv files
library(readxl) # used for reading .xlsx files
library(ggplot2) # used for data visualization
library(pastecs) # used for analysis of space-time ecological series
In this section, detailed description of data, the main procedures of importing and cleaning data, and a brief final data preview will be provided.
Original Data Variables
Variables Created in this project
for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 5000 observations and 9 variables. Named the dataset “households.”for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 151,141 observations and 5 variables. Named the dataset “products.”for loop function we created(instead of trying to use read_csv() function to import data everytime for all three data files), which was a csv file with 10,625,553 observations and 9 variables. Named the dataset “transactions.”files <- c("5000_households", "5000_products", "5000_transactions")
names <- c("households","products","transactions")
for (i in seq_along(files)) {
#create file name
full_path <- paste0("data/", files[i], ".csv")
#import data
if (file.exists(full_path)) {
df <- read_csv(full_path)
assign(names[i],df)
rm(df)
} else {
print("No such file exists")
}
}
There are 68 missing values in Households dataset, and they are located at Household Composition variable. There is 0 missing value in Products dataset. There is 0 missing value in Transactions dataset.
The missing values just indicated that some of the household composition information were missing from the customer. The analysis will be influenced if we simply remove all the missing values from the original data source.
Clean Data Duplicates will damage our analysis and give us wrong results, so we need to
Check the duplicates data in each dataframe, and
names <- c("households","products","transactions")
for (i in seq_along(names)) {
#create the function to check the number of duplicates
duplicates <- function(x){sum(duplicates(x))}
#import data
if (duplicates(x) != 0) {
summary(duplicates(x))
} else {
print("No duplicates exists")
}
}
trans_clean <- transactions %>% distinct()
prod_trans <- merge(trans_clean, products)
data <- merge(prod_trans, households)
head()function andsummary()function were used to generate a brief preview of the combined clean dataset.
head(data, 6)
| HSHD_NUM | PRODUCT_NUM | BASKET_NUM | PURCHASE_ | SPEND | UNITS | STORE_R | WEEK_NUM | YEAR | DEPARTMENT | COMMODITY | BRAND_TY | X5 | L | AGE_RANGE | MARITAL | INCOME_RANGE | HOMEOWNER | HSHD_COMPOSITION | HH_SIZE | CHILDREN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0001 | 1345638 | 559252 | 29-JAN-17 | 4.01 | 1 | EAST | 57 | 2017 | FOOD | MEAT - BEEF | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 0324896 | 713487 | 18-MAY-17 | 3.79 | 1 | EAST | 72 | 2017 | FOOD | INTERNATIONAL FOOD | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 1029355 | 203002 | 25-MAY-16 | 1.63 | 1 | EAST | 21 | 2016 | FOOD | PRODUCE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 0654440 | 741608 | 07-JUN-17 | 1.99 | 1 | EAST | 75 | 2017 | FOOD | INTERNATIONAL FOOD | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 4174205 | 588333 | 19-FEB-17 | 2.50 | 2 | EAST | 60 | 2017 | FOOD | PRODUCE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 2933903 | 699493 | 09-MAY-17 | 3.99 | 1 | EAST | 71 | 2017 | FOOD | GROCERY STAPLE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
summary(data)
| HSHD_NUM | PRODUCT_NUM | BASKET_NUM | PURCHASE_ | SPEND | UNITS | STORE_R | WEEK_NUM | YEAR | DEPARTMENT | COMMODITY | BRAND_TY | X5 | L | AGE_RANGE | MARITAL | INCOME_RANGE | HOMEOWNER | HSHD_COMPOSITION | HH_SIZE | CHILDREN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Min. :-14.990 | Min. :-26.000 | Length:10625131 | Min. : 1.00 | Min. :2016 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | |
| Class :character | Class :character | Class :character | Class :character | 1st Qu.: 1.670 | 1st Qu.: 1.000 | Class :character | 1st Qu.: 27.00 | 1st Qu.:2016 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | |
| Mode :character | Mode :character | Mode :character | Mode :character | Median : 2.580 | Median : 1.000 | Mode :character | Median : 52.00 | Median :2016 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | |
| NA | NA | NA | NA | Mean : 3.598 | Mean : 1.296 | NA | Mean : 52.48 | Mean :2016 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| NA | NA | NA | NA | 3rd Qu.: 4.000 | 3rd Qu.: 1.000 | NA | 3rd Qu.: 78.00 | 3rd Qu.:2017 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| NA | NA | NA | NA | Max. :299.990 | Max. :246.000 | NA | Max. :104.00 | Max. :2017 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Time Series Analysis has been applied in this project to generate the spending trend of customers. Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.
High-end Customer Group: income range above 100K
high_end <- filter(data, INCOME_RANGE == "150K+" | INCOME_RANGE == "100-150K")
To generate the spending trend of the high-end customer group
#finding spending trends in high-end
high_end$PURCHASE_ <- strptime(high_end$PURCHASE_,format = "%d-%b-%Y")
high.end <- high_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend.id.high <- data.frame(high.end$ID, high.end$SPEND)
spend.day.high <- aggregate(high.end.SPEND ~ high.end.ID, data = spend.id.high, FUN = sum)
spend.high <- spend.day.high %>%
separate(high.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
unite(date, year, month, date, sep = "-") %>%
group_by_at(vars(-high.end.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = high.end.SPEND) %>%
select(-row_id)
value.high <- spend.high[, colSums(is.na(spend.high)) < 720]
In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.
dim(value.high)
## [1] 726 669
There were 668 customers in total who have been taken into consideration when generating the spending trend in high-end market.
#obtain trends in high-end market
high.test <- value.high[,c(2:669)]
for (x in high.test) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend.high <- read_excel("data/trends_high_end.xlsx") %>%
separate(trend, c("n", "trend"), sep = " ")
unique(trend.high$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in high_end market
high.test <- value.high[,c(2:669)]
high.increase <- 0
high.decrease <- 0
for (x in high.test) {
if (trend.test(x)$estimate > 0) {
high.increase = high.increase + 1
}
else{
high.decrease = high.decrease + 1
}
}
high.increase
## [1] 340
high.decrease
## [1] 328
There are 340 customers have increasing spending trend among past two years, and 328 customers have decreasing spending trend among past two years.
#add the tend column to high_end table based on HSHD_ID, call new table high
high_ <- value.high %>% gather(ID, spend, 2:669)
high.ID <- as.vector(unique(high_$ID))
high.trend <- as.vector(trend.high$trend)
high.comb <- cbind.data.frame(high.ID, high.trend)
high.inc <- high.comb[high.comb$high.trend == "\"inc\"", 1]
high.dec <- high.comb[high.comb$high.trend == "\"dec\"", 1]
high.part1 <- high_end[which(high_end$HSHD_NUM == high.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION","X5")]
high.part2 <- high_end[which(high_end$HSHD_NUM == high.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "X5")]
high.part1["trends"] <- rep("inc")
high.part2["trends"] <- rep("dec")
high <- rbind.data.frame(high.part1, high.part2)
high.trend.L <- ggplot(high, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
high.trend.L + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends shows in high-end customer loyalty")
Observation:
Most high-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the high-end customers who were not loyalty members of the retailer, more customers decreased their spend over time.
Loyalty customers tend to spend more over time. The retailer needs to take action to reward their loyalty members and keep them in basket. Meanwhile, they could hand out some coupons or other promotions to invite more customers join the loyalty membership and increase their spend in the long run.
high.trend.stores <- ggplot(high, aes(trends)) + scale_fill_brewer(palette =
"Pastel2")
high.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of high end markect shows in
brand type selection")
Observation:
Majority of high-end customers turned to national brands. More customers increased their spend over time.
The very low portion of high-end customers especially compared to national-brand-customer turned to private brands. Among the private-brand-customer, slightly more than half of them increased their spend over time.
Even though private-brand-buyer was only a small part of the high-end customer group, many of them tended to buy more over time. The retailer can promote more on the specific rand and product which gained the best sell in the past and use the product to keep customers.
high.trend.stores <- ggplot(high, aes(trends)) + scale_fill_brewer(palette =
"Pastel2")
high.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of high-end market shows in four
stores")
Observation: * Eastern store had the majority of high-end customers in all four stores, and over half of its customers increased their spend. More customers increased their spend than the number of customers who decreased their spend over time.
Central store had the minority of high-end customers in all four stores, and slightly more customers decreased their spend over time.
Southern store and western store were similar at the number of high-end customers, however, more than half of customers of southern store decreased their spending over time, while more than half of customers of western store increased their spend over time.
The other three stores need to learn from eastern store, and not only try to attract more customers in the future but also try to promote the customers to buy more.
high.store.trend <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette =
"Pastel2")
high.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by high-end customers with different
household composition")
Observation:
Overall speaking, the largest group of customers had family with 2 adults and kids.
Eastern store had the majority of high-end customers in all four stores. The number of customers who had family with 1 adult and kids was larger than the other three stores.
Central store had the minority of high-end customers in all four stores. The number of customers who were single adult at home was the lowest among all stores.
Southern store and western store were similar at the number of high-end customers, and both the two stores were higher in the number of customers who were single female or single male by themselves.
high.store.age <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette =
"Pastel2")
high.store.age + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by high-end customers with customer loyalty")
Observation:
Overall speaking, majority of the customers of the four stores were loyalty members.
Eastern store had the majority of high-end customers in all four stores. The number of customers who were not loyalty members tended to be the lowest among all four stores.
Oppositely, central store had the minority of high-end customers in all four stores. The number of customers who were not loyalty members tended to be the highest among all four stores.
Middle-end Customer Group: income range between 50K and 99K
middle_end <- filter(data, INCOME_RANGE == "75-99K" | INCOME_RANGE == "50-74K")
To generate the spending trend of the middle-end customer group
#finding spending trends in middle-end
middle_end$PURCHASE_ <- strptime(middle_end$PURCHASE_,format = "%d-%b-%Y")
mid.end <- middle_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend.id.mid <- data.frame(mid.end$ID, mid.end$SPEND)
spend.day.mid <- aggregate(mid.end.SPEND ~ mid.end.ID, data = spend.id.mid, FUN = sum)
spend.mid <- spend.day.mid %>%
separate(mid.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
unite(date, year, month, date, sep = "-") %>%
group_by_at(vars(-mid.end.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = mid.end.SPEND) %>%
select(-row_id)
value.mid <- spend.mid[, colSums(is.na(spend.mid)) < 720]
In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.
dim(value.mid)
## [1] 726 1234
There were totally 1233 customers who have been taken into consideration when generating the spending trend in middle-end market.
#obtain trends in middle-end market
mid.test <- value.mid[,c(2:1234)]
for (x in mid.test) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend.mid <- read_excel("data/trends_middle_end.xlsx") %>%
separate(trend, c("n", "trend"), sep = " ")
unique(trend.mid$trend)
## [1] "\"dec\"" "\"inc\""
#number of increasing and decreasing in middle-end market
mid.increase <- 0
mid.decrease <- 0
for (x in mid.test) {
if (trend.test(x)$estimate > 0) {
mid.increase = mid.increase + 1
}
else{
mid.decrease = mid.decrease + 1
}
}
mid.increase
## [1] 645
mid.decrease
## [1] 588
There are 645 customers have increasing spending trend among past two years, and 588 customers have decreasing spending trend among past two years.
#add the trends column to high_end table based on HSHD_ID, call new table middle
mid_ <- value.mid %>% gather(ID, spend, 2:1234)
mid.ID <- as.vector(unique(mid_$ID))
mid.trend <- as.vector(trend.mid$trend)
mid.comb <- cbind.data.frame(mid.ID, mid.trend)
mid.inc <- mid.comb[mid.comb$mid.trend == "\"inc\"", 1]
mid.dec <- mid.comb[mid.comb$mid.trend == "\"dec\"", 1]
mid.part1 <- middle_end[which(middle_end$HSHD_NUM == mid.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION","X5")]
mid.part2 <- middle_end[which(middle_end$HSHD_NUM == mid.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "X5")]
mid.part1["trends"] <- rep("inc")
mid.part2["trends"] <- rep("dec")
mid <- rbind.data.frame(mid.part1, mid.part2)
mid.trend.L <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.L + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends shows in middle-end customer loyalty")
Observation:
Most middle-end customers were the loyalty members of the retailer. More loyalty customers increased their spend over time.
Among the small portion of the middle-end customers who were not loyalty members of the retailer, more customers tended to decrease their spend over time.
mid.trend.stores <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of middle end markect shows in brand type selection")
Observation:
Majority of middle-end customers turned to national brands. There were more customers increased their spend over time while the rest decreased their spend over time.
The smaller portion of middle-end customers turned to private brands. Among the private-brand-customer, more than half of their increased their spend over time.
There was no big division between midd-end customers who increased spend over time or decreased their spend over time on either national brand or private brand.
mid.trend.stores <- ggplot(mid, aes(trends)) + scale_fill_brewer(palette = "Pastel1")
mid.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of middle-end market shows in four stores")
Observation:
Eastern store had the majority of middle-end customers in all four stores, however, over half of its customers decreased their spend over time.
Central store had the lowest portion of middle-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.
Southern store and western store were similar at the number of middle-end customers, however, more than half of customers of southern store increased their spending over time, while more than half of customers of western store increased their spend over time as well.
The central store need to take action to persuasive their customers to spend more, and the other stores, even though didn’t have as many as customers as the central store, did a fair job on promote their customers spend more over time.
mid.store.trend <- ggplot(mid, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
mid.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by mid-end customers with different household composition")
Observation:
Overall speaking, majority of the middle-end customers had a family with two adults and kids.
Eastern store had the majority of middle-end customers in all four stores. The number of customers who were single females was lower than the other three stores.
Central store had the minority of middle-end customers in all four stores. The number of customers who were single adult at home was the lowest among all stores.
Southern store and western store were similar at the number of high-end customers, and both the two stores had more customers who were single female or single male by themselves.
mid.store.age <- ggplot(mid, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
mid.store.age + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by mid-end customers with customer loyalty")
Observation:
Overall speaking, majority of the customers of the four stores were loyalty members.
Eastern store had the majority of mid-end customers in all four stores. The number of customers who were not loyalty members tended to be the lowest among all four stores.
Oppositely, central store had the minority of high-end customers in all four stores. The number of non-loyal customers tended to be the highest among all stores.
Low-end Customer Group: income range below 49K.
low_end <- filter(data, INCOME_RANGE == "35-49K" | INCOME_RANGE == "UNDER 35K")
To generate the spending trend of the low-end customer group
#finding spending trends in low-end
low_end$PURCHASE_ <- strptime(low_end$PURCHASE_,format = "%d-%b-%Y")
low.end <- low_end %>% separate(PURCHASE_, c("year", "month", "date"), sep = '-') %>%
unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend.id.low <- data.frame(low.end$ID, low.end$SPEND)
spend.day.low <- aggregate(low.end.SPEND ~ low.end.ID, data = spend.id.low, FUN = sum)
spend.low <- spend.day.low %>%
separate(low.end.ID, c("ID", "year","month", "date"), sep = '-') %>%
unite(date, year, month, date, sep = "-") %>%
group_by_at(vars(-low.end.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = low.end.SPEND) %>%
select(-row_id)
value.low <- spend.low[, colSums(is.na(spend.low)) < 720]
In our analysis, we only took customers who had over 6 transactions within two years (the range of the original dataset), to be exact, 726 days intoconsideration. The spending trends of these customers will be generated and performed a time series where the rows indicate every single day from 2016/01/03 to 2017/12/30.
dim(value.low)
## [1] 726 1415
There were 1414 customers in total be taken into consideration for generating the spending trend in low-end market.
##obtain trends in low-end market
low.test <- value.low[,c(2:1415)]
for (x in low.test) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend.low <- read_excel("data/trends_low_end.xlsx") %>%
separate(trend, c("n", "trend"), sep = " ")
unique(trend.low$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in low-end market
low.test <- value.low[,c(2:1415)]
low.increase <- 0
low.decrease <- 0
for (x in low.test) {
if (trend.test(x)$estimate > 0) {
low.increase = low.increase + 1
}
else{
low.decrease = low.decrease + 1
}
}
low.increase
## [1] 748
low.decrease
## [1] 666
There are 748 customers have increasing spending trend among past two years, and 666 customers have decreasing spending trend among past two years.
#add the trends column to low_end table based on HSHD_ID, call new table low
low_ <- value.low %>% gather(ID, spend, 2:1415)
low.ID <- as.vector(unique(low_$ID))
low.trend <- as.vector(trend.low$trend)
low.comb <- cbind.data.frame(low.ID, low.trend)
low.inc <- low.comb[low.comb$low.trend == "\"inc\"", 1]
low.dec <- low.comb[low.comb$low.trend == "\"dec\"", 1]
low.part1 <- low_end[which(low_end$HSHD_NUM == low.inc), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER","X5")]
low.part2 <- low_end[which(low_end$HSHD_NUM == low.dec), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER", "X5")]
low.part1["trends"] <- rep("inc")
low.part2["trends"] <- rep("dec")
low <- rbind.data.frame(low.part1, low.part2)
low.trend.L <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.L + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends shows in low-end customer loyalty")
Observation:
Most low-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the low-end customers who were not loyalty members of the retailer, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
low.trend.stores <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.stores + geom_histogram(aes(fill = factor(BRAND_TY)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of low end markect shows in brand
type selection")
Observation:
Majority of low-end customers turned to national brands. More customers increased their spend over time while the rest decreased their spend over time.
A lower portion of low-end customers turned to private brands. The number of customers who either decreased their spend or increased their spend seemed to be even numbers.
For low-end market, customers didn’t have a strong preference of brand type compared with middle-end market and high-end market. The retailer did a fair job to push its customers to increase their spend over time.
low.trend.stores <- ggplot(low, aes(trends)) + scale_fill_brewer(palette = "Pastel2")
low.trend.stores + geom_histogram(aes(fill = factor(STORE_R)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "increasing and decreasing trends of low-end market shows in four stores")
Observation:
Eastern store had the majority of low-end customers in all four stores, and over half of its customers increased their spend over time.
Southern store had the lowest portion of low-end customers in all four stores, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Central store and western store were similar at the number of low-end customers. Though they had lower portion of low-end customers compared with the other two stores, more customers increased their spend over time.
The four store did a fair job of promoting their customers to spend more.
low.store.trend <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
low.store.trend + geom_histogram(aes(fill = factor(HSHD_COMPOSITION)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by low-end customers with different household composition")
Observation:
Majority of the low-end customers had a family with two adults. The second largest portion of the low-end customers was customers with a family with two adults and kids.
Eastern store had the majority of low-end customers in all four stores. The number of customers who were single males was higher than the other three stores.
Southern store had the minority of low-end customers in all four stores. The number of customers who were single adult whether male or female seemed to be even.
low.store.age <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
low.store.age + geom_histogram(aes(fill = factor(L)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "store selection by low-end customers with customer loyalty")
Observation:
Overall speaking, majority of the low-end customers of the four stores were loyalty members.
Eastern store had the majority of low-end customers in all four stores. The number of customers who were not loyalty members tended to be lower than the others.
Summary on our problem statement and how we addressed them
The main goal of this analysis is to find out the spending trends of customers on whether they increased their spend over time or they decreased their spend according to our records. Other than that, we also look into details in the trends showed in loyalty, brand type of product purchased, store selections and customer features of loyalty and household composition in store selection. Data of the transactions taken place in two years provided by our professor has been used in this analysis. Our methods being used in this project include but not limited to dividend, time series analysis and plots. Finally we generate inspirations from the analysis and show in this report.
Some interesting finds we generating from our analysis:
There was not a huge difference betwwen the number of customers who whether increased or decreased their spend over time
Eastern store tended to have the best performance overall.
Loyalty members would like to revisit stores from time to time and increase their spend in the long run.
Customers were much more interested in national brands compared with private brands.
The compositions of customers in each store selection were quite similar from our analysis and the plots we generated. Family with kids tended to be the major term.
We might have some ideas and suggestions for the retailer:
Look carefully at Eastern store and learn more about the secret of the success in this particular store. Whether it is the location, the assistants in the store, or just the right promotions. Maybe reorganize other stores a little bit based on further detailed analysis.
Making sure your customers are loyal is critical. Based on our analysis, loyal customers tended to buy more over time. Non-loyalty members might decrease their spend or even quit. Keep you loyalty members while invite new customers to join your loyalty family!
Some stores such as Central Store tended to have more single customers besides the major group of customers who have adults-with-kids. These stores may have more products target on the single person while remain the abundance of househols products.
Even though customers purchased mainly and hugely on national products, there was an increasing trend in private brands ass well. It will be helpful to find out several particular brands customers crave for and increase sells potentially.
Customers are definitely in the first place for a retailer. There will not be an end in learning more about customers!
Further analysis
The trend analysis is not perfect for the reason that further hypothesis tests were not generated to prove the existence of linear regression we obtained in the previous steps. To be exact, some of the trends of customer-spending may not have linear correlation. We would kindly and sincerely suggest to add hypothesis test to develop model and provide more evidence in further trend analysis in order to improve the accurancy of the result.
Further analysis is able to perform if more detailed information on customers are provided. Then more detailed results and suggestions can be made to provide to the retailer on what kind of promotions or improvement they may have and how to take those in action.