The core of a business is to learn about its customers.
A company can never grow larger nor further if the management levels know nothing about its customer or competitor. Especially when their competitor stay close to the trend of society and always keep an eye on the customers. A company needs to learn from the behavior of the current customers, which might be the solid and easiest information they can collect, in order to keep the current customers around, attract more new customers and improve itself. So does a retailer.
To help the retailer have better understanding of its customer, this project mainly provided an analysis on the spending trends of the customer. Households with different levels of incoming will have their distinct needs. Meanwhile, they have no competition against each other. The retailer is supposed to react with groups with different ranges of income respectively. Based on the range of income of the 2500 households, we dvided them into three customer groups:
Each question been asked and explained in this project will be explicitly spread in these three customer groups.
To obtain the specific details which will help the retailer learn more about its customers and improve themselves to perform better in the future, several questions have been raised up with and will be explained and solved in this project are listed as follows:
Data Source
This dataset was released by 84.51º for academic research and case study, and was provided by our professor Bradley Boehmke for the Data Wrangling with R course at University of Cincinnati. The data source can be find at course page. The whole dataset contains a household group of 2500 households and thier household level transactions over two years. All the households are frequent shoppers at a retailer. This dataset also contains the demographic information of the 2500 households in order to help the retailer learn more about its customers. This data provided an opportunity to explore 84.51°/Kroger data and an opportunity to apply skills learned to investigate different business questions.
Methodology
Dividend has been used to divide the whole large dataset into three groups of customers based on their income ranges.
Time Series Analysis has been applied in this project to generate the spending trend of customers.
Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.
The mission of this analysis project is to help the retailer have a better understanding of its customers. Understand the data they collected from their customers can help it understand the market and have win-win relationship with their respectful customers. There is no doubt that the final goal of each business is to earn profit.
The spending trends of the customers in two years (2016 and 2017) will provide the retailer with a detailed picture of its performance by pointing out how many customers spending less as well as how many customers tend to spending more. The distinct factors such as store selection, brand selection, and product selection by customers will help the retailer improving itself in order to attract more customers. Other demographic factors, for example, age range and household composition of customers will help the retailer look into more specific details so it can provide its customers with more proper promotion, which will help the retailer gain profits in return.
The whole project can not only help the retailer with its performance, but also can help each reader with their understanding of customer behaviors, customer spending trends and other aspects of business. Will provide you with new ideas about business while you are looking through the whole project we did for sure!
Packages being used in this project. If you want to run the code individually, please install these packages first.
library(tidyr) # used for tidying up data
library(dplyr) # used for data manipulation
library(lubridate) # used for transforming date
library(knitr) # used for viewing data
library(printr) # used for viewing data and tables
library(leaflet) # used for viewing data
library(readr) # used for importing and reading .csv files
library(readxl) # used for reading .xlsx files
library(ggplot2) # used for data visualization
library(pastecs) # used for analysis of space-time ecological series
In this section, detailed description of data, the main procedures of importing and cleaning data, and a brief final data preview will be provided.
Original Data Variables
Variables Created in this project
read_csv() function, which was a csv file with 5000 observations and 9 variables. Named the dataset “households.”read_csv() function, which was a csv file with 151,141 observations and 5 variables. Named the dataset “products.”read_csv() function, which was a csv file with 10,625,553 observations and 9 variables. Named the dataset “transactions.”households <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_households.csv")
products <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_products.csv")
transactions <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_transactions.csv")
There are 68 missing values in Households dataset, and they are located at Household Composition variable. There is 0 missing value in Products dataset. There is 0 missing value in Transactions dataset.
The missing values just indicated that some of the household composition information were missing from the customer. The analysis will be influenced if we simply remove all the missing values from the original data source.
Clean Data Duplicates will damage our analysis and give us wrong results, so we need to
Check the duplicates data in each dataframe, and
sum(duplicated.data.frame(households))
## [1] 0
sum(duplicated.data.frame(products))
## [1] 0
trans_clean <- transactions %>% distinct()
prod_trans <- merge(trans_clean, products)
data <- merge(prod_trans, households)
head()function andsummary()function were used to generate a brief preview of the combined clean dataset.
head(data, 6)
| HSHD_NUM | PRODUCT_NUM | BASKET_NUM | PURCHASE_ | SPEND | UNITS | STORE_R | WEEK_NUM | YEAR | DEPARTMENT | COMMODITY | BRAND_TY | X5 | L | AGE_RANGE | MARITAL | INCOME_RANGE | HOMEOWNER | HSHD_COMPOSITION | HH_SIZE | CHILDREN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0001 | 1345638 | 559252 | 29-JAN-17 | 4.01 | 1 | EAST | 57 | 2017 | FOOD | MEAT - BEEF | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 0324896 | 713487 | 18-MAY-17 | 3.79 | 1 | EAST | 72 | 2017 | FOOD | INTERNATIONAL FOOD | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 1029355 | 203002 | 25-MAY-16 | 1.63 | 1 | EAST | 21 | 2016 | FOOD | PRODUCE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 0654440 | 741608 | 07-JUN-17 | 1.99 | 1 | EAST | 75 | 2017 | FOOD | INTERNATIONAL FOOD | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 4174205 | 588333 | 19-FEB-17 | 2.50 | 2 | EAST | 60 | 2017 | FOOD | PRODUCE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
| 0001 | 2933903 | 699493 | 09-MAY-17 | 3.99 | 1 | EAST | 71 | 2017 | FOOD | GROCERY STAPLE | NATIONAL | N | Y | 55-64 | Married | 150K+ | Homeowner | 2 Adults and Kids | 3 | 1 |
summary(data)
| HSHD_NUM | PRODUCT_NUM | BASKET_NUM | PURCHASE_ | SPEND | UNITS | STORE_R | WEEK_NUM | YEAR | DEPARTMENT | COMMODITY | BRAND_TY | X5 | L | AGE_RANGE | MARITAL | INCOME_RANGE | HOMEOWNER | HSHD_COMPOSITION | HH_SIZE | CHILDREN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Min. :-14.990 | Min. :-26.000 | Length:10625131 | Min. : 1.00 | Min. :2016 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | Length:10625131 | |
| Class :character | Class :character | Class :character | Class :character | 1st Qu.: 1.670 | 1st Qu.: 1.000 | Class :character | 1st Qu.: 27.00 | 1st Qu.:2016 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | |
| Mode :character | Mode :character | Mode :character | Mode :character | Median : 2.580 | Median : 1.000 | Mode :character | Median : 52.00 | Median :2016 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | |
| NA | NA | NA | NA | Mean : 3.598 | Mean : 1.296 | NA | Mean : 52.48 | Mean :2016 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| NA | NA | NA | NA | 3rd Qu.: 4.000 | 3rd Qu.: 1.000 | NA | 3rd Qu.: 78.00 | 3rd Qu.:2017 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| NA | NA | NA | NA | Max. :299.990 | Max. :246.000 | NA | Max. :104.00 | Max. :2017 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Time Series Analysis has been applied in this project to generate the spending trend of customers. Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.
High-end Customer Group: income range above 100K
high_end <- filter(data, INCOME_RANGE == "150K+" | INCOME_RANGE == "100-150K")
To generate the spending trend of the high-end customer group
#find spending trends in high-end
high_end$PURCHASE_ <- strptime(high_end$PURCHASE_,format = "%Y-%b-%d")
separate <- high_end %>% separate(PURCHASE_, c("date", "month", "year"), sep = '-')
unite <- separate %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend <- data.frame(unite$ID, unite$SPEND)
ag <- aggregate(unite.SPEND~unite.ID, data = spend, FUN = sum)
separate_1 <- ag %>% separate(unite.ID, c("ID", "year","month", "date"), sep = '-')
unite_1 <- separate_1 %>% unite(date, year, month, date, sep = "-")
spread <- unite_1 %>%
group_by_at(vars(-unite.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = unite.SPEND) %>%
select(-row_id)
value <- spread[, colSums(is.na(spread)) < 720]
dim(value)
## [1] 726 669
#obtain trends in high-end market
ts <- value[,c(2:669)]
for (x in ts) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend_1 <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_high_end.xlsx")
trend_2 <- trend_1 %>% separate(trend, c("n", "trend"), sep = " ")
unique(trend_2$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in high_end market
up <- 0
down <- 0
for (x in ts) {
if (trend.test(x)$estimate > 0) {
up = up + 1
}
else{
down = down + 1
}
}
up
## [1] 340
down
## [1] 328
There are 340 customers have increasing spending trend among past two years, and 328 customers have decreasing spending trend among past two years.
#add the tend column to high_end table based on HSHD_ID, call new table high
gather <- value %>% gather(ID, spend, 2:669)
ID <- as.vector(unique(gather$ID))
trend <- as.vector(trend_2$trend)
htrend <- cbind.data.frame(ID, trend)
increase <- htrend[htrend$trend == "\"inc\"", 1]
decrease <- htrend[htrend$trend == "\"dec\"", 1]
part1 <- high_end[which(high_end$HSHD_NUM == increase), c("HSHD_NUM","STORE_R",
"DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
part2 <- high_end[which(high_end$HSHD_NUM == decrease), c("HSHD_NUM","STORE_R",
"DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
part1["trends"] <- rep("inc")
part2["trends"] <- rep("dec")
high <- rbind.data.frame(part1, part2)
store <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
store + geom_histogram(aes(fill = factor(trends)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Store Selection by Customers")
Observation: * Eastern store had the majority of high-end customers in all four stores, and over half of its customers increased their spend.
Central store had the minority of high-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.
Southern store and western store were similar at the number of high-end customers, however, more than half of customers of southern store decreased their spending over time, while more than half of customers of western store increased their spend over time.
The other three stores need to learn from eastern store, and not only try to attract more customers in the future but also try to promote the customers to buy more.
brand_ty <- ggplot(high, aes(BRAND_TY)) + scale_fill_brewer(palette = "Pastel1")
brand_ty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Brand Type Selection by Customers")
Observation:
Majority of high-end customers turned to national brands. There were sightly more customers increased their spend over time while the rest decreased their spend over time.
The very low portion of high-end customers especially compared to national-brand-customer turned to private brands. Among the private-brand-customer, more than half of them increased their spend over time.
Even though private-brand-buyer was only a small part of the high-end customer group, many of them tended to buy more over time. The retailer can promote more on the specific rand and product which gained the best sell in the past and use the product to keep customers.
product <- ggplot(high, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
product + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Product Type Selection by Customers")
This plot mainly shows preference of the high-end customer group of the type of product based on departments.
Observation:
Most high-end customers bought food products, and slightly over half of them increased their spend.
A small portion of high-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few high-end customers chose pharma, and this small group of customers decreased their spend over time.
loyalty <- ggplot(high, aes(L)) + scale_fill_brewer(palette = "Pastel1")
loyalty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Loyalty of Customers")
Observation:
Most high-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the high-end customers who were not loyalty members of the retailer, most customers also increased their spend over time.
Age_range <- ggplot(high, aes(AGE_RANGE)) + scale_fill_brewer(palette = "Pastel1")
Age_range + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Age Range of Customers")
Observation:
The majority of the high-end customer group were households aged between 45 and 54. And gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the high-end customer group aged among 55 to 64, more than half of this portion also increased their spend over time.
The third large portion of the high-end customer group aged among 35 to 44, most of these customers also increased their spend over time.
composition <- ggplot(high, aes(HSHD_COMPOSITION)) +
scale_fill_brewer(palette = "Pastel1")
composition + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "High-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Household Composition of Customers")
Observation:
Majority of the high-end group of customers had a family with two adults and kids. The portion of customers increased their spend over time were resemble to the portion of customers decreased their spend over time.
The other customer in the high-end group with different household composition, such as single male/female, one adult and kids and two adults and kids, mostly increased their spend over time.
Middle-end Customer Group: income range between 50K and 99K
middle_end <- filter(data, INCOME_RANGE == "75-99K" | INCOME_RANGE == "50-74K")
To generate the spending trend of the middle-end customer group
#finding trends in middle-end
middle_end$PURCHASE_ <- strptime(middle_end$PURCHASE_, format = "%d-%b-%Y")
separate_m <- middle_end %>% separate(PURCHASE_, c("date", "month", "year"),
sep = '-')
unite_m <- separate_m %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend_m <- data.frame(unite_m$ID, unite_m$SPEND)
agg <- aggregate(unite_m.SPEND~unite_m.ID, data = spend_m, FUN = sum)
separate_1m <- agg %>% separate(unite_m.ID, c("ID", "date","month", "year"),
sep = '-')
unite_1m <- separate_1m %>% unite(date, year, month, date, sep = "-")
spread_m <- unite_1m %>%
group_by_at(vars(-unite_m.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = unite_m.SPEND) %>%
select(-row_id)
value_m <- spread_m[, colSums(is.na(spread_m)) < 720]
dim(value_m)
## [1] 726 1234
#obtain trends in middle-end market
ts_m1 <- value_m[ , c(2:700)]
ts_m2 <- value_m[ , c(701:1234)]
ts_m <- value_m[ , c(2:1234)]
for (x in ts_m1) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
for (x in ts_m2) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend_1m <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_middle_end.xlsx")
trend_m <- trend_1m %>% separate(trend, c("n", "trend"), sep = " ")
unique(trend_m$trend)
## [1] "\"dec\"" "\"inc\""
#number of increasing and decreasing in middle-end market
count_up <- 0
count_down <- 0
for (x in ts_m) {
if (trend.test(x)$estimate > 0) {
count_up = count_up + 1
}
else{
count_down = count_down + 1
}
}
count_up
## [1] 645
count_down
## [1] 588
There are 645 customers have increasing spending trend among past two years, and 588 customers have decreasing spending trend among past two years.
#add the trends column to high_end table based on HSHD_ID, call new table middle
gather_m <- value_m %>% gather(ID, spend, 2:1234)
m_ID <- as.vector(unique(gather_m$ID))
m_trend <- as.vector(trend_m$trend)
trendID <- cbind.data.frame(m_ID, m_trend)
m_increase <- trendID[trendID$m_trend == "\"inc\"", 1]
m_decrease <- trendID[trendID$m_trend == "\"dec\"", 1]
m_part1 <- middle_end[which(middle_end$HSHD_NUM == m_increase),
c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L",
"AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
m_part2 <- middle_end[which(middle_end$HSHD_NUM == m_decrease),
c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L",
"AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
m_part1["trends"] <- rep("inc")
m_part2["trends"] <- rep("dec")
middle <- rbind.data.frame(m_part1, m_part2)
m_store <- ggplot(middle, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
m_store + geom_histogram(aes(fill = factor(trends)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = 0.1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Store Selection by Customers")
Observation:
Eastern store had the majority of middle-end customers in all four stores, however, over half of its customers decreased their spend over time.
Central store had the lowest portion of middle-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.
Southern store and western store were similar at the number of middle-end customers, however, more than half of customers of southern store increased their spending over time, while more than half of customers of western store increased their spend over time as well.
The central store need to take action to persuasive their customers to spend more, and the other stores, even though didn’t have as many as customers as the central store, did a fair job on promote their customers spend more over time.
m_brand_ty <- ggplot(middle, aes(BRAND_TY)) + scale_fill_brewer(palette = "Pastel2")
m_brand_ty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Brand Type Selection by Customers")
Observation:
Majority of middle-end customers turned to national brands. There were more customers increased their spend over time while the rest decreased their spend over time.
The smaller portion of middle-end customers turned to private brands. Among the private-brand-customer, more than half of their increased their spend over time.
There was no big division between midd-end customers who increased spend over time or decreased their spend over time on either national brand or private brand.
m_product <- ggplot(middle, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
m_product + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Product Type Selection by Customers")
This plot mainly shows preference of the middle-end customer group of the type of product based on departments.
Observation:
Most middle-end customers bought food products, and over half of them increased their spend over time.
A small portion of middle-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few middle-end customers chose pharma, and this small group of customers decreased their spend over time.
m_loyalty <- ggplot(middle, aes(L)) + scale_fill_brewer(palette = "Pastel1")
m_loyalty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Loyalty of Customers")
Observation:
Most middle-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the middle-end customers who were not loyalty members of the retailer, more customers tended to decrease their spend over time.
m_Age_range <- ggplot(middle, aes(AGE_RANGE)) + scale_fill_brewer(palette = "Pastel1")
m_Age_range + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Age Range of Customers")
Observation:
The majority of the middle-end customers were households aged between 45 and 54. Gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the middle-end customer group aged among 55 to 64, more than half of this portion also increased their spend over time.
The third large portion of the middle-end customer group aged among 35 to 44, most of these customers decreased their spend over time.
The rest middle-end customers aged either below 34 or above 65 were more likely to increased their spend over time.
It is always a good choice for the retailer to target at the age group from 45 to 64 while maintain their customers in other age ranges.
m_composition <- ggplot(middle, aes(HSHD_COMPOSITION)) +
scale_fill_brewer(palette = "Pastel1")
m_composition + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Middle-end Customer Group by Increasing and Decreasing Spendd",
subtitle = "Household Composition of Customers")
Observation:
Majority of the middle-end customers had a family with two adults and kids. The portion of this group of customers who either increased or decreased their spend over time was almost 50-50.
The other customer in the middle-end group with different household composition, such as single male/female, one adult and kids and two adults and kids, mostly increased their spend over time.
Low-end Customer Group: income range below 49K.
low_end <- filter(data, INCOME_RANGE == "35-49K" | INCOME_RANGE == "UNDER 35K")
#finding trends in low-end
low_end$PURCHASE_ <- strptime(low_end$PURCHASE_,format = "%d-%b-%Y")
separate_l <- low_end %>% separate(PURCHASE_, c("date", "month", "year"), sep = '-')
unite_l <- separate_l %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend_l <- data.frame(unite_l$ID, unite_l$SPEND)
agg_l <- aggregate(unite_l.SPEND~unite_l.ID, data = spend_l, FUN = sum)
separate_1l <- agg_l %>% separate(unite_l.ID, c("ID", "date","month", "year"), sep = '-')
unite_1l <- separate_1l %>% unite(date, year, month, date, sep = "-")
spread_l <- unite_1l %>%
group_by_at(vars(-unite_l.SPEND)) %>%
mutate(row_id = 1:n()) %>% ungroup() %>%
spread(key = ID, value = unite_l.SPEND) %>%
select(-row_id)
value_l <- spread_l[, colSums(is.na(spread_l)) < 720]
dim(value_l)
## [1] 726 1415
##obtain trends in low-end market
ts_l1 <- value_l[,c(2:700)]
ts_l2 <- value_l[,c(701:1415)]
ts_l <- value_l[,c(2:1415)]
for (x in ts_l1) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
for (x in ts_l2) {
if (trend.test(x)$estimate > 0) {
print("inc")
}
else{
print("dec")
}
}
trend_1l <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_low_end.xlsx")
trend_l <- trend_1l %>% separate(trend, c("n", "trend"), sep = " ")
unique(trend_l$trend)
## [1] "\"inc\"" "\"dec\""
#number of increasing and decreasing in low-end market
count_up <- 0
count_down <- 0
for (x in ts_l) {
if (trend.test(x)$estimate > 0) {
count_up = count_up + 1
}
else{
count_down = count_down + 1
}
}
count_up
## [1] 748
count_down
## [1] 666
There are 748 customers have increasing spending trend among past two years, and 666 customers have decreasing spending trend among past two years.
#add the trends column to low_end table based on HSHD_ID, call new table low
gather_l <- value_l %>% gather(ID, spend, 2:1415)
l_ID <- as.vector(unique(gather_l$ID))
l_trend <- as.vector(trend_l$trend)
trendID_l <- cbind.data.frame(l_ID, l_trend)
l_increase <- trendID_l[trendID_l$l_trend == "\"inc\"", 1]
l_decrease <- trendID_l[trendID_l$l_trend == "\"dec\"", 1]
l_part1 <- low_end[which(low_end$HSHD_NUM == l_increase), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
l_part2 <- low_end[which(low_end$HSHD_NUM == l_decrease), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
l_part1["trends"] <- rep("inc")
l_part2["trends"] <- rep("dec")
low <- rbind.data.frame(l_part1, l_part2)
l_store <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "RdPu")
l_store + geom_histogram(aes(fill = factor(trends)),
stat = "count",
bins = 4,
binwidth = 2,
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Store Selection by Customers")
Observation:
Eastern store had the majority of low-end customers in all four stores, and over half of its customers increased their spend over time.
Central store had the second largest portion of low-end customers in all four stores, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Southern store and western store were similar at the number of low-end customers. Though they had lower portion of low-end customers compared with the other two stores, more than half of stores’ customers increased their spend over time.
The four store did a fair job of promoting their customers to spend more.
l_brand_ty <- ggplot(low, aes(BRAND_TY)) + scale_fill_brewer(palette = "RdPu")
l_brand_ty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Brand Type Selection by Customers")
Observation:
Majority of low-end customers turned to national brands. More than half of the customers increased their spend over time while the rest decreased their spend over time.
A lower portion of low-end customers turned to private brands. Among the private-brand-customer, more than half of them increased their spend over time.
For low-end market, customers didn’t have a strong preference of brand type compared with middle-end market and high-end market. The retailer did a fair job to push its customers to increase their spend over time.
l_product <- ggplot(low, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
l_product + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Product Type Selection by Customers")
This plot mainly shows preference of the low-end customer group of the type of product based on departments.
Observation:
Most low-end customers bought food products, and over half of them increased their spend over time.
Very small portion of low-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few low-end customers chose pharma, and this small group of customers decreased their spend over time.
l_loyalty <- ggplot(low, aes(L)) + scale_fill_brewer(palette = "RdPu")
l_loyalty + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Loyalty of Customers")
Observation:
Most low-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the low-end customers who were not loyalty members of the retailer, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
l_Age_range <- ggplot(low, aes(AGE_RANGE)) + scale_fill_brewer(palette = "RdPu")
l_Age_range + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Age Range of Customers")
Observation:
The majority of the low-end customer group were households aged between 65 and 74. Gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the low-end customer group aged among 45 to 54, more than half of this portion also increased their spend over time.
The third large portion of the low-end customer group aged among 55 to 64, most of these customers also increased their spend over time.
Low-end customers aged between 35 and 44 and above 75 also contributed a large portion of this group of customers. Most customers in these two age ranges increased their spend over time.
Low-end market seems to attract customers in different age ranges compared with middle-end and high-end markets. The retailer may want to provide distinct promotions to attract customers and gain profits.
l_composition <- ggplot(low, aes(HSHD_COMPOSITION)) + scale_fill_brewer(palette = "RdPu")
l_composition + geom_histogram(aes(fill = factor(trends)),
stat = "count",
col = "white",
size = .1) +
labs(title = "Low-end Customer Group by Increasing and Decreasing Spend",
subtitle = "Household Composition of Customers")
Observation:
Majority of the low-end customers had a family with two adults. The portion of this group of customers who either increased or decreased their spend over time was almost 50-50.
The second largest portion of the low-end customers was customers with a family with two adults and kids. Most customers in this segment increased their spend over time.
The other customer in the low-end group with different household composition, single male/female, one adult and kids, mostly increased their spend over time.
Since it is only our mid-term project currently, there will be more data analysis work on this project. To be continued… Thank you so much for watching!