Project Proposal

\(\color{#29A887}{\text{Introduction}}\)

The core of a business is to learn about its customers.

A company can never grow larger nor further if the management levels know nothing about its customer or competitor. Especially when their competitor stay close to the trend of society and always keep an eye on the customers. A company needs to learn from the behavior of the current customers, which might be the solid and easiest information they can collect, in order to keep the current customers around, attract more new customers and improve itself. So does a retailer.

To help the retailer have better understanding of its customer, this project mainly provided an analysis on the spending trends of the customer. Households with different levels of incoming will have their distinct needs. Meanwhile, they have no competition against each other. The retailer is supposed to react with groups with different ranges of income respectively. Based on the range of income of the 2500 households, we dvided them into three customer groups:

High-end Customer Group: income range above 100K;
Middle-end Customer Group: income range between 50K and 99K;
Low-end Customer Group: income range below 49K.

Each question been asked and explained in this project will be explicitly spread in these three customer groups.

\(\color{#5097A4}{\text{Problem Statement}}\)

To obtain the specific details which will help the retailer learn more about its customers and improve themselves to perform better in the future, several questions have been raised up with and will be explained and solved in this project are listed as follows:

According to the spending trends of customers:
- How many customers are spending more over time?
- How many customers are spending less over time?
What trend can be observed in the following factors:
- Store Selection
- Brand Type Selection
- Product Type Selection
- Loyalty
- Age Range
- Household Composition

\(\color{#5097A4}{\text{Data Source and Methodology}}\)

Data Source

This dataset was released by 84.51º for academic research and case study, and was provided by our professor Bradley Boehmke for the Data Wrangling with R course at University of Cincinnati. The data source can be find at course page. The whole dataset contains a household group of 2500 households and thier household level transactions over two years. All the households are frequent shoppers at a retailer. This dataset also contains the demographic information of the 2500 households in order to help the retailer learn more about its customers. This data provided an opportunity to explore 84.51°/Kroger data and an opportunity to apply skills learned to investigate different business questions.

Methodology

Dividend has been used to divide the whole large dataset into three groups of customers based on their income ranges.
Time Series Analysis has been applied in this project to generate the spending trend of customers.
Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.

\(\color{#5097A4}{\text{Mission}}\)

The mission of this analysis project is to help the retailer have a better understanding of its customers. Understand the data they collected from their customers can help it understand the market and have win-win relationship with their respectful customers. There is no doubt that the final goal of each business is to earn profit.

The spending trends of the customers in two years (2016 and 2017) will provide the retailer with a detailed picture of its performance by pointing out how many customers spending less as well as how many customers tend to spending more. The distinct factors such as store selection, brand selection, and product selection by customers will help the retailer improving itself in order to attract more customers. Other demographic factors, for example, age range and household composition of customers will help the retailer look into more specific details so it can provide its customers with more proper promotion, which will help the retailer gain profits in return.

The whole project can not only help the retailer with its performance, but also can help each reader with their understanding of customer behaviors, customer spending trends and other aspects of business. Will provide you with new ideas about business while you are looking through the whole project we did for sure!

\(\color{#29A887}{\text{Packages Required}}\)

Packages being used in this project. If you want to run the code individually, please install these packages first.

library(tidyr)      # used for tidying up data
library(dplyr)      # used for data manipulation
library(lubridate)  # used for transforming date
library(knitr)      # used for viewing data
library(printr)     # used for viewing data and tables
library(leaflet)    # used for viewing data
library(readr)      # used for importing and reading .csv files
library(readxl)     # used for reading .xlsx files
library(ggplot2)    # used for data visualization
library(pastecs)    # used for analysis of space-time ecological series

\(\color{#29A887}{\text{Data Preparation}}\)

In this section, detailed description of data, the main procedures of importing and cleaning data, and a brief final data preview will be provided.

\(\color{#5097A4}{\text{Data Description}}\)

Original Data Variables

Households
- HSHD_NUM: each household(customer) has been assigned a unique ID
- L: whether the customer is a loyalty member or not
- AGE_RANGE: age range of the customer
- MARITAL: marital status of the customer
- INCOME_RANGE: income range of the customer(what we used to divide our investigated customer groups in this project)
- HOMEOWNER: whether the customer is a homeowner or not
- HSHD_COMPOSITION: the composition of the customer’s family
- HH_SIZE: the family size of the customer
- CHILDREN: the number of child the customer has
Products
- PRODUCT_NUM: each product has been assigned a unique ID
- DEPARTMENT: the department each product belongs to
- COMMODITY: product classification such as pet, meditation or cosmetics
- BRAND_TY: the brand type of each product - private or national
- X5: natural or organic flag
Transactions
- Basket_NUM: each transaction has been assigned a unique ID
- PURCHASE_: the exact purchase date of the transaction
- WEEK_NUM: the week number when the transaction happened
- YEAR: which year the transaction happened
- HSHD_NUM: foreign key from Households dataset
- PRODUCT_NUM: foreign key from Products dataset
- SPEND: the money the customer spend(get refund) in the transaction
- UNITS: the number of products the customer purchased(returned) in the transaction
- STORE_R: the region of the store the transaction happened

Variables Created in this project

trans_clean: clean dataset without duplicates
prod_trans: the two datasets(products and transactions) merged together
data: the final clean dataset
high_end: customer group with income range above 100K
middle-end: customer Group with income range between 50K and 99K
low-end: customer group with income range below 49K

\(\color{#5097A4}{\text{Import and Clean Data}}\)

Import Data
- Imported Household data using read_csv() function, which was a csv file with 5000 observations and 9 variables. Named the dataset “households.”
- Imported Products data using read_csv() function, which was a csv file with 151,141 observations and 5 variables. Named the dataset “products.”
- Imported Transactions data using read_csv() function, which was a csv file with 10,625,553 observations and 9 variables. Named the dataset “transactions.”

households <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_households.csv")
products <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_products.csv")
transactions <- read_csv("~/Desktop/Files~/UC/2018 Fall/Wrangling R/Complete-Journey-Reduced/5000_transactions.csv")

Missing Values of Data Sources

There are 68 missing values in Households dataset, and they are located at Household Composition variable. There is 0 missing value in Products dataset. There is 0 missing value in Transactions dataset.

The missing values just indicated that some of the household composition information were missing from the customer. The analysis will be influenced if we simply remove all the missing values from the original data source.

Clean Data Duplicates will damage our analysis and give us wrong results, so we need to
Check the duplicates data in each dataframe, and

sum(duplicated.data.frame(households))

## [1] 0

sum(duplicated.data.frame(products))

## [1] 0

Clean all the duplicates and get the clean data set

trans_clean <- transactions %>% distinct()

Merge Data
- To obtain a more direct observation of the household level of transactions, the next step is to merge the three separate dataset into one combined dataset, which was named as “data.”

prod_trans <- merge(trans_clean, products)
data <- merge(prod_trans, households)

\(\color{#5097A4}{\text{Final Data Preview}}\)

head() function and summary() function were used to generate a brief preview of the combined clean dataset.

head(data, 6)

HSHD_NUM	PRODUCT_NUM	BASKET_NUM	PURCHASE_	SPEND	UNITS	STORE_R	WEEK_NUM	YEAR	DEPARTMENT	COMMODITY	BRAND_TY	X5	L	AGE_RANGE	MARITAL	INCOME_RANGE	HOMEOWNER	HSHD_COMPOSITION	HH_SIZE	CHILDREN
0001	1345638	559252	29-JAN-17	4.01	1	EAST	57	2017	FOOD	MEAT - BEEF	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1
0001	0324896	713487	18-MAY-17	3.79	1	EAST	72	2017	FOOD	INTERNATIONAL FOOD	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1
0001	1029355	203002	25-MAY-16	1.63	1	EAST	21	2016	FOOD	PRODUCE	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1
0001	0654440	741608	07-JUN-17	1.99	1	EAST	75	2017	FOOD	INTERNATIONAL FOOD	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1
0001	4174205	588333	19-FEB-17	2.50	2	EAST	60	2017	FOOD	PRODUCE	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1
0001	2933903	699493	09-MAY-17	3.99	1	EAST	71	2017	FOOD	GROCERY STAPLE	NATIONAL	N	Y	55-64	Married	150K+	Homeowner	2 Adults and Kids	3	1

summary(data)

HSHD_NUM	PRODUCT_NUM	BASKET_NUM	PURCHASE_	SPEND	UNITS	STORE_R	WEEK_NUM	YEAR	DEPARTMENT	COMMODITY	BRAND_TY	X5	L	AGE_RANGE	MARITAL	INCOME_RANGE	HOMEOWNER	HSHD_COMPOSITION	HH_SIZE	CHILDREN
Length:10625131	Length:10625131	Length:10625131	Length:10625131	Min. :-14.990	Min. :-26.000	Length:10625131	Min. : 1.00	Min. :2016	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131	Length:10625131
Class :character	Class :character	Class :character	Class :character	1st Qu.: 1.670	1st Qu.: 1.000	Class :character	1st Qu.: 27.00	1st Qu.:2016	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Mode :character	Median : 2.580	Median : 1.000	Mode :character	Median : 52.00	Median :2016	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character
NA	NA	NA	NA	Mean : 3.598	Mean : 1.296	NA	Mean : 52.48	Mean :2016	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	3rd Qu.: 4.000	3rd Qu.: 1.000	NA	3rd Qu.: 78.00	3rd Qu.:2017	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	Max. :299.990	Max. :246.000	NA	Max. :104.00	Max. :2017	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

\(\color{#29A887}{\text{Proposed Exploratory Data Analysis}}\)

Time Series Analysis has been applied in this project to generate the spending trend of customers. Plots have been drawed to see the distribution of and comparision between the increasing and decreasing spending trend of customers in the related demographic factors.

\(\color{#5097A4}{\text{High-End Customer Group}}\)

High-end Customer Group: income range above 100K

high_end <- filter(data, INCOME_RANGE == "150K+" | INCOME_RANGE == "100-150K")

To generate the spending trend of the high-end customer group

#find spending trends in high-end

high_end$PURCHASE_ <- strptime(high_end$PURCHASE_,format = "%Y-%b-%d")
separate <- high_end %>% separate(PURCHASE_, c("date", "month", "year"), sep = '-')
unite <- separate %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend <- data.frame(unite$ID, unite$SPEND)

ag <- aggregate(unite.SPEND~unite.ID, data = spend, FUN = sum)

separate_1 <- ag %>% separate(unite.ID, c("ID", "year","month", "date"), sep = '-')
unite_1 <- separate_1 %>% unite(date, year, month, date, sep = "-")

spread <- unite_1  %>% 
  group_by_at(vars(-unite.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = unite.SPEND) %>%    
  select(-row_id) 


value <- spread[, colSums(is.na(spread)) < 720]

dim(value)

## [1] 726 669

#obtain trends in high-end market

ts <- value[,c(2:669)]
for (x in ts) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}


trend_1 <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_high_end.xlsx")
trend_2 <- trend_1 %>% separate(trend, c("n", "trend"), sep = " ")

unique(trend_2$trend)

## [1] "\"inc\"" "\"dec\""

#number of increasing and decreasing in high_end market

up <- 0
down <- 0
for (x in ts) {
  if (trend.test(x)$estimate > 0) {
    up = up + 1 
  }
  else{
    down = down + 1 
  }
  
}

up

## [1] 340

down

## [1] 328

There are 340 customers have increasing spending trend among past two years, and 328 customers have decreasing spending trend among past two years.

#add the tend column to high_end table based on HSHD_ID, call new table high

gather <- value %>% gather(ID, spend, 2:669) 
ID <- as.vector(unique(gather$ID))
trend <- as.vector(trend_2$trend)
htrend <- cbind.data.frame(ID, trend)
increase <- htrend[htrend$trend == "\"inc\"", 1]
decrease <- htrend[htrend$trend == "\"dec\"", 1]
part1 <- high_end[which(high_end$HSHD_NUM == increase), c("HSHD_NUM","STORE_R",
  "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
part2 <- high_end[which(high_end$HSHD_NUM == decrease), c("HSHD_NUM","STORE_R", 
  "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
part1["trends"] <- rep("inc")
part2["trends"] <- rep("dec")
high <- rbind.data.frame(part1, part2)

\(\color{#003151}{\text{Store Selection}}\)

store <- ggplot(high, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel1")
store + geom_histogram(aes(fill =  factor(trends)),
                   stat = "count",
                   bins = 4,
                   binwidth = 2,
                   col = "white", 
                   size = 0.1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Store Selection by Customers")

Observation: * Eastern store had the majority of high-end customers in all four stores, and over half of its customers increased their spend.

Central store had the minority of high-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.
Southern store and western store were similar at the number of high-end customers, however, more than half of customers of southern store decreased their spending over time, while more than half of customers of western store increased their spend over time.
The other three stores need to learn from eastern store, and not only try to attract more customers in the future but also try to promote the customers to buy more.

\(\color{#003151}{\text{Brand Type Selection}}\)

brand_ty <- ggplot(high, aes(BRAND_TY)) + scale_fill_brewer(palette = "Pastel1")
brand_ty + geom_histogram(aes(fill = factor(trends)),
                       stat = "count",
                       col = "white", 
                       size = .1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Brand Type Selection by Customers")

Observation:

Majority of high-end customers turned to national brands. There were sightly more customers increased their spend over time while the rest decreased their spend over time.
The very low portion of high-end customers especially compared to national-brand-customer turned to private brands. Among the private-brand-customer, more than half of them increased their spend over time.
Even though private-brand-buyer was only a small part of the high-end customer group, many of them tended to buy more over time. The retailer can promote more on the specific rand and product which gained the best sell in the past and use the product to keep customers.

\(\color{#003151}{\text{Product Type Selection}}\)

product <- ggplot(high, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
product + geom_histogram(aes(fill = factor(trends)),
                         stat = "count",
                         col = "white", 
                         size = .1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Product Type Selection by Customers")

This plot mainly shows preference of the high-end customer group of the type of product based on departments.

Observation:

Most high-end customers bought food products, and slightly over half of them increased their spend.
A small portion of high-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few high-end customers chose pharma, and this small group of customers decreased their spend over time.

\(\color{#003151}{\text{Loyalty}}\)

loyalty <- ggplot(high, aes(L)) + scale_fill_brewer(palette = "Pastel1")
loyalty + geom_histogram(aes(fill = factor(trends)),
                          stat = "count",
                          col = "white", 
                          size = .1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Loyalty of Customers")

Observation:

Most high-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the high-end customers who were not loyalty members of the retailer, most customers also increased their spend over time.

\(\color{#003151}{\text{Age Range}}\)

Age_range <- ggplot(high, aes(AGE_RANGE)) + scale_fill_brewer(palette = "Pastel1")
Age_range + geom_histogram(aes(fill = factor(trends)),
                          stat = "count",
                          col = "white", 
                          size = .1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Age Range of Customers")

Observation:

The majority of the high-end customer group were households aged between 45 and 54. And gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the high-end customer group aged among 55 to 64, more than half of this portion also increased their spend over time.
The third large portion of the high-end customer group aged among 35 to 44, most of these customers also increased their spend over time.

\(\color{#003151}{\text{Household Composition}}\)

composition <- ggplot(high, aes(HSHD_COMPOSITION))  + 
                 scale_fill_brewer(palette = "Pastel1")
composition + geom_histogram(aes(fill = factor(trends)),
                           stat = "count",
                           col = "white", 
                           size = .1) +   
  labs(title = "High-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Household Composition of Customers")

Observation:

Majority of the high-end group of customers had a family with two adults and kids. The portion of customers increased their spend over time were resemble to the portion of customers decreased their spend over time.
The other customer in the high-end group with different household composition, such as single male/female, one adult and kids and two adults and kids, mostly increased their spend over time.

\(\color{#5097A4}{\text{Middle-End Customer Group}}\)

Middle-end Customer Group: income range between 50K and 99K

middle_end <- filter(data, INCOME_RANGE == "75-99K" | INCOME_RANGE == "50-74K")

To generate the spending trend of the middle-end customer group

#finding trends in middle-end

middle_end$PURCHASE_ <- strptime(middle_end$PURCHASE_, format = "%d-%b-%Y")
separate_m <- middle_end %>% separate(PURCHASE_, c("date", "month", "year"), 
                                      sep = '-')
unite_m <- separate_m %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend_m <- data.frame(unite_m$ID, unite_m$SPEND)
agg <- aggregate(unite_m.SPEND~unite_m.ID, data = spend_m, FUN = sum)
separate_1m <- agg %>% separate(unite_m.ID, c("ID", "date","month", "year"), 
                                      sep = '-')
unite_1m <- separate_1m %>% unite(date, year, month, date, sep = "-")

spread_m <- unite_1m  %>% 
  group_by_at(vars(-unite_m.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = unite_m.SPEND) %>%    
  select(-row_id)  

value_m <- spread_m[, colSums(is.na(spread_m)) < 720]

dim(value_m)

## [1]  726 1234

#obtain trends in middle-end market

ts_m1 <- value_m[ , c(2:700)]
ts_m2 <- value_m[ , c(701:1234)]
ts_m <- value_m[ , c(2:1234)]

for (x in ts_m1) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}

for (x in ts_m2) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}


trend_1m <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_middle_end.xlsx")
trend_m <- trend_1m %>% separate(trend, c("n", "trend"), sep = " ")

unique(trend_m$trend)

## [1] "\"dec\"" "\"inc\""

#number of increasing and decreasing in middle-end market

count_up <- 0
count_down <- 0
for (x in ts_m) {
  if (trend.test(x)$estimate > 0) {
    count_up = count_up + 1 
  }
  else{
    count_down = count_down + 1 
  }
  
}

count_up

## [1] 645

count_down

## [1] 588

There are 645 customers have increasing spending trend among past two years, and 588 customers have decreasing spending trend among past two years.

#add the trends column to high_end table based on HSHD_ID, call new table middle

gather_m <- value_m %>% gather(ID, spend, 2:1234) 
m_ID <- as.vector(unique(gather_m$ID))
m_trend <- as.vector(trend_m$trend)
trendID <- cbind.data.frame(m_ID, m_trend)
m_increase <- trendID[trendID$m_trend == "\"inc\"", 1]
m_decrease <- trendID[trendID$m_trend == "\"dec\"", 1]
m_part1 <- middle_end[which(middle_end$HSHD_NUM == m_increase), 
                       c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", 
                         "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
m_part2 <- middle_end[which(middle_end$HSHD_NUM == m_decrease), 
                       c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", 
                         "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
m_part1["trends"] <- rep("inc")
m_part2["trends"] <- rep("dec")
middle <- rbind.data.frame(m_part1, m_part2)

\(\color{#003151}{\text{Store Collection}}\)

m_store <- ggplot(middle, aes(STORE_R)) + scale_fill_brewer(palette = "Pastel2")
m_store + geom_histogram(aes(fill = factor(trends)),
                       stat = "count",
                       bins = 4,
                       binwidth = 2,
                       col = "white", 
                       size = 0.1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Store Selection by Customers")

Observation:

Eastern store had the majority of middle-end customers in all four stores, however, over half of its customers decreased their spend over time.
Central store had the lowest portion of middle-end customers in all four stores, and slightly more than half of its customers decreased their spend over time.
Southern store and western store were similar at the number of middle-end customers, however, more than half of customers of southern store increased their spending over time, while more than half of customers of western store increased their spend over time as well.
The central store need to take action to persuasive their customers to spend more, and the other stores, even though didn’t have as many as customers as the central store, did a fair job on promote their customers spend more over time.

\(\color{#003151}{\text{Brand Type}}\)

m_brand_ty <- ggplot(middle, aes(BRAND_TY)) + scale_fill_brewer(palette = "Pastel2")
m_brand_ty + geom_histogram(aes(fill = factor(trends)),
                          stat = "count",
                          col = "white", 
                          size = .1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Brand Type Selection by Customers")

Observation:

Majority of middle-end customers turned to national brands. There were more customers increased their spend over time while the rest decreased their spend over time.
The smaller portion of middle-end customers turned to private brands. Among the private-brand-customer, more than half of their increased their spend over time.
There was no big division between midd-end customers who increased spend over time or decreased their spend over time on either national brand or private brand.

\(\color{#003151}{\text{Product Type Selection}}\)

m_product <- ggplot(middle, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
m_product + geom_histogram(aes(fill = factor(trends)),
                         stat = "count",
                         col = "white", 
                         size = .1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Product Type Selection by Customers")

This plot mainly shows preference of the middle-end customer group of the type of product based on departments.

Observation:

Most middle-end customers bought food products, and over half of them increased their spend over time.
A small portion of middle-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few middle-end customers chose pharma, and this small group of customers decreased their spend over time.

\(\color{#003151}{\text{Loyalty}}\)

m_loyalty <- ggplot(middle, aes(L)) + scale_fill_brewer(palette = "Pastel1")
m_loyalty + geom_histogram(aes(fill = factor(trends)),
                         stat = "count",
                         col = "white", 
                         size = .1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Loyalty of Customers")

Observation:

Most middle-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the middle-end customers who were not loyalty members of the retailer, more customers tended to decrease their spend over time.

\(\color{#003151}{\text{Age Range}}\)

m_Age_range <- ggplot(middle, aes(AGE_RANGE)) + scale_fill_brewer(palette = "Pastel1")
m_Age_range + geom_histogram(aes(fill = factor(trends)),
                           stat = "count",
                           col = "white", 
                           size = .1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Age Range of Customers")

Observation:

The majority of the middle-end customers were households aged between 45 and 54. Gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the middle-end customer group aged among 55 to 64, more than half of this portion also increased their spend over time.
The third large portion of the middle-end customer group aged among 35 to 44, most of these customers decreased their spend over time.
The rest middle-end customers aged either below 34 or above 65 were more likely to increased their spend over time.
It is always a good choice for the retailer to target at the age group from 45 to 64 while maintain their customers in other age ranges.

\(\color{#003151}{\text{Household Composition}}\)

m_composition <- ggplot(middle, aes(HSHD_COMPOSITION)) + 
                    scale_fill_brewer(palette = "Pastel1")
m_composition + geom_histogram(aes(fill = factor(trends)),
                             stat = "count",
                             col = "white", 
                             size = .1) +   
  labs(title = "Middle-end Customer Group by Increasing and Decreasing Spendd", 
       subtitle = "Household Composition of Customers")

Observation:

Majority of the middle-end customers had a family with two adults and kids. The portion of this group of customers who either increased or decreased their spend over time was almost 50-50.
The other customer in the middle-end group with different household composition, such as single male/female, one adult and kids and two adults and kids, mostly increased their spend over time.

\(\color{#5097A4}{\text{Low-End Customer Group}}\)

Low-end Customer Group: income range below 49K.

low_end <- filter(data, INCOME_RANGE == "35-49K"  | INCOME_RANGE == "UNDER 35K")

#finding trends in low-end 

low_end$PURCHASE_ <- strptime(low_end$PURCHASE_,format = "%d-%b-%Y")
separate_l <- low_end %>% separate(PURCHASE_, c("date", "month", "year"), sep = '-')
unite_l <- separate_l %>% unite(ID, HSHD_NUM, year, month, date, sep = "-")
spend_l <- data.frame(unite_l$ID, unite_l$SPEND)
agg_l <- aggregate(unite_l.SPEND~unite_l.ID, data = spend_l, FUN = sum)
separate_1l <- agg_l %>% separate(unite_l.ID, c("ID", "date","month", "year"), sep = '-')
unite_1l <- separate_1l %>% unite(date, year, month, date, sep = "-")

spread_l <- unite_1l  %>% 
  group_by_at(vars(-unite_l.SPEND)) %>%   
  mutate(row_id = 1:n()) %>% ungroup() %>%  
  spread(key = ID, value = unite_l.SPEND) %>%    
  select(-row_id)  

value_l <- spread_l[, colSums(is.na(spread_l)) < 720]

dim(value_l)

## [1]  726 1415

##obtain trends in low-end market

ts_l1 <- value_l[,c(2:700)]
ts_l2 <- value_l[,c(701:1415)]
ts_l <- value_l[,c(2:1415)]

for (x in ts_l1) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}

for (x in ts_l2) {
  if (trend.test(x)$estimate > 0) {
    print("inc") 
  }
  else{
    print("dec")
  }
  
}


trend_1l <- read_excel("~/Desktop/Files~/UC/2018 Fall/Wrangling R/trends_low_end.xlsx")
trend_l <- trend_1l %>% separate(trend, c("n", "trend"), sep = " ")

unique(trend_l$trend)

## [1] "\"inc\"" "\"dec\""

#number of increasing and decreasing in low-end market

count_up <- 0
count_down <- 0
for (x in ts_l) {
  if (trend.test(x)$estimate > 0) {
    count_up = count_up + 1 
  }
  else{
    count_down = count_down + 1 
  }
  
}

count_up

## [1] 748

count_down

## [1] 666

There are 748 customers have increasing spending trend among past two years, and 666 customers have decreasing spending trend among past two years.

#add the trends column to low_end table based on HSHD_ID, call new table low

gather_l <- value_l %>% gather(ID, spend, 2:1415) 
l_ID <- as.vector(unique(gather_l$ID))
l_trend <- as.vector(trend_l$trend)
trendID_l <- cbind.data.frame(l_ID, l_trend)
l_increase <- trendID_l[trendID_l$l_trend == "\"inc\"", 1]
l_decrease <- trendID_l[trendID_l$l_trend == "\"dec\"", 1]
l_part1 <- low_end[which(low_end$HSHD_NUM == l_increase), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
l_part2 <- low_end[which(low_end$HSHD_NUM == l_decrease), c("HSHD_NUM","STORE_R", "DEPARTMENT", "BRAND_TY", "L", "AGE_RANGE", "HSHD_COMPOSITION", "HOMEOWNER")]
l_part1["trends"] <- rep("inc")
l_part2["trends"] <- rep("dec")
low <- rbind.data.frame(l_part1, l_part2)

\(\color{#003151}{\text{Store Collection}}\)

l_store <- ggplot(low, aes(STORE_R)) + scale_fill_brewer(palette = "RdPu")
l_store + geom_histogram(aes(fill = factor(trends)),
                         stat = "count",
                         bins = 4,
                         binwidth = 2,
                         col = "white", 
                         size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Store Selection by Customers")

Observation:

Eastern store had the majority of low-end customers in all four stores, and over half of its customers increased their spend over time.
Central store had the second largest portion of low-end customers in all four stores, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Southern store and western store were similar at the number of low-end customers. Though they had lower portion of low-end customers compared with the other two stores, more than half of stores’ customers increased their spend over time.
The four store did a fair job of promoting their customers to spend more.

\(\color{#003151}{\text{Brand Type}}\)

l_brand_ty <- ggplot(low, aes(BRAND_TY)) + scale_fill_brewer(palette = "RdPu")
l_brand_ty + geom_histogram(aes(fill = factor(trends)),
                            stat = "count",
                            col = "white", 
                            size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Brand Type Selection by Customers")

Observation:

Majority of low-end customers turned to national brands. More than half of the customers increased their spend over time while the rest decreased their spend over time.
A lower portion of low-end customers turned to private brands. Among the private-brand-customer, more than half of them increased their spend over time.
For low-end market, customers didn’t have a strong preference of brand type compared with middle-end market and high-end market. The retailer did a fair job to push its customers to increase their spend over time.

\(\color{#003151}{\text{Product Type Selection}}\)

l_product <- ggplot(low, aes(DEPARTMENT)) + scale_fill_brewer(palette = "Pastel1")
l_product + geom_histogram(aes(fill = factor(trends)),
                           stat = "count",
                           col = "white", 
                           size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Product Type Selection by Customers")

This plot mainly shows preference of the low-end customer group of the type of product based on departments.

Observation:

Most low-end customers bought food products, and over half of them increased their spend over time.
Very small portion of low-end customers chose non-food product, and the portion of customers who either increased or decreased their spend over time was almost 50-50.
Only a few low-end customers chose pharma, and this small group of customers decreased their spend over time.

\(\color{#003151}{\text{Loyalty}}\)

l_loyalty <- ggplot(low, aes(L)) + scale_fill_brewer(palette = "RdPu")
l_loyalty + geom_histogram(aes(fill = factor(trends)),
                           stat = "count",
                           col = "white", 
                           size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Loyalty of Customers")

Observation:

Most low-end customers were the loyalty members of the retailer. More than half of the loyalty customers increased their spend over time.
Among the small portion of the low-end customers who were not loyalty members of the retailer, and the portion of customers who either increased or decreased their spend over time was almost 50-50.

\(\color{#003151}{\text{Age Range}}\)

l_Age_range <- ggplot(low, aes(AGE_RANGE)) + scale_fill_brewer(palette = "RdPu")
l_Age_range + geom_histogram(aes(fill = factor(trends)),
                             stat = "count",
                             col = "white", 
                             size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Age Range of Customers")

Observation:

The majority of the low-end customer group were households aged between 65 and 74. Gladly we may see from the plot that most of them increased their spend over time.
The second large portion of the low-end customer group aged among 45 to 54, more than half of this portion also increased their spend over time.
The third large portion of the low-end customer group aged among 55 to 64, most of these customers also increased their spend over time.
Low-end customers aged between 35 and 44 and above 75 also contributed a large portion of this group of customers. Most customers in these two age ranges increased their spend over time.
Low-end market seems to attract customers in different age ranges compared with middle-end and high-end markets. The retailer may want to provide distinct promotions to attract customers and gain profits.

\(\color{#003151}{\text{Household Composition}}\)

l_composition <- ggplot(low, aes(HSHD_COMPOSITION)) + scale_fill_brewer(palette = "RdPu")
l_composition + geom_histogram(aes(fill = factor(trends)),
                               stat = "count",
                               col = "white", 
                               size = .1) +   
  labs(title = "Low-end Customer Group by Increasing and Decreasing Spend", 
       subtitle = "Household Composition of Customers")

Observation:

Majority of the low-end customers had a family with two adults. The portion of this group of customers who either increased or decreased their spend over time was almost 50-50.
The second largest portion of the low-end customers was customers with a family with two adults and kids. Most customers in this segment increased their spend over time.
The other customer in the low-end group with different household composition, single male/female, one adult and kids, mostly increased their spend over time.

\(\color{#29A887}{\text{Summary}}\)

Since it is only our mid-term project currently, there will be more data analysis work on this project. To be continued… Thank you so much for watching!

Project Proposal

Tingyu Zhao, Kaijun Sheng

\(\color{#29A887}{\text{Introduction}}\)

\(\color{#5097A4}{\text{Problem Statement}}\)

\(\color{#5097A4}{\text{Data Source and Methodology}}\)

\(\color{#5097A4}{\text{Mission}}\)

\(\color{#29A887}{\text{Packages Required}}\)

\(\color{#29A887}{\text{Data Preparation}}\)

\(\color{#5097A4}{\text{Data Description}}\)

\(\color{#5097A4}{\text{Import and Clean Data}}\)

\(\color{#5097A4}{\text{Final Data Preview}}\)

\(\color{#29A887}{\text{Proposed Exploratory Data Analysis}}\)

\(\color{#5097A4}{\text{High-End Customer Group}}\)

\(\color{#003151}{\text{Store Selection}}\)

\(\color{#003151}{\text{Brand Type Selection}}\)

\(\color{#003151}{\text{Product Type Selection}}\)

\(\color{#003151}{\text{Loyalty}}\)

\(\color{#003151}{\text{Age Range}}\)

\(\color{#003151}{\text{Household Composition}}\)

\(\color{#5097A4}{\text{Middle-End Customer Group}}\)

\(\color{#003151}{\text{Store Collection}}\)

\(\color{#003151}{\text{Brand Type}}\)

\(\color{#003151}{\text{Product Type Selection}}\)

\(\color{#003151}{\text{Loyalty}}\)

\(\color{#003151}{\text{Age Range}}\)

\(\color{#003151}{\text{Household Composition}}\)

\(\color{#5097A4}{\text{Low-End Customer Group}}\)

\(\color{#003151}{\text{Store Collection}}\)

\(\color{#003151}{\text{Brand Type}}\)

\(\color{#003151}{\text{Product Type Selection}}\)

\(\color{#003151}{\text{Loyalty}}\)

\(\color{#003151}{\text{Age Range}}\)

\(\color{#003151}{\text{Household Composition}}\)

\(\color{#29A887}{\text{Summary}}\)