Introduction

The goal of my project is to use data analytics to improve retailer (Kroger in this particular case) efficiency in purchasing and marketing by means of better understanding customer habits and establishing a long-term relationships with them. The reader interest in this could stem both from the overall approach and specific technical details that I use and by the mere fact that they are most likely a Kroger shopper and want to know how the company sees them.

For my project, I am analyzing the data from Complete Journey Customer Transactions without marketing campaigns provided by 84.51. It contains 2016 – 2017 data on 2,500 households who frequently shop at Kroger. With this dataset, the question that I want to find an answer to is which customer is buying what types of products, as well as the demographics of each customer, to better understand their shopping trends.

I also what to predict, using decision-tree analysis, whether a customer will potentially become a loyalty card member and whether it should then be marketed to them. The reason I believe this is beneficial to Kroger is that it can help the retailer to better understand their main target audience in order to stock their inventory more efficiently and to more precisely focus their marketing of a product to the consumer.

A particular trend that I want to analyze is regional differences of customer spending across the United States. I want to tackle this task analytically through a series of visualizations, tables, and using machine-learning techniques that would clearly convey my message.

Packages Required

The three packages that I use in my project are tidyverse,data.table, and scales. My reasoning behind choosing these packages is the following.

In tidyverse I will be able to clean my data as well as to use certain functions, such as ggplot() and filter(), mutate(), and select(), to be able to select the data that I want and to group it in a way that can answer the questions that I set out to resolve.

The data.table package allows me to take advantage of the fread() function, which enables importing large csv files very quickly.

Lastly, with the scales package I can convert the labels in my plots to correct data labels.

library(tidyverse) ## to have data tidy 
library(data.table) ## to read in excel files
library(scales) ## used for visualizations

Data Preparation

As a first step in data preparation, I read in all 3 data tables – households, products, and transactions – separately.

Next, with the transaction data set I separate out the purchase date into 3 separate variables: day, month, and year. My reasoning behind this was to be able to see purchase habits during certain days or certain months of the year, that I otherwise wouldn’t be able to do.

After that, I use the merge() function to combine the product and transaction data by the product number and then combine the household and transaction data by HSHD_NUM.

I continue to combine those two merged datasets by HSHD_NUM to get the final complete data. The reason to merge all these datasets together is that I want to look at the household data, while simultaneously see spending habits and product habits from different tables.

Then, I take my complete data and split it by year in order to make the data sets smaller and, subsequently, in shorter runtimes.

Finally, in each data set I remove all of the NA entries since those failed to produce answers to my inquiries – in that they contained only spending information, without relevant household and product components – and, as such, would not be beneficial to my analysis. I also replaced the null values with ‘not available’ because those values are still useful in my analysis.

# household dataset
households <- as.tibble(fread("data/5000_households.csv"))

# product dataset
products <- as.tibble(fread("data/5000_products.csv")) 

# transactions dataset
transactions <- as.tibble(fread("data/5000_transactions.csv"))
transactions <- transactions[,-9] %>% separate(PURCHASE_,
                         c("DAY","MONTH","YEAR"),sep = "-")

House_Trans_data <- as.tibble(left_join(households,transactions, by = "HSHD_NUM"))

Complete_data <- as.tibble(left_join(House_Trans_data,products,by = "PRODUCT_NUM"))

Complete_data[Complete_data == "null"] <- "NOT AVAILABLE"
Complete_data_clean <- drop_na(Complete_data)

Complete_data_2016 <- filter(Complete_data_clean, YEAR == 16)
Complete_data_2017 <- filter(Complete_data_clean, YEAR == 17)

Glimpse of Clean Dataset

## Observations: 10,486,751
## Variables: 22
## $ HSHD_NUM         <int> 1549, 1549, 1549, 1549, 1549, 1549, 1549, 154...
## $ L                <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", ...
## $ AGE_RANGE        <chr> "45-54", "45-54", "45-54", "45-54", "45-54", ...
## $ MARITAL          <chr> "Married", "Married", "Married", "Married", "...
## $ INCOME_RANGE     <chr> "35-49K", "35-49K", "35-49K", "35-49K", "35-4...
## $ HOMEOWNER        <chr> "Homeowner", "Homeowner", "Homeowner", "Homeo...
## $ HSHD_COMPOSITION <chr> "NOT AVAILABLE", "NOT AVAILABLE", "NOT AVAILA...
## $ HH_SIZE          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", ...
## $ CHILDREN         <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", ...
## $ BASKET_NUM       <int> 185728, 358614, 821434, 806822, 852769, 10465...
## $ DAY              <chr> "12", "14", "31", "22", "22", "17", "17", "15...
## $ MONTH            <chr> "MAY", "SEP", "JUL", "JUL", "AUG", "MAR", "JU...
## $ YEAR             <chr> "16", "16", "17", "17", "17", "16", "17", "16...
## $ PRODUCT_NUM      <int> 76700, 418055, 6222929, 193077, 956966, 71454...
## $ SPEND            <dbl> 1.00, 4.00, 1.00, 2.49, 5.49, 5.49, 2.50, 0.8...
## $ UNITS            <int> 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, ...
## $ STORE_R          <chr> "WEST", "WEST", "WEST", "WEST", "WEST", "WEST...
## $ WEEK_NUM         <int> 19, 37, 83, 81, 86, 11, 76, 33, 33, 85, 46, 5...
## $ DEPARTMENT       <chr> "FOOD", "FOOD", "NON-FOOD", "FOOD", "FOOD", "...
## $ COMMODITY        <chr> "GROCERY STAPLE", "FROZEN FOOD", "TOYS", "GRO...
## $ BRAND_TY         <chr> "PRIVATE", "NATIONAL", "NATIONAL", "NATIONAL"...
## $ Natural_Organic  <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", ...

First 6 rows

## # A tibble: 6 x 22
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 2     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 3     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 4     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 5     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 6     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## # ... with 15 more variables: HH_SIZE <chr>, CHILDREN <chr>,
## #   BASKET_NUM <int>, DAY <chr>, MONTH <chr>, YEAR <chr>,
## #   PRODUCT_NUM <int>, SPEND <dbl>, UNITS <int>, STORE_R <chr>,
## #   WEEK_NUM <int>, DEPARTMENT <chr>, COMMODITY <chr>, BRAND_TY <chr>,
## #   Natural_Organic <chr>

Last 6 rows

## # A tibble: 6 x 22
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 2     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 3     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 4     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 5     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 6     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## # ... with 15 more variables: HH_SIZE <chr>, CHILDREN <chr>,
## #   BASKET_NUM <int>, DAY <chr>, MONTH <chr>, YEAR <chr>,
## #   PRODUCT_NUM <int>, SPEND <dbl>, UNITS <int>, STORE_R <chr>,
## #   WEEK_NUM <int>, DEPARTMENT <chr>, COMMODITY <chr>, BRAND_TY <chr>,
## #   Natural_Organic <chr>

Proposed Exploratory Data Analysis

I want to present my analysis through a series of visualizations to show the reader how to best market certain products to a wide variety of people. The most natural way to do this is with the use of the ggplot() function to help show the trends and the amount of spending for a variety of groups. I will also utilize dplyr to help me summarize the key statistics.

My ultimate goal is to determine how different age groups differ in their spending habits and how the region and the income might play a role in that. I also want to be able to figure out whether or not – and how – various factors, such as income and family size, affect buying certain commodities or private labels and organic foods. In terms of machine learning techniques, I will try to utilize some sort of a cluster analysis or a decision tree model to determine how these variables feed into the customer spending and loyalty habits.

Complete_data_clean %>%
  select(AGE_RANGE,SPEND,YEAR) %>%
  group_by(AGE_RANGE,YEAR) %>%
  summarize(total = sum(SPEND)) %>%
  ggplot(aes(AGE_RANGE,total)) + 
  geom_bar(aes(fill = YEAR),position = "dodge",stat = "identity") +
  scale_y_continuous("Spending",labels = dollar) +
  scale_x_discrete("Age_Range") + 
  ggtitle("Spending with Respect to Age and Year")