Final Project - Complete Journey Customer Transactions

Introduction

The goal of my project is to use data analytics to improve retailer (Kroger in this particular case) efficiency in purchasing and marketing by means of better understanding customer habits and establishing a long-term relationships with them. The reader interest in this could stem both from the overall approach and specific technical details that I use and by the mere fact that they are most likely a Kroger shopper and want to know how the company sees them.

For my project, I am analyzing the data from Complete Journey Customer Transactions without marketing campaigns provided by 84.51. It contains 2016 – 2017 data on 2,500 households who frequently shop at Kroger. With this dataset, the question that I want to find an answer to is which customer is buying what types of products, as well as the demographics of each customer, to better understand their shopping trends.

I also want to predict, using decision-tree analysis, whether a customer will buy natural/organic products and to find out to what segment of the population and in what manner I would want to market natural/organic products. The reason I believe this is beneficial to Kroger is that it can help the retailer to better understand their main target audience in order to stock their inventory more efficiently and to more precisely focus their marketing of natural/organic products to the consumer.

The reason I want to concentrate on natural/organic products is manifold: a) it is one of the fastest growing retail categories; b) it is one of the higher-margin categories; c) it is associated with “local and sustainable,” which helps with marketing to certain categories of consumers; d) it is more conducive to selling as a private label; e) it is targeted to younger and/or more educated and affluent customer; f) it creates opportunities for up-sales and additional sales; and g) it has potential to attract new customers and new loyalty customers.

A particular trend that I want to analyze is regional and demographic differences of customer spending across the United States. I want to tackle this task analytically through a series of visualizations, tables, and using machine-learning techniques that would clearly convey my message.

Packages Required

The packages that I use in my project are tidyverse, data.table, rpart, rpart.plot, and scales. My reasoning behind choosing these packages is the following.

In tidyverse I will be able to clean my data as well as to use certain functions, such as ggplot() and filter(), mutate(), and select(), to be able to select the data that I want and to group it in a way that can answer the questions that I set out to resolve.

The data.table package allows me to take advantage of the fread() function, which enables importing large csv files very quickly.

The rpart and rpart.plot packages can help me create my decision tree models to help me determine the variables that will select whether a group buys natural/organic products or not.

Lastly, with the scales package I can convert the labels in my plots to correct data labels.

library(tidyverse) ## to have data tidy 
library(data.table) ## to read in excel files
library(scales) ## used for visualizations
library(rpart) ## decision tree
library(rpart.plot) ## plotting decision tree
library(rattle)  ## better decision tree plot

Data Preparation

As explained from the introductory paragraph my data came from 84.51, it contains 2016 – 2017 data on 2500 households who frequently shop at Kroger. The data was split up into 3 different tables. To explore each dataset I created a function, show_data, that would print out a glimpse of the dataset: the first six rows and the last six rows. I also created another function, count_na, which counts the “NA,” “null,” and “Not Available” values in all the columns.

As a first step in data preparation, I read in all 3 data tables – households, products, and transactions using a for-loop that takes each dataset in my data folder and makes it into a dataframe in R automatically.

Next, with the transaction data set, I separate out the purchase date into 3 separate variables: day, month, and year. My reasoning behind this was to be able to see purchase habits during certain days or certain months of the year, which may enable additional target marketing and improve inventory efficiency.

x <- c("5000_households","5000_products","5000_transactions")

for (i in seq_along(x)) {
  
    file_name <- paste0("data/", x[i], ".csv")
    
      if (file.exists(file_name)) {
        df <- as.tibble(fread(file_name))
        assign(unlist(strsplit(x[i],split = "_"))[2],df)
        rm(df)
      } 
      else {
     print("no such file exists")
      }
}

colnames(products)[5] <- "Natural_Organic"

transactions <- transactions[,-9] %>% separate(PURCHASE_,
                         c("DAY","MONTH","YEAR"),sep = "-")

Created Functions

show_data <- function(x){
 
  str_data <- str(x)
 head_data <- head(x)
 tail_data <- tail(x)

 print(str_data)
 
 cat("\n")
 cat("First Six Rows","\n")
 
  print(head_data)
  
  cat("\n")
  cat("Last Six Rows","\n")
  
  print(tail_data)

 }

#################################

na_count <- function(x){
  
  
  is_na <- colSums(is.na(x))
  
  cat("Column Sums NA")
  cat("\n")
  
  print(is_na)
  
  cat("\n")
  
  cat("Column Sums NOT AVAILABLE","\n")
  print(colSums(x == "NOT AVAILABLE"))
  
  cat("\n")
  
  cat("Column Sums null","\n")
  print(colSums(x == "null"))
}

The first conglomeration of tables, 5000_households, contains all the household data items from customers from 2016 and 2017. It has 5000 observations with 9 different variables such as Age Range, Homeowner, Children, Income Range, and other fields relating to a particular household. The first of the tables shows the glimpse of the data set, including variable names and the structure. The second and the third tables show the first six and the last six rows. The fourth table counts the NA, null and Not Available values in each column using the count_na function that I created.

## Classes 'tbl_df', 'tbl' and 'data.frame':    5000 obs. of  9 variables:
##  $ HSHD_NUM        : int  688 2590 1171 1531 403 283 3864 4899 1443 3598 ...
##  $ L               : chr  "Y" "N" "Y" "Y" ...
##  $ AGE_RANGE       : chr  "75+" "75+" "75+" "75+" ...
##  $ MARITAL         : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ INCOME_RANGE    : chr  "35-49K" "50-74K" "75-99K" "75-99K" ...
##  $ HOMEOWNER       : chr  "Homeowner" "Homeowner" "Homeowner" "Homeowner" ...
##  $ HSHD_COMPOSITION: chr  NA NA NA NA ...
##  $ HH_SIZE         : chr  "null" "null" "null" "null" ...
##  $ CHILDREN        : chr  "NOT AVAILABLE" "NOT AVAILABLE" "NOT AVAILABLE" "NOT AVAILABLE" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
## 
## First Six Rows 
## # A tibble: 6 x 9
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1      688 Y     75+       Unknown 35-49K       Homeowner <NA>            
## 2     2590 N     75+       Unknown 50-74K       Homeowner <NA>            
## 3     1171 Y     75+       Unknown 75-99K       Homeowner <NA>            
## 4     1531 Y     75+       Unknown 75-99K       Homeowner <NA>            
## 5      403 N     75+       Unknown UNDER 35K    Renter    <NA>            
## 6      283 Y     25-34     Unknown 50-74K       Homeowner <NA>            
## # ... with 2 more variables: HH_SIZE <chr>, CHILDREN <chr>
## 
## Last Six Rows 
## # A tibble: 6 x 9
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1     1034 Y     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 2     1947 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 3     2592 Y     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 4     2673 Y     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 5     3285 Y     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 6     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## # ... with 2 more variables: HH_SIZE <chr>, CHILDREN <chr>

## Column Sums NA
##         HSHD_NUM                L        AGE_RANGE          MARITAL 
##                0                0                0                0 
##     INCOME_RANGE        HOMEOWNER HSHD_COMPOSITION          HH_SIZE 
##                0                0               68                0 
##         CHILDREN 
##                0 
## 
## Column Sums NOT AVAILABLE 
##         HSHD_NUM                L        AGE_RANGE          MARITAL 
##                0                0              132                0 
##     INCOME_RANGE        HOMEOWNER HSHD_COMPOSITION          HH_SIZE 
##                0                0               NA                0 
##         CHILDREN 
##             2210 
## 
## Column Sums null 
##         HSHD_NUM                L        AGE_RANGE          MARITAL 
##                0                0              865              896 
##     INCOME_RANGE        HOMEOWNER HSHD_COMPOSITION          HH_SIZE 
##              865              865               NA              933 
##         CHILDREN 
##              865

The next conglomeration of tables, 5000_products, contains all the product data that each household bought. It has 151,141 observations with five variables, including the following four categories: Commodity, Department, Brand and Natural/Organic. This shows which household bought what type of product and which department it belongs to. Notice that there are no NA entries in this dataset, as seen in tables below, since NA was originally used for natural/organic products. Consequently, I had to replace NA in R with Natural_Organic.

## Classes 'tbl_df', 'tbl' and 'data.frame':    151141 obs. of  5 variables:
##  $ PRODUCT_NUM    : int  92993 93924 94272 94299 94594 94606 94613 95625 96152 96153 ...
##  $ DEPARTMENT     : chr  "NON-FOOD" "NON-FOOD" "NON-FOOD" "NON-FOOD" ...
##  $ COMMODITY      : chr  "PET" "PET" "PET" "PET" ...
##  $ BRAND_TY       : chr  "PRIVATE" "PRIVATE" "PRIVATE" "PRIVATE" ...
##  $ Natural_Organic: chr  "N" "N" "N" "N" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## NULL
## 
## First Six Rows 
## # A tibble: 6 x 5
##   PRODUCT_NUM DEPARTMENT COMMODITY BRAND_TY Natural_Organic
##         <int> <chr>      <chr>     <chr>    <chr>          
## 1       92993 NON-FOOD   PET       PRIVATE  N              
## 2       93924 NON-FOOD   PET       PRIVATE  N              
## 3       94272 NON-FOOD   PET       PRIVATE  N              
## 4       94299 NON-FOOD   PET       PRIVATE  N              
## 5       94594 NON-FOOD   PET       PRIVATE  N              
## 6       94606 NON-FOOD   PET       PRIVATE  N              
## 
## Last Six Rows 
## # A tibble: 6 x 5
##   PRODUCT_NUM DEPARTMENT COMMODITY             BRAND_TY Natural_Organic
##         <int> <chr>      <chr>                 <chr>    <chr>          
## 1     6125298 FOOD       IN STORE FOOD SERVICE NATIONAL N              
## 2     6463714 FOOD       IN STORE FOOD SERVICE NATIONAL N              
## 3     4814398 FOOD       IN STORE FOOD SERVICE NATIONAL N              
## 4     5031878 FOOD       IN STORE FOOD SERVICE NATIONAL N              
## 5     5356527 FOOD       IN STORE FOOD SERVICE NATIONAL N              
## 6     5485929 FOOD       IN STORE FOOD SERVICE PRIVATE  N

## Column Sums NA
##     PRODUCT_NUM      DEPARTMENT       COMMODITY        BRAND_TY 
##               0               0               0               0 
## Natural_Organic 
##               0 
## 
## Column Sums NOT AVAILABLE 
##     PRODUCT_NUM      DEPARTMENT       COMMODITY        BRAND_TY 
##               0               0               0               0 
## Natural_Organic 
##               0 
## 
## Column Sums null 
##     PRODUCT_NUM      DEPARTMENT       COMMODITY        BRAND_TY 
##               0               0               0               0 
## Natural_Organic 
##               0

The last table, 5000_transcations, contains all the transactional data from all households. It has 10,625,553 observations with nine variables, including Store, Region, Purchase Date, Units, and Amount Spent. This data set posed no particular problems. The only change I made to it was to split up the purchase factor into three separate variables: Month, Day and Year with the purpose to explore seasonality and daily habits.

## Classes 'tbl_df', 'tbl' and 'data.frame':    10625553 obs. of  10 variables:
##  $ BASKET_NUM : int  24 24 34 60 60 168 199 252 355 366 ...
##  $ HSHD_NUM   : int  1809 1809 1253 1595 1595 3393 4558 3491 4733 4369 ...
##  $ DAY        : chr  "03" "03" "03" "03" ...
##  $ MONTH      : chr  "JAN" "JAN" "JAN" "JAN" ...
##  $ YEAR       : chr  "16" "16" "16" "16" ...
##  $ PRODUCT_NUM: int  5817389 5829886 539501 5260099 4535660 5602916 5683532 980450 735993 138157 ...
##  $ SPEND      : num  -1.5 -1.5 2.19 0.99 2.5 4.5 3.49 2.79 1 9.98 ...
##  $ UNITS      : int  -1 -1 1 1 2 1 1 1 1 2 ...
##  $ STORE_R    : chr  "SOUTH" "SOUTH" "EAST" "WEST" ...
##  $ WEEK_NUM   : int  1 1 1 1 1 1 1 1 1 1 ...
## NULL
## 
## First Six Rows 
## # A tibble: 6 x 10
##   BASKET_NUM HSHD_NUM DAY   MONTH YEAR  PRODUCT_NUM SPEND UNITS STORE_R
##        <int>    <int> <chr> <chr> <chr>       <int> <dbl> <int> <chr>  
## 1         24     1809 03    JAN   16        5817389 -1.5     -1 SOUTH  
## 2         24     1809 03    JAN   16        5829886 -1.5     -1 SOUTH  
## 3         34     1253 03    JAN   16         539501  2.19     1 EAST   
## 4         60     1595 03    JAN   16        5260099  0.99     1 WEST   
## 5         60     1595 03    JAN   16        4535660  2.5      2 WEST   
## 6        168     3393 03    JAN   16        5602916  4.5      1 SOUTH  
## # ... with 1 more variable: WEEK_NUM <int>
## 
## Last Six Rows 
## # A tibble: 6 x 10
##   BASKET_NUM HSHD_NUM DAY   MONTH YEAR  PRODUCT_NUM SPEND UNITS STORE_R
##        <int>    <int> <chr> <chr> <chr>       <int> <dbl> <int> <chr>  
## 1     103423     4159 28    DEC   17         698658  3.29     1 SOUTH  
## 2     103429     2164 29    DEC   17        6602908  4.99     1 WEST   
## 3     103433     3413 29    DEC   17        3759093  2        2 SOUTH  
## 4     103433     3413 29    DEC   17        5709849  5.99     1 SOUTH  
## 5     103437     3817 29    DEC   17        2474555  1.79     1 SOUTH  
## 6     103537      644 30    DEC   17        3759136  2        2 CENTRAL
## # ... with 1 more variable: WEEK_NUM <int>

## Column Sums NA
##  BASKET_NUM    HSHD_NUM         DAY       MONTH        YEAR PRODUCT_NUM 
##           0           0           0           0           0           0 
##       SPEND       UNITS     STORE_R    WEEK_NUM 
##           0           0           0           0 
## 
## Column Sums NOT AVAILABLE 
##  BASKET_NUM    HSHD_NUM         DAY       MONTH        YEAR PRODUCT_NUM 
##           0           0           0           0           0           0 
##       SPEND       UNITS     STORE_R    WEEK_NUM 
##           0           0           0           0 
## 
## Column Sums null 
##  BASKET_NUM    HSHD_NUM         DAY       MONTH        YEAR PRODUCT_NUM 
##           0           0           0           0           0           0 
##       SPEND       UNITS     STORE_R    WEEK_NUM 
##           0           0           0           0

After reading all three tables and doing some basic data exploration, I use the merge() function to combine the product and transaction data by the product number and then combine the household and transaction data by HSHD_NUM.

I continue to combine those two merged datasets by HSHD_NUM to get the final complete dataset. The reason to merge all these datasets into one is to obtain a multi-dimensional view of the household data vis-a-vis spending and product habits collected from different tables.

Next, I take my complete data and split it by year in order to make the data sets smaller and, subsequently, achieve shorter runtimes.

Finally, in each data set I remove all of the NA entries since those failed to produce answers to my inquiries – in that they contained only spending information, without relevant household and product components – and, as such, would not be beneficial for my analysis. I also replaced the null values with ‘not available’ because those values are still useful in my analysis.

x <- c("5000_households","5000_products","5000_transactions")

for (i in seq_along(x)) {
  
    file_name <- paste0("data/", x[i], ".csv")
    
      if (file.exists(file_name)) {
        df <- as.tibble(fread(file_name))
        assign(unlist(strsplit(x[i],split = "_"))[2],df)
        rm(df)
      } 
      else {
     print("no such file exists")
      }
}

colnames(products)[5] <- "Natural_Organic"

transactions <- transactions[,-9] %>% separate(PURCHASE_,
                         c("DAY","MONTH","YEAR"),sep = "-")

House_Trans_data <- as.tibble(left_join(households,transactions, by = "HSHD_NUM"))

Complete_data <- as.tibble(left_join(House_Trans_data,products,by = "PRODUCT_NUM"))

Complete_data[Complete_data == "null"] <- "NOT AVAILABLE"
Complete_data_clean <- drop_na(Complete_data)

Glimpse of Clean Dataset

## Observations: 10,486,751
## Variables: 22
## $ HSHD_NUM         <int> 1549, 1549, 1549, 1549, 1549, 1549, 1549, 154...
## $ L                <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", ...
## $ AGE_RANGE        <chr> "45-54", "45-54", "45-54", "45-54", "45-54", ...
## $ MARITAL          <chr> "Married", "Married", "Married", "Married", "...
## $ INCOME_RANGE     <chr> "35-49K", "35-49K", "35-49K", "35-49K", "35-4...
## $ HOMEOWNER        <chr> "Homeowner", "Homeowner", "Homeowner", "Homeo...
## $ HSHD_COMPOSITION <chr> "NOT AVAILABLE", "NOT AVAILABLE", "NOT AVAILA...
## $ HH_SIZE          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", ...
## $ CHILDREN         <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", ...
## $ BASKET_NUM       <int> 185728, 358614, 821434, 806822, 852769, 10465...
## $ DAY              <chr> "12", "14", "31", "22", "22", "17", "17", "15...
## $ MONTH            <chr> "MAY", "SEP", "JUL", "JUL", "AUG", "MAR", "JU...
## $ YEAR             <chr> "16", "16", "17", "17", "17", "16", "17", "16...
## $ PRODUCT_NUM      <int> 76700, 418055, 6222929, 193077, 956966, 71454...
## $ SPEND            <dbl> 1.00, 4.00, 1.00, 2.49, 5.49, 5.49, 2.50, 0.8...
## $ UNITS            <int> 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, ...
## $ STORE_R          <chr> "WEST", "WEST", "WEST", "WEST", "WEST", "WEST...
## $ WEEK_NUM         <int> 19, 37, 83, 81, 86, 11, 76, 33, 33, 85, 46, 5...
## $ DEPARTMENT       <chr> "FOOD", "FOOD", "NON-FOOD", "FOOD", "FOOD", "...
## $ COMMODITY        <chr> "GROCERY STAPLE", "FROZEN FOOD", "TOYS", "GRO...
## $ BRAND_TY         <chr> "PRIVATE", "NATIONAL", "NATIONAL", "NATIONAL"...
## $ Natural_Organic  <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", ...

First 6 rows

## # A tibble: 6 x 22
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 2     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 3     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 4     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 5     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## 6     1549 Y     45-54     Married 35-49K       Homeowner NOT AVAILABLE   
## # ... with 15 more variables: HH_SIZE <chr>, CHILDREN <chr>,
## #   BASKET_NUM <int>, DAY <chr>, MONTH <chr>, YEAR <chr>,
## #   PRODUCT_NUM <int>, SPEND <dbl>, UNITS <int>, STORE_R <chr>,
## #   WEEK_NUM <int>, DEPARTMENT <chr>, COMMODITY <chr>, BRAND_TY <chr>,
## #   Natural_Organic <chr>

Last 6 rows

## # A tibble: 6 x 22
##   HSHD_NUM L     AGE_RANGE MARITAL INCOME_RANGE HOMEOWNER HSHD_COMPOSITION
##      <int> <chr> <chr>     <chr>   <chr>        <chr>     <chr>           
## 1     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 2     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 3     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 4     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 5     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## 6     3650 N     65-74     Married UNDER 35K    Homeowner 2 Adults and Ki…
## # ... with 15 more variables: HH_SIZE <chr>, CHILDREN <chr>,
## #   BASKET_NUM <int>, DAY <chr>, MONTH <chr>, YEAR <chr>,
## #   PRODUCT_NUM <int>, SPEND <dbl>, UNITS <int>, STORE_R <chr>,
## #   WEEK_NUM <int>, DEPARTMENT <chr>, COMMODITY <chr>, BRAND_TY <chr>,
## #   Natural_Organic <chr>

Exploratory Data Analysis

My main objective in exploration of the data set is to find patterns especially when it comes to spending patterns and identifying customers, who are buying private label vs. national label and natural/organic products vs. standard variety.

My first table below analyses total spending by age group. As one can see, the 45-54 age group spent the most, at $8,992,839, while 19-24 year-olds, who spent only $260,039, are obviously at the bottom of the ladder. There are no apparent unusual trends here. One would expect that the age group, which typically has families and are at or near peak of their educational level and earnings, would spend the most, while younger ages, who are typically single, may still live with their parents and be in school, or just starting their careers, would spend the least.

My first graph below represents spending by age and region. Analyzing the graph I observed some surprising trends. For instance, I fully expected the East Region to lead in overall sales, given the fact that it includes such populous and wealthy states as New York, New Jersey, Connecticut and Massachusetts. On the other hand, in the West, especially for the 45-54 group, I expected spending to be comparable to the East, given the most populous and wealthy state of California, as well as the wealthy states of Washington and Oregon. Another surprise was that South spending exceeded the West in the 19-44 groups. Clearly, a finer set of per capita data is needed to resolve some of these puzzles. On the other hand, a possible explanation may be that in the West customers do more shopping in alternative stores. This may be a drawback but it can also present a good marketing opportunity to try to increase sales in the region.

The next graph slices the above data by income. An interesting observation here is that for the highest income group, the Southern Region overtakes all others for this retailer, while the Western Region has the lowest spending. This is not good news for this retailer as it is often the most sought after customer group. Also in the West, the lowest-income population spends the least at this retailer. This may suggest that it is not affordable enough there. This correlates well with the age grouping. Both the highest and the lowest end suggest that additional research is needed and may present an opportunity to expand business.

Complete_data_clean %>%
  select(AGE_RANGE,SPEND) %>%
  group_by(AGE_RANGE) %>%
  summarize(total = sum(SPEND)) %>%
  arrange(desc(total))

## # A tibble: 8 x 2
##   AGE_RANGE        total
##   <chr>            <dbl>
## 1 45-54         8992839.
## 2 55-64         7353485.
## 3 NOT AVAILABLE 7123850.
## 4 35-44         5765960.
## 5 65-74         3968608.
## 6 25-34         2336082.
## 7 75+           1942278.
## 8 19-24          260039.

The next table shows that national brands outsell private brands by roughly three to one. The graph that follows shows that non-natural/organic products outsell the natural/organic ones by a far greater ratio. This creates a huge opportunity for private brands to capitalize on the growing popularity of the natural/organic category to greatly reduce the latter ratio, which, in turn, will help to reduce the former.

In other words, a retailer can use the natural/organic category to separate its private brand from that of a national brand. Two more important considerations factor into this direction. First, from the above graph, it is clear that 35-44 year-olds spend more on natural/organic products than the 45-54 group. That means that once they enter the peak of their earnings, they will spend even more on this category. Furthermore, it is also clear from the next table that adults, and especially adults with children, buy more natural/organic products than do younger people. Promoting natural/organic products should definitely pay off once younger groups come of age, form families and achieve higher earnings since they may be already gravitating to this category.

The graph that follows shows that, analyzed by region, the result is consistent with the preceding graph is that non-natural/organic category dwarfs its natural/organic counterpart. However, one interesting difference is that variations are far smaller by region than by age. This hints at that the natural/organic category should be pursued on a national scale.

In the next graph, the same data is sliced by income. Again, the highest income bracket may seem somewhat underrepresented in the natural/organic category. But we don’t have the per capita data and the number of people in this bracket is relatively small. Notice also the 50-74K income bracket outspending the 75-99K bracket and being comparable to the 100-150K bracket. This may be related to the above mention of the 35-44 year-olds spending more on natural/organic products than the 45-54 group.

## # A tibble: 2 x 2
##   BRAND_TY     total
##   <chr>        <dbl>
## 1 NATIONAL 27090698.
## 2 PRIVATE  10652443.

## # A tibble: 7 x 2
##   HSHD_COMPOSITION      total
##   <chr>                 <dbl>
## 1 2 Adults and Kids 12549461.
## 2 2 Adults           8227468.
## 3 NOT AVAILABLE      6201235.
## 4 Single Male        3372761.
## 5 1 Adult and Kids   3311890.
## 6 Single Female      2839029.
## 7 1 Adult            1241296.

Summary

First I want to address the analyzed data sets, on which this project was based. In general, the size of the datasets was quite limited. The major drawbacks for me were the following three: absence of per/capita data; absence of shopping data at other retailers (in particular in the natural/organic category, which became of special interest to me); absence of loyalty vs. non-loyalty customer data. In fairness, the second one may not have been available to 84.51. The last one, however, was crucial to interfere with my original idea to perform a decision-tree analysis in order to identify the target customer segment, which would more likely to become loyalty customers. Unfortunately, almost all of the shoppers in the 84.51 data were already loyalty card holders.

Still, based on the available data, several important conclusions can be reached:

First, while sales of private brands to national brands is smaller by the ratio of roughly one to three, the sale ratio of natural/organic products to their counterpart is much smaller. This may be indicative of the small footprint of this particular retailer in the natural/organic category but it also opens up a two-prong opportunity: to grow private labels using this category as well as to attract shoppers from the competitors by offering more choices in this rapidly growing and high-margin category.

Second, while variations in sales of natural/organic products vary greatly by age and income, regional differences are relatively smaller, which points to the necessity of a national strategy in this category.

Third, the 35-44 age group already spends more on natural/organic products than 45-54 group, despite lower incomes. Further, adults and, in particular adults with children, spend more in this category. This means that there is a generational trend supporting this line of products and as the 35-44 year-olds will approach their peak earnings and younger generations form families, complete their education and advance their careers, the natural/organic category may see a sizeable expansion. Analysis by income confirms both the generational trend and the gradient towards wealthier customers.

In summary my analysis points to the necessity of a national strategy for this retailer to increase its offerings of natural/organic products and market them to existing customers and use them to attract new customers. It also suggests a promising idea to use natural/organic category to grow private labels, which are already more profitable than national brands and even more so due to higher margins of these products. Additionally, one can use the daily, monthly and yearly data to improve inventory efficiency.

Final Project - Complete Journey Customer Transactions

Asher Serota

12/8/2018