As one of the world’s largest retailers operating in 2,769 grocery retail stores, Kroger is committed to meet the demands of an ever-evolving customer base. As a part of it’s next level growth strategy, Kroger plans to analyze it’s transactional data from April 2016 to October 2017 on the basis of what its customers looks forward to buy and factors influencing their spend capacity.
Prior installing and loading the following package would result in smooth execution of codes to reproduce the project.
| Package | Purpose |
|---|---|
| tidyverse | Creating Tidy and data and ease of installation of other package |
| ggplot2 | Creating Visuals |
| lubridate | Rectify Data format across tables |
| readr | Importing CSV files |
| DT | Creating functional tables in HTML |
| knitr | Useful in generating dynamic reports |
| rmarkdown | Converting R Markdown files into HTML/required format |
| magrittr | For Pipe operator |
Using R’s read.csv() function to import the comma separated files.
households <- read.csv("data/5000_households.csv", header = TRUE,
strip.white = TRUE, stringsAsFactors = FALSE,
na.strings = c("null", "NA"))
products <- read.csv("data/5000_products.csv", header = TRUE,
strip.white = TRUE, stringsAsFactors = FALSE,
na.strings = c("null", "NA"))
transactions <- read.csv("data/5000_transactions.csv", header = TRUE,
strip.white = TRUE, stringsAsFactors = FALSE,
na.strings = c("null", "NA"))
Step 1. White space removal from data set while importing dataset.
Step 2. Null value treatment. Identifying blank,NA,“null” as Null values.
Step 3. Renaming Columns names as per the data dictionary provided to avoid any confusion.
colnames(households) <-c("hshd_num","loyalty_flag","age_range","martial_status",
"income_range","homeowner_desc","hshd_composition",
"hh_size","children")
colnames(products) <- c("product_num","department","commodity","brand_type",
"natural_organic_flag")
colnames(transactions) <-c("hshd_num","basket_num","date","product_num","spend",
"units","store_region","week_num","year")
summary(transactions)
transactions$date <- dmy(as.character(transactions$date))
transactions <- subset(transactions, (spend>=0 & spend>=0))
boxplot(transactions$spend)
boxplot(transactions$units)
A look at the sample of 10 observations from each table, gives the user a chance to have a glimpse at the dataset.
| hshd_num | loyalty_flag | age_range | martial_status | income_range | homeowner_desc | hshd_composition | hh_size | children |
|---|---|---|---|---|---|---|---|---|
| 688 | Y | 75+ | Unknown | 35-49K | Homeowner | NA | NA | NOT AVAILABLE |
| 2590 | N | 75+ | Unknown | 50-74K | Homeowner | NA | NA | NOT AVAILABLE |
| 1171 | Y | 75+ | Unknown | 75-99K | Homeowner | NA | NA | NOT AVAILABLE |
| 1531 | Y | 75+ | Unknown | 75-99K | Homeowner | NA | NA | NOT AVAILABLE |
| 403 | N | 75+ | Unknown | UNDER 35K | Renter | NA | NA | NOT AVAILABLE |
| 283 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | NA | NOT AVAILABLE |
| 3864 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | NA | NOT AVAILABLE |
| 4899 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | NA | NOT AVAILABLE |
| 1443 | N | 35-44 | Unknown | 150K+ | Homeowner | NA | NA | NOT AVAILABLE |
| 3598 | Y | 35-44 | Unknown | 35-49K | Unknown | NA | NA | NOT AVAILABLE |
| product_num | department | commodity | brand_type | natural_organic_flag |
|---|---|---|---|---|
| 92993 | NON-FOOD | PET | PRIVATE | N |
| 93924 | NON-FOOD | PET | PRIVATE | N |
| 94272 | NON-FOOD | PET | PRIVATE | N |
| 94299 | NON-FOOD | PET | PRIVATE | N |
| 94594 | NON-FOOD | PET | PRIVATE | N |
| 94606 | NON-FOOD | PET | PRIVATE | N |
| 94613 | NON-FOOD | PET | PRIVATE | N |
| 95625 | NON-FOOD | PET | PRIVATE | N |
| 96152 | NON-FOOD | PET | PRIVATE | N |
| 96153 | NON-FOOD | PET | PRIVATE | N |
| hshd_num | basket_num | date | product_num | spend | units | store_region | week_num | year | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 34 | 1253 | 2016-01-03 | 539501 | 2.19 | 1 | EAST | 1 | 2016 |
| 4 | 60 | 1595 | 2016-01-03 | 5260099 | 0.99 | 1 | WEST | 1 | 2016 |
| 5 | 60 | 1595 | 2016-01-03 | 4535660 | 2.50 | 2 | WEST | 1 | 2016 |
| 6 | 168 | 3393 | 2016-01-03 | 5602916 | 4.50 | 1 | SOUTH | 1 | 2016 |
| 7 | 199 | 4558 | 2016-01-03 | 5683532 | 3.49 | 1 | SOUTH | 1 | 2016 |
| 8 | 252 | 3491 | 2016-01-03 | 980450 | 2.79 | 1 | SOUTH | 1 | 2016 |
| 9 | 355 | 4733 | 2016-01-03 | 735993 | 1.00 | 1 | EAST | 1 | 2016 |
| 10 | 366 | 4369 | 2016-01-03 | 138157 | 9.98 | 2 | EAST | 1 | 2016 |
| 11 | 379 | 226 | 2016-01-03 | 83880 | 1.29 | 1 | EAST | 1 | 2016 |
| 12 | 413 | 2668 | 2016-01-03 | 89361 | 1.79 | 1 | WEST | 1 | 2016 |
| Variable | Description |
|---|---|
| hshd_num | Uniquely identifies each household |
| loyalty_flag | Estimated age range |
| age_range | If the household owns a loyalty card |
| martial_status | Maritial Status (A - Married, B - Single, U - Unknown) |
| income_range | Total Household income |
| homeowner_desc | Type of Residency(Homeowner,renter,etc) |
| hshd_composition | Composition of household |
| hh_size | Size of household up to 5+ |
| children | Number of children present up tp 3+ |
| Variable | Description |
|---|---|
| product_num | Uniquely identifies each product |
| department | Groups products at Food, Non-Food, Pharma |
| commodity | Groups products together at lower level than Department |
| brand_type | Identifies the product as Private or National |
| natural_organic_flag | Identifies the product as Natural/Organic or Inorganic |
| Variable | Description |
|---|---|
| product_num | Uniquely identifies each product |
| department | Groups products at Food, Non-Food, Pharma |
| commodity | Groups products together at lower level than Department |
| brand_type | Identifies the product as Private or National |
| natural_organic_flag | Identifies the product as Natural/Organic or Inorganic |
monthly_sales <- transactions %>%
group_by(Month = month(date), Year = year(date)) %>%
summarize(spend=sum(spend)) %>%
group_by(Month) %>%
summarize(spend = mean(spend)) %>%
arrange(Month) %>% data.frame
monthly_sales$Month <- as.factor(monthly_sales$Month)
Sales are highest in December presumably due to Christmas holidays
avg_basket_sales <- transactions %>%
group_by(hshd_num, basket_num, Month = month(date)) %>%
summarize(basket_spend=sum(spend))
avg_basket_sales2 <- avg_basket_sales %>%
group_by(Month) %>%
summarize(monthly_avg_basket_spend = mean(basket_spend)) %>%
arrange(Month) %>% data.frame
avg_basket_sales2$Month <- as.factor(avg_basket_sales2$Month)
The average basket spend is approximately constant throughout year except December when average basket amount is highest
Testing