What is the Ta Feng Grocery Dataset?

The Ta Feng Grocery Dataset is a Supermarket Dataset containining 817741 transactions from November 2000 until the end of February 2001.

The dataset contains information about 119578 shopping baskets, belonging to 32266 users, where 1129939 items were purchased from a range of 23812 products.

Fields of the Dataset are:
- Transaction date and time (time invalid and useless)
- Customer ID
- Age: 10 possible values
- Residence Area: 8 possible values
- Product subclass
- Product ID
- Amount
- Asset
- Sales price

Ta Feng Grocery Dataset

How reliable is this data?

There is no context for this data we do not know where (country or city) the data is sourced from, what type of items being sold, whether and how the data has been pre-processed or subsetted.
In addition:
The dataset does not come with any description of how it was collected and cannot hence not be reproduced or verified.
There is no detailed description of the data ie what currency is used for Sales Price or what units are used for the variables ‘Amounts’ or ‘Assets’

How did you make an assessment?

I checked the internet for descriptions and information about the data
I used univariate statistical techniques to summarise the data
Dataset commands were used to check for null values
Data visualisation was used to show the distributions of the variables

Ta Feng Grocery Dataset

How does the data quality limit or constrain your ideas for analysis?

Since the data is not current, any modelling or forecasting is not applicable to the present day.
Without a context or reliable description of the data, I would not use it to make quantitative conclusions eg price predictions
It can be used for descriptive and insight analysis ie relationship analysis, customer profiling, historical comparisons

What other datasets could be brought in to enrich these questions?

Customer Information ie Demographic information
Products Information ie Product Descriptions
Store Information ie Location, Opening Hours, Size

Stakeholders

What interesting questions could you answer with this data?

If you are a retailer?

Who are my customers (customer profiling)
What are they typically buying? (basket analysis)
When are items being purchased? (time analysis)
What is the time (in days) between visits for my customers? (customer profiling & time analysis)

If you are a consumer?

How do my shopping habits compare to others in my area (customer profiling)
When is the busiest day to shop in my area? (time and region analysis)
Is there any difference between prices for the products I buy in different Regions? (Region & Price analysis)

Stakeholders

What interesting questions could you answer with this data?

If you are a wholesaler?

What are my best selling products?
What are the best-selling products in each Region? (basket analysis)
When is the best day for sales of my product? (time series analysis)
What is the sales trend for my products?

Analysis from Wholesaler Perspective

What are the best selling products?

Top 3 Products are
Product IDs 4714981010038, 4710421090059, 4711271000014 Bar Chart of Top 10 Selling ProdID

Analysis from Wholesaler Perspective

How are the best selling products selling in the Regions?

Region E sells the most of all products
Region G sells a high volume of Product ‘4710421090059’

Analysis from Wholesaler Perspective

What is the sales trend for my best-selling products?

Bar Chart of Product Sales by ProdID

Analysis from Wholesaler Perspective

Time Series Analysis of Product ID 4714981010038

Time Series Analysis allows seasonality to be removed from data, showing trends more clearly

Time Series Decomposition of Product Sales by ProdID

Analysis from Wholesaler Perspective

How does the sales trend differ for the Top Selling Products?

Sales Trend for Top 3 Products

Analysis / Modelling on Sales Trends for Top Selling Products

What is something interesting?

The times of peak sales for the best selling products may coincide but are genrally quite different.
Product ID 4711271000014 also has high sales volumes from the middle of December until the middle of January, however there appears to be an increase in sales volumes from the middle of February.

What is something useful?

Product ID 4711271000014 sells very well in Region ‘G’
The increasing sales trend for the Product 4711271000014 at the end of the time period is an unusual trend.

Analysis / Modelling on Sales Trend for Top Selling Products

What evidence or rationale supports your findings?

The sales totals for the Top Selling Products by Region
The timer series analysis for the different Product IDs

How would you explain this to a non-technical business client?

Product ID 4711271000014 sells very well in Region ‘G’ as shown by the total sales per Region.
Product ID 4711271000014 has an unusual upward sales trend in February.

If a client wanted to track these insights ongoing

How would you go about designing an operational solution?

Automated data collection from required database
Automated dataset collation and combining on an agreed time schedule
Scripted data processing and report output
Automated report distribution

How would you go about estimating time/cost/skills needed to build something?

Obtain information about the type, access and location of the involved Database systems
Obtain information about the size and complexity of the data to be processed
Obtain information about the client requirements for the format and frequency of reporting

If a client wanted to track these insights ongoing

What are some of the top challenges you would expect?

Database systems may be different types, have different access methods and provide data in different formats
Data sizes may be large and not transportable
Reports may require interpretation or context ie the input of expert opinion before being widely distributed

Appendix

R Libraries

The R libraries used include those for data manipulation (reshape2,data.table), plotting (ggplot, grid, gridExtra), and forecasting (xts, forecast)

library(reshape2)
library(ggplot2)
library(grid)
library(gridExtra)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt

library(forecast)
library(xts)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:data.table':
## 
##     first, last

Data Load and Combine

The Dataset is loaded from the four input files and combined after adding column descriptors and formatting some columns.

df.TFD11 <- read.csv("D11", header = T, sep = ";", 
    stringsAsFactors = FALSE)
df.TFD12 <- read.csv("D12", header = T, sep = ";", 
    stringsAsFactors = FALSE)
df.TFD01 <- read.csv("D01", header = T, sep = ";", 
    stringsAsFactors = FALSE)
df.TFD02 <- read.csv("D02", header = T, sep = ";", 
    stringsAsFactors = FALSE)
D11cols <- colnames(df.TFD11)
D12cols <- colnames(df.TFD12)
D01cols <- colnames(df.TFD01)
D02cols <- colnames(df.TFD02)
identical(D11cols, D12cols)
identical(D11cols, D01cols)
identical(D11cols, D02cols)
df.TFD0 <- rbind(df.TFD01, df.TFD02)
df.TFD1 <- rbind(df.TFD11, df.TFD12)
df.TFDall <- rbind(df.TFD0, df.TFD1)

Data Format

Dataset Columns are defined and fields converted to required format.
Data Table is created for query analysis

TFcols <- c("DateTime", "CustID", "Age_cat", "ResArea", 
    "ProdSub", "ProdID", "Amount", "Asset", "SalesPrice")
colnames(df.TFDall) <- TFcols
dt.TF <- as.data.table(df.TFDall)
dt.TF$DateTime <- as.POSIXct(dt.TF$DateTime)
dt.TF$ProdID <- as.factor(dt.TF$ProdID)
df.TFDall$DateTime <- as.POSIXlt(df.TFDall$DateTime)
df.TFDall$Age_cat = as.factor(df.TFDall$Age_cat)
df.TFDall$ResArea = as.factor(df.TFDall$ResArea)
df.TFDall$CustID = as.factor(df.TFDall$CustID)
df.TFDall$ProdSub = as.factor(df.TFDall$ProdSub)

Data Exploration

From the summary of the dataset it is observed that the max values of colums CustID, Amount, Asset and Sales Price is much higher that the 3rd Quartile, suggesting outliers. This can be explored graphically.
The dataset was also checked for NULL values (with none found)

summary(df.TFDall)
str(df.TFDall)
sapply(df.TFDall, function(x) sum(is.na(x)))

Data Exploration

##     DateTime                       CustID          Age_cat      
##  Min.   :2000-11-01 00:00:00   20459  :  1246   D      :181213  
##  1st Qu.:2000-11-28 00:00:00   2112589:   879   E      :151023  
##  Median :2001-01-01 00:00:00   2112596:   822   C      :140805  
##  Mean   :2000-12-30 16:40:45   2113579:   776   F      : 99719  
##  3rd Qu.:2001-01-30 00:00:00   1847994:   754   B      : 66432  
##  Max.   :2001-02-28 00:00:00   426053 :   596   G      : 53719  
##                                (Other):812668   (Other):124830  
##     ResArea          ProdSub           ProdID             
##  E      :312501   100205 : 20400   Min.   :     20008819  
##  F      :245213   120103 : 15360   1st Qu.:4710085127020  
##  G      : 72092   110401 : 13099   Median :4710421090060  
##  C      : 71640   110411 : 12794   Mean   :4461639280530  
##  H      : 40666   130206 : 12008   3rd Qu.:4712500125130  
##  D      : 38674   130315 : 11874   Max.   :9789579967620  
##  (Other): 36955   (Other):732206                          
##      Amount             Asset            SalesPrice      
##  Min.   :   1.000   Min.   :     0.0   Min.   :     1.0  
##  1st Qu.:   1.000   1st Qu.:    35.0   1st Qu.:    42.0  
##  Median :   1.000   Median :    62.0   Median :    76.0  
##  Mean   :   1.382   Mean   :   112.1   Mean   :   131.9  
##  3rd Qu.:   1.000   3rd Qu.:   112.0   3rd Qu.:   132.0  
##  Max.   :1200.000   Max.   :432000.0   Max.   :444000.0  
##

## 'data.frame':    817741 obs. of  9 variables:
##  $ DateTime  : POSIXlt, format: "2001-01-01" "2001-01-01" ...
##  $ CustID    : Factor w/ 32266 levels "1069","1113",..: 1163 10676 14476 18272 30090 21913 8144 18759 7565 7013 ...
##  $ Age_cat   : Factor w/ 11 levels "A ","B ","C ",..: 6 5 5 5 1 10 4 8 6 5 ...
##  $ ResArea   : Factor w/ 8 levels "A ","B ","C ",..: 6 5 7 6 2 5 3 1 3 5 ...
##  $ ProdSub   : Factor w/ 2012 levels "100101","100102",..: 317 180 14 962 462 135 1489 1169 1014 577 ...
##  $ ProdID    : num  4710105011011 4710265849066 4712019100607 4710168702901 4710431339148 ...
##  $ Amount    : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ Asset     : int  44 150 35 94 100 144 740 676 170 36 ...
##  $ SalesPrice: int  52 129 39 119 159 190 969 849 219 59 ...

##   DateTime     CustID    Age_cat    ResArea    ProdSub     ProdID 
##          0          0          0          0          0          0 
##     Amount      Asset SalesPrice 
##          0          0          0

Data Exploration

Questions for the dataset:
How many Customers in the dataset?
How many different products?
How many Shopping Baskets?
How many items are purchased?

ncustomers = length(unique(dt.TF[, CustID]))
print(sprintf("Total Number of Customers= %10.0f", 
    ncustomers))
nproducts = length(unique(dt.TF[, ProdID]))
print(sprintf("Total Number of Products= %10.0f", nproducts))
nbaskets = nrow(unique(dt.TF[, list(DateTime, CustID, 
    ResArea)]))
print(sprintf("Total Number of Shopping Baskets= %10.0f", 
    nbaskets))
nitems = sum(dt.TF[, Amount])
print(sprintf("Total Number of Items= %10.0f", nitems))
nvalue = dt.TF[, list(sum(Amount * SalesPrice))]
print(sprintf("Total Value of Transactions = %10.0f", 
    nvalue))

Data Exploration - Amount

The distributions of variable ‘Amount’ show almost all transactions are low value transactions (less than 2), which a small number of high value transactions.
From this it can be assumed that the number of transactions can be assumed to be of more interest than the size of the transactions.

p1 = ggplot(df.TFDall, aes(x = df.TFDall$Amount)) + 
    geom_histogram(binwidth = 1, colour = "black", 
        fill = "#56B4D9", na.rm = TRUE) + ylim(c(0, 
    15000)) + xlim(c(1, 100)) + ggtitle("Histogram of Amount") + 
    labs(y = "Count", x = "")
p2 = ggplot(df.TFDall, aes(x = "", y = df.TFDall$Amount)) + 
    geom_boxplot(fill = "#56B4D9", outlier.colour = "black", 
        outlier.shape = 1, outlier.size = 2, na.rm = TRUE) + 
    ggtitle("Boxplot of Amount") + theme(axis.text.x = element_blank()) + 
    labs(y = "Amount Value") + labs(y = "Count", x = "")
grid.arrange(p1, p2, ncol = 2, nrow = 1, widths = c(0.8, 
    0.8))

Data Exploration - Amount

TaFeng Variable ‘Amount’ Distribution

Data Exploration - Asset

The distributions of variable ‘Asset’ show almost all values are less than 1000, and all except 2 values being less that 10000.
This sheds no light on this variable and it is not used further in the analysis.

p1 = ggplot(df.TFDall, aes(x = df.TFDall$Asset)) + 
    geom_histogram(binwidth = 100, colour = "black", 
        fill = "#56B4D9", na.rm = TRUE) + ylim(c(0, 
    100000)) + xlim(c(0, 10000)) + ggtitle("Histogram of TaFeng Asset") + 
    labs(y = "Count", x = "")
p2 = ggplot(df.TFDall, aes(x = "", y = df.TFDall$Asset)) + 
    geom_boxplot(fill = "#56B4D9", outlier.colour = "black", 
        outlier.shape = 1, outlier.size = 2, na.rm = TRUE) + 
    ggtitle("Boxplot of TaFeng Asset") + labs(y = "Asset Value") + 
    theme(axis.text.x = element_blank()) + labs(y = "Count", 
    x = "")
grid.arrange(p1, p2, ncol = 2, nrow = 1, widths = c(0.8, 
    0.8))

Data Exploration - Asset

TaFeng Variable ‘Asset’ Distribution

Data Exploration - Sales Price

The distributions of variable ‘Sales Price’ show almost all values are less than 2000, with only a small number of samples with Sales Prices above 10000.
These values could be outliers, however they are not removed from the dataset.

p1 = ggplot(df.TFDall, aes(x = df.TFDall$SalesPrice)) + 
    geom_histogram(binwidth = 20, colour = "black", 
        fill = "#56B4D9", na.rm = TRUE) + ylim(c(0, 
    75000)) + xlim(c(0, 2000)) + ggtitle("Histogram of TaFeng Sales Price") + 
    labs(y = "Count", x = "")
p2 = ggplot(df.TFDall, aes(x = "", y = df.TFDall$SalesPrice)) + 
    geom_boxplot(fill = "#56B4D9", outlier.colour = "black", 
        outlier.shape = 1, outlier.size = 2, na.rm = TRUE) + 
    ggtitle("Boxplot of TaFeng Sales Price") + labs(y = "Sales Price", 
    x = "")
grid.arrange(p1, p2, ncol = 2, nrow = 1, widths = c(0.8, 
    0.8))

Data Exploration - Sales Price

‘Sales Price’ Distribution

Data Analysis: Transactions per Day

Defining a Transaction to be a unique combination of Customer ID, DateTime, and ResArea, it is possible to visualise the number of transactions per day.

dt.TF_trans <- unique(dt.TF[, list(DateTime, CustID, 
    ResArea)])
df.trans <- dt.TF_trans[, list(num_trans = length(CustID)), 
    by = DateTime]
ggplot(dt.TF_trans[, list(num_trans = length(CustID)), 
    by = DateTime]) + geom_bar(aes(x = DateTime, y = num_trans), 
    stat = "identity", alpha = 0.8) + labs(y = "# Transactions", 
    x = "Date")

Data Analysis: Transactions per Day

Data Analysis: Transactions per Region

Splitting the data into regions shows the difference in the number of transactions in each Region, with Region E being the busiest region with the highest number of transactions in January.
Regions A and B can be seen to have low numbers of transactions.

ggplot(dt.TF_trans[, list(num_trans = length(CustID)), 
    by = list(DateTime, ResArea)]) + geom_bar(aes(x = DateTime, 
    y = num_trans, color = ResArea), stat = "identity", 
    alpha = 0.8) + facet_wrap(~ResArea) + labs(y = "# Transactions", 
    x = "Date")

Data Analysis: Transactions per Region

Data Analysis: Top Selling Products

Looking at the amount of each ProdID sold per day, it is possible to determine the top selling product using ‘Amount’.
It is possible to show the amount of each ProdID sold each day.

dt.TF_prodid <- dt.TF[, list(TAmount = sum(Amount)), 
    by = list(DateTime, ProdID, ResArea)]
topPr <- dt.TF_prodid[, list(Total = sum(TAmount)), 
    by = .(ProdID)]
topPr <- topPr[order(-Total)]
head(topPr)

Data Analysis: Top Selling Products

##           ProdID Total
## 1: 4714981010038 14537
## 2: 4710421090059 11790
## 3: 4711271000014 10615
## 4: 4711663700010  3810
## 5: 4710114128038  3322
## 6: 4710032501791  2865

Data Analysis: Number of Products sold

ggplot(topPr[Total >= 2700, ]) + geom_bar(aes(x = ProdID, 
    y = Total, fill = ProdID), stat = "identity", alpha = 0.8) + 
    theme(axis.text.x = element_text(angle = 25)) + 
    scale_fill_hue(l = 40)

Data Analysis: Number of Products sold

Data Analysis: Number Products Sold per Day (Top 3 Selling Products)

A more detailed analysis of the sales of the Top selling Product shows significant increases at some times in the series.

dt.TF_prodid <- dt.TF[, list(TAmount = sum(Amount)), 
    by = list(DateTime, ProdID, ResArea)]
dt.TF_prodidtop3 <- subset(dt.TF_prodid, ProdID %in% 
    c(4714981010038, 4710421090059, 4711271000014))
dt.TF_prodid_nores <- dt.TF[, list(TAmount = sum(Amount)), 
    by = list(DateTime, ProdID)]
dt.TF_prodid1 <- subset(dt.TF_prodid_nores, ProdID == 
    4714981010038)
dt.TF_prodid2 <- subset(dt.TF_prodid_nores, ProdID == 
    4710421090059)
dt.TF_prodid3 <- subset(dt.TF_prodid_nores, ProdID == 
    4711271000014)
ggplot(dt.TF_prodidtop3[, list(TAmount), by = list(DateTime, 
    ProdID)]) + geom_bar(aes(x = DateTime, y = TAmount, 
    fill = ProdID), stat = "identity", alpha = 0.8) + 
    scale_fill_manual(values = c("#007F85", "#0073B9", 
        "#AB00B6")) + labs(y = "Sales", x = "")

Data Analysis: Number Products Sold per Day (Top 3 Selling Products)

Data Analysis - Sales per Region

Splitting the data into regions shows the difference in the number of transactions in each Region, with Region E being the busiest region with the highest number of sales. Regions A and B can be seen to have low numbers of sales. Most Regions show a similar sales profile for all three top products however Region G shows the highest number of sales for Product 4710421090059.

ggplot(dt.TF_prodidtop3[, list(num_trans = sum(TAmount)), 
    by = list(ProdID, ResArea)]) + geom_bar(aes(x = ProdID, 
    y = num_trans, fill = ProdID), stat = "identity", 
    alpha = 0.8) + scale_fill_manual(values = c("#007F85", 
    "#0073B9", "#AB00B6")) + facet_wrap(~ResArea) + 
    labs(y = "Sales", x = "") + theme(axis.ticks = element_blank(), 
    axis.text.x = element_blank())

Data Analysis - Sales per Region

Time Series Analysis

Time Series Analysis of Product IDs 4714981010038,4710421090059 and 4711271000014

ts_trans <- xts(df.trans$num_trans, as.Date(df.trans$DateTime))
attr(ts_trans, "frequency") <- 7
trans_decom <- stl(as.ts(ts_trans), s.window = "periodic", 
    t.window = 7)
plot(trans_decom)

Time Series Analysis of Product ID ‘4714981010038’

ts_trans1 <- xts(dt.TF_prodid1$TAmount, as.Date(dt.TF_prodid1$DateTime))
attr(ts_trans1, "frequency") <- 7
trans_decom1 <- stl(as.ts(ts_trans1), s.window = "periodic", 
    t.window = 7)
plot(trans_decom1)

Time Series Analysis of Product IDs 4714981010038,4710421090059 and 4711271000014

Time Series Analysis of Product ID ‘4714981010038’

Time Series Analysis of Product ID ‘4710421090059’

ts_trans2 <- xts(dt.TF_prodid2$TAmount, as.Date(dt.TF_prodid2$DateTime))
attr(ts_trans2, "frequency") <- 7
trans_decom2 <- stl(as.ts(ts_trans2), s.window = "periodic", 
    t.window = 7)
plot(trans_decom2)

Time Series Analysis of Product ID ‘4711271000014’

ts_trans3 <- xts(dt.TF_prodid3$TAmount, as.Date(dt.TF_prodid3$DateTime))
attr(ts_trans3, "frequency") <- 7
trans_decom3 <- stl(as.ts(ts_trans3), s.window = "periodic", 
    t.window = 7)
plot(trans_decom3)

Time Series Analysis of Product ID ‘4710421090059’

Time Series Analysis of Product ID ‘4711271000014’

Time Series Analysis

Combine Time Series Analysis into one plot

par(mfrow = c(3, 1))
plot(trans_decom1$time.series[, "trend"], ylab = "", 
    main = "ProdID: 4714981010038")
plot(trans_decom2$time.series[, "trend"], ylab = "", 
    main = "ProdID: 4710421090059")
plot(trans_decom3$time.series[, "trend"], ylab = "", 
    main = "ProdID: 4711271000014")

Analysis of Ta Feng Grocery Dataset

What is the Ta Feng Grocery Dataset?

Ta Feng Grocery Dataset

How reliable is this data?

How did you make an assessment?

Ta Feng Grocery Dataset

How does the data quality limit or constrain your ideas for analysis?

What other datasets could be brought in to enrich these questions?

Stakeholders

What interesting questions could you answer with this data?

If you are a retailer?

If you are a consumer?

Stakeholders

What interesting questions could you answer with this data?

If you are a wholesaler?

Analysis from Wholesaler Perspective

What are the best selling products?

Analysis from Wholesaler Perspective

How are the best selling products selling in the Regions?

Analysis from Wholesaler Perspective

Analysis from Wholesaler Perspective

Time Series Analysis of Product ID 4714981010038

Analysis from Wholesaler Perspective

Analysis / Modelling on Sales Trends for Top Selling Products

What is something interesting?

What is something useful?

Analysis / Modelling on Sales Trend for Top Selling Products

What evidence or rationale supports your findings?

How would you explain this to a non-technical business client?

If a client wanted to track these insights ongoing

How would you go about designing an operational solution?

How would you go about estimating time/cost/skills needed to build something?

If a client wanted to track these insights ongoing

What are some of the top challenges you would expect?

Appendix

R Libraries

Data Load and Combine

Data Format

Data Exploration

Data Exploration

Data Exploration

Data Exploration - Amount

Data Exploration - Amount

Data Exploration - Asset

Data Exploration - Asset

Data Exploration - Sales Price

Data Exploration - Sales Price

Data Analysis: Transactions per Day

Data Analysis: Transactions per Day

Data Analysis: Transactions per Region

Data Analysis: Transactions per Region

Data Analysis: Top Selling Products

Data Analysis: Top Selling Products

Data Analysis: Number of Products sold

Data Analysis: Number of Products sold

Data Analysis: Number Products Sold per Day (Top 3 Selling Products)

Data Analysis: Number Products Sold per Day (Top 3 Selling Products)

Data Analysis - Sales per Region

Data Analysis - Sales per Region

Time Series Analysis

Time Series Analysis of Product IDs 4714981010038,4710421090059 and 4711271000014

Time Series Analysis of Product ID ‘4714981010038’

Time Series Analysis of Product IDs 4714981010038,4710421090059 and 4711271000014

Time Series Analysis of Product ID ‘4714981010038’

Time Series Analysis of Product ID ‘4710421090059’

Time Series Analysis of Product ID ‘4711271000014’

Time Series Analysis of Product ID ‘4710421090059’

Time Series Analysis of Product ID ‘4711271000014’

Time Series Analysis

Combine Time Series Analysis into one plot

Combine Time Series Analysis into one plot