The analysis of customer transactions for a period can provide useful insights on how customer engagement changes over time. It can be used to answer important questions like:
In general, these analyses can be useful for the management to get insights on which product or commodity are doing well in terms of sales and number of units sold and also can be used for selective marketing of products among each demographic group
To analyze this data, we will use the following R packages:
library(ggplot2) # visualizing data
library(dplyr) # transforming (joining, summarizing, etc.) data
library(DT) # for printing nice HTML output tables
Prior to assessing how customer engagement over time by analyzing the household transactions, data is imported, vaidated and cleaned. The source data consists of three csv files which consists of the Household data, products data and transactions data. The households data includes the data corresponding to 5000 households with demographic details such as Age range, Marital status, Income range, Homeowner description, Household compensation, Size and Number of children. The Products data has the product information such as Product number, Department, Commodity, Private/National brand and Natural/Organic flag. The transactions data has the transactions details corresponding to each household for the products they purchased for a period of two years.
The missing values are recorded mainly in three different formats. For certain variables it is marked as ‘Unknown’, certain others it is ‘null’,‘NOT AVAILABLE’ and some are marked as ‘NA’ as well.
The data is obtained from three different csv file corresponding to Households data, Products data and Transaction data respectively.
households <- read.csv("data/5000_households.csv", strip.white = T)
products <- read.csv("data/5000_products.csv", strip.white = T)
transactions <- read.csv("data/5000_transactions.csv", strip.white = T)
#using strip.white to remove the trailing spaces in the data
Once the data has been imported from the source files, the next step is to validate the data and clean the data into a tidy format.
For each data set, the columns which have incomlete/unidentifiable names are renamed with identifiable names. Since the missing values are recorded differently for each variable, we need to change those values to ‘NA’ in order to make them uniform. The code below takes care of this process. The variables are first converted into characters for this purpose and then after the data is cleaned, they are changed to appropriate data types.
Finally, the cleaned data sets are merged to a single final dataset with all the required variables.
# create tidy households data
households[] <- lapply(households, as.character)
# null values and all the data points where the data is not available are assigned a uniform
# value NA
households[households[,] == "null"] <- "NA"
households[households[,] == "NOT AVAILABLE"] <- "NA"
households[households[,] == "Not Available"] <- "NA"
households[households[,] == "Unknown"] <- "NA"
households[] <- lapply(households, factor)
households$HSHD_NUM <- as.integer(households$HSHD_NUM)
colnames(households)[2] <- "LOYALTY_FLAG"
colnames(households)[4] <- "MARITAL_STATUS"
colnames(households)[6] <- "HOMEOWNER_DESCRIPTION"
# create tidy products data
colnames(products)[5] <- "NATURAL/ORGANIC_FLAG"
colnames(products)[4] <- "BRAND_TYPE"
# create tidy transactions data
colnames(transactions)[3] <- "PURCHASE_DATE"
colnames(transactions)[7] <- "STORE_REGION"
# converting purchase date to a date column
transactions$PURCHASE_DATE <- as.character(transactions$PURCHASE_DATE)
transactions$PURCHASE_DATE <- as.Date(transactions$PURCHASE_DATE, format = "%d-%b-%y")
# create tidy merged data frame
hh_trans <- merge(households, transactions, by = "HSHD_NUM")
final_dataset <- merge(hh_trans, products, by = "PRODUCT_NUM")
The final cleaned dataset consists of 21 variables in total as shown in the data table below. The variables which we are mainly concerned about are WEEK_NUM (varies from 1-104), SPEND(spend corresponding to each transaction), UNITS (number of units for each transaction). In addition to this, we are interested in factors such as DEPARTMENT, COMMODITY, AGE_RANGE, MARITAL_STATUS, INCOME_RANGE, HSHD_COMPOSITION, HH_SIZE and CHILDREN.
data_sample <- head(final_dataset, n = 50)
datatable(data_sample)
The main goal of this analysis is to find out how customer engagement changes over time. In order to get an overall picture of this, we may need to perform the following analyses.
In general, the product analysis can be useful for the management to get insights on which product or commodity are doing well in terms of sales and number of units sold.
All the above information can be used for selective marketing of products among each demographic group