# R Packages used for this report
library(data.table)
library(readr)
library(dplyr)
library(tidyr)
library(outliers)
library(ggplot2)
library(lattice)
This project analyzed the sales of potato chips to define some metrics of interest to the client such as:
- Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is?
- How many customers are in each segment ?
- How many chips are bought per customer by segment?
- What’s the average chip price by customer segment?
- The customer’s total spend over the period and total spend for each transaction to understand what proportion of their grocery spend is on chips.
- Proportion of customers in each customer segment overall to compare against the mix of customers who purchase chips.
The process includes data preprocessing steps as below:
1. Read/Import 2 datasets from Quantium in R Studio using appropriate functions and library packages.
2. Merge/combine two datasets into one and select useful variables to work with.
3. Provide description of each variable used on the data.
4. Understand the data such as the structure, the data type, discover the unique values in factor variable, label the name and put in order where necessary, check the class and the levels.
5. Perform Tidy and Manipulate technique such as mutate a new variable with calculation, reshape the data from untidy to tidy format for better analysis.
6. Scan the data to find any missing values, errors, or special values, then find a suitable methodology to deal with it.
7. Scan for any outliers using summary statistics, apply appropriate plots on the data, explain the methodology and transform it for better insight.
Provide conclusion in overall.
### Getting the Files
filePath <- paste(as.character(getwd()), "/", sep="")
transactionData <- fread(paste0(filePath,"QVI_transaction_data.csv"))
customerData <- fread(paste0(filePath,"QVI_purchase_behaviour.csv"))
### Exploratory Data Analysis
dim(transactionData)
## [1] 264836 8
dim(customerData)
## [1] 72637 3
names(transactionData)
## [1] "DATE" "STORE_NBR" "LYLTY_CARD_NBR" "TXN_ID"
## [5] "PROD_NBR" "PROD_NAME" "PROD_QTY" "TOT_SALES"
names(customerData)
## [1] "LYLTY_CARD_NBR" "LIFESTAGE" "PREMIUM_CUSTOMER"
head(transactionData)
head(customerData)
### Examine Data type
str(transactionData)
## Classes 'data.table' and 'data.frame': 264836 obs. of 8 variables:
## $ DATE : int 43390 43599 43605 43329 43330 43604 43601 43601 43332 43330 ...
## $ STORE_NBR : int 1 1 1 2 2 4 4 4 5 7 ...
## $ LYLTY_CARD_NBR: int 1000 1307 1343 2373 2426 4074 4149 4196 5026 7150 ...
## $ TXN_ID : int 1 348 383 974 1038 2982 3333 3539 4525 6900 ...
## $ PROD_NBR : int 5 66 61 69 108 57 16 24 42 52 ...
## $ PROD_NAME : chr "Natural Chip Compny SeaSalt175g" "CCs Nacho Cheese 175g" "Smiths Crinkle Cut Chips Chicken 170g" "Smiths Chip Thinly S/Cream&Onion 175g" ...
## $ PROD_QTY : int 2 3 2 5 3 1 1 1 1 2 ...
## $ TOT_SALES : num 6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(customerData)
## Classes 'data.table' and 'data.frame': 72637 obs. of 3 variables:
## $ LYLTY_CARD_NBR : int 1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
## $ LIFESTAGE : chr "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
## $ PREMIUM_CUSTOMER: chr "Premium" "Mainstream" "Budget" "Mainstream" ...
## - attr(*, ".internal.selfref")=<externalptr>
### Convert Date column to a date format
transactionData$DATE <- as.Date(transactionData$DATE, origin = "1899-12-30")
transactionData$DATE %>% head()
## [1] "2018-10-17" "2019-05-14" "2019-05-20" "2018-08-17" "2018-08-18"
## [6] "2019-05-19"
The transaction dataset consists of 264836 observations and 8 variables.
The customer dataset consists of 72637 observations and 3 variables.
Variables in Transaction Dataset:
1. DATE: date of the transaction
2. STORE_NBR: store id
3. LYLTY_CARD_NBR: customer loyalty card number
4. TXN_ID: transaction number
5. PROD_NBR: product number/ID
6. PROD_NAME: product name
7. PROD_QTY: product quantity
8. TOT_SALES: total sales
Variables in Transaction Dataset:
1. LYLTY_CARD_NBR: customer loyalty card number
2. LIFESTAGE: customer segmentation by age
3. PREMIUM_CUSTOMER: type of membership
The two datasets are merged into one using the merge() function by LYLTY_CARD_NBR. We are interested to analyse the total chips sold and from what customer background.
Data Cleansing: check for any missing values, outliers, whitespace, and special character.
### Examine PROD_NAME
summary(transactionData$PROD_NAME)
## Length Class Mode
## 264836 character character
Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarising the individual words in the product name.
#### Examine the words in PROD_NAME to see if there are any incorrect entries
#### such as products that are not chips
productWords<- data.table(unlist(strsplit(unique(transactionData$PROD_NAME), "")))
setnames(productWords, 'words')
As we are only interested in words that will tell us if the product is chips or not, let’s remove all words with digits and special characters such as ‘&’ from our set of product words.
### Removing digits and special characters
productWords <- lapply(productWords, function(x) gsub("[^[:alnum:]]", "", x))
### Counting the number of times a word appears and sort by highest to lowest
transactionData %>% count(PROD_NAME) %>% arrange(desc(n))
There are salsa products in the dataset but we are only interested in the chips category, so let’s remove these.
### Remove Salsa products in the dataset
dim(transactionData) # Dimension before removing Salsa
## [1] 264836 8
transactionData <- filter(transactionData, !grepl('Salsa|SALSA', PROD_NAME))
dim(transactionData) # Dimension after removing Salsa
## [1] 246742 8
### Summarise data
summary(transactionData)
## DATE STORE_NBR LYLTY_CARD_NBR TXN_ID
## Min. :2018-07-01 Min. : 1.0 Min. : 1000 Min. : 1
## 1st Qu.:2018-09-30 1st Qu.: 70.0 1st Qu.: 70015 1st Qu.: 67569
## Median :2018-12-30 Median :130.0 Median : 130367 Median : 135183
## Mean :2018-12-30 Mean :135.1 Mean : 135531 Mean : 135131
## 3rd Qu.:2019-03-31 3rd Qu.:203.0 3rd Qu.: 203084 3rd Qu.: 202654
## Max. :2019-06-30 Max. :272.0 Max. :2373711 Max. :2415841
## PROD_NBR PROD_NAME PROD_QTY TOT_SALES
## Min. : 1.00 Length:246742 Min. : 1.000 Min. : 1.700
## 1st Qu.: 26.00 Class :character 1st Qu.: 2.000 1st Qu.: 5.800
## Median : 53.00 Mode :character Median : 2.000 Median : 7.400
## Mean : 56.35 Mean : 1.908 Mean : 7.321
## 3rd Qu.: 87.00 3rd Qu.: 2.000 3rd Qu.: 8.800
## Max. :114.00 Max. :200.000 Max. :650.000
### Check for any missing value
sum(is.na(transactionData))
## [1] 0
Let’s further check the dataset to find any outliers.
### Boxplot of PROD_QTY to find outliers
boxplot(transactionData$PROD_QTY)
### Filter the PROD_QTY with more than 200 packets
transactionData %>% filter(PROD_QTY >= 200)
We can see that there are outliers in PROD_QTY. Let’s remove this.
### Exclude outliers from the dataset
transactionData <- transactionData[!(transactionData$PROD_QTY>=200),]
### Boxplot of LYLTY_CARD_NBR to find outliers
boxplot(transactionData$LYLTY_CARD_NBR)
### Filter the customer number 226000 purchase activity
transactionData %>% filter(LYLTY_CARD_NBR >= 226000)
There are outliers in the LYLTY_CARD_NBR. Let’s remove this.
### Exclude outliers from the dataset
transactionData <- transactionData[!(transactionData$LYLTY_CARD_NBR>=226000),]
Now, we are interested to check the sales by date and plot it.
### Count the number of transactions by date and create a new column
count_date <- transactionData %>% group_by(DATE) %>% count(DATE)
#### Setting plot themes to format graphs
theme_set(theme_bw())
theme_update(plot.title = element_text(hjust = 0.5))
#### Plot transactions over time
ggplot(count_date, aes(x = DATE, y = n)) +
geom_line() +
labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
scale_x_date(breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
It looks like there is a spike in sales between December 2018 and January 2019. Let’s check in detail the sales between this date range.
### Filter to December sales only
december <-seq(from = as.Date("2018-12-01"), to = as.Date("2018-12-31"), by = 'day')
count_sales_december <- filter(count_date, DATE %in% december)
#### Plot transactions for December only
ggplot(count_sales_december, aes(x = DATE, y = n)) +
geom_line() +
labs(x = "Day", y = "Number of transactions", title = "Transactions in December") +
scale_x_date(breaks = "1 day") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))