Required packages

# R Packages used for this report
library(data.table)
library(readr)
library(dplyr)
library(tidyr)
library(outliers)
library(ggplot2)
library(lattice)

Executive Summary

This project analyzed the sales of potato chips to define some metrics of interest to the client such as:
- Who spends the most on chips (total sales), describing customers by lifestage and how premium their general purchasing behaviour is?
- How many customers are in each segment ?
- How many chips are bought per customer by segment?
- What’s the average chip price by customer segment?
- The customer’s total spend over the period and total spend for each transaction to understand what proportion of their grocery spend is on chips.
- Proportion of customers in each customer segment overall to compare against the mix of customers who purchase chips.

The process includes data preprocessing steps as below:
1. Read/Import 2 datasets from Quantium in R Studio using appropriate functions and library packages.
2. Merge/combine two datasets into one and select useful variables to work with.
3. Provide description of each variable used on the data.
4. Understand the data such as the structure, the data type, discover the unique values in factor variable, label the name and put in order where necessary, check the class and the levels.
5. Perform Tidy and Manipulate technique such as mutate a new variable with calculation, reshape the data from untidy to tidy format for better analysis.
6. Scan the data to find any missing values, errors, or special values, then find a suitable methodology to deal with it.
7. Scan for any outliers using summary statistics, apply appropriate plots on the data, explain the methodology and transform it for better insight.

Provide conclusion in overall.

### Getting the Files 
filePath <- paste(as.character(getwd()), "/", sep="")

transactionData <- fread(paste0(filePath,"QVI_transaction_data.csv"))
customerData <- fread(paste0(filePath,"QVI_purchase_behaviour.csv"))

### Exploratory Data Analysis
dim(transactionData)
## [1] 264836      8
dim(customerData)
## [1] 72637     3
names(transactionData)
## [1] "DATE"           "STORE_NBR"      "LYLTY_CARD_NBR" "TXN_ID"        
## [5] "PROD_NBR"       "PROD_NAME"      "PROD_QTY"       "TOT_SALES"
names(customerData)
## [1] "LYLTY_CARD_NBR"   "LIFESTAGE"        "PREMIUM_CUSTOMER"
head(transactionData)
head(customerData)
### Examine Data type
str(transactionData)
## Classes 'data.table' and 'data.frame':   264836 obs. of  8 variables:
##  $ DATE          : int  43390 43599 43605 43329 43330 43604 43601 43601 43332 43330 ...
##  $ STORE_NBR     : int  1 1 1 2 2 4 4 4 5 7 ...
##  $ LYLTY_CARD_NBR: int  1000 1307 1343 2373 2426 4074 4149 4196 5026 7150 ...
##  $ TXN_ID        : int  1 348 383 974 1038 2982 3333 3539 4525 6900 ...
##  $ PROD_NBR      : int  5 66 61 69 108 57 16 24 42 52 ...
##  $ PROD_NAME     : chr  "Natural Chip        Compny SeaSalt175g" "CCs Nacho Cheese    175g" "Smiths Crinkle Cut  Chips Chicken 170g" "Smiths Chip Thinly  S/Cream&Onion 175g" ...
##  $ PROD_QTY      : int  2 3 2 5 3 1 1 1 1 2 ...
##  $ TOT_SALES     : num  6 6.3 2.9 15 13.8 5.1 5.7 3.6 3.9 7.2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(customerData)
## Classes 'data.table' and 'data.frame':   72637 obs. of  3 variables:
##  $ LYLTY_CARD_NBR  : int  1000 1002 1003 1004 1005 1007 1009 1010 1011 1012 ...
##  $ LIFESTAGE       : chr  "YOUNG SINGLES/COUPLES" "YOUNG SINGLES/COUPLES" "YOUNG FAMILIES" "OLDER SINGLES/COUPLES" ...
##  $ PREMIUM_CUSTOMER: chr  "Premium" "Mainstream" "Budget" "Mainstream" ...
##  - attr(*, ".internal.selfref")=<externalptr>
### Convert Date column to a date format
transactionData$DATE <- as.Date(transactionData$DATE, origin = "1899-12-30")
transactionData$DATE %>% head()
## [1] "2018-10-17" "2019-05-14" "2019-05-20" "2018-08-17" "2018-08-18"
## [6] "2019-05-19"

The transaction dataset consists of 264836 observations and 8 variables.
The customer dataset consists of 72637 observations and 3 variables.

Variables in Transaction Dataset:
1. DATE: date of the transaction
2. STORE_NBR: store id
3. LYLTY_CARD_NBR: customer loyalty card number
4. TXN_ID: transaction number
5. PROD_NBR: product number/ID
6. PROD_NAME: product name
7. PROD_QTY: product quantity
8. TOT_SALES: total sales

Variables in Transaction Dataset:
1. LYLTY_CARD_NBR: customer loyalty card number
2. LIFESTAGE: customer segmentation by age
3. PREMIUM_CUSTOMER: type of membership

The two datasets are merged into one using the merge() function by LYLTY_CARD_NBR. We are interested to analyse the total chips sold and from what customer background.

Data Cleansing: check for any missing values, outliers, whitespace, and special character.

### Examine PROD_NAME
summary(transactionData$PROD_NAME)
##    Length     Class      Mode 
##    264836 character character

Looks like we are definitely looking at potato chips but how can we check that these are all chips? We can do some basic text analysis by summarising the individual words in the product name.

#### Examine the words in PROD_NAME to see if there are any incorrect entries
#### such as products that are not chips
productWords<- data.table(unlist(strsplit(unique(transactionData$PROD_NAME), "")))
setnames(productWords, 'words')

As we are only interested in words that will tell us if the product is chips or not, let’s remove all words with digits and special characters such as ‘&’ from our set of product words.

### Removing digits and special characters
productWords <- lapply(productWords, function(x) gsub("[^[:alnum:]]", "", x))


### Counting the number of times a word appears and sort by highest to lowest
transactionData %>% count(PROD_NAME) %>% arrange(desc(n))

There are salsa products in the dataset but we are only interested in the chips category, so let’s remove these.

### Remove Salsa products in the dataset
dim(transactionData) # Dimension before removing Salsa
## [1] 264836      8
transactionData <- filter(transactionData, !grepl('Salsa|SALSA', PROD_NAME))
dim(transactionData) # Dimension after removing Salsa
## [1] 246742      8
### Summarise data
summary(transactionData)
##       DATE              STORE_NBR     LYLTY_CARD_NBR        TXN_ID       
##  Min.   :2018-07-01   Min.   :  1.0   Min.   :   1000   Min.   :      1  
##  1st Qu.:2018-09-30   1st Qu.: 70.0   1st Qu.:  70015   1st Qu.:  67569  
##  Median :2018-12-30   Median :130.0   Median : 130367   Median : 135183  
##  Mean   :2018-12-30   Mean   :135.1   Mean   : 135531   Mean   : 135131  
##  3rd Qu.:2019-03-31   3rd Qu.:203.0   3rd Qu.: 203084   3rd Qu.: 202654  
##  Max.   :2019-06-30   Max.   :272.0   Max.   :2373711   Max.   :2415841  
##     PROD_NBR       PROD_NAME            PROD_QTY         TOT_SALES      
##  Min.   :  1.00   Length:246742      Min.   :  1.000   Min.   :  1.700  
##  1st Qu.: 26.00   Class :character   1st Qu.:  2.000   1st Qu.:  5.800  
##  Median : 53.00   Mode  :character   Median :  2.000   Median :  7.400  
##  Mean   : 56.35                      Mean   :  1.908   Mean   :  7.321  
##  3rd Qu.: 87.00                      3rd Qu.:  2.000   3rd Qu.:  8.800  
##  Max.   :114.00                      Max.   :200.000   Max.   :650.000
### Check for any missing value
sum(is.na(transactionData))
## [1] 0

Let’s further check the dataset to find any outliers.

### Boxplot of PROD_QTY to find outliers
boxplot(transactionData$PROD_QTY)

### Filter the PROD_QTY with more than 200 packets
transactionData %>% filter(PROD_QTY >= 200)

We can see that there are outliers in PROD_QTY. Let’s remove this.

### Exclude outliers from the dataset
transactionData <- transactionData[!(transactionData$PROD_QTY>=200),]
### Boxplot of LYLTY_CARD_NBR to find outliers
boxplot(transactionData$LYLTY_CARD_NBR)

### Filter the customer number 226000 purchase activity
transactionData %>% filter(LYLTY_CARD_NBR >= 226000)

There are outliers in the LYLTY_CARD_NBR. Let’s remove this.

### Exclude outliers from the dataset
transactionData <- transactionData[!(transactionData$LYLTY_CARD_NBR>=226000),]

Now, we are interested to check the sales by date and plot it.

### Count the number of transactions by date and create a new column
count_date <- transactionData %>% group_by(DATE) %>% count(DATE)

#### Setting plot themes to format graphs
theme_set(theme_bw())
theme_update(plot.title = element_text(hjust = 0.5))

#### Plot transactions over time
ggplot(count_date, aes(x = DATE, y = n)) +
  geom_line() +
  labs(x = "Day", y = "Number of transactions", title = "Transactions over time") +
  scale_x_date(breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

It looks like there is a spike in sales between December 2018 and January 2019. Let’s check in detail the sales between this date range.

### Filter to December sales only
december <-seq(from = as.Date("2018-12-01"), to = as.Date("2018-12-31"), by = 'day')
count_sales_december <- filter(count_date, DATE %in% december)

#### Plot transactions for December only
ggplot(count_sales_december, aes(x = DATE, y = n)) +
  geom_line() +
  labs(x = "Day", y = "Number of transactions", title = "Transactions in December") +
  scale_x_date(breaks = "1 day") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))