MARKET BASKET ANALYSIS

1. Defining the Question

a) Specifying the Question

What are the most prominent variables from the sales data?

b) Defining the Metric for Success

Correct identification of the most important variables from the provided sales data.

c) Understanding the Context

Carrefour Kenya was undertaking a project that would inform the marketing department on the presence of fraudulent sales. They provided a data set from which anomalies were to be detected, if they existed.

Correct identification of the most important variables from the data set is important as this can reduce the amount of factors to be considered during decision making and planning, to only the most relevant factors.

d) Recording the Experimental Design

  1. Data sourcing/loading
  2. Data Understanding
  3. Data Relevance
  4. External Dataset Validation
  5. Data Preparation
  6. Univariate Analysis
  7. Bivariate Analysis
  8. Multivariate Analysis
  9. Part 1: PCA
  10. Part 2: Feature Selection
  11. Implementing the solution
  12. Challenging the solution
  13. Conclusion
  14. Follow up questions

e) Data Relevance

The data provided should have the correct information/ records, for the analysis results to be relevant.

2. Data Understanding

a) Importing Required Libraries

# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base)       # Date-time conversion
library(ggplot2)    # Plotting Library
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(moments)    # Measures of distribution 
library(ggcorrplot) # Correlation plotting 
library(mice)       # Missing values 
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(devtools)
## Loading required package: usethis
library(superml)    # Label encoding
## Loading required package: R6
library(R6)
library(stats)      # PCA
library(ggbiplot)   # PCA plot
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## Loading required package: grid
library(factoextra, quietly = TRUE) # PCA visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)   # Correlation plot
## corrplot 0.92 loaded
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
## 
##     sim
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust, quiet = T)
library(FSelectorRcpp) # Feature ranking
library(caret)
## Loading required package: lattice
library(lattice)
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(viridisLite)
library(hrbrthemes, quietly = T)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

b) Reading the Data

# Loading data set
part12 <- read.csv('dataset_1_2.csv')

c) Checking the Data

i) Number of Records

# Data set records summary
records.summary <- function(data){
  cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
  
}
# Data set summary

# Part 1 and 2
print('Dataset 1 and 2')
## [1] "Dataset 1 and 2"
records.summary(part12)
## Number of rows =  1000  and columns =  16

ii) Top Dataset Preview

# Top data set preview
head(part12)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415  1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200  3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155  3/3/2019 13:23 Credit card 324.31                4.761905
## 4        8 23.2880 1/27/2019 20:33     Ewallet 465.76                4.761905
## 5        7 30.2085  2/8/2019 10:37     Ewallet 604.17                4.761905
## 6        7 29.8865 3/25/2019 18:30     Ewallet 597.73                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165

iii) Bottom Dataset Preview

# Bottom data set preview
tail(part12)
##       Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 995  652-49-6720      C        Member Female Electronic accessories      60.95
## 996  233-67-5758      C        Normal   Male      Health and beauty      40.35
## 997  303-96-2227      B        Normal Female     Home and lifestyle      97.38
## 998  727-02-1313      A        Member   Male     Food and beverages      31.84
## 999  347-56-2442      A        Normal   Male     Home and lifestyle      65.82
## 1000 849-09-3807      A        Member Female    Fashion accessories      88.34
##      Quantity     Tax      Date  Time Payment   cogs gross.margin.percentage
## 995         1  3.0475 2/18/2019 11:40 Ewallet  60.95                4.761905
## 996         1  2.0175 1/29/2019 13:46 Ewallet  40.35                4.761905
## 997        10 48.6900  3/2/2019 17:16 Ewallet 973.80                4.761905
## 998         1  1.5920  2/9/2019 13:22    Cash  31.84                4.761905
## 999         1  3.2910 2/22/2019 15:33    Cash  65.82                4.761905
## 1000        7 30.9190 2/18/2019 13:28    Cash 618.38                4.761905
##      gross.income Rating     Total
## 995        3.0475    5.9   63.9975
## 996        2.0175    6.2   42.3675
## 997       48.6900    4.4 1022.4900
## 998        1.5920    7.7   33.4320
## 999        3.2910    4.1   69.1110
## 1000      30.9190    6.6  649.2990

The invoice ID will be dropped as it doesn’t provide the required information, while the date column will be changed to a date data type.

d) Checking Datatypes

# Data set description
str(part12)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...

The date columns will be changed to date format. The date and time columns will be combined for convenient data type conversion.

# Combining the date and time columns
part12$Date <- paste(part12$Date, part12$Time)
# Dropping the time column
index <- grep('Time', colnames(part12))
part12 <- part12[,-index]
colnames(part12)
##  [1] "Invoice.ID"              "Branch"                 
##  [3] "Customer.type"           "Gender"                 
##  [5] "Product.line"            "Unit.price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Payment"                
## [11] "cogs"                    "gross.margin.percentage"
## [13] "gross.income"            "Rating"                 
## [15] "Total"
# Checking the combined column
head(part12$Date)
## [1] "1/5/2019 13:08"  "3/8/2019 10:29"  "3/3/2019 13:23"  "1/27/2019 20:33"
## [5] "2/8/2019 10:37"  "3/25/2019 18:30"

The date and time columns have been converted.

# Unique dates to determine the required date and time format
head(unique(part12$Date), 10)
##  [1] "1/5/2019 13:08"  "3/8/2019 10:29"  "3/3/2019 13:23"  "1/27/2019 20:33"
##  [5] "2/8/2019 10:37"  "3/25/2019 18:30" "2/25/2019 14:36" "2/24/2019 11:38"
##  [9] "1/10/2019 17:15" "2/20/2019 13:27"

The format is ‘%m/%d/%Y’.

# Converting the date column to time data type, time zone = East African Time
part12$Date <- strptime(part12$Date, format = '%m/%d/%Y %H:%M')
# Preview
class(part12$Date)
## [1] "POSIXlt" "POSIXt"
# Column preview
head(part12$Date, 10)
##  [1] "2019-01-05 13:08:00 EAT" "2019-03-08 10:29:00 EAT"
##  [3] "2019-03-03 13:23:00 EAT" "2019-01-27 20:33:00 EAT"
##  [5] "2019-02-08 10:37:00 EAT" "2019-03-25 18:30:00 EAT"
##  [7] "2019-02-25 14:36:00 EAT" "2019-02-24 11:38:00 EAT"
##  [9] "2019-01-10 17:15:00 EAT" "2019-02-20 13:27:00 EAT"

The date column has been converted to the required datetime type.

Converting the character columns to factors.

# Get character columns
char <- unlist(lapply(part12, is.character))
char
##              Invoice.ID                  Branch           Customer.type 
##                    TRUE                    TRUE                    TRUE 
##                  Gender            Product.line              Unit.price 
##                    TRUE                    TRUE                   FALSE 
##                Quantity                     Tax                    Date 
##                   FALSE                   FALSE                   FALSE 
##                 Payment                    cogs gross.margin.percentage 
##                    TRUE                   FALSE                   FALSE 
##            gross.income                  Rating                   Total 
##                   FALSE                   FALSE                   FALSE
# Converting categorical columns to factor types
part12[, char]<- lapply(part12[, char], factor)
# Checking changes
head(part12[, char])
##    Invoice.ID Branch Customer.type Gender           Product.line     Payment
## 1 750-67-8428      A        Member Female      Health and beauty     Ewallet
## 2 226-31-3081      C        Normal Female Electronic accessories        Cash
## 3 631-41-3108      A        Normal   Male     Home and lifestyle Credit card
## 4 123-19-1176      A        Member   Male      Health and beauty     Ewallet
## 5 373-73-7910      A        Normal   Male      Sports and travel     Ewallet
## 6 699-14-3026      C        Normal   Male Electronic accessories     Ewallet

The columns have been successfully converted to factors.

3. External Data Set Validation

The data sets have been provided by the client, therefore, external data set validation will not be used.

4. Data Preperation

a) Validation

# Data set columns
colnames(part12)
##  [1] "Invoice.ID"              "Branch"                 
##  [3] "Customer.type"           "Gender"                 
##  [5] "Product.line"            "Unit.price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Payment"                
## [11] "cogs"                    "gross.margin.percentage"
## [13] "gross.income"            "Rating"                 
## [15] "Total"

Checking the invoice ID column.

# Number of unique values vs number of rows.
cat('Unique values =', length(part12$Invoice.ID), 'and  number of rows =',
    nrow(part12))
## Unique values = 1000 and  number of rows = 1000

The values are all unique, therefore, the column will be dropped.

# Dropping the invoice ID column
part12 <- part12[,-1]
# Confirming changes
colnames(part12)
##  [1] "Branch"                  "Customer.type"          
##  [3] "Gender"                  "Product.line"           
##  [5] "Unit.price"              "Quantity"               
##  [7] "Tax"                     "Date"                   
##  [9] "Payment"                 "cogs"                   
## [11] "gross.margin.percentage" "gross.income"           
## [13] "Rating"                  "Total"

b) Consistency

# Missing values
colSums(is.na(part12))
##                  Branch           Customer.type                  Gender 
##                       0                       0                       0 
##            Product.line              Unit.price                Quantity 
##                       0                       0                       0 
##                     Tax                    Date                 Payment 
##                       0                       0                       0 
##                    cogs gross.margin.percentage            gross.income 
##                       0                       0                       0 
##                  Rating                   Total 
##                       0                       0

There are no missing values in the first data set.

c) Completeness

# Checking for duplicates
sum(duplicated(part12))
## [1] 0

There are no duplicates in the first data set.

d) Uniformity

Checking the uniformity of column names and values.

# Column names
colnames(part12)
##  [1] "Branch"                  "Customer.type"          
##  [3] "Gender"                  "Product.line"           
##  [5] "Unit.price"              "Quantity"               
##  [7] "Tax"                     "Date"                   
##  [9] "Payment"                 "cogs"                   
## [11] "gross.margin.percentage" "gross.income"           
## [13] "Rating"                  "Total"

The column names are not uniform, therefore, they will be renamed.

# Renaming columns
colnames(part12)[c(10, 11, 12)] <- c('Cogs', 'Gross.margin.percentage',
                                     'Gross.income')
# Checking changes
colnames(part12)
##  [1] "Branch"                  "Customer.type"          
##  [3] "Gender"                  "Product.line"           
##  [5] "Unit.price"              "Quantity"               
##  [7] "Tax"                     "Date"                   
##  [9] "Payment"                 "Cogs"                   
## [11] "Gross.margin.percentage" "Gross.income"           
## [13] "Rating"                  "Total"

The column names have been renamed successfully.

e) Outliers

# Selecting numerical columns
num <- unlist(lapply(part12, is.numeric))
# Numeric data frame
num_df <- part12[, num]
# Number of numeric columns
length(names(num_df))
## [1] 8
## [1] 10
# Box plots
par(mfrow = c(5,2), mar=c(1,5,2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[ , i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

The Tax, Cogs, Gross margin percentage, Gross income and the Total column have outliers.

# Outliers summary function
outliers.summary <- function(data){
  total <- c(0) # Placeholder for total number of outliers
  cat('Precentage outliers per column\n')
  cat(rep('-', 16), '\n')
  for (col in names(data)){
    # IQR
    Q1 <- quantile(data[,col], probs=.25)
    Q3 <- quantile(data[,col], probs=.75)
    IQR = Q3-Q1
    # Limits
    upper.limit = Q3 + (IQR*1.5)
    lower.limit = Q1 - (IQR*1.5)
    # Outliers per column
    out <- sum(data[,col] > upper.limit | data[,col] < lower.limit)
    total <- total + out
    perc <- (out/nrow(data))*100
    print(paste0(col, ' = ', round(perc,2), '%'))
    cat('\n')
  }
  # Total percentage of outliers
  cat('\nTotal precentage of outliers\n')
  cat(rep('-', 16), '\n')
  cells <- nrow(data)*ncol(data)
  perc <- (total/cells)*100
  print(paste0('Total = ', round(perc,2), '%'))
  
}
# Outliers summary
outliers.summary(num_df)
## Precentage outliers per column
## - - - - - - - - - - - - - - - - 
## [1] "Unit.price = 0%"
## 
## [1] "Quantity = 0%"
## 
## [1] "Tax = 0.9%"
## 
## [1] "Cogs = 0.9%"
## 
## [1] "Gross.margin.percentage = 0%"
## 
## [1] "Gross.income = 0.9%"
## 
## [1] "Rating = 0%"
## 
## [1] "Total = 0.9%"
## 
## 
## Total precentage of outliers
## - - - - - - - - - - - - - - - - 
## [1] "Total = 0.45%"

The outliers are minimal, therefore they will removed in order to analyze the general trend.

# Get column indices
index <- match(colnames(num_df), colnames(part12))
index
## [1]  5  6  7 10 11 12 13 14
# Removing outliers

# Function to obtain logical vector of outliers
outliers.vector <- function(data) {
  # IQR
  Q1 <- quantile(data, probs=.25)
  Q3 <- quantile(data, probs=.75)
  IQR <- Q3-Q1
  # Limits
  upper.limit = Q3 + (IQR*1.5)
  lower.limit = Q1 - (IQR*1.5)
  # Logical vector
  data > upper.limit | data < lower.limit
}

remove.outliers <- function(data, cols = names(data)) {
  for (col in cols) {
    data <- data[!outliers.vector(data[[col]]),]
  }
  data
} 

# Function that removes rows with outliers
remove.func <- function(data, indices){
  for(col in indices){
    no <- outliers.vector(data[,col])
    data <- data[!no,]
    
  }
  data
}
# Removing outliers
# Copy of the dataset.
part12.no <- data.frame(part12)

# Function call
part12.no <- remove.func(part12.no, index)
# Number of removed rows
perc <- ((nrow(part12) - nrow(part12.no))/nrow(part12))*100
print(paste0('Removed rows = ', round((perc),2), '%'))
## [1] "Removed rows = 0.9%"

f) Feature Engineering

Splitting the time column

# Copy of the data set
part12.dates <- data.frame(part12.no)

# Columns required  = Year, month, day, hour and minute
part12.dates$Year <- format(part12.dates$Date, format = '%Y')
# Checking for unique years
unique(part12.dates$Year)
## [1] "2019"

The data only contains records for the year 2019. This column will be dropped as it provides no further information.

# Dropping the year column
part12.dates <- part12.dates[,-ncol(part12.dates)]
# Creating the remaining date and time columns
part12.dates$Month <- format(part12.dates$Date, format = '%m')
part12.dates$Day <- format(part12.dates$Date, format = '%d')
part12.dates$Hour <- format(part12.dates$Date, format = '%H')
part12.dates$Minute <- format(part12.dates$Date, format = '%M')

# Dropping the date column
part12.dates <- subset(part12.dates, select = -c(Date) )
# Checking changes
head(part12.dates)
##   Branch Customer.type Gender           Product.line Unit.price Quantity
## 1      A        Member Female      Health and beauty      74.69        7
## 2      C        Normal Female Electronic accessories      15.28        5
## 3      A        Normal   Male     Home and lifestyle      46.33        7
## 4      A        Member   Male      Health and beauty      58.22        8
## 5      A        Normal   Male      Sports and travel      86.31        7
## 6      C        Normal   Male Electronic accessories      85.39        7
##       Tax     Payment   Cogs Gross.margin.percentage Gross.income Rating
## 1 26.1415     Ewallet 522.83                4.761905      26.1415    9.1
## 2  3.8200        Cash  76.40                4.761905       3.8200    9.6
## 3 16.2155 Credit card 324.31                4.761905      16.2155    7.4
## 4 23.2880     Ewallet 465.76                4.761905      23.2880    8.4
## 5 30.2085     Ewallet 604.17                4.761905      30.2085    5.3
## 6 29.8865     Ewallet 597.73                4.761905      29.8865    4.1
##      Total Month Day Hour Minute
## 1 548.9715    01  05   13     08
## 2  80.2200    03  08   10     29
## 3 340.5255    03  03   13     23
## 4 489.0480    01  27   20     33
## 5 634.3785    02  08   10     37
## 6 627.6165    03  25   18     30

The date column has been dropped.

# Converting the month, day, hour and minutes to numeric format.
part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)] <- sapply(
  part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)], as.numeric)
#
# Preview changes
head(part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)])
##   Month Day Hour Minute
## 1     1   5   13      8
## 2     3   8   10     29
## 3     3   3   13     23
## 4     1  27   20     33
## 5     2   8   10     37
## 6     3  25   18     30

5. Descriptive Analysis

a) Univariate Analysis

i) Categorical

# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend, colors, method, col2 = NULL){
  if (method == 'count'){
    ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
    ggtitle(paste(title, 'Frequency Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  else if (method == 'bar'){
    ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  
}
head(part12)
##   Branch Customer.type Gender           Product.line Unit.price Quantity
## 1      A        Member Female      Health and beauty      74.69        7
## 2      C        Normal Female Electronic accessories      15.28        5
## 3      A        Normal   Male     Home and lifestyle      46.33        7
## 4      A        Member   Male      Health and beauty      58.22        8
## 5      A        Normal   Male      Sports and travel      86.31        7
## 6      C        Normal   Male Electronic accessories      85.39        7
##       Tax                Date     Payment   Cogs Gross.margin.percentage
## 1 26.1415 2019-01-05 13:08:00     Ewallet 522.83                4.761905
## 2  3.8200 2019-03-08 10:29:00        Cash  76.40                4.761905
## 3 16.2155 2019-03-03 13:23:00 Credit card 324.31                4.761905
## 4 23.2880 2019-01-27 20:33:00     Ewallet 465.76                4.761905
## 5 30.2085 2019-02-08 10:37:00     Ewallet 604.17                4.761905
## 6 29.8865 2019-03-25 18:30:00     Ewallet 597.73                4.761905
##   Gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165

Branch

# Unique
not.cat <- unlist(lapply(part12.dates, is.numeric))
catdf <- part12.dates[,!not.cat]
catdf$Month <- cbind(part12.dates$Month)
unique(catdf$Branch)
## [1] A C B
## Levels: A B C
# Bar plot
bar.plt (catdf, Branch, title = 'Branch', method = 'count', legend = 
           'Branch', colors = c('A' = '#0276AB', 
                                   'B' = '#02557A', 
                                   'C' = '#013349'), col2 = NULL) 

Branch A has the highest frequency, and C the lowest, though the difference is minimal.

Customer Type

# Unique
unique(catdf$Customer.type)
## [1] Member Normal
## Levels: Member Normal
# Bar plot
bar.plt (catdf, Customer.type, title = 'Customer Type', method = 'count',
         legend = 'Customer Type', colors = c('Member' = '#0276AB', 
                                   'Normal' = '#013349'), col2 = NULL) 

The frequency of customers who are members is slighly higher than that of normal customers.

Gender

# Unique
unique(catdf$Gender)
## [1] Female Male  
## Levels: Female Male
# Bar plot
bar.plt (catdf, Gender, title = 'Gender', method = 'count', legend = 
           'Gender', colors = c('Male' = '#0276AB',
                                   'Female' = '#013349'), col2 = NULL) 

There is a slightly higher proportion of male customers, than female.

Product Line

# Unique
unique(catdf$Product.line)
## [1] Health and beauty      Electronic accessories Home and lifestyle    
## [4] Sports and travel      Food and beverages     Fashion accessories   
## 6 Levels: Electronic accessories Fashion accessories ... Sports and travel
#Br plot
ggplot(catdf, aes(x = Product.line, fill = Product.line)) + geom_bar() + ggtitle(paste('Product Line', 'Frequency Plot')) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_fill_manual('Product Line', values = 
                    c('Sports and travel' = '#0276AB', 
                      'Home and lifestyle' = '#026592', 
                      'Health and beauty' = '#02557A',
                      'Food and beverages' = '#014462',
                      'Fashion accessories' = '#013349',
                      'Electronic accessories' = '#012231')) + 
  coord_flip()

Food and beverages, and fashion accessories have the highest frequency.

Payment

# Unique
unique(catdf$Payment)
## [1] Ewallet     Cash        Credit card
## Levels: Cash Credit card Ewallet
# Bar plot
bar.plt (catdf, Payment, title = 'Payment', method = 'count', legend = 
           'Payment', colors = c('Cash' = '#0276AB',
                                   'Credit card' = '#02557A',
                                   'Ewallet' = '#012231'), col2 = NULL) 

Most payment are done in cash or using e-wallets.

Month

# Unique
catdf$Month <- as.factor(catdf$Month)
unique(catdf$Month)
## [1] 1 3 2
## Levels: 1 2 3
# Bar plot
bar.plt (catdf, Month, title = 'Month', method = 'count', legend = 
           'Month', colors = c('1' = '#0276AB',
                                   '2' = '#02557A',
                                   '3' = '#012231'), col2 = NULL) 

January and February had the highest frequency, and February had the lowest.

Bivariate analysis will be done to get a clearer picture of factors that affect total sales.

ii) Numerical

# Mode function
mode <- function(col, data) {
   unique.value <- unique(data[, col])
   unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}

central.tendency <- function(col, data){
  cat('Measures of Central Tendency \n')
  # Mean
  cat('Mean = ', mean(data[, col]), '\n')
  # Median
  cat('Median = ', median(data[,col]), '\n')
  # Mode
  cat('Mode = ', mode(col, data), '\n')
  
}

dispersion <- function(col, data){
  cat('\nMeasures of Dispersion \n')
  # Range
  cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
  # IQR
  cat('IQR = ', IQR(data[ ,col]), '\n')
  # Variance
  cat('Variance = ', var(data[ ,col]), '\n')
  # Standard Deviation
  cat('Standard Deviation = ', sd(data[ ,col]), '\n')
  # Skewness
  cat('Skewness = ', skewness(data[ ,col]), '\n')
  # Kurtosis
  cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}
# Numerical data
numc <- unlist(lapply(part12.dates, is.numeric))
numcdf <- part12.dates[,numc]
numcdf <- numcdf[, !colnames(numcdf) %in% c('Month')]

Unit Price

# Measures of central tendency
central.tendency(names(numcdf)[1], numcdf)
## Measures of Central Tendency 
## Mean =  55.29248 
## Median =  54.84 
## Mode =  83.77
# Measures of dispersion
dispersion(names(numcdf)[1], numcdf)
## 
## Measures of Dispersion 
## Range =  10.08 - 99.96 
## IQR =  44.82 
## Variance =  692.3013 
## Standard Deviation =  26.31162 
## Skewness =  0.01401477 
## Kurtosis =  1.789507
  • The average unit price is 55.
  • The deviation of unit prices is large, due to different pricing for each product category.
  • The column has a normal distribution, and kurtosis is <3, therefore mesokurtic.

Quantity

# Measures of central tendency
central.tendency(names(numcdf)[2], numcdf)
## Measures of Central Tendency 
## Mean =  5.469223 
## Median =  5 
## Mode =  1
# Measures of dispersion
dispersion(names(numcdf)[2], numcdf)
## 
## Measures of Dispersion 
## Range =  1 - 10 
## IQR =  5 
## Variance =  8.439203 
## Standard Deviation =  2.905031 
## Skewness =  0.02184004 
## Kurtosis =  1.794938
  • The average quantity is 5.5.
  • The column has a normal distribution, and kurtosis is <3, therefore mesokurtic.

Tax

# Measures of central tendency
central.tendency(names(numcdf)[3], numcdf)
## Measures of Central Tendency 
## Mean =  15.07642 
## Median =  12.036 
## Mode =  39.48
# Measures of dispersion
dispersion(names(numcdf)[3], numcdf)
## 
## Measures of Dispersion 
## Range =  0.5085 - 45.325 
## IQR =  16.3095 
## Variance =  128.1312 
## Standard Deviation =  11.3195 
## Skewness =  0.8426702 
## Kurtosis =  2.762815
  • The average quantity is 15
  • The column has a normal distribution, and kurtosis is <3, therefore mesokurtic.

Cogs

# Measures of central tendency
central.tendency(names(numcdf)[4], numcdf)
## Measures of Central Tendency 
## Mean =  301.5283 
## Median =  240.72 
## Mode =  789.6
# Measures of dispersion
dispersion(names(numcdf)[4], numcdf)
## 
## Measures of Dispersion 
## Range =  10.17 - 906.5 
## IQR =  326.19 
## Variance =  51252.47 
## Standard Deviation =  226.3901 
## Skewness =  0.8426702 
## Kurtosis =  2.762815
  • The average cogs is 301.53.
  • The column has a normal distribution, and kurtosis is <3, therefore mesokurtic.

Gross Margin Percentage

# Measures of central tendency
central.tendency(names(numcdf)[5], numcdf)
## Measures of Central Tendency 
## Mean =  4.761905 
## Median =  4.761905 
## Mode =  4.761905
# Measures of dispersion
dispersion(names(numcdf)[5], numcdf)
## 
## Measures of Dispersion 
## Range =  4.761905 - 4.761905 
## IQR =  0 
## Variance =  0 
## Standard Deviation =  0 
## Skewness =  NaN 
## Kurtosis =  NaN

The average gross margin income is 4.8. From the measures of dispersion is can be seen that the column only has one value.

Gross Income

# Measures of central tendency
central.tendency(names(numcdf)[6], numcdf)
## Measures of Central Tendency 
## Mean =  15.07642 
## Median =  12.036 
## Mode =  39.48
# Measures of dispersion
dispersion(names(numcdf)[6], numcdf)
## 
## Measures of Dispersion 
## Range =  0.5085 - 45.325 
## IQR =  16.3095 
## Variance =  128.1312 
## Standard Deviation =  11.3195 
## Skewness =  0.8426702 
## Kurtosis =  2.762815

The mean gross income is 15.

Rating

# Measures of central tendency
central.tendency(names(numcdf)[7], numcdf)
## Measures of Central Tendency 
## Mean =  6.979717 
## Median =  7 
## Mode =  6
# Measures of dispersion
dispersion(names(numcdf)[7], numcdf)
## 
## Measures of Dispersion 
## Range =  4 - 10 
## IQR =  3 
## Variance =  2.947942 
## Standard Deviation =  1.716957 
## Skewness =  0.00592866 
## Kurtosis =  1.854314

The average rating is 7.

Total

# Measures of central tendency
central.tendency(names(numcdf)[8], numcdf)
## Measures of Central Tendency 
## Mean =  316.6048 
## Median =  252.756 
## Mode =  829.08
# Measures of dispersion
dispersion(names(numcdf)[8], numcdf)
## 
## Measures of Dispersion 
## Range =  10.6785 - 951.825 
## IQR =  342.4995 
## Variance =  56505.85 
## Standard Deviation =  237.7096 
## Skewness =  0.8426702 
## Kurtosis =  2.762815

The average total sales, including tax is 316.

Days

# Days Histogram
hist(numcdf$Day, 
     main = 'Days Histogram', 
     xlab = 'Day')

The trend of sales seems to generally decrease towards the end of the end of the month, apart from between the 22nd and 26th days.

Hours

# Hours Histogram
hist(numcdf$Hour, 
     main = 'Hours Histogram', 
     xlab = 'Hour')

Most sales occurred in the morning hours.

b) Bivariate Analysis

i) Categorical-Categorical

Branch Vs Customer Type

# Target columns
Branch <- catdf$Branch
Customer <- catdf$Customer.type

# Contingency table
contingency.table <- table(Branch, Customer)
contingency.table
##       Customer
## Branch Member Normal
##      A    167    172
##      B    165    165
##      C    166    156

Branch A has the highest number of normal and member customers.

Branch Vs Gender

# Target columns
Branch <- catdf$Branch
Gender <- catdf$Gender

# Contingency table
contingency.table <- table(Branch, Gender)
contingency.table
##       Gender
## Branch Female Male
##      A    161  178
##      B    160  170
##      C    174  148

Branch A and B had more male customers, and C had more female customers.

Branch Vs Product Line

# Target columns
Branch <- catdf$Branch
Product <- catdf$Product.line

# Contingency table
contingency.table <- table(Branch, Product)
contingency.table
##       Product
## Branch Electronic accessories Fashion accessories Food and beverages
##      A                     60                  50                 58
##      B                     55                  62                 50
##      C                     55                  63                 65
##       Product
## Branch Health and beauty Home and lifestyle Sports and travel
##      A                47                 65                59
##      B                53                 48                62
##      C                52                 43                44

All branches sell the specified products, but in varying amounts. All available products seem to be popular in all the stores.

Branch Vs Payment

# Target columns
Branch <- catdf$Branch
Payment <- catdf$Payment

# Contingency table
contingency.table <- table(Branch, Payment)
contingency.table
##       Payment
## Branch Cash Credit card Ewallet
##      A  110         103     126
##      B  110         108     112
##      C  122          96     104

The most popular payment method is by cash, followed by e-wallets.

ii) Numerical-Numerical

# Scatter plot and correlation function
scatter.plt <- function(col1, col2, data, title){
  # Scatter plot
data <- ggplot(data, aes(x = {{col1}}, y= {{col2}})) + 
  geom_point(color = 'black') + ggtitle(paste(title, 'Scatter Plot')) + 
  theme(plot.title = element_text(hjust = 0.5))
plot(data)

}

Sales Vs Unit Price

# Scatter plot and correlation
scatter.plt(Unit.price, Total, data = numcdf, title = 'Sales Vs Unit Price')

correlation <- cor(numcdf$Unit.price, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 0.623354749751846."

An increase in unit price leads to an increase in sales.

Sales Vs Quantity

# Scatter plot and correlation
scatter.plt(Quantity, Total, data = numcdf, title = 'Sales Vs Quantity')

correlation <- cor(numcdf$Quantity, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 0.699290259869811."

An increase in quantity = increase in sales.

Sales Vs Cogs

# Scatter plot and correlation
scatter.plt(Cogs, Total, data = numcdf, title = 'Sales Vs Cogs')

correlation <- cor(numcdf$Cogs, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 1."

Increase in cogs = increase in sales.

Sales Vs Gross Income

# Scatter plot and correlation
scatter.plt(Gross.income, Total, data = numcdf,
            title = 'Sales Vs Gross Income')

correlation <- cor(numcdf$Gross.income, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 1."

Increase in sales = increase in gross income.

Sales Vs Rating

# Scatter plot and correlation
scatter.plt(Rating, Total, data = numcdf, title = 'Sales Vs Rating')

correlation <- cor(numcdf$Rating, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = -0.0260916394610116."

There is no relationship between the sales and the ratings.

iii) Numerical-Categorical

Sales Trend

# Line plot

line.plt <- function(col1, col2, data, title){
ggplot(data, aes(x = {{col1}}, y= {{col2}})) + geom_line(color = '#281E5D',
                                                         size = 1) +
    ggtitle(paste(title, 'Line Plot')) +
    theme(plot.title = element_text(hjust = 0.5)) 

}
# Conversion to date class
part12$Date <- as.Date(part12$Date)
# Trend
line.plt(data = part12, col1 =  Date, col2 = Total,
         title = 'Sales Trend')

The sales change drastically on a daily basis.

# Bar plot for averaged y axis
bar.plt.summary <- function(data, col1, col2, title, legend, colors){
  ggplot(data, mapping=aes(x= {{col1}}, y= {{col2}}, fill = {{col1}})) + 
  stat_summary(fun=mean, geom="bar") + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)
}

Sales Vs Branch

# Bar plot
bar.plt.summary(part12,Branch, Total,'Sales Vs Branch', 'Branch',
                c('A' = '#0276AB', 'B' = '#02557A', 'C' = '#013349'))

Branch C has the highest sales.

Sales Vs Customer Type

# Bar plot
bar.plt.summary(part12,Customer.type, Total,'Sales Vs Customer Type',
                'Customer Type', c('Member' = '#0276AB', 'Normal' = '#013349'))

Members of the store contribute to a majority of the sales, but not by a large margin.

Sales Vs Gender

# Bar plot
bar.plt.summary(part12,Gender, Total,'Sales Vs Gender', 'Gender',
                c('Male' = '#0276AB', 'Female' = '#02557A', 'C' = '#013349'))

Women contribute to more sales than men.

Sales Vs Product Line

# Conversion to factor
# Bar plot
ggplot(part12.dates, mapping = aes(x = Product.line, y = Total, fill = Product.line))+
  geom_bar(stat = 'identity') + stat_summary(fun=median, geom="bar") + 
  ggtitle(paste('Sales Vs Product Line Plot')) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_fill_manual('Product Line', values = 
                    c('Sports and travel' = '#0276AB', 
                      'Home and lifestyle' = '#026592', 
                      'Health and beauty' = '#02557A',
                      'Food and beverages' = '#014462',
                      'Fashion accessories' = '#013349',
                      'Electronic accessories' = '#012231')) + 
  coord_flip()

Electronics accessories, Fashion accessories and Sports and travel products have the highest sales contribution.

Sales Vs Payment

# Bar plot
bar.plt.summary(part12,Payment, Total,'Sales Vs Payment', 'Payment',
                c('Cash' = '#0276AB', 'Credit card' = '#02557A',
                  'Ewallet' = '#013349'))

Most transaction are payed for in cash and e-wallet.

Sales Vs Month

# Bar plot
part12.dates$Month <- as.factor(part12.dates$Month)
bar.plt.summary(part12.dates,Month, Total,'Sales Vs Month', 'Month',
                c('1' = '#0276AB', '2' = '#02557A', '3' = '#013349'))

Sales were highest in January, before steadily decreasing in the subsequent months.

Multivariate analysis will be carried out to analyze more complex combinations.

c) Multivariate Analysis

Correlation Matrix

# Visualize correlation matrix
# Removing the gross margin percentage as it only has one value.
numcdf <- numcdf[,!colnames(numcdf) %in%  c('Gross.margin.percentage')]
ggcorrplot(cor(numcdf), lab = TRUE, title = 'Correlation Heatmap',
           colors = c('#022D36', 'white', '#48AAAD'))

The Total - Unit Price, Total - Quantity, Gross income - Unit Price are among the highly correlated columns.

# Stacked bar plot function
stacked.bar <- function(col1, col2, col3, data, title){
  ggplot(data, aes( x = {{col1}}, y = {{col2}}, fill = {{col3}})) + 
  geom_bar(position='stack', stat='identity') +
  scale_fill_viridis(discrete = T) +
  stat_summary(fun=mean, geom="bar") + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))
    
}

Sales Vs Gender Vs Branch

# Stacked bar plot
stacked.bar(Branch, Total, Gender, part12, 'Sales Vs Gender Vs Branch')

Branch A and B have the same gender distribution, but branch C has more female customers, as well as the highest sales.

Sales Vs Customer Type Vs Branch

# Stacked bar plot
stacked.bar(Branch, Total, Customer.type, part12,
            'Sales Vs Customer Type Vs Branch')

Branch C has the highest number of customers, and most of them are members. It also has the highest sales.

Sales Vs Payment Vs Branch

# Stacked bar plot
stacked.bar(Branch, Total, Payment, part12,
            'Sales Vs Payment Vs Branch')

In branch A, E-wallet payments have the highest popularity. In branch B, its payment via credit cards. Finally, in branch C, the most popular mode is cash payments.

Sales Vs Month Vs Branch

# Stacked bar plot
part12.dates$Month <- as.factor(part12.dates$Month)
stacked.bar(Month, Total, Branch, part12.dates,
            'Sales Vs Month Vs Branch')

Branch A had the highest sales in January, and C had the lowest. In February, overall, sales dropped, and branch A sales plummeted. There was a slight decrease for branch C, but B had fairly constant sales. In March, sales in branches A and C recovered, but for B, it remained constant.

Part 1: Dimensionality Reduction Using PCA

For unsupervised learning, the main aim of the procedure is to learn patterns from unlabeled data, therefore, the Total column will be dropped, as it is the target/column of the data set.

Preparing the data set

# Removing the target column
part12.dates.nt <- data.frame(part12.dates)
part12.dates.nt <- part12.dates[,!colnames(part12.dates.nt) %in% c('Total')]
# Data set preview
head(part12.dates.nt)
##   Branch Customer.type Gender           Product.line Unit.price Quantity
## 1      A        Member Female      Health and beauty      74.69        7
## 2      C        Normal Female Electronic accessories      15.28        5
## 3      A        Normal   Male     Home and lifestyle      46.33        7
## 4      A        Member   Male      Health and beauty      58.22        8
## 5      A        Normal   Male      Sports and travel      86.31        7
## 6      C        Normal   Male Electronic accessories      85.39        7
##       Tax     Payment   Cogs Gross.margin.percentage Gross.income Rating Month
## 1 26.1415     Ewallet 522.83                4.761905      26.1415    9.1     1
## 2  3.8200        Cash  76.40                4.761905       3.8200    9.6     3
## 3 16.2155 Credit card 324.31                4.761905      16.2155    7.4     3
## 4 23.2880     Ewallet 465.76                4.761905      23.2880    8.4     1
## 5 30.2085     Ewallet 604.17                4.761905      30.2085    5.3     2
## 6 29.8865     Ewallet 597.73                4.761905      29.8865    4.1     3
##   Day Hour Minute
## 1   5   13      8
## 2   8   10     29
## 3   3   13     23
## 4  27   20     33
## 5   8   10     37
## 6  25   18     30

Scaling and encoding is required, and the date column.

Scaling

# The gross margin percentage only contains one value, therefore, it cannot
# be scaled.

unique(part12[,c('Gross.margin.percentage')])
## [1] 4.761905

The column only contains a single value. This column does not provide relevant insights as it is constant for all records. Therefore, it will be removed.

# Manually scaling the numeric columns only as categorical columns might contain 
# zeros and constants, which cannot be re-scaled.

# Standard scaling.
part12.scaled <- data.frame(part12.dates.nt)
# Removing the gross margin percentage
part12.scaled <- part12.scaled[,-10]
num_df <- num_df[, !colnames(num_df) %in% c('Total')]
numdf <- num_df[,-5]
part12.scaled[, colnames(numdf)] <- scale(part12.scaled[, colnames(numdf)])
head(part12.scaled[, colnames(numdf)])
##   Unit.price   Quantity        Tax       Cogs Gross.income     Rating
## 1  0.7372225  0.5269400  0.9775237  0.9775237    0.9775237  1.2349071
## 2 -1.5207153 -0.1615208 -0.9944267 -0.9944267   -0.9944267  1.5261200
## 3 -0.3406283  0.5269400  0.1006301  0.1006301    0.1006301  0.2447834
## 4  0.1112633  0.8711705  0.7254367  0.7254367    0.7254367  0.8272091
## 5  1.1788525  0.5269400  1.3368150  1.3368150    1.3368150 -0.9783107
## 6  1.1438870  0.5269400  1.3083685  1.3083685    1.3083685 -1.6772215
# Scaling the day, hour and minute columns.
part12.scaled[,(ncol(part12.scaled)-2):ncol(part12.scaled)] <- 
  scale(part12.scaled[,(ncol(part12.scaled)-2):ncol(part12.scaled)])
# Data set with target
part12.sc.with.target <- data.frame(part12.scaled)
part12.sc.with.target <- cbind(part12.sc.with.target, scale(part12.dates$Total))

Label Encoding

# Categorical
num <- unlist(lapply(part12.dates, is.numeric))
cat.df <- part12.scaled[,!num]
cat.df <- cat.df[,-ncol(cat.df)]
head(cat.df)
##   Branch Customer.type Gender           Product.line     Payment
## 1      A        Member Female      Health and beauty     Ewallet
## 2      C        Normal Female Electronic accessories        Cash
## 3      A        Normal   Male     Home and lifestyle Credit card
## 4      A        Member   Male      Health and beauty     Ewallet
## 5      A        Normal   Male      Sports and travel     Ewallet
## 6      C        Normal   Male Electronic accessories     Ewallet
# Data set copy
part12.encoded <- data.frame(part12.scaled)
# Label encoding
label <- LabelEncoder$new()
part12.encoded[,colnames(cat.df)] <- sapply(part12.encoded[,colnames(cat.df)],
                                            label$fit_transform)

# Confirming that changes have been made.
head(part12.encoded[,colnames(cat.df)])
##   Branch Customer.type Gender Product.line Payment
## 1      0             0      0            3       2
## 2      2             1      0            0       0
## 3      0             1      1            4       1
## 4      0             0      1            3       2
## 5      0             1      1            5       2
## 6      2             1      1            0       2
# Ensuring that the entire column in numeric format.
part12.encoded$Month <- as.numeric(part12.encoded$Month)
head(part12.encoded)
##   Branch Customer.type Gender Product.line Unit.price   Quantity        Tax
## 1      0             0      0            3  0.7372225  0.5269400  0.9775237
## 2      2             1      0            0 -1.5207153 -0.1615208 -0.9944267
## 3      0             1      1            4 -0.3406283  0.5269400  0.1006301
## 4      0             0      1            3  0.1112633  0.8711705  0.7254367
## 5      0             1      1            5  1.1788525  0.5269400  1.3368150
## 6      2             1      1            0  1.1438870  0.5269400  1.3083685
##   Payment       Cogs Gross.income     Rating Month        Day       Hour
## 1       2  0.9775237    0.9775237  1.2349071     1 -1.1812796 -0.5977934
## 2       0 -0.9944267   -0.9944267  1.5261200     3 -0.8364981 -1.5371379
## 3       1  0.1006301    0.1006301  0.2447834     3 -1.4111340 -0.5977934
## 4       2  0.7254367    0.7254367  0.8272091     1  1.3471180  1.5940104
## 5       2  1.3368150    1.3368150 -0.9783107     2 -0.8364981 -1.5371379
## 6       2  1.3083685    1.3083685 -1.6772215     3  1.1172637  0.9677808
##         Minute
## 1 -1.309810186
## 2 -0.065983593
## 3 -0.421362620
## 4  0.170935758
## 5  0.407855109
## 6 -0.006753755

PCA

Computing PCA

# PCA on pre-scaled data
pca <- prcomp(part12.encoded)

Scree Plot

Visualizing the Eigen values.

# Scree Plot
fviz_eig(pca, addlabels = TRUE, ylim = c(0,40))

The first component accounts for most of the variance in the data.

# PCA summary
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6   PC7
## Standard deviation     1.9784 1.7185 1.04798 1.03790 0.99331 0.97426 0.971
## Proportion of Variance 0.2699 0.2036 0.07573 0.07428 0.06803 0.06545 0.065
## Cumulative Proportion  0.2699 0.4735 0.54923 0.62351 0.69154 0.75699 0.822
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.85104 0.80829 0.79224 0.50451 0.48700 0.29121
## Proportion of Variance 0.04994 0.04505 0.04328 0.01755 0.01635 0.00585
## Cumulative Proportion  0.87193 0.91697 0.96025 0.97780 0.99415 1.00000
##                             PC14      PC15
## Standard deviation     4.411e-16 1.402e-17
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00

The first component explains 27% of the variance. The second highest, PC2, explains 20% of the variance. Analysis will focus on these two PC’s.

# Result of variables
var.result <- get_pca_var(pca)
var.result
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Quality of Representation

# Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(pca, choice = "var", axes = 1:2)

The product line has the highest representation in the dimensions. Followed by the Cogs, Tax and Gross income. Lastly, the Quantity and Unit price.

Contribution to the PCs

# Contributions of variables to PC1
fviz_contrib(pca, choice = "var", axes = 1)

# Contributions of variables to PC2
fviz_contrib(pca, choice = "var", axes = 2)

  • For the first PC, the Cogs, Tax, Gross Income, Quantity and Unit price columns have the highest contribution.
  • For PC2, Product line has the highest contribution.

When only numeric columns are used

# Only selecting the numeric columns
part12.num <- part12.dates.nt[,c(-1,-2,-3,-4,-8, -10, -13)]
part12.num$Quantity <- as.numeric(part12.num$Quantity)
head(part12.num)
##   Unit.price Quantity     Tax   Cogs Gross.income Rating Day Hour Minute
## 1      74.69        7 26.1415 522.83      26.1415    9.1   5   13      8
## 2      15.28        5  3.8200  76.40       3.8200    9.6   8   10     29
## 3      46.33        7 16.2155 324.31      16.2155    7.4   3   13     23
## 4      58.22        8 23.2880 465.76      23.2880    8.4  27   20     33
## 5      86.31        7 30.2085 604.17      30.2085    5.3   8   10     37
## 6      85.39        7 29.8865 597.73      29.8865    4.1  25   18     30

PCA on numeric columns only

# Computing the PCA
pca.num <- prcomp(part12.num, scale= T)
# PCA summary
summary(pca.num)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.9760 1.0415 1.0355 0.9883 0.9738 0.9635 0.29146
## Proportion of Variance 0.4339 0.1205 0.1192 0.1085 0.1054 0.1031 0.00944
## Cumulative Proportion  0.4339 0.5544 0.6735 0.7821 0.8874 0.9906 1.00000
##                              PC8       PC9
## Standard deviation     1.238e-16 6.592e-17
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00

PC1 accounts for 43% of the variance. A scree plot of the Eigen values will be used to provide a better summary.

# Scree plot
fviz_eig(pca.num, addlabels = TRUE, ylim = c(0,50))

Quality of Representation

# Variable plot
fviz_pca_var(pca.num, repel = T,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

For the numeric variable only PCA, the Cogs, Tax and Gross income have the highest contribution, while the hour, minute and rating have the lowest.

# Variables
var.result.num <- get_pca_var(pca.num)
# Correlation plot
corrplot(var.result.num$cos2, is.corr=FALSE)

From this plot, the Tax, Cogs and Gross income columns have the highest representation for Dim.1.

# Representation in the 1st and 2nd dimensions.
fviz_cos2(pca.num,  choice = "var", axes = 1:2)

The columns with the highest representation are the Tax, Cogs and Gross income.

Variable Contribution

# Contribution in individual components

# PC1
fviz_contrib(pca.num, choice = "var", axes = 1)

# PC2
fviz_contrib(pca.num, choice = "var", axes = 2)

  • For PC1, columns with the highest contribution are the Tax, Total, Cogs and Gross income.
  • For PC2, the day, unit price and quantity have the highest contribution.

Individuals are numerous, therefore, analysis will be focused on the variables.

Part 2: Feature Selection

# Data set
head(part12.scaled)
##   Branch Customer.type Gender           Product.line Unit.price   Quantity
## 1      A        Member Female      Health and beauty  0.7372225  0.5269400
## 2      C        Normal Female Electronic accessories -1.5207153 -0.1615208
## 3      A        Normal   Male     Home and lifestyle -0.3406283  0.5269400
## 4      A        Member   Male      Health and beauty  0.1112633  0.8711705
## 5      A        Normal   Male      Sports and travel  1.1788525  0.5269400
## 6      C        Normal   Male Electronic accessories  1.1438870  0.5269400
##          Tax     Payment       Cogs Gross.income     Rating Month        Day
## 1  0.9775237     Ewallet  0.9775237    0.9775237  1.2349071     1 -1.1812796
## 2 -0.9944267        Cash -0.9944267   -0.9944267  1.5261200     3 -0.8364981
## 3  0.1006301 Credit card  0.1006301    0.1006301  0.2447834     3 -1.4111340
## 4  0.7254367     Ewallet  0.7254367    0.7254367  0.8272091     1  1.3471180
## 5  1.3368150     Ewallet  1.3368150    1.3368150 -0.9783107     2 -0.8364981
## 6  1.3083685     Ewallet  1.3083685    1.3083685 -1.6772215     3  1.1172637
##         Hour       Minute
## 1 -0.5977934 -1.309810186
## 2 -1.5371379 -0.065983593
## 3 -0.5977934 -0.421362620
## 4  1.5940104  0.170935758
## 5 -1.5371379  0.407855109
## 6  0.9677808 -0.006753755

Wrapper Method

# The clustvarsel function will be used to determine the optimal subset of 
# variables.

# Forward-Backward
forward = clustvarsel(part12.encoded,  search = c("greedy"),
            direction = c("forward"), verbose = F)

.

# Result for the F-B approach
forward
## ------------------------------------------------------ 
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------ 
## 
##  Variable proposed Type of step   BICclust Model G     BICdiff Decision
##           Quantity          Add -2141.9736     E 9    683.1595 Accepted
##                Tax          Add 32938.0738   VEI 9  37246.7226 Accepted
##         Unit.price          Add  -366.9583   EVV 9 -31968.0127 Rejected
##                Tax       Remove -2050.8575     E 9  37155.6065 Rejected
## 
## Selected subset: Quantity, Tax
# Subset of selected variables
sub.forward <- part12.scaled[,forward$subset]
colnames(sub.forward)
## [1] "Quantity" "Tax"

Only the Quantity and tax have been selected.

Classification Plots

# Mclust
fwd = Mclust(sub.forward, verbose = F)
summary(fwd)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEI (diagonal, equal shape) model with 9 components: 
## 
##  log-likelihood   n df      BIC      ICL
##        5037.589 991 36 9826.825 9822.281
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9 
## 187 102  98  50 112  91  60 199  92
# Classification plot
plot(fwd, c("classification"), main = 'Forward-Backward Classification Result')

There are 9 clusters, the 8th cluster has the highest number of values.

Feature Ranking

# FSelectorRcpp - Information gain

# Selecting the entire data set
colnames(part12.sc.with.target)[ncol(part12.sc.with.target)] <- c('Total')
imp <- information_gain(Total ~ . , data = part12.sc.with.target, equal = T)
imp <- imp[order(imp$importance, decreasing = T), ]
head(imp)
##      attributes  importance
## 7           Tax 1.609435878
## 9          Cogs 1.609435878
## 10 Gross.income 1.609435878
## 6      Quantity 0.420954158
## 5    Unit.price 0.311023067
## 4  Product.line 0.007238866

The Tax, Cogs and Gross income have the highest feature importance.

Filter Method

# Visualize correlation matrix
corrplot(cor(num_df))

Multiple columns have high correlations.

# Removing column with a single value - zero standard deviation
num_df <- num_df[, !colnames(num_df) %in% c('Gross.margin.percentage')]
# Correlation matrix.
cm <- cor(num_df)

# Highly correlated attributes.
high.corr <- findCorrelation(cm, cutoff=0.6)

names(num_df[,high.corr])
## [1] "Tax"          "Cogs"         "Gross.income"

These columns have a Pearson’s correlation coefficient >= 0.6.

# Removing redundant features 
num.df.select <- num_df[-high.corr]

# Performing our graphical comparison
# ---
# 
par(mfrow = c(1, 2))
corrplot(cm, order = "hclust")
corrplot(cor(num.df.select), order = "hclust")

The rating, unit price and quantity remain.

6. Implementing the Solution

Analysis Summary

Univariate Analysis

Categorical:

  • Branch A has the highest frequency, and C the lowest, though the difference is minimal.
  • The frequency of customers who are members is slightly higher than that of normal customers.
  • There is a slightly higher proportion of male customers, than female.
  • Food and beverages, and fashion accessories have the highest frequency.
  • Most payment are done in cash or using e-wallets.
  • January and February had the highest frequency, and February had the lowest.

Numerical:

The average unit price is 55. The average quantity is 5.5. The average tax is 15. The average cogs is 301.53. * The average gross margin income is 4.8. From the measures of dispersion is can be seen that the column only has one value. * The average gross income is 15. * The average rating is 7. * The average total sales, including tax is 316. * The trend of sales seems to generally decrease towards the end of the end of the month, apart from between the 22nd and 26th days. * Most sales occurred in the morning hours.

Bivariate Analysis

Categorical- Categorical:

  • Branch A has the highest number of normal and member customers.
  • Branch A and B had more male customers, and C had more female customers.
  • All available products seem to be popular in all the stores.
  • The most popular payment method is by cash, followed by e-wallets.

Numerical-Numerical:

  • An increase in unit price leads to an increase in sales.
  • An increase in quantity = increase in sales.
  • Increase in cogs = increase in sales.
  • Increase in sales = increase in gross income.
  • There is no relationship between the sales and the ratings.

Numerical-Categorical:

  • The sales change drastically on a daily basis.
  • Branch C has the highest sales.
  • Members of the store contribute to a majority of the sales, but not by a large margin.
  • Women contribute to more sales than men.
  • Electronics accessories, Fashion accessories and Sports and travel products have the highest sales contribution.
  • Most transaction are payed for in cash and e-wallet.
  • Sales were highest in January, before steadily decreasing in the subsequent months.

Multivariate

  • Branch A and B have the same gender distribution, but branch C has more female customers, as well as the highest sales.
  • Branch C has the highest number of customers, and most of them are members. It also has the highest sales.
  • In branch A, E-wallet payments have the highest popularity. In branch B, its payment via credit cards. Finally, in branch C, the most popular mode is cash payments.
  • Branch A had the highest sales in January, and C had the lowest. In February, overall, sales dropped, and branch A sales plummeted. There was a slight decrease for branch C, but B had fairly constant sales. In March, sales in branches A and C recovered, but for B, it remained constant.

PCA

PC1 - 27% PC2 - 20.4%

Quality of Representation

The product line has the highest representation in the dimensions. Followed by the Cogs, Tax and Gross income. Lastly, the Quantity and Unit price.

Contribution to the PCs

  • For the first PC, the Cogs, Tax, Gross Income, Quantity and Unit price columns have the highest contribution.
  • For PC2, Product line has the highest contribution.

Quality of Representation

The columns with the highest representation are the Tax, Cogs and Gross income.

Variable Contribution

  • For PC1, columns with the highest contribution are the Tax, Cogs and Gross income.
  • For PC2, the day, unit price and quantity have the highest contribution.

Feature Selection * Wrapper method - Only the Quantity and tax have been selected. * Feature ranking - The Tax, Cogs and Gross income have the highest feature importance. * Filter method - The rating, unit price and quantity remain.

Overall, the Tax, Cogs and Gross income are the variables with the highest importance in determining sales

7. Challenging the Solution

PCA and other feature selection methods were used to determine the most important variables.

8. Conclusion

In conclusion,

Descriptive analysis: the branch, gender, payment method and month of sales greatly affected the sales incurred by the mall.

PCA and feature analysis: the tax, cogs and gross income have the highest importance when determining the sales.

9. Recommendations

  1. An analysis into the reason behind the overall lagging sales in branch A and B.

  2. An analysis into the reason behind differing popularity of payment methods in different branches for better customer service.

  3. An analysis into the reason behind the drop in sales in branch A and B in February, and why B was not affected.

  4. An analysis on the importance of the tax, cogs and gross income using business domain knowledge, to determine how they can be used to plan and improve sales and marketing strategies.

  5. Collection of more data, with more variables, to provide a more in depth understanding into customer groups, sales trends and branch performance.

10. Follow Up Questions

a) Did we have the right data?

Yes, we have the right data as it was provied by the client, to perform the requested analysis. Furthermore, meaningful insights were extracted from the data.

b) Do we need other data to answer our question?

Yes, more indepth data, as well as more data will help in getting a deeper understanding on the sales trend, different customer groups, as well the performance of different branches.

c) Did we have the right question?

. Yes, we have the right question, as this analysis was done as per the client’s request.

# Suppressing warnings
options(warn = defaultW)