What are the most prominent variables from the sales data?
Correct identification of the most important variables from the provided sales data.
Carrefour Kenya was undertaking a project that would inform the marketing department on the presence of fraudulent sales. They provided a data set from which anomalies were to be detected, if they existed.
Correct identification of the most important variables from the data set is important as this can reduce the amount of factors to be considered during decision making and planning, to only the most relevant factors.
The data provided should have the correct information/ records, for the analysis results to be relevant.
# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base) # Date-time conversion
library(ggplot2) # Plotting Library
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(moments) # Measures of distribution
library(ggcorrplot) # Correlation plotting
library(mice) # Missing values
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(devtools)
## Loading required package: usethis
library(superml) # Label encoding
## Loading required package: R6
library(R6)
library(stats) # PCA
library(ggbiplot) # PCA plot
## Loading required package: scales
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## Loading required package: grid
library(factoextra, quietly = TRUE) # PCA visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot) # Correlation plot
## corrplot 0.92 loaded
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
##
## sim
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust, quiet = T)
library(FSelectorRcpp) # Feature ranking
library(caret)
## Loading required package: lattice
library(lattice)
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
library(viridisLite)
library(hrbrthemes, quietly = T)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
# Loading data set
part12 <- read.csv('dataset_1_2.csv')
# Data set records summary
records.summary <- function(data){
cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
}
# Data set summary
# Part 1 and 2
print('Dataset 1 and 2')
## [1] "Dataset 1 and 2"
records.summary(part12)
## Number of rows = 1000 and columns = 16
# Top data set preview
head(part12)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
# Bottom data set preview
tail(part12)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 995 652-49-6720 C Member Female Electronic accessories 60.95
## 996 233-67-5758 C Normal Male Health and beauty 40.35
## 997 303-96-2227 B Normal Female Home and lifestyle 97.38
## 998 727-02-1313 A Member Male Food and beverages 31.84
## 999 347-56-2442 A Normal Male Home and lifestyle 65.82
## 1000 849-09-3807 A Member Female Fashion accessories 88.34
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 995 1 3.0475 2/18/2019 11:40 Ewallet 60.95 4.761905
## 996 1 2.0175 1/29/2019 13:46 Ewallet 40.35 4.761905
## 997 10 48.6900 3/2/2019 17:16 Ewallet 973.80 4.761905
## 998 1 1.5920 2/9/2019 13:22 Cash 31.84 4.761905
## 999 1 3.2910 2/22/2019 15:33 Cash 65.82 4.761905
## 1000 7 30.9190 2/18/2019 13:28 Cash 618.38 4.761905
## gross.income Rating Total
## 995 3.0475 5.9 63.9975
## 996 2.0175 6.2 42.3675
## 997 48.6900 4.4 1022.4900
## 998 1.5920 7.7 33.4320
## 999 3.2910 4.1 69.1110
## 1000 30.9190 6.6 649.2990
The invoice ID will be dropped as it doesn’t provide the required information, while the date column will be changed to a date data type.
# Data set description
str(part12)
## 'data.frame': 1000 obs. of 16 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
The date columns will be changed to date format. The date and time columns will be combined for convenient data type conversion.
# Combining the date and time columns
part12$Date <- paste(part12$Date, part12$Time)
# Dropping the time column
index <- grep('Time', colnames(part12))
part12 <- part12[,-index]
colnames(part12)
## [1] "Invoice.ID" "Branch"
## [3] "Customer.type" "Gender"
## [5] "Product.line" "Unit.price"
## [7] "Quantity" "Tax"
## [9] "Date" "Payment"
## [11] "cogs" "gross.margin.percentage"
## [13] "gross.income" "Rating"
## [15] "Total"
# Checking the combined column
head(part12$Date)
## [1] "1/5/2019 13:08" "3/8/2019 10:29" "3/3/2019 13:23" "1/27/2019 20:33"
## [5] "2/8/2019 10:37" "3/25/2019 18:30"
The date and time columns have been converted.
# Unique dates to determine the required date and time format
head(unique(part12$Date), 10)
## [1] "1/5/2019 13:08" "3/8/2019 10:29" "3/3/2019 13:23" "1/27/2019 20:33"
## [5] "2/8/2019 10:37" "3/25/2019 18:30" "2/25/2019 14:36" "2/24/2019 11:38"
## [9] "1/10/2019 17:15" "2/20/2019 13:27"
The format is ‘%m/%d/%Y’.
# Converting the date column to time data type, time zone = East African Time
part12$Date <- strptime(part12$Date, format = '%m/%d/%Y %H:%M')
# Preview
class(part12$Date)
## [1] "POSIXlt" "POSIXt"
# Column preview
head(part12$Date, 10)
## [1] "2019-01-05 13:08:00 EAT" "2019-03-08 10:29:00 EAT"
## [3] "2019-03-03 13:23:00 EAT" "2019-01-27 20:33:00 EAT"
## [5] "2019-02-08 10:37:00 EAT" "2019-03-25 18:30:00 EAT"
## [7] "2019-02-25 14:36:00 EAT" "2019-02-24 11:38:00 EAT"
## [9] "2019-01-10 17:15:00 EAT" "2019-02-20 13:27:00 EAT"
The date column has been converted to the required datetime type.
Converting the character columns to factors.
# Get character columns
char <- unlist(lapply(part12, is.character))
char
## Invoice.ID Branch Customer.type
## TRUE TRUE TRUE
## Gender Product.line Unit.price
## TRUE TRUE FALSE
## Quantity Tax Date
## FALSE FALSE FALSE
## Payment cogs gross.margin.percentage
## TRUE FALSE FALSE
## gross.income Rating Total
## FALSE FALSE FALSE
# Converting categorical columns to factor types
part12[, char]<- lapply(part12[, char], factor)
# Checking changes
head(part12[, char])
## Invoice.ID Branch Customer.type Gender Product.line Payment
## 1 750-67-8428 A Member Female Health and beauty Ewallet
## 2 226-31-3081 C Normal Female Electronic accessories Cash
## 3 631-41-3108 A Normal Male Home and lifestyle Credit card
## 4 123-19-1176 A Member Male Health and beauty Ewallet
## 5 373-73-7910 A Normal Male Sports and travel Ewallet
## 6 699-14-3026 C Normal Male Electronic accessories Ewallet
The columns have been successfully converted to factors.
The data sets have been provided by the client, therefore, external data set validation will not be used.
# Data set columns
colnames(part12)
## [1] "Invoice.ID" "Branch"
## [3] "Customer.type" "Gender"
## [5] "Product.line" "Unit.price"
## [7] "Quantity" "Tax"
## [9] "Date" "Payment"
## [11] "cogs" "gross.margin.percentage"
## [13] "gross.income" "Rating"
## [15] "Total"
Checking the invoice ID column.
# Number of unique values vs number of rows.
cat('Unique values =', length(part12$Invoice.ID), 'and number of rows =',
nrow(part12))
## Unique values = 1000 and number of rows = 1000
The values are all unique, therefore, the column will be dropped.
# Dropping the invoice ID column
part12 <- part12[,-1]
# Confirming changes
colnames(part12)
## [1] "Branch" "Customer.type"
## [3] "Gender" "Product.line"
## [5] "Unit.price" "Quantity"
## [7] "Tax" "Date"
## [9] "Payment" "cogs"
## [11] "gross.margin.percentage" "gross.income"
## [13] "Rating" "Total"
# Missing values
colSums(is.na(part12))
## Branch Customer.type Gender
## 0 0 0
## Product.line Unit.price Quantity
## 0 0 0
## Tax Date Payment
## 0 0 0
## cogs gross.margin.percentage gross.income
## 0 0 0
## Rating Total
## 0 0
There are no missing values in the first data set.
# Checking for duplicates
sum(duplicated(part12))
## [1] 0
There are no duplicates in the first data set.
Checking the uniformity of column names and values.
# Column names
colnames(part12)
## [1] "Branch" "Customer.type"
## [3] "Gender" "Product.line"
## [5] "Unit.price" "Quantity"
## [7] "Tax" "Date"
## [9] "Payment" "cogs"
## [11] "gross.margin.percentage" "gross.income"
## [13] "Rating" "Total"
The column names are not uniform, therefore, they will be renamed.
# Renaming columns
colnames(part12)[c(10, 11, 12)] <- c('Cogs', 'Gross.margin.percentage',
'Gross.income')
# Checking changes
colnames(part12)
## [1] "Branch" "Customer.type"
## [3] "Gender" "Product.line"
## [5] "Unit.price" "Quantity"
## [7] "Tax" "Date"
## [9] "Payment" "Cogs"
## [11] "Gross.margin.percentage" "Gross.income"
## [13] "Rating" "Total"
The column names have been renamed successfully.
# Selecting numerical columns
num <- unlist(lapply(part12, is.numeric))
# Numeric data frame
num_df <- part12[, num]
# Number of numeric columns
length(names(num_df))
## [1] 8
## [1] 10
# Box plots
par(mfrow = c(5,2), mar=c(1,5,2,2))
for (i in 1:length(num_df)){
boxplot(num_df[ , i], main = paste('Boxplot of', names(num_df)[i]),
ylab = 'Count')
}
The Tax, Cogs, Gross margin percentage, Gross income and the Total column have outliers.
# Outliers summary function
outliers.summary <- function(data){
total <- c(0) # Placeholder for total number of outliers
cat('Precentage outliers per column\n')
cat(rep('-', 16), '\n')
for (col in names(data)){
# IQR
Q1 <- quantile(data[,col], probs=.25)
Q3 <- quantile(data[,col], probs=.75)
IQR = Q3-Q1
# Limits
upper.limit = Q3 + (IQR*1.5)
lower.limit = Q1 - (IQR*1.5)
# Outliers per column
out <- sum(data[,col] > upper.limit | data[,col] < lower.limit)
total <- total + out
perc <- (out/nrow(data))*100
print(paste0(col, ' = ', round(perc,2), '%'))
cat('\n')
}
# Total percentage of outliers
cat('\nTotal precentage of outliers\n')
cat(rep('-', 16), '\n')
cells <- nrow(data)*ncol(data)
perc <- (total/cells)*100
print(paste0('Total = ', round(perc,2), '%'))
}
# Outliers summary
outliers.summary(num_df)
## Precentage outliers per column
## - - - - - - - - - - - - - - - -
## [1] "Unit.price = 0%"
##
## [1] "Quantity = 0%"
##
## [1] "Tax = 0.9%"
##
## [1] "Cogs = 0.9%"
##
## [1] "Gross.margin.percentage = 0%"
##
## [1] "Gross.income = 0.9%"
##
## [1] "Rating = 0%"
##
## [1] "Total = 0.9%"
##
##
## Total precentage of outliers
## - - - - - - - - - - - - - - - -
## [1] "Total = 0.45%"
The outliers are minimal, therefore they will removed in order to analyze the general trend.
# Get column indices
index <- match(colnames(num_df), colnames(part12))
index
## [1] 5 6 7 10 11 12 13 14
# Removing outliers
# Function to obtain logical vector of outliers
outliers.vector <- function(data) {
# IQR
Q1 <- quantile(data, probs=.25)
Q3 <- quantile(data, probs=.75)
IQR <- Q3-Q1
# Limits
upper.limit = Q3 + (IQR*1.5)
lower.limit = Q1 - (IQR*1.5)
# Logical vector
data > upper.limit | data < lower.limit
}
remove.outliers <- function(data, cols = names(data)) {
for (col in cols) {
data <- data[!outliers.vector(data[[col]]),]
}
data
}
# Function that removes rows with outliers
remove.func <- function(data, indices){
for(col in indices){
no <- outliers.vector(data[,col])
data <- data[!no,]
}
data
}
# Removing outliers
# Copy of the dataset.
part12.no <- data.frame(part12)
# Function call
part12.no <- remove.func(part12.no, index)
# Number of removed rows
perc <- ((nrow(part12) - nrow(part12.no))/nrow(part12))*100
print(paste0('Removed rows = ', round((perc),2), '%'))
## [1] "Removed rows = 0.9%"
Splitting the time column
# Copy of the data set
part12.dates <- data.frame(part12.no)
# Columns required = Year, month, day, hour and minute
part12.dates$Year <- format(part12.dates$Date, format = '%Y')
# Checking for unique years
unique(part12.dates$Year)
## [1] "2019"
The data only contains records for the year 2019. This column will be dropped as it provides no further information.
# Dropping the year column
part12.dates <- part12.dates[,-ncol(part12.dates)]
# Creating the remaining date and time columns
part12.dates$Month <- format(part12.dates$Date, format = '%m')
part12.dates$Day <- format(part12.dates$Date, format = '%d')
part12.dates$Hour <- format(part12.dates$Date, format = '%H')
part12.dates$Minute <- format(part12.dates$Date, format = '%M')
# Dropping the date column
part12.dates <- subset(part12.dates, select = -c(Date) )
# Checking changes
head(part12.dates)
## Branch Customer.type Gender Product.line Unit.price Quantity
## 1 A Member Female Health and beauty 74.69 7
## 2 C Normal Female Electronic accessories 15.28 5
## 3 A Normal Male Home and lifestyle 46.33 7
## 4 A Member Male Health and beauty 58.22 8
## 5 A Normal Male Sports and travel 86.31 7
## 6 C Normal Male Electronic accessories 85.39 7
## Tax Payment Cogs Gross.margin.percentage Gross.income Rating
## 1 26.1415 Ewallet 522.83 4.761905 26.1415 9.1
## 2 3.8200 Cash 76.40 4.761905 3.8200 9.6
## 3 16.2155 Credit card 324.31 4.761905 16.2155 7.4
## 4 23.2880 Ewallet 465.76 4.761905 23.2880 8.4
## 5 30.2085 Ewallet 604.17 4.761905 30.2085 5.3
## 6 29.8865 Ewallet 597.73 4.761905 29.8865 4.1
## Total Month Day Hour Minute
## 1 548.9715 01 05 13 08
## 2 80.2200 03 08 10 29
## 3 340.5255 03 03 13 23
## 4 489.0480 01 27 20 33
## 5 634.3785 02 08 10 37
## 6 627.6165 03 25 18 30
The date column has been dropped.
# Converting the month, day, hour and minutes to numeric format.
part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)] <- sapply(
part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)], as.numeric)
#
# Preview changes
head(part12.dates[,(ncol(part12.dates) - 3): ncol(part12.dates)])
## Month Day Hour Minute
## 1 1 5 13 8
## 2 3 8 10 29
## 3 3 3 13 23
## 4 1 27 20 33
## 5 2 8 10 37
## 6 3 25 18 30
# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend, colors, method, col2 = NULL){
if (method == 'count'){
ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
ggtitle(paste(title, 'Frequency Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
else if (method == 'bar'){
ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
}
head(part12)
## Branch Customer.type Gender Product.line Unit.price Quantity
## 1 A Member Female Health and beauty 74.69 7
## 2 C Normal Female Electronic accessories 15.28 5
## 3 A Normal Male Home and lifestyle 46.33 7
## 4 A Member Male Health and beauty 58.22 8
## 5 A Normal Male Sports and travel 86.31 7
## 6 C Normal Male Electronic accessories 85.39 7
## Tax Date Payment Cogs Gross.margin.percentage
## 1 26.1415 2019-01-05 13:08:00 Ewallet 522.83 4.761905
## 2 3.8200 2019-03-08 10:29:00 Cash 76.40 4.761905
## 3 16.2155 2019-03-03 13:23:00 Credit card 324.31 4.761905
## 4 23.2880 2019-01-27 20:33:00 Ewallet 465.76 4.761905
## 5 30.2085 2019-02-08 10:37:00 Ewallet 604.17 4.761905
## 6 29.8865 2019-03-25 18:30:00 Ewallet 597.73 4.761905
## Gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
Branch
# Unique
not.cat <- unlist(lapply(part12.dates, is.numeric))
catdf <- part12.dates[,!not.cat]
catdf$Month <- cbind(part12.dates$Month)
unique(catdf$Branch)
## [1] A C B
## Levels: A B C
# Bar plot
bar.plt (catdf, Branch, title = 'Branch', method = 'count', legend =
'Branch', colors = c('A' = '#0276AB',
'B' = '#02557A',
'C' = '#013349'), col2 = NULL)
Branch A has the highest frequency, and C the lowest, though the difference is minimal.
Customer Type
# Unique
unique(catdf$Customer.type)
## [1] Member Normal
## Levels: Member Normal
# Bar plot
bar.plt (catdf, Customer.type, title = 'Customer Type', method = 'count',
legend = 'Customer Type', colors = c('Member' = '#0276AB',
'Normal' = '#013349'), col2 = NULL)
The frequency of customers who are members is slighly higher than that of normal customers.
Gender
# Unique
unique(catdf$Gender)
## [1] Female Male
## Levels: Female Male
# Bar plot
bar.plt (catdf, Gender, title = 'Gender', method = 'count', legend =
'Gender', colors = c('Male' = '#0276AB',
'Female' = '#013349'), col2 = NULL)
There is a slightly higher proportion of male customers, than female.
Product Line
# Unique
unique(catdf$Product.line)
## [1] Health and beauty Electronic accessories Home and lifestyle
## [4] Sports and travel Food and beverages Fashion accessories
## 6 Levels: Electronic accessories Fashion accessories ... Sports and travel
#Br plot
ggplot(catdf, aes(x = Product.line, fill = Product.line)) + geom_bar() + ggtitle(paste('Product Line', 'Frequency Plot')) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_manual('Product Line', values =
c('Sports and travel' = '#0276AB',
'Home and lifestyle' = '#026592',
'Health and beauty' = '#02557A',
'Food and beverages' = '#014462',
'Fashion accessories' = '#013349',
'Electronic accessories' = '#012231')) +
coord_flip()
Food and beverages, and fashion accessories have the highest frequency.
Payment
# Unique
unique(catdf$Payment)
## [1] Ewallet Cash Credit card
## Levels: Cash Credit card Ewallet
# Bar plot
bar.plt (catdf, Payment, title = 'Payment', method = 'count', legend =
'Payment', colors = c('Cash' = '#0276AB',
'Credit card' = '#02557A',
'Ewallet' = '#012231'), col2 = NULL)
Most payment are done in cash or using e-wallets.
Month
# Unique
catdf$Month <- as.factor(catdf$Month)
unique(catdf$Month)
## [1] 1 3 2
## Levels: 1 2 3
# Bar plot
bar.plt (catdf, Month, title = 'Month', method = 'count', legend =
'Month', colors = c('1' = '#0276AB',
'2' = '#02557A',
'3' = '#012231'), col2 = NULL)
January and February had the highest frequency, and February had the lowest.
Bivariate analysis will be done to get a clearer picture of factors that affect total sales.
# Mode function
mode <- function(col, data) {
unique.value <- unique(data[, col])
unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}
central.tendency <- function(col, data){
cat('Measures of Central Tendency \n')
# Mean
cat('Mean = ', mean(data[, col]), '\n')
# Median
cat('Median = ', median(data[,col]), '\n')
# Mode
cat('Mode = ', mode(col, data), '\n')
}
dispersion <- function(col, data){
cat('\nMeasures of Dispersion \n')
# Range
cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
# IQR
cat('IQR = ', IQR(data[ ,col]), '\n')
# Variance
cat('Variance = ', var(data[ ,col]), '\n')
# Standard Deviation
cat('Standard Deviation = ', sd(data[ ,col]), '\n')
# Skewness
cat('Skewness = ', skewness(data[ ,col]), '\n')
# Kurtosis
cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}
# Numerical data
numc <- unlist(lapply(part12.dates, is.numeric))
numcdf <- part12.dates[,numc]
numcdf <- numcdf[, !colnames(numcdf) %in% c('Month')]
Unit Price
# Measures of central tendency
central.tendency(names(numcdf)[1], numcdf)
## Measures of Central Tendency
## Mean = 55.29248
## Median = 54.84
## Mode = 83.77
# Measures of dispersion
dispersion(names(numcdf)[1], numcdf)
##
## Measures of Dispersion
## Range = 10.08 - 99.96
## IQR = 44.82
## Variance = 692.3013
## Standard Deviation = 26.31162
## Skewness = 0.01401477
## Kurtosis = 1.789507
Quantity
# Measures of central tendency
central.tendency(names(numcdf)[2], numcdf)
## Measures of Central Tendency
## Mean = 5.469223
## Median = 5
## Mode = 1
# Measures of dispersion
dispersion(names(numcdf)[2], numcdf)
##
## Measures of Dispersion
## Range = 1 - 10
## IQR = 5
## Variance = 8.439203
## Standard Deviation = 2.905031
## Skewness = 0.02184004
## Kurtosis = 1.794938
Tax
# Measures of central tendency
central.tendency(names(numcdf)[3], numcdf)
## Measures of Central Tendency
## Mean = 15.07642
## Median = 12.036
## Mode = 39.48
# Measures of dispersion
dispersion(names(numcdf)[3], numcdf)
##
## Measures of Dispersion
## Range = 0.5085 - 45.325
## IQR = 16.3095
## Variance = 128.1312
## Standard Deviation = 11.3195
## Skewness = 0.8426702
## Kurtosis = 2.762815
Cogs
# Measures of central tendency
central.tendency(names(numcdf)[4], numcdf)
## Measures of Central Tendency
## Mean = 301.5283
## Median = 240.72
## Mode = 789.6
# Measures of dispersion
dispersion(names(numcdf)[4], numcdf)
##
## Measures of Dispersion
## Range = 10.17 - 906.5
## IQR = 326.19
## Variance = 51252.47
## Standard Deviation = 226.3901
## Skewness = 0.8426702
## Kurtosis = 2.762815
Gross Margin Percentage
# Measures of central tendency
central.tendency(names(numcdf)[5], numcdf)
## Measures of Central Tendency
## Mean = 4.761905
## Median = 4.761905
## Mode = 4.761905
# Measures of dispersion
dispersion(names(numcdf)[5], numcdf)
##
## Measures of Dispersion
## Range = 4.761905 - 4.761905
## IQR = 0
## Variance = 0
## Standard Deviation = 0
## Skewness = NaN
## Kurtosis = NaN
The average gross margin income is 4.8. From the measures of dispersion is can be seen that the column only has one value.
Gross Income
# Measures of central tendency
central.tendency(names(numcdf)[6], numcdf)
## Measures of Central Tendency
## Mean = 15.07642
## Median = 12.036
## Mode = 39.48
# Measures of dispersion
dispersion(names(numcdf)[6], numcdf)
##
## Measures of Dispersion
## Range = 0.5085 - 45.325
## IQR = 16.3095
## Variance = 128.1312
## Standard Deviation = 11.3195
## Skewness = 0.8426702
## Kurtosis = 2.762815
The mean gross income is 15.
Rating
# Measures of central tendency
central.tendency(names(numcdf)[7], numcdf)
## Measures of Central Tendency
## Mean = 6.979717
## Median = 7
## Mode = 6
# Measures of dispersion
dispersion(names(numcdf)[7], numcdf)
##
## Measures of Dispersion
## Range = 4 - 10
## IQR = 3
## Variance = 2.947942
## Standard Deviation = 1.716957
## Skewness = 0.00592866
## Kurtosis = 1.854314
The average rating is 7.
Total
# Measures of central tendency
central.tendency(names(numcdf)[8], numcdf)
## Measures of Central Tendency
## Mean = 316.6048
## Median = 252.756
## Mode = 829.08
# Measures of dispersion
dispersion(names(numcdf)[8], numcdf)
##
## Measures of Dispersion
## Range = 10.6785 - 951.825
## IQR = 342.4995
## Variance = 56505.85
## Standard Deviation = 237.7096
## Skewness = 0.8426702
## Kurtosis = 2.762815
The average total sales, including tax is 316.
Days
# Days Histogram
hist(numcdf$Day,
main = 'Days Histogram',
xlab = 'Day')
The trend of sales seems to generally decrease towards the end of the end of the month, apart from between the 22nd and 26th days.
Hours
# Hours Histogram
hist(numcdf$Hour,
main = 'Hours Histogram',
xlab = 'Hour')
Most sales occurred in the morning hours.
Branch Vs Customer Type
# Target columns
Branch <- catdf$Branch
Customer <- catdf$Customer.type
# Contingency table
contingency.table <- table(Branch, Customer)
contingency.table
## Customer
## Branch Member Normal
## A 167 172
## B 165 165
## C 166 156
Branch A has the highest number of normal and member customers.
Branch Vs Gender
# Target columns
Branch <- catdf$Branch
Gender <- catdf$Gender
# Contingency table
contingency.table <- table(Branch, Gender)
contingency.table
## Gender
## Branch Female Male
## A 161 178
## B 160 170
## C 174 148
Branch A and B had more male customers, and C had more female customers.
Branch Vs Product Line
# Target columns
Branch <- catdf$Branch
Product <- catdf$Product.line
# Contingency table
contingency.table <- table(Branch, Product)
contingency.table
## Product
## Branch Electronic accessories Fashion accessories Food and beverages
## A 60 50 58
## B 55 62 50
## C 55 63 65
## Product
## Branch Health and beauty Home and lifestyle Sports and travel
## A 47 65 59
## B 53 48 62
## C 52 43 44
All branches sell the specified products, but in varying amounts. All available products seem to be popular in all the stores.
Branch Vs Payment
# Target columns
Branch <- catdf$Branch
Payment <- catdf$Payment
# Contingency table
contingency.table <- table(Branch, Payment)
contingency.table
## Payment
## Branch Cash Credit card Ewallet
## A 110 103 126
## B 110 108 112
## C 122 96 104
The most popular payment method is by cash, followed by e-wallets.
# Scatter plot and correlation function
scatter.plt <- function(col1, col2, data, title){
# Scatter plot
data <- ggplot(data, aes(x = {{col1}}, y= {{col2}})) +
geom_point(color = 'black') + ggtitle(paste(title, 'Scatter Plot')) +
theme(plot.title = element_text(hjust = 0.5))
plot(data)
}
Sales Vs Unit Price
# Scatter plot and correlation
scatter.plt(Unit.price, Total, data = numcdf, title = 'Sales Vs Unit Price')
correlation <- cor(numcdf$Unit.price, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 0.623354749751846."
An increase in unit price leads to an increase in sales.
Sales Vs Quantity
# Scatter plot and correlation
scatter.plt(Quantity, Total, data = numcdf, title = 'Sales Vs Quantity')
correlation <- cor(numcdf$Quantity, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 0.699290259869811."
An increase in quantity = increase in sales.
Sales Vs Cogs
# Scatter plot and correlation
scatter.plt(Cogs, Total, data = numcdf, title = 'Sales Vs Cogs')
correlation <- cor(numcdf$Cogs, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 1."
Increase in cogs = increase in sales.
Sales Vs Gross Income
# Scatter plot and correlation
scatter.plt(Gross.income, Total, data = numcdf,
title = 'Sales Vs Gross Income')
correlation <- cor(numcdf$Gross.income, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = 1."
Increase in sales = increase in gross income.
Sales Vs Rating
# Scatter plot and correlation
scatter.plt(Rating, Total, data = numcdf, title = 'Sales Vs Rating')
correlation <- cor(numcdf$Rating, numcdf$Total)
print(paste0('Correlation = ', correlation, '.'))
## [1] "Correlation = -0.0260916394610116."
There is no relationship between the sales and the ratings.
Sales Trend
# Line plot
line.plt <- function(col1, col2, data, title){
ggplot(data, aes(x = {{col1}}, y= {{col2}})) + geom_line(color = '#281E5D',
size = 1) +
ggtitle(paste(title, 'Line Plot')) +
theme(plot.title = element_text(hjust = 0.5))
}
# Conversion to date class
part12$Date <- as.Date(part12$Date)
# Trend
line.plt(data = part12, col1 = Date, col2 = Total,
title = 'Sales Trend')
The sales change drastically on a daily basis.
# Bar plot for averaged y axis
bar.plt.summary <- function(data, col1, col2, title, legend, colors){
ggplot(data, mapping=aes(x= {{col1}}, y= {{col2}}, fill = {{col1}})) +
stat_summary(fun=mean, geom="bar") + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)
}
Sales Vs Branch
# Bar plot
bar.plt.summary(part12,Branch, Total,'Sales Vs Branch', 'Branch',
c('A' = '#0276AB', 'B' = '#02557A', 'C' = '#013349'))
Branch C has the highest sales.
Sales Vs Customer Type
# Bar plot
bar.plt.summary(part12,Customer.type, Total,'Sales Vs Customer Type',
'Customer Type', c('Member' = '#0276AB', 'Normal' = '#013349'))
Members of the store contribute to a majority of the sales, but not by a large margin.
Sales Vs Gender
# Bar plot
bar.plt.summary(part12,Gender, Total,'Sales Vs Gender', 'Gender',
c('Male' = '#0276AB', 'Female' = '#02557A', 'C' = '#013349'))
Women contribute to more sales than men.
Sales Vs Product Line
# Conversion to factor
# Bar plot
ggplot(part12.dates, mapping = aes(x = Product.line, y = Total, fill = Product.line))+
geom_bar(stat = 'identity') + stat_summary(fun=median, geom="bar") +
ggtitle(paste('Sales Vs Product Line Plot')) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_manual('Product Line', values =
c('Sports and travel' = '#0276AB',
'Home and lifestyle' = '#026592',
'Health and beauty' = '#02557A',
'Food and beverages' = '#014462',
'Fashion accessories' = '#013349',
'Electronic accessories' = '#012231')) +
coord_flip()
Electronics accessories, Fashion accessories and Sports and travel products have the highest sales contribution.
Sales Vs Payment
# Bar plot
bar.plt.summary(part12,Payment, Total,'Sales Vs Payment', 'Payment',
c('Cash' = '#0276AB', 'Credit card' = '#02557A',
'Ewallet' = '#013349'))
Most transaction are payed for in cash and e-wallet.
Sales Vs Month
# Bar plot
part12.dates$Month <- as.factor(part12.dates$Month)
bar.plt.summary(part12.dates,Month, Total,'Sales Vs Month', 'Month',
c('1' = '#0276AB', '2' = '#02557A', '3' = '#013349'))
Sales were highest in January, before steadily decreasing in the subsequent months.
Multivariate analysis will be carried out to analyze more complex combinations.
Correlation Matrix
# Visualize correlation matrix
# Removing the gross margin percentage as it only has one value.
numcdf <- numcdf[,!colnames(numcdf) %in% c('Gross.margin.percentage')]
ggcorrplot(cor(numcdf), lab = TRUE, title = 'Correlation Heatmap',
colors = c('#022D36', 'white', '#48AAAD'))
The Total - Unit Price, Total - Quantity, Gross income - Unit Price are among the highly correlated columns.
# Stacked bar plot function
stacked.bar <- function(col1, col2, col3, data, title){
ggplot(data, aes( x = {{col1}}, y = {{col2}}, fill = {{col3}})) +
geom_bar(position='stack', stat='identity') +
scale_fill_viridis(discrete = T) +
stat_summary(fun=mean, geom="bar") + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))
}
Sales Vs Gender Vs Branch
# Stacked bar plot
stacked.bar(Branch, Total, Gender, part12, 'Sales Vs Gender Vs Branch')
Branch A and B have the same gender distribution, but branch C has more female customers, as well as the highest sales.
Sales Vs Customer Type Vs Branch
# Stacked bar plot
stacked.bar(Branch, Total, Customer.type, part12,
'Sales Vs Customer Type Vs Branch')
Branch C has the highest number of customers, and most of them are members. It also has the highest sales.
Sales Vs Payment Vs Branch
# Stacked bar plot
stacked.bar(Branch, Total, Payment, part12,
'Sales Vs Payment Vs Branch')
In branch A, E-wallet payments have the highest popularity. In branch B, its payment via credit cards. Finally, in branch C, the most popular mode is cash payments.
Sales Vs Month Vs Branch
# Stacked bar plot
part12.dates$Month <- as.factor(part12.dates$Month)
stacked.bar(Month, Total, Branch, part12.dates,
'Sales Vs Month Vs Branch')
Branch A had the highest sales in January, and C had the lowest. In February, overall, sales dropped, and branch A sales plummeted. There was a slight decrease for branch C, but B had fairly constant sales. In March, sales in branches A and C recovered, but for B, it remained constant.
For unsupervised learning, the main aim of the procedure is to learn patterns from unlabeled data, therefore, the Total column will be dropped, as it is the target/column of the data set.
Preparing the data set
# Removing the target column
part12.dates.nt <- data.frame(part12.dates)
part12.dates.nt <- part12.dates[,!colnames(part12.dates.nt) %in% c('Total')]
# Data set preview
head(part12.dates.nt)
## Branch Customer.type Gender Product.line Unit.price Quantity
## 1 A Member Female Health and beauty 74.69 7
## 2 C Normal Female Electronic accessories 15.28 5
## 3 A Normal Male Home and lifestyle 46.33 7
## 4 A Member Male Health and beauty 58.22 8
## 5 A Normal Male Sports and travel 86.31 7
## 6 C Normal Male Electronic accessories 85.39 7
## Tax Payment Cogs Gross.margin.percentage Gross.income Rating Month
## 1 26.1415 Ewallet 522.83 4.761905 26.1415 9.1 1
## 2 3.8200 Cash 76.40 4.761905 3.8200 9.6 3
## 3 16.2155 Credit card 324.31 4.761905 16.2155 7.4 3
## 4 23.2880 Ewallet 465.76 4.761905 23.2880 8.4 1
## 5 30.2085 Ewallet 604.17 4.761905 30.2085 5.3 2
## 6 29.8865 Ewallet 597.73 4.761905 29.8865 4.1 3
## Day Hour Minute
## 1 5 13 8
## 2 8 10 29
## 3 3 13 23
## 4 27 20 33
## 5 8 10 37
## 6 25 18 30
Scaling and encoding is required, and the date column.
Scaling
# The gross margin percentage only contains one value, therefore, it cannot
# be scaled.
unique(part12[,c('Gross.margin.percentage')])
## [1] 4.761905
The column only contains a single value. This column does not provide relevant insights as it is constant for all records. Therefore, it will be removed.
# Manually scaling the numeric columns only as categorical columns might contain
# zeros and constants, which cannot be re-scaled.
# Standard scaling.
part12.scaled <- data.frame(part12.dates.nt)
# Removing the gross margin percentage
part12.scaled <- part12.scaled[,-10]
num_df <- num_df[, !colnames(num_df) %in% c('Total')]
numdf <- num_df[,-5]
part12.scaled[, colnames(numdf)] <- scale(part12.scaled[, colnames(numdf)])
head(part12.scaled[, colnames(numdf)])
## Unit.price Quantity Tax Cogs Gross.income Rating
## 1 0.7372225 0.5269400 0.9775237 0.9775237 0.9775237 1.2349071
## 2 -1.5207153 -0.1615208 -0.9944267 -0.9944267 -0.9944267 1.5261200
## 3 -0.3406283 0.5269400 0.1006301 0.1006301 0.1006301 0.2447834
## 4 0.1112633 0.8711705 0.7254367 0.7254367 0.7254367 0.8272091
## 5 1.1788525 0.5269400 1.3368150 1.3368150 1.3368150 -0.9783107
## 6 1.1438870 0.5269400 1.3083685 1.3083685 1.3083685 -1.6772215
# Scaling the day, hour and minute columns.
part12.scaled[,(ncol(part12.scaled)-2):ncol(part12.scaled)] <-
scale(part12.scaled[,(ncol(part12.scaled)-2):ncol(part12.scaled)])
# Data set with target
part12.sc.with.target <- data.frame(part12.scaled)
part12.sc.with.target <- cbind(part12.sc.with.target, scale(part12.dates$Total))
Label Encoding
# Categorical
num <- unlist(lapply(part12.dates, is.numeric))
cat.df <- part12.scaled[,!num]
cat.df <- cat.df[,-ncol(cat.df)]
head(cat.df)
## Branch Customer.type Gender Product.line Payment
## 1 A Member Female Health and beauty Ewallet
## 2 C Normal Female Electronic accessories Cash
## 3 A Normal Male Home and lifestyle Credit card
## 4 A Member Male Health and beauty Ewallet
## 5 A Normal Male Sports and travel Ewallet
## 6 C Normal Male Electronic accessories Ewallet
# Data set copy
part12.encoded <- data.frame(part12.scaled)
# Label encoding
label <- LabelEncoder$new()
part12.encoded[,colnames(cat.df)] <- sapply(part12.encoded[,colnames(cat.df)],
label$fit_transform)
# Confirming that changes have been made.
head(part12.encoded[,colnames(cat.df)])
## Branch Customer.type Gender Product.line Payment
## 1 0 0 0 3 2
## 2 2 1 0 0 0
## 3 0 1 1 4 1
## 4 0 0 1 3 2
## 5 0 1 1 5 2
## 6 2 1 1 0 2
# Ensuring that the entire column in numeric format.
part12.encoded$Month <- as.numeric(part12.encoded$Month)
head(part12.encoded)
## Branch Customer.type Gender Product.line Unit.price Quantity Tax
## 1 0 0 0 3 0.7372225 0.5269400 0.9775237
## 2 2 1 0 0 -1.5207153 -0.1615208 -0.9944267
## 3 0 1 1 4 -0.3406283 0.5269400 0.1006301
## 4 0 0 1 3 0.1112633 0.8711705 0.7254367
## 5 0 1 1 5 1.1788525 0.5269400 1.3368150
## 6 2 1 1 0 1.1438870 0.5269400 1.3083685
## Payment Cogs Gross.income Rating Month Day Hour
## 1 2 0.9775237 0.9775237 1.2349071 1 -1.1812796 -0.5977934
## 2 0 -0.9944267 -0.9944267 1.5261200 3 -0.8364981 -1.5371379
## 3 1 0.1006301 0.1006301 0.2447834 3 -1.4111340 -0.5977934
## 4 2 0.7254367 0.7254367 0.8272091 1 1.3471180 1.5940104
## 5 2 1.3368150 1.3368150 -0.9783107 2 -0.8364981 -1.5371379
## 6 2 1.3083685 1.3083685 -1.6772215 3 1.1172637 0.9677808
## Minute
## 1 -1.309810186
## 2 -0.065983593
## 3 -0.421362620
## 4 0.170935758
## 5 0.407855109
## 6 -0.006753755
PCA
Computing PCA
# PCA on pre-scaled data
pca <- prcomp(part12.encoded)
Scree Plot
Visualizing the Eigen values.
# Scree Plot
fviz_eig(pca, addlabels = TRUE, ylim = c(0,40))
The first component accounts for most of the variance in the data.
# PCA summary
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9784 1.7185 1.04798 1.03790 0.99331 0.97426 0.971
## Proportion of Variance 0.2699 0.2036 0.07573 0.07428 0.06803 0.06545 0.065
## Cumulative Proportion 0.2699 0.4735 0.54923 0.62351 0.69154 0.75699 0.822
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.85104 0.80829 0.79224 0.50451 0.48700 0.29121
## Proportion of Variance 0.04994 0.04505 0.04328 0.01755 0.01635 0.00585
## Cumulative Proportion 0.87193 0.91697 0.96025 0.97780 0.99415 1.00000
## PC14 PC15
## Standard deviation 4.411e-16 1.402e-17
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00
The first component explains 27% of the variance. The second highest, PC2, explains 20% of the variance. Analysis will focus on these two PC’s.
# Result of variables
var.result <- get_pca_var(pca)
var.result
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
Quality of Representation
# Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(pca, choice = "var", axes = 1:2)
The product line has the highest representation in the dimensions. Followed by the Cogs, Tax and Gross income. Lastly, the Quantity and Unit price.
Contribution to the PCs
# Contributions of variables to PC1
fviz_contrib(pca, choice = "var", axes = 1)
# Contributions of variables to PC2
fviz_contrib(pca, choice = "var", axes = 2)
When only numeric columns are used
# Only selecting the numeric columns
part12.num <- part12.dates.nt[,c(-1,-2,-3,-4,-8, -10, -13)]
part12.num$Quantity <- as.numeric(part12.num$Quantity)
head(part12.num)
## Unit.price Quantity Tax Cogs Gross.income Rating Day Hour Minute
## 1 74.69 7 26.1415 522.83 26.1415 9.1 5 13 8
## 2 15.28 5 3.8200 76.40 3.8200 9.6 8 10 29
## 3 46.33 7 16.2155 324.31 16.2155 7.4 3 13 23
## 4 58.22 8 23.2880 465.76 23.2880 8.4 27 20 33
## 5 86.31 7 30.2085 604.17 30.2085 5.3 8 10 37
## 6 85.39 7 29.8865 597.73 29.8865 4.1 25 18 30
PCA on numeric columns only
# Computing the PCA
pca.num <- prcomp(part12.num, scale= T)
# PCA summary
summary(pca.num)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9760 1.0415 1.0355 0.9883 0.9738 0.9635 0.29146
## Proportion of Variance 0.4339 0.1205 0.1192 0.1085 0.1054 0.1031 0.00944
## Cumulative Proportion 0.4339 0.5544 0.6735 0.7821 0.8874 0.9906 1.00000
## PC8 PC9
## Standard deviation 1.238e-16 6.592e-17
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00
PC1 accounts for 43% of the variance. A scree plot of the Eigen values will be used to provide a better summary.
# Scree plot
fviz_eig(pca.num, addlabels = TRUE, ylim = c(0,50))
Quality of Representation
# Variable plot
fviz_pca_var(pca.num, repel = T,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
For the numeric variable only PCA, the Cogs, Tax and Gross income have the highest contribution, while the hour, minute and rating have the lowest.
# Variables
var.result.num <- get_pca_var(pca.num)
# Correlation plot
corrplot(var.result.num$cos2, is.corr=FALSE)
From this plot, the Tax, Cogs and Gross income columns have the highest representation for Dim.1.
# Representation in the 1st and 2nd dimensions.
fviz_cos2(pca.num, choice = "var", axes = 1:2)
The columns with the highest representation are the Tax, Cogs and Gross income.
Variable Contribution
# Contribution in individual components
# PC1
fviz_contrib(pca.num, choice = "var", axes = 1)
# PC2
fviz_contrib(pca.num, choice = "var", axes = 2)
Individuals are numerous, therefore, analysis will be focused on the variables.
# Data set
head(part12.scaled)
## Branch Customer.type Gender Product.line Unit.price Quantity
## 1 A Member Female Health and beauty 0.7372225 0.5269400
## 2 C Normal Female Electronic accessories -1.5207153 -0.1615208
## 3 A Normal Male Home and lifestyle -0.3406283 0.5269400
## 4 A Member Male Health and beauty 0.1112633 0.8711705
## 5 A Normal Male Sports and travel 1.1788525 0.5269400
## 6 C Normal Male Electronic accessories 1.1438870 0.5269400
## Tax Payment Cogs Gross.income Rating Month Day
## 1 0.9775237 Ewallet 0.9775237 0.9775237 1.2349071 1 -1.1812796
## 2 -0.9944267 Cash -0.9944267 -0.9944267 1.5261200 3 -0.8364981
## 3 0.1006301 Credit card 0.1006301 0.1006301 0.2447834 3 -1.4111340
## 4 0.7254367 Ewallet 0.7254367 0.7254367 0.8272091 1 1.3471180
## 5 1.3368150 Ewallet 1.3368150 1.3368150 -0.9783107 2 -0.8364981
## 6 1.3083685 Ewallet 1.3083685 1.3083685 -1.6772215 3 1.1172637
## Hour Minute
## 1 -0.5977934 -1.309810186
## 2 -1.5371379 -0.065983593
## 3 -0.5977934 -0.421362620
## 4 1.5940104 0.170935758
## 5 -1.5371379 0.407855109
## 6 0.9677808 -0.006753755
Wrapper Method
# The clustvarsel function will be used to determine the optimal subset of
# variables.
# Forward-Backward
forward = clustvarsel(part12.encoded, search = c("greedy"),
direction = c("forward"), verbose = F)
.
# Result for the F-B approach
forward
## ------------------------------------------------------
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------
##
## Variable proposed Type of step BICclust Model G BICdiff Decision
## Quantity Add -2141.9736 E 9 683.1595 Accepted
## Tax Add 32938.0738 VEI 9 37246.7226 Accepted
## Unit.price Add -366.9583 EVV 9 -31968.0127 Rejected
## Tax Remove -2050.8575 E 9 37155.6065 Rejected
##
## Selected subset: Quantity, Tax
# Subset of selected variables
sub.forward <- part12.scaled[,forward$subset]
colnames(sub.forward)
## [1] "Quantity" "Tax"
Only the Quantity and tax have been selected.
Classification Plots
# Mclust
fwd = Mclust(sub.forward, verbose = F)
summary(fwd)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEI (diagonal, equal shape) model with 9 components:
##
## log-likelihood n df BIC ICL
## 5037.589 991 36 9826.825 9822.281
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 187 102 98 50 112 91 60 199 92
# Classification plot
plot(fwd, c("classification"), main = 'Forward-Backward Classification Result')
There are 9 clusters, the 8th cluster has the highest number of values.
Feature Ranking
# FSelectorRcpp - Information gain
# Selecting the entire data set
colnames(part12.sc.with.target)[ncol(part12.sc.with.target)] <- c('Total')
imp <- information_gain(Total ~ . , data = part12.sc.with.target, equal = T)
imp <- imp[order(imp$importance, decreasing = T), ]
head(imp)
## attributes importance
## 7 Tax 1.609435878
## 9 Cogs 1.609435878
## 10 Gross.income 1.609435878
## 6 Quantity 0.420954158
## 5 Unit.price 0.311023067
## 4 Product.line 0.007238866
The Tax, Cogs and Gross income have the highest feature importance.
Filter Method
# Visualize correlation matrix
corrplot(cor(num_df))
Multiple columns have high correlations.
# Removing column with a single value - zero standard deviation
num_df <- num_df[, !colnames(num_df) %in% c('Gross.margin.percentage')]
# Correlation matrix.
cm <- cor(num_df)
# Highly correlated attributes.
high.corr <- findCorrelation(cm, cutoff=0.6)
names(num_df[,high.corr])
## [1] "Tax" "Cogs" "Gross.income"
These columns have a Pearson’s correlation coefficient >= 0.6.
# Removing redundant features
num.df.select <- num_df[-high.corr]
# Performing our graphical comparison
# ---
#
par(mfrow = c(1, 2))
corrplot(cm, order = "hclust")
corrplot(cor(num.df.select), order = "hclust")
The rating, unit price and quantity remain.
Analysis Summary
Univariate Analysis
Categorical:
Numerical:
The average unit price is 55. The average quantity is 5.5. The average tax is 15. The average cogs is 301.53. * The average gross margin income is 4.8. From the measures of dispersion is can be seen that the column only has one value. * The average gross income is 15. * The average rating is 7. * The average total sales, including tax is 316. * The trend of sales seems to generally decrease towards the end of the end of the month, apart from between the 22nd and 26th days. * Most sales occurred in the morning hours.
Bivariate Analysis
Categorical- Categorical:
Numerical-Numerical:
Numerical-Categorical:
Multivariate
PCA
PC1 - 27% PC2 - 20.4%
Quality of Representation
The product line has the highest representation in the dimensions. Followed by the Cogs, Tax and Gross income. Lastly, the Quantity and Unit price.
Contribution to the PCs
Quality of Representation
The columns with the highest representation are the Tax, Cogs and Gross income.
Variable Contribution
Feature Selection * Wrapper method - Only the Quantity and tax have been selected. * Feature ranking - The Tax, Cogs and Gross income have the highest feature importance. * Filter method - The rating, unit price and quantity remain.
Overall, the Tax, Cogs and Gross income are the variables with the highest importance in determining sales
PCA and other feature selection methods were used to determine the most important variables.
In conclusion,
Descriptive analysis: the branch, gender, payment method and month of sales greatly affected the sales incurred by the mall.
PCA and feature analysis: the tax, cogs and gross income have the highest importance when determining the sales.
An analysis into the reason behind the overall lagging sales in branch A and B.
An analysis into the reason behind differing popularity of payment methods in different branches for better customer service.
An analysis into the reason behind the drop in sales in branch A and B in February, and why B was not affected.
An analysis on the importance of the tax, cogs and gross income using business domain knowledge, to determine how they can be used to plan and improve sales and marketing strategies.
Collection of more data, with more variables, to provide a more in depth understanding into customer groups, sales trends and branch performance.
Yes, we have the right data as it was provied by the client, to perform the requested analysis. Furthermore, meaningful insights were extracted from the data.
Yes, more indepth data, as well as more data will help in getting a deeper understanding on the sales trend, different customer groups, as well the performance of different branches.
. Yes, we have the right question, as this analysis was done as per the client’s request.
# Suppressing warnings
options(warn = defaultW)