1. Research question

As Data analysts at Carrefour Kenya and we are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). The project has been divided into two parts. The first is performing dimensionality reduction using PCA and feature selection.

2. Success Criteria

The most important features that will help inform Carrefour marketing strategies are identified.

3. Research Methodology

4. Understanding Data provided

The following are the attributes in the data site provided:

Invoice.ID- The unique identifier of the invoice of goods sold Branch - The branch of Carrefour supermarket the sale was recorded Customer.type - The type of customer Gender - Gender of customer Product.line - The line of product being sold(e.g groceries/electronics) Unit.price - The price per unit of the item Quantity - The quantity of goods sold Tax - The tax of the sale Date - The date the sell was recorded Time - The time the sale was recorded Payment - How the customer payed for the goods cogs - Cost of goods sold gross.margin.percentage - The percentage profit margin of goods sold gross.income - The amount of income earned from the sell Rating - The rating of the item sold Total - Total amount earned from the sale

Loading libraries

# loading libraries
#
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(stringr)
library(ggplot2)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
library(corrplot)
## corrplot 0.92 loaded
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

loading data

# Loading dataset
#
sales_dataset <- read.csv("http://bit.ly/CarreFourDataset")

Previewing dataset

# previewing the first six records of dataset
#
head(sales_dataset)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415  1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200  3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155  3/3/2019 13:23 Credit card 324.31                4.761905
## 4        8 23.2880 1/27/2019 20:33     Ewallet 465.76                4.761905
## 5        7 30.2085  2/8/2019 10:37     Ewallet 604.17                4.761905
## 6        7 29.8865 3/25/2019 18:30     Ewallet 597.73                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165
# Dimensions of dataset
#
dim(sales_dataset)
## [1] 1000   16

The dataset has 1000 records and 16 columns

# Preview dataset structure
# 
str(sales_dataset)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...

The dataset date time, continuous and categorical varaiables The datetime and categorical variables have inappropiate datatypes

# Preview the type of dataset
#
class(sales_dataset)
## [1] "data.frame"

The data set is a data frame

Validating the dataset

Since the data set is provided by the client, it is assumed to be accurate hence no need to validate it further

Data cleaning

Data validity

The following categorical variables are listed as characters instead of factors; - Branch, Customer.type, Gender, Product.line, and Payment Datetime data; - Date, and time.

The catagorical variables shall be converted to factors while the date and time shall be converted to date time

# specifying categorical variables
#
categ_cols <- c("Branch", "Customer.type",  "Gender", "Product.line", "Payment")

# Converting the categorical columns to factors
#
sales_dataset[,categ_cols] <- lapply(sales_dataset[,categ_cols] , factor)
# adding ss to the time to change format format from HH:MM to HH:MM:SS
#
sales_dataset <- sales_dataset %>% 
  mutate(Time=sprintf("%s.%02d",Time, row_number(Time)*0))
# Merging date and time column to create datetime column
#
sales_dataset$datetime <- with(sales_dataset, as.POSIXct(paste(as.Date(Date, format="%m/%d/%y"), Time)))
# previewing the sturcture of dataset
#
str(sales_dataset)
## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ Customer.type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ Product.line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08.00" "10:29.00" "13:23.00" "20:33.00" ...
##  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ datetime               : POSIXct, format: "2020-01-05 13:08:00" "2020-03-08 10:29:00" ...

All columns now have appropriate columns and the new datetime column has been successfully created and appropriately filled.

Checking for missing values

# Checking the number of missing values per column in the data set
#
colSums(is.na(sales_dataset))
##              Invoice.ID                  Branch           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                     Tax                    Date 
##                       0                       0                       0 
##                    Time                 Payment                    cogs 
##                       0                       0                       0 
## gross.margin.percentage            gross.income                  Rating 
##                       0                       0                       0 
##                   Total                datetime 
##                       0                       0

The dataset has no null values

Duplicate values

# finding the duplicated rows in the data set and assign to a variable duplicated_rows below
#
duplicated_rows = sales_dataset[duplicated(sales_dataset),]

# Printing out the duplicated rows
duplicated_rows
##  [1] Invoice.ID              Branch                  Customer.type          
##  [4] Gender                  Product.line            Unit.price             
##  [7] Quantity                Tax                     Date                   
## [10] Time                    Payment                 cogs                   
## [13] gross.margin.percentage gross.income            Rating                 
## [16] Total                   datetime               
## <0 rows> (or 0-length row.names)

The dataset has no duplicate values

Checking for outliers

# number of rows in data frame
#
num_rows = nrow(sales_dataset)
  
# creating ID column vector
#
ID <- c(1:num_rows)
 
# binding id column to the data frame
#
sales_dataset1 <- cbind(ID , sales_dataset)
# Applying names function to get column names from numeric columns in dataset
# as a list
#
colnames <- names(select_if(sales_dataset1, is.numeric))

# Print vector of column names
#
colnames    
## [1] "ID"                      "Unit.price"             
## [3] "Quantity"                "Tax"                    
## [5] "cogs"                    "gross.margin.percentage"
## [7] "gross.income"            "Rating"                 
## [9] "Total"
# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
                  measure.vars=c("cogs", "Total"))

# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
  
# printing the plot
#
print(p)

The cogs and total variables have outliers

# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
                  measure.vars=c("Unit.price", "Quantity", "Tax","gross.income"))

# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
  
# printing the plot
#
print(p)

The tax and gross income have outliers

# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
                  measure.vars=c("gross.margin.percentage"))

# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))

# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
                  measure.vars=c("Rating"))

# creating a plot of area income
#
p1 <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
  
# printing the plot
#
print(p)

print(p1)

There are no outliers in the gross margin percentage and rating. The outliers in the other variables shall be retained due to the small number of records in the data set.

Feature engineering

Here we aim to extract the day, month, year, hour and minute for further analysis

# Extracting the year from the time stamp
#
sales_dataset$year <- format (as.Date(sales_dataset$datetime, format="%d/%m/%Y"),"%Y")

# Extracting the month from the time stamp
#
sales_dataset$month <- format (as.Date(sales_dataset$datetime, format="%d/%m/%Y"),"%m")

# Convert Date to Weekday in R (weekdays Function)
# 
sales_dataset$day <- weekdays(sales_dataset$datetime)

# Extracting the hour of day from the time stamp
#
sales_dataset$hour <- format (as.POSIXct(sales_dataset$datetime, format="%H:%M:%S"),"%H")

# Extracting the minute of day from the time stamp
#
sales_dataset$minute <- format (as.POSIXct(sales_dataset$datetime, format="%H:%M:%S"),"%M")

# Previewing the first six records to see the extracted variables
#
head(sales_dataset)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date     Time     Payment   cogs
## 1        7 26.1415  1/5/2019 13:08.00     Ewallet 522.83
## 2        5  3.8200  3/8/2019 10:29.00        Cash  76.40
## 3        7 16.2155  3/3/2019 13:23.00 Credit card 324.31
## 4        8 23.2880 1/27/2019 20:33.00     Ewallet 465.76
## 5        7 30.2085  2/8/2019 10:37.00     Ewallet 604.17
## 6        7 29.8865 3/25/2019 18:30.00     Ewallet 597.73
##   gross.margin.percentage gross.income Rating    Total            datetime year
## 1                4.761905      26.1415    9.1 548.9715 2020-01-05 13:08:00 2020
## 2                4.761905       3.8200    9.6  80.2200 2020-03-08 10:29:00 2020
## 3                4.761905      16.2155    7.4 340.5255 2020-03-03 13:23:00 2020
## 4                4.761905      23.2880    8.4 489.0480 2020-01-27 20:33:00 2020
## 5                4.761905      30.2085    5.3 634.3785 2020-02-08 10:37:00 2020
## 6                4.761905      29.8865    4.1 627.6165 2020-03-25 18:30:00 2020
##   month       day hour minute
## 1    01    Sunday   13     08
## 2    03    Sunday   10     29
## 3    03   Tuesday   13     23
## 4    01    Monday   20     33
## 5    02  Saturday   10     37
## 6    03 Wednesday   18     30
# preview the data type of created variables(year, month, day, minute, hour)
#
# Converting the attribute year minute and day from character to integer
#
as.integer(sales_dataset$year) -> sales_dataset$year
as.integer(sales_dataset$minute) -> sales_dataset$minute

# Converting continent, hour, month, and weekday from character data type
# to integer data type
#
cols <- c("hour", "day", "month")

# Applying factor conversion
#
sales_dataset[cols] <- lapply(sales_dataset[cols], factor) 

# preview the dataset structure after changes to datatypes
#
str(sales_dataset)
## 'data.frame':    1000 obs. of  22 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ Customer.type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ Product.line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08.00" "10:29.00" "13:23.00" "20:33.00" ...
##  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ datetime               : POSIXct, format: "2020-01-05 13:08:00" "2020-03-08 10:29:00" ...
##  $ year                   : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ month                  : Factor w/ 3 levels "01","02","03": 1 3 3 1 2 3 2 2 1 2 ...
##  $ day                    : Factor w/ 7 levels "Friday","Monday",..: 4 4 6 2 3 7 6 2 1 5 ...
##  $ hour                   : Factor w/ 11 levels "10","11","12",..: 4 1 4 11 1 9 5 2 8 4 ...
##  $ minute                 : int  8 29 23 33 37 30 36 38 15 27 ...

All variables have appropriate data types. The total is a summation of cost of goods sold and tax and tax. The gross income is obtained from subtracting the total from cost of goods sold before factoring in other expenses such as tax. These shall also be dropped since they all can be represented by the total.

# creating copy of data set before dropping unwanted columns
#
sales_org <- sales_dataset
# Dropping the date and time columns
#
drops <- c("Date","Time", "gross.income", "Tax", "cogs")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]

# previewing first six records of dataframe
#
head(sales_dataset)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Payment gross.margin.percentage Rating    Total
## 1        7     Ewallet                4.761905    9.1 548.9715
## 2        5        Cash                4.761905    9.6  80.2200
## 3        7 Credit card                4.761905    7.4 340.5255
## 4        8     Ewallet                4.761905    8.4 489.0480
## 5        7     Ewallet                4.761905    5.3 634.3785
## 6        7     Ewallet                4.761905    4.1 627.6165
##              datetime year month       day hour minute
## 1 2020-01-05 13:08:00 2020    01    Sunday   13      8
## 2 2020-03-08 10:29:00 2020    03    Sunday   10     29
## 3 2020-03-03 13:23:00 2020    03   Tuesday   13     23
## 4 2020-01-27 20:33:00 2020    01    Monday   20     33
## 5 2020-02-08 10:37:00 2020    02  Saturday   10     37
## 6 2020-03-25 18:30:00 2020    03 Wednesday   18     30

All redundant columns have been successfully dropped.

Univariate analysis

numerical analysis

# Specifying numeric columns
#
cols <- c("Unit.price", "Quantity",  "gross.margin.percentage", "minute", "year", "Rating", "Total")

# Creating dataset with numerical variables
#
num_dataset <- sales_dataset[cols]

# Describing the continuous variables in the dataset
#
describe(num_dataset)
## num_dataset 
## 
##  7  Variables      1000  Observations
## --------------------------------------------------------------------------------
## Unit.price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      943        1    55.67     30.6    15.28    19.31 
##      .25      .50      .75      .90      .95 
##    32.88    55.23    77.94    93.12    97.22 
## 
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       10     0.99     5.51     3.36        1        1 
##      .25      .50      .75      .90      .95 
##        3        5        8       10       10 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    112    91    90   109   102    98   102    85    92   119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## gross.margin.percentage 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0    4.762        0 
##                    
## Value      4.761905
## Frequency      1000
## Proportion        1
## --------------------------------------------------------------------------------
## minute 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       60        1     30.1    19.48        3        6 
##      .25      .50      .75      .90      .95 
##       16       30       44       53       56 
## 
## lowest :  0  1  2  3  4, highest: 55 56 57 58 59
## --------------------------------------------------------------------------------
## year 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0     2020        0 
##                
## Value      2020
## Frequency  1000
## Proportion    1
## --------------------------------------------------------------------------------
## Rating 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       61        1    6.973    1.985    4.295    4.500 
##      .25      .50      .75      .90      .95 
##    5.500    7.000    8.500    9.400    9.700 
## 
## lowest :  4.0  4.1  4.2  4.3  4.4, highest:  9.6  9.7  9.8  9.9 10.0
## --------------------------------------------------------------------------------
## Total 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1      323    270.7    41.07    68.10 
##      .25      .50      .75      .90      .95 
##   124.42   253.85   471.35   718.91   822.50 
## 
## lowest :   10.6785   12.6945   13.1670   13.4190   14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------

From the descriptive statistics, gross margin percentage had zero variance from the mean. All other variables had large deviations relative to their means. The total had the largest skew(did not follow normal distribution) The other variables(unit price, quantity, and rating had a gausian distribution) All variables had positive skews and negative kurtosis

# histogram representations of numeric variables
#
hist.data.frame(sales_dataset[cols])

From the data it is clear that tax, gross income, cogs and total are all skewed to the left. The others seem to have a relatively normal distribution

Non-Numerical analysis

Here, the categorical data shall be visualized to get insights into them.

# plotting the branches 
#
ggplot(sales_dataset, aes(x = Branch)) +
    geom_bar(fill = "coral") +
    theme_classic()

From the bar plot Carrefour Kenya has 3 branches A, B, C.

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = Customer.type)) +
    geom_bar(fill = "coral") +
    theme_classic()

Carrefour Kenya has two customer types: Member and Normal

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = Gender)) +
    geom_bar(fill = "coral") +
    theme_classic()

Carrefour Kenya customers are segmented to two genders, Male and Female

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = Product.line)) +
    geom_bar(fill = "coral") +
    theme_classic()

Carrefour Kenya has six product lines. fashion accessories is the highest seller.

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = Payment)) +
    geom_bar(fill = "coral") +
    theme_classic()

Carrefour Kenya takes three payment methods: Cash, Credit card and Ewallet

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = month)) +
    geom_bar(fill = "coral") +
    theme_classic()

The data was sourced for three months(January, february, March)

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = hour)) +
    geom_bar(fill = "coral") +
    theme_classic()

The data was collected from the 10th to the 20th hour.

#  Plotting the customer type
#
ggplot(sales_dataset, aes(x = day)) +
    geom_bar(fill = "coral") +
    theme_classic()

The data was collected for seven days Monday to Thursday. Thursday seemed to have the highest level of activity. After extracting the day, month, hour, and minute variables the datetime column is redundant hence will be dropped.

From the uni variate analysis there are only three months and a year has 12 months. Hence these is not representative of the entire year. This will be dropped also. The hours for transaction ranged from hour 10 to hour 20. This can be representative of a typical working day in the supermarket hence shall be retained. The minute a transaction is made has a minimal impact in analyzing retail sales hence shall be dropped The are seven days in a week and all represented in a data set. In Specific days high transaction levels can be observed hence this variables shall be retained.

# dropping datetime column
#
drops <- c("datetime", "minute", "month")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]

Bivariate analysis

Covariance

# Specifying numeric columns
#
cols <- c("Unit.price", "Quantity",  "gross.margin.percentage", "year", "Rating", "Total")

# Create Covariance matrix of the numerical variables in dataset
#
cov(sales_dataset[cols])
##                           Unit.price     Quantity gross.margin.percentage year
## Unit.price               701.9653313   0.83477848                       0    0
## Quantity                   0.8347785   8.54644645                       0    0
## gross.margin.percentage    0.0000000   0.00000000                       0    0
## year                       0.0000000   0.00000000                       0    0
## Rating                    -0.3996675  -0.07945646                       0    0
## Total                   4130.0351420 507.14097799                       0    0
##                               Rating       Total
## Unit.price               -0.39966752  4130.03514
## Quantity                 -0.07945646   507.14098
## gross.margin.percentage   0.00000000     0.00000
## year                      0.00000000     0.00000
## Rating                    2.95351823   -15.39931
## Total                   -15.39930581 60459.59802

From the covariance matrix, minute and rating varied negatively relative to the other variables in dataset

Correlation

# Correlation matrix of numerical data in the Carefour dataset
#
corr <- cor(sales_dataset[cols])
## Warning in cor(sales_dataset[cols]): the standard deviation is zero
# round corr to two
#
round(corr, 2)
##                         Unit.price Quantity gross.margin.percentage year Rating
## Unit.price                    1.00     0.01                      NA   NA  -0.01
## Quantity                      0.01     1.00                      NA   NA  -0.02
## gross.margin.percentage         NA       NA                       1   NA     NA
## year                            NA       NA                      NA    1     NA
## Rating                       -0.01    -0.02                      NA   NA   1.00
## Total                         0.63     0.71                      NA   NA  -0.04
##                         Total
## Unit.price               0.63
## Quantity                 0.71
## gross.margin.percentage    NA
## year                       NA
## Rating                  -0.04
## Total                    1.00

Year and gross margin percentages were constants hence had null correlations with other variables. These shall be dropped since they provide no information to aid the analysis A correlation plot shall be plotted to visualize this further.

# dropping the year and gross margin percentages
#
drops <- c("gross.margin.percentage","year")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]
# specifying the columns 
#
cols <- c("Unit.price", "Quantity",  "Rating", "Total")

# specifying numeric variables dataset after dropping redundant columns
#
num_dataset <- sales_dataset[cols]

# correlation plot of remaining numeric variables
#
rcorr(as.matrix(num_dataset),type="pearson")
##            Unit.price Quantity Rating Total
## Unit.price       1.00     0.01  -0.01  0.63
## Quantity         0.01     1.00  -0.02  0.71
## Rating          -0.01    -0.02   1.00 -0.04
## Total            0.63     0.71  -0.04  1.00
## 
## n= 1000 
## 
## 
## P
##            Unit.price Quantity Rating Total 
## Unit.price            0.7336   0.7816 0.0000
## Quantity   0.7336              0.6174 0.0000
## Rating     0.7816     0.6174          0.2496
## Total      0.0000     0.0000   0.2496

From the correlation matrix above, From the data above; quantity, unit price had a high (>0.5) positive correlation with total. The p values (0.00) obtained for these strong positive correlations were less than the confidence interval (0.05). These implied that the strong correlation between these variables was statistically significant.

# Visualizing the correlations
#
corrplot(corr)

Visualize the correlations in a pair plot

# Creating a regression line function
#
upper_panel_regression_line = function(x,y, ...){
  points(x,y,...)
  linear_regression = lm(y~x)
  linear_regression_line = abline(linear_regression)
}

# pair plot of numerical variables
#
pairs(num_dataset, lower.panel=NULL, upper.panel = 
        upper_panel_regression_line)

From the pair plot, the variables with strong positive correlations had linear relationship

Preleminary test to check the test assumptions

Is the covariation linear? Yes, for variables with strong linear correlation the variables form the plot above, have a linear relationship. Are the data from each of the numeric variables follow a normal distribution?

Use Shapiro-Wilk normality test –> R function: shapiro.test() Shapiro-Wilk test shall be performed as follow:

Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed

# shapiro-wilk test for normality for numeric variables
#
apply(num_dataset,2,shapiro.test)
## $Unit.price
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.95188, p-value < 2.2e-16
## 
## 
## $Quantity
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.93258, p-value < 2.2e-16
## 
## 
## $Rating
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.9582, p-value = 2.688e-16
## 
## 
## $Total
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.90876, p-value < 2.2e-16

From the test; The “Unit.price”, “Quantity”, “Rating”, and “Total” all had p-values less than the confidence level 0.05. The test was statically significant for all numeric variables. We reject the null hypothesis and conclude that all numeric variables in the data set are not normally distributed.

Since the data set has a large number of highly correlated variables, Principal component analysis shall be performed on the data set to reduce dimensions.

Implementing the solution

Dimension reduction

The data set has a large number of variables that are highly correlated. Hence to reduce the dimensions PCA shall be applied. However the data set also has categorical variables. PCA maximizes the projected inertia of all the columns (fairly distributed) over the given component.

Directly applying the PCA over a table containing the one-hot encoded modalities, the inertia given to a categorical variable would inherently depend on the number of modalities available to the variable, and on the probabilities of these modalities. As a result, it would be impossible to give a similar weight to all the initial variables over the calculated components.

Hence we are going to apply Factor analysis for mixed data (FAMD). It can be seen roughly as a mixed between PCA and MCA. It works by scaling the continuous variables to unit variance and categorical variables are transformed into a disjunctive data table and then scaled using specific scaling of MCA.

FAMD does not require numeric data hence the dataset shall not be encoded

Application of FAMD

# Moving the target variable total to the end of the data frame
#
sales_dataset <- sales_dataset %>% relocate(Total, .after = last_col())

# Preview dataframe
head(sales_dataset)
##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Payment Rating       day hour    Total
## 1        7     Ewallet    9.1    Sunday   13 548.9715
## 2        5        Cash    9.6    Sunday   10  80.2200
## 3        7 Credit card    7.4   Tuesday   13 340.5255
## 4        8     Ewallet    8.4    Monday   20 489.0480
## 5        7     Ewallet    5.3  Saturday   10 634.3785
## 6        7     Ewallet    4.1 Wednesday   18 627.6165

The total variable has been successfully moved. In developing a marketing strategy the minute a transaction is made is less relevant in the making of the decision

# The function FAMD() [FactoMiner package] can be used to compute FAMD. 
# Format
res.famd <- FAMD(sales_dataset[2:11], # Selecting the variables excluding 
# invoice.id, minute and the target variable Total for the study
                 sup.var = NULL,  
                 graph = FALSE, 
                 ncp=20)

base : a data frame with n rows (individuals) and p columns (variables). ncp: the number of dimensions kept in the results (by default 5) sup.var: a vector indicating the indexes of the supplementary variables. ind.sup: a vector indicating the indexes of the supplementary individuals. graph : a logical value. If TRUE a graph is displayed.

Visualization and interpretation

# Inspect principal components
#
eig.val <- get_eigenvalue(res.famd)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   1.3018802         4.339601                    4.339601
## Dim.2   1.2622489         4.207496                    8.547097
## Dim.3   1.2321415         4.107138                   12.654235
## Dim.4   1.1841100         3.947033                   16.601268
## Dim.5   1.1526844         3.842281                   20.443550
## Dim.6   1.1471645         3.823882                   24.267431
## Dim.7   1.1310607         3.770202                   28.037634
## Dim.8   1.1090969         3.696990                   31.734623
## Dim.9   1.0965362         3.655121                   35.389744
## Dim.10  1.0742065         3.580688                   38.970432
## Dim.11  1.0615482         3.538494                   42.508926
## Dim.12  1.0519132         3.506377                   46.015303
## Dim.13  1.0425172         3.475057                   49.490361
## Dim.14  1.0301471         3.433824                   52.924184
## Dim.15  1.0060851         3.353617                   56.277801
## Dim.16  1.0023917         3.341306                   59.619107
## Dim.17  0.9823694         3.274565                   62.893672
## Dim.18  0.9610530         3.203510                   66.097182
## Dim.19  0.9488909         3.162970                   69.260151
## Dim.20  0.9187747         3.062582                   72.322734
# visualize the contributions of the different dimensions to analysis
#
fviz_screeplot(res.famd)

Graph of variables

# Variable results
#
var <- get_famd_var(res.famd)
var
## FAMD results for variables 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# Coordinates of variables
head(var$coord)
##                      Dim.1        Dim.2      Dim.3        Dim.4       Dim.5
## Unit.price    0.0158778569 0.0597587936 0.05501640 0.1559953279 0.036112088
## Quantity      0.0748252292 0.0005448101 0.05424777 0.0023011124 0.030965523
## Rating        0.0003515504 0.0904379573 0.06856060 0.0242755683 0.018505520
## Branch        0.0488628791 0.1782762956 0.27510590 0.1199921074 0.084529222
## Customer.type 0.0109090802 0.0121764594 0.11967109 0.0001948876 0.005586064
## Gender        0.1749371668 0.1779455023 0.03027944 0.0069758429 0.040675881
##                    Dim.6        Dim.7        Dim.8        Dim.9       Dim.10
## Unit.price    0.03792533 7.848787e-05 0.0009004472 0.0558549327 4.317397e-06
## Quantity      0.02500569 5.448624e-02 0.1184833335 0.1226003780 8.313675e-06
## Rating        0.01226721 1.250436e-02 0.0553616055 0.0527799297 1.345477e-02
## Branch        0.04539697 4.188296e-02 0.0135782524 0.0379092688 1.080490e-01
## Customer.type 0.09877340 1.595633e-01 0.0152433143 0.0002501625 3.304697e-02
## Gender        0.01674923 2.039167e-02 0.0174026216 0.0031910191 3.157402e-02
##                     Dim.11       Dim.12       Dim.13       Dim.14       Dim.15
## Unit.price    2.689639e-05 5.552713e-05 4.653004e-05 0.0347934636 7.456146e-02
## Quantity      1.762724e-04 1.569750e-02 1.301944e-02 0.0289643300 1.597997e-05
## Rating        2.594583e-02 1.360725e-02 6.241517e-02 0.0058297693 6.813763e-02
## Branch        2.307060e-02 4.755874e-02 2.288884e-02 0.0485671073 4.013904e-02
## Customer.type 5.523017e-04 4.530930e-02 1.905288e-04 0.0001792863 5.489751e-02
## Gender        5.676415e-03 2.078372e-02 1.380556e-02 0.0036613491 3.614168e-04
##                     Dim.16      Dim.17       Dim.18      Dim.19       Dim.20
## Unit.price    0.0669765591 0.002145952 0.0006583885 0.001998380 0.0075438747
## Quantity      0.0000624302 0.005331724 0.0270469512 0.030969840 0.1686727122
## Rating        0.0199936058 0.014699996 0.0459594867 0.033377864 0.1893699560
## Branch        0.0211366879 0.045597436 0.0131557039 0.003576857 0.0699866884
## Customer.type 0.0006007309 0.001447561 0.0773465770 0.016965620 0.0347386027
## Gender        0.0004091263 0.007518441 0.0220398950 0.010545964 0.0003529779
# Cos2: quality of representation on the factore map
head(var$cos2)
##                      Dim.1        Dim.2        Dim.3        Dim.4        Dim.5
## Unit.price    2.521063e-04 3.571113e-03 0.0030268038 2.433454e-02 1.304083e-03
## Quantity      5.598815e-03 2.968180e-07 0.0029428208 5.295118e-06 9.588636e-04
## Rating        1.235877e-07 8.179024e-03 0.0047005553 5.893032e-04 3.424543e-04
## Branch        1.193790e-03 1.589122e-02 0.0378416277 7.199053e-03 3.572595e-03
## Customer.type 1.190080e-04 1.482662e-04 0.0143211691 3.798118e-08 3.120411e-05
## Gender        3.060301e-02 3.166460e-02 0.0009168444 4.866238e-05 1.654527e-03
##                      Dim.6        Dim.7        Dim.8        Dim.9       Dim.10
## Unit.price    0.0014383309 6.160346e-09 8.108051e-07 3.119774e-03 1.863992e-11
## Quantity      0.0006252846 2.968750e-03 1.403830e-02 1.503085e-02 6.911719e-11
## Rating        0.0001504843 1.563591e-04 3.064907e-03 2.785721e-03 1.810308e-04
## Branch        0.0010304424 8.770911e-04 9.218447e-05 7.185563e-04 5.837291e-03
## Customer.type 0.0097561847 2.546046e-02 2.323586e-04 6.258128e-08 1.092102e-03
## Gender        0.0002805366 4.158200e-04 3.028512e-04 1.018260e-05 9.969190e-04
##                     Dim.11       Dim.12       Dim.13       Dim.14       Dim.15
## Unit.price    7.234158e-10 3.083262e-09 2.165045e-09 1.210585e-03 5.559412e-03
## Quantity      3.107197e-08 2.464114e-04 1.695057e-04 8.389324e-04 2.553594e-10
## Rating        6.731860e-04 1.851571e-04 3.895653e-03 3.398621e-05 4.642737e-03
## Branch        2.661262e-04 1.130917e-03 2.619495e-04 1.179382e-03 8.055713e-04
## Customer.type 3.050372e-07 2.052933e-03 3.630122e-08 3.214359e-08 3.013736e-03
## Gender        3.222168e-05 4.319630e-04 1.905935e-04 1.340548e-05 1.306221e-07
##                     Dim.16       Dim.17       Dim.18       Dim.19       Dim.20
## Unit.price    4.485859e-03 4.605108e-06 4.334755e-07 3.993521e-06 5.691004e-05
## Quantity      3.897530e-09 2.842728e-05 7.315376e-04 9.591310e-04 2.845048e-02
## Rating        3.997443e-04 2.160899e-04 2.112274e-03 1.114082e-03 3.586098e-02
## Branch        2.233798e-04 1.039563e-03 8.653627e-05 6.396955e-06 2.449068e-03
## Customer.type 3.608777e-07 2.095433e-06 5.982493e-03 2.878322e-04 1.206771e-03
## Gender        1.673843e-07 5.652695e-05 4.857570e-04 1.112174e-04 1.245934e-07
# Contributions to the  dimensions
var$contrib
##                     Dim.1       Dim.2     Dim.3       Dim.4      Dim.5
## Unit.price     1.21960969  4.73431150  4.465104 13.17405745  3.1328685
## Quantity       5.74747430  0.04316186  4.402723  0.19433266  2.6863833
## Rating         0.02700328  7.16482772  5.564344  2.05011096  1.6054282
## Branch         3.75325468 14.12370406 22.327460 10.13352731  7.3332492
## Customer.type  0.83794809  0.96466391  9.712447  0.01645857  0.4846134
## Gender        13.43727083 14.09749740  2.457464  0.58912121  3.5287959
## Product.line  17.76582517  7.96155013 11.027411 20.24077601 18.9165763
## Payment        5.51439117 10.34348866  2.357113  3.63022453 11.0884671
## day           23.84364755 22.36952333  9.177146 13.38469906 18.0689951
## hour          27.85357524 18.19727143 28.508788 36.58669225 33.1546229
##                   Dim.6        Dim.7       Dim.8       Dim.9       Dim.10
## Unit.price     3.306007  0.006939316  0.08118742  5.09376094 4.019150e-04
## Quantity       2.179783  4.817268925 10.68286615 11.18069589 7.739364e-04
## Rating         1.069350  1.105543152  4.99159337  4.81333217 1.252531e+00
## Branch         3.957320  3.702980660  1.22426209  3.45718352 1.005849e+01
## Customer.type  8.610221 14.107407426  1.37438981  0.02281388 3.076408e+00
## Gender         1.460055  1.802879859  1.56908041  0.29100901 2.939288e+00
## Product.line  22.300928 30.443708058 28.17627872  9.99052373 6.066860e+00
## Payment        8.242191  9.674356147  9.45650606  7.63997557 3.397197e+00
## day           16.544085 15.621861132 17.13762774 15.95374806 2.920421e+01
## hour          32.330060 18.717055324 25.30620824 41.55695722 4.400383e+01
##                     Dim.11      Dim.12       Dim.13      Dim.14       Dim.15
## Unit.price     0.002533695  0.00527868  0.004463239  3.37752394  7.411049527
## Quantity       0.016605222  1.49228056  1.248846182  2.81166942  0.001588332
## Rating         2.444149726  1.29357113  5.986967363  0.56591622  6.772551853
## Branch         2.173297038  4.52116561  2.195535814  4.71457999  3.989627061
## Customer.type  0.052027943  4.30732308  0.018275842  0.01740395  5.456547222
## Gender         0.534729818  1.97580177  1.324252349  0.35542004  0.035923085
## Product.line  24.757547389 26.45945964 21.431972137 27.99759108  5.172609186
## Payment       10.751972275  2.71608510  1.674391530  5.36483830  0.262359495
## day           21.552900130 11.72909249 21.575817902 29.80619183 18.965154217
## hour          37.714236765 45.49994193 44.539477640 24.98886521 51.932590023
##                     Dim.16     Dim.17    Dim.18     Dim.19      Dim.20
## Unit.price     6.681675466  0.2184465  0.068507  0.2106016  0.82107991
## Quantity       0.006228124  0.5427412  2.814304  3.2637936 18.35844070
## Rating         1.994590155  1.4963817  4.782201  3.5175661 20.61114132
## Branch         2.108625627  4.6415773  1.368884  0.3769514  7.61739378
## Customer.type  0.059929761  0.1473540  8.048107  1.7879421  3.78097067
## Gender         0.040815016  0.7653374  2.293307  1.1113991  0.03841833
## Product.line   3.845050369  2.1650578  8.168468 14.5697535  7.32605826
## Payment       13.073562558  1.3884063  3.636465 16.2784712  0.03384746
## day           25.225238473 49.7819035 22.899180 30.1937079 19.63185593
## hour          46.964284452 38.8527944 45.920578 28.6898136 21.78079363

For better visualization, a plot detailing the contributions of each variable to dim(1-3) shall be plotted.

  • fviz_famd_var() to plot both quantitative and qualitative variables
  • fviz_contrib() to visualize the contribution of variables to the principal dimensions
# Plot of variables
fviz_famd_var(res.famd, repel = TRUE)

# Contribution to the first dimension
fviz_contrib(res.famd, "var", axes = 1)

# Contribution to the second dimension
fviz_contrib(res.famd, "var", axes = 2)

# Contribution to the third dimension
fviz_contrib(res.famd, "var", axes = 3)

Observation All variables are represented in Dim1 and Dim2 excluding quantity which is solely represented in Dim1 and rating solely represented in Dim 2

Dim1: Hour, day, and product line contribute the most to PC1 Dim2: day, hour and branch in descending orde contribute the most to PC2 Dim3: The hour and branch are the greatest contributors to PC3

Graphing Quantitative variables

# Extracting the results for quantitative variables
#
quanti.var <- get_famd_var(res.famd, "quanti.var")
quanti.var 
## FAMD results for quantitative variables 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# highlighting the most contributing quantitative variables on scatter 
# plot
fviz_famd_var(res.famd, "quanti.var", col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

# highligting the quality of representation of the continuous variables
# on the factor map. If a variable is well represented by two dimensions,
# the sum of the cos2 is closed to one

# Color by cos2 values: quality on the factor map
fviz_famd_var(res.famd, "quanti.var", col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

Observations From the visualization, the most important contributing continuous variable was the rating. However it was poorly represented in Dim 1 and Dim 2. with a sum(cos2) slightly higher than 0.09

Graph of qualitative variables

# extracting qualitative variables format
#
quali.var <- get_famd_var(res.famd, "quali.var")
quali.var 
## FAMD results for qualitative variable categories 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# Visualize the qualitative variables
fviz_famd_var(res.famd, "quali.var", col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
             )

# Color by cos2 values: quality on the factor map
fviz_famd_var(res.famd, "quali.var", col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

observations

The strongest contributing factor was the gender. Whether the client was male or female. However these qualitative variables are not well represented by the first two dimensions. They has a sum(cos2) slightly higher than 0.4 which is low

Graph of individuals

# format to get these individuals
#
ind <- get_famd_ind(res.famd)
ind
## FAMD results for individuals 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# To plot individuals, use the function fviz_mfa_ind()
#
fviz_famd_ind(res.famd, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 946 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# coloring the invoice details based on the total plus tax
fviz_mfa_ind(res.famd, 
             habillage = "Gender", # color by groups 
             palette = c("#00AFBB", "#E7B800"),
             addEllipses = TRUE, ellipse.type = "confidence", 
             repel = TRUE # Avoid text overlapping
             ) 
## Warning: ggrepel: 947 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From the analysis, 20 dimensions explain 70% of the data in data set.

Feature selection

We shall select the features by employing; 1. Filter methods 2. Embedded Methods

Filter methods

We shall select features from the numerical data in the data set

# suppressing warnings
suppressWarnings(
        suppressMessages(if
                         (!require(caret, quietly=TRUE))
                install.packages("caret")))
# specifying numeric columns
cols <- c("Unit.price", "Quantity", "minute",  "Rating", "gross.income", "Tax", "cogs", "Total")

# numerical dataset
#
sales_num <- sales_org[cols]
# Calculating the correlation matrix of numerical variables
# ---
#
correlationMatrix <- cor(sales_num)
# Find attributes that are highly correlated
# ---
#
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

# Highly correlated attributes
# ---
# 
highlyCorrelated
## [1] 5 6 7
names(sales_num[,highlyCorrelated])
## [1] "gross.income" "Tax"          "cogs"

The tax, cost of goods sold, and gross income have high correlations >/= 0.75

# Removing Redundant Features 
# ---
# 
sales_num1<-sales_num[-highlyCorrelated]
# Performing our graphical comparison
# ---
# 
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(sales_num1), order = "hclust")

The Cogs, gross income, and tax columns that had high correlations have been dropped

Embedded Methods

# loading wskm package
# ---
#
suppressWarnings(
        suppressMessages(if
                         (!require(wskm, quietly=TRUE))
                install.packages("wskm")))
library(wskm)
# Creating model removing total the target
#
set.seed(2)
model <- ewkm(sales_num[1:7], 3, lambda=2, maxiter=1000)
# Loading and installing our cluster package
# ---
#
suppressWarnings(
        suppressMessages(if
                         (!require(cluster, quietly=TRUE))
                install.packages("cluster")))
library("cluster")
# Cluster Plot against 1st 2 principal components
# ---
#
clusplot(sales_num[1:7], model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis for Iris')

# Weights remain stored in the model and we can check them as follows:
# 
round(model$weights*100,2)
##   Unit.price Quantity minute Rating gross.income Tax cogs
## 1          0        0      0   0.00           50  50    0
## 2          0        0      0  99.99            0   0    0
## 3          0        0      0   0.00           50  50    0

The first component is explained the tax and the gross income The second component is full explained by the rating

Conclusion

Dimension reduction

For dimension reduction, FAMD(Factor Analysis for Mixed Data ) was applied. IT was established that 20 principal components were required to explain 72% of the information in the data set.

From the analysis The most important components that had the most contribution were the Hour, day, and product line which contributed the most to the first principal component and day, hour and branch which contributed to the second principal component

The most important contributing continuous variable was the rating and the most important qualitative variable is gender.

Feature selection Using embedded and filtering It could be concluded that all other continuous variables aside from Cost of goods sold(cogs), gross income,and tax could help the marketing department develop a marketing strategy

Further questions

A) Do we have the right data

Yes. The data provided by Carrefour supermarket is sufficient for the retailer to establish

B) Do we have the right question?

Yes. It is crucial for a supermarket to gain insights from its business dealings in order to best inform a marketing strategy and maximize profitability