As Data analysts at Carrefour Kenya and we are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). The project has been divided into two parts. The first is performing dimensionality reduction using PCA and feature selection.
The most important features that will help inform Carrefour marketing strategies are identified.
The following are the attributes in the data site provided:
Invoice.ID- The unique identifier of the invoice of goods sold Branch - The branch of Carrefour supermarket the sale was recorded Customer.type - The type of customer Gender - Gender of customer Product.line - The line of product being sold(e.g groceries/electronics) Unit.price - The price per unit of the item Quantity - The quantity of goods sold Tax - The tax of the sale Date - The date the sell was recorded Time - The time the sale was recorded Payment - How the customer payed for the goods cogs - Cost of goods sold gross.margin.percentage - The percentage profit margin of goods sold gross.income - The amount of income earned from the sell Rating - The rating of the item sold Total - Total amount earned from the sale
# loading libraries
#
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ purrr 0.3.4
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(stringr)
library(ggplot2)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(corrplot)
## corrplot 0.92 loaded
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Loading dataset
#
sales_dataset <- read.csv("http://bit.ly/CarreFourDataset")
# previewing the first six records of dataset
#
head(sales_dataset)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
# Dimensions of dataset
#
dim(sales_dataset)
## [1] 1000 16
The dataset has 1000 records and 16 columns
# Preview dataset structure
#
str(sales_dataset)
## 'data.frame': 1000 obs. of 16 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
The dataset date time, continuous and categorical varaiables The datetime and categorical variables have inappropiate datatypes
# Preview the type of dataset
#
class(sales_dataset)
## [1] "data.frame"
The data set is a data frame
Since the data set is provided by the client, it is assumed to be accurate hence no need to validate it further
The following categorical variables are listed as characters instead of factors; - Branch, Customer.type, Gender, Product.line, and Payment Datetime data; - Date, and time.
The catagorical variables shall be converted to factors while the date and time shall be converted to date time
# specifying categorical variables
#
categ_cols <- c("Branch", "Customer.type", "Gender", "Product.line", "Payment")
# Converting the categorical columns to factors
#
sales_dataset[,categ_cols] <- lapply(sales_dataset[,categ_cols] , factor)
# adding ss to the time to change format format from HH:MM to HH:MM:SS
#
sales_dataset <- sales_dataset %>%
mutate(Time=sprintf("%s.%02d",Time, row_number(Time)*0))
# Merging date and time column to create datetime column
#
sales_dataset$datetime <- with(sales_dataset, as.POSIXct(paste(as.Date(Date, format="%m/%d/%y"), Time)))
# previewing the sturcture of dataset
#
str(sales_dataset)
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
## $ Customer.type : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
## $ Product.line : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08.00" "10:29.00" "13:23.00" "20:33.00" ...
## $ Payment : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ datetime : POSIXct, format: "2020-01-05 13:08:00" "2020-03-08 10:29:00" ...
All columns now have appropriate columns and the new datetime column has been successfully created and appropriately filled.
# Checking the number of missing values per column in the data set
#
colSums(is.na(sales_dataset))
## Invoice.ID Branch Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax Date
## 0 0 0
## Time Payment cogs
## 0 0 0
## gross.margin.percentage gross.income Rating
## 0 0 0
## Total datetime
## 0 0
The dataset has no null values
# finding the duplicated rows in the data set and assign to a variable duplicated_rows below
#
duplicated_rows = sales_dataset[duplicated(sales_dataset),]
# Printing out the duplicated rows
duplicated_rows
## [1] Invoice.ID Branch Customer.type
## [4] Gender Product.line Unit.price
## [7] Quantity Tax Date
## [10] Time Payment cogs
## [13] gross.margin.percentage gross.income Rating
## [16] Total datetime
## <0 rows> (or 0-length row.names)
The dataset has no duplicate values
# number of rows in data frame
#
num_rows = nrow(sales_dataset)
# creating ID column vector
#
ID <- c(1:num_rows)
# binding id column to the data frame
#
sales_dataset1 <- cbind(ID , sales_dataset)
# Applying names function to get column names from numeric columns in dataset
# as a list
#
colnames <- names(select_if(sales_dataset1, is.numeric))
# Print vector of column names
#
colnames
## [1] "ID" "Unit.price"
## [3] "Quantity" "Tax"
## [5] "cogs" "gross.margin.percentage"
## [7] "gross.income" "Rating"
## [9] "Total"
# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
measure.vars=c("cogs", "Total"))
# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# printing the plot
#
print(p)
The cogs and total variables have outliers
# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
measure.vars=c("Unit.price", "Quantity", "Tax","gross.income"))
# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# printing the plot
#
print(p)
The tax and gross income have outliers
# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
measure.vars=c("gross.margin.percentage"))
# creating a plot of area income
#
p <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# creating the modified data frame
#
mod_df<- melt(sales_dataset1, id.vars='ID',
measure.vars=c("Rating"))
# creating a plot of area income
#
p1 <- ggplot(mod_df) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# printing the plot
#
print(p)
print(p1)
There are no outliers in the gross margin percentage and rating. The
outliers in the other variables shall be retained due to the small
number of records in the data set.
Here we aim to extract the day, month, year, hour and minute for further analysis
# Extracting the year from the time stamp
#
sales_dataset$year <- format (as.Date(sales_dataset$datetime, format="%d/%m/%Y"),"%Y")
# Extracting the month from the time stamp
#
sales_dataset$month <- format (as.Date(sales_dataset$datetime, format="%d/%m/%Y"),"%m")
# Convert Date to Weekday in R (weekdays Function)
#
sales_dataset$day <- weekdays(sales_dataset$datetime)
# Extracting the hour of day from the time stamp
#
sales_dataset$hour <- format (as.POSIXct(sales_dataset$datetime, format="%H:%M:%S"),"%H")
# Extracting the minute of day from the time stamp
#
sales_dataset$minute <- format (as.POSIXct(sales_dataset$datetime, format="%H:%M:%S"),"%M")
# Previewing the first six records to see the extracted variables
#
head(sales_dataset)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs
## 1 7 26.1415 1/5/2019 13:08.00 Ewallet 522.83
## 2 5 3.8200 3/8/2019 10:29.00 Cash 76.40
## 3 7 16.2155 3/3/2019 13:23.00 Credit card 324.31
## 4 8 23.2880 1/27/2019 20:33.00 Ewallet 465.76
## 5 7 30.2085 2/8/2019 10:37.00 Ewallet 604.17
## 6 7 29.8865 3/25/2019 18:30.00 Ewallet 597.73
## gross.margin.percentage gross.income Rating Total datetime year
## 1 4.761905 26.1415 9.1 548.9715 2020-01-05 13:08:00 2020
## 2 4.761905 3.8200 9.6 80.2200 2020-03-08 10:29:00 2020
## 3 4.761905 16.2155 7.4 340.5255 2020-03-03 13:23:00 2020
## 4 4.761905 23.2880 8.4 489.0480 2020-01-27 20:33:00 2020
## 5 4.761905 30.2085 5.3 634.3785 2020-02-08 10:37:00 2020
## 6 4.761905 29.8865 4.1 627.6165 2020-03-25 18:30:00 2020
## month day hour minute
## 1 01 Sunday 13 08
## 2 03 Sunday 10 29
## 3 03 Tuesday 13 23
## 4 01 Monday 20 33
## 5 02 Saturday 10 37
## 6 03 Wednesday 18 30
# preview the data type of created variables(year, month, day, minute, hour)
#
# Converting the attribute year minute and day from character to integer
#
as.integer(sales_dataset$year) -> sales_dataset$year
as.integer(sales_dataset$minute) -> sales_dataset$minute
# Converting continent, hour, month, and weekday from character data type
# to integer data type
#
cols <- c("hour", "day", "month")
# Applying factor conversion
#
sales_dataset[cols] <- lapply(sales_dataset[cols], factor)
# preview the dataset structure after changes to datatypes
#
str(sales_dataset)
## 'data.frame': 1000 obs. of 22 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
## $ Customer.type : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
## $ Product.line : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08.00" "10:29.00" "13:23.00" "20:33.00" ...
## $ Payment : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ datetime : POSIXct, format: "2020-01-05 13:08:00" "2020-03-08 10:29:00" ...
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ month : Factor w/ 3 levels "01","02","03": 1 3 3 1 2 3 2 2 1 2 ...
## $ day : Factor w/ 7 levels "Friday","Monday",..: 4 4 6 2 3 7 6 2 1 5 ...
## $ hour : Factor w/ 11 levels "10","11","12",..: 4 1 4 11 1 9 5 2 8 4 ...
## $ minute : int 8 29 23 33 37 30 36 38 15 27 ...
All variables have appropriate data types. The total is a summation of cost of goods sold and tax and tax. The gross income is obtained from subtracting the total from cost of goods sold before factoring in other expenses such as tax. These shall also be dropped since they all can be represented by the total.
# creating copy of data set before dropping unwanted columns
#
sales_org <- sales_dataset
# Dropping the date and time columns
#
drops <- c("Date","Time", "gross.income", "Tax", "cogs")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]
# previewing first six records of dataframe
#
head(sales_dataset)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Payment gross.margin.percentage Rating Total
## 1 7 Ewallet 4.761905 9.1 548.9715
## 2 5 Cash 4.761905 9.6 80.2200
## 3 7 Credit card 4.761905 7.4 340.5255
## 4 8 Ewallet 4.761905 8.4 489.0480
## 5 7 Ewallet 4.761905 5.3 634.3785
## 6 7 Ewallet 4.761905 4.1 627.6165
## datetime year month day hour minute
## 1 2020-01-05 13:08:00 2020 01 Sunday 13 8
## 2 2020-03-08 10:29:00 2020 03 Sunday 10 29
## 3 2020-03-03 13:23:00 2020 03 Tuesday 13 23
## 4 2020-01-27 20:33:00 2020 01 Monday 20 33
## 5 2020-02-08 10:37:00 2020 02 Saturday 10 37
## 6 2020-03-25 18:30:00 2020 03 Wednesday 18 30
All redundant columns have been successfully dropped.
# Specifying numeric columns
#
cols <- c("Unit.price", "Quantity", "gross.margin.percentage", "minute", "year", "Rating", "Total")
# Creating dataset with numerical variables
#
num_dataset <- sales_dataset[cols]
# Describing the continuous variables in the dataset
#
describe(num_dataset)
## num_dataset
##
## 7 Variables 1000 Observations
## --------------------------------------------------------------------------------
## Unit.price
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 943 1 55.67 30.6 15.28 19.31
## .25 .50 .75 .90 .95
## 32.88 55.23 77.94 93.12 97.22
##
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 10 0.99 5.51 3.36 1 1
## .25 .50 .75 .90 .95
## 3 5 8 10 10
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 112 91 90 109 102 98 102 85 92 119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## gross.margin.percentage
## n missing distinct Info Mean Gmd
## 1000 0 1 0 4.762 0
##
## Value 4.761905
## Frequency 1000
## Proportion 1
## --------------------------------------------------------------------------------
## minute
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 60 1 30.1 19.48 3 6
## .25 .50 .75 .90 .95
## 16 30 44 53 56
##
## lowest : 0 1 2 3 4, highest: 55 56 57 58 59
## --------------------------------------------------------------------------------
## year
## n missing distinct Info Mean Gmd
## 1000 0 1 0 2020 0
##
## Value 2020
## Frequency 1000
## Proportion 1
## --------------------------------------------------------------------------------
## Rating
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 61 1 6.973 1.985 4.295 4.500
## .25 .50 .75 .90 .95
## 5.500 7.000 8.500 9.400 9.700
##
## lowest : 4.0 4.1 4.2 4.3 4.4, highest: 9.6 9.7 9.8 9.9 10.0
## --------------------------------------------------------------------------------
## Total
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 323 270.7 41.07 68.10
## .25 .50 .75 .90 .95
## 124.42 253.85 471.35 718.91 822.50
##
## lowest : 10.6785 12.6945 13.1670 13.4190 14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------
From the descriptive statistics, gross margin percentage had zero variance from the mean. All other variables had large deviations relative to their means. The total had the largest skew(did not follow normal distribution) The other variables(unit price, quantity, and rating had a gausian distribution) All variables had positive skews and negative kurtosis
# histogram representations of numeric variables
#
hist.data.frame(sales_dataset[cols])
From the data it is clear that tax, gross income, cogs and total are all
skewed to the left. The others seem to have a relatively normal
distribution
Here, the categorical data shall be visualized to get insights into them.
# plotting the branches
#
ggplot(sales_dataset, aes(x = Branch)) +
geom_bar(fill = "coral") +
theme_classic()
From the bar plot Carrefour Kenya has 3 branches A, B, C.
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = Customer.type)) +
geom_bar(fill = "coral") +
theme_classic()
Carrefour Kenya has two customer types: Member and Normal
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = Gender)) +
geom_bar(fill = "coral") +
theme_classic()
Carrefour Kenya customers are segmented to two genders, Male and
Female
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = Product.line)) +
geom_bar(fill = "coral") +
theme_classic()
Carrefour Kenya has six product lines. fashion accessories is the
highest seller.
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = Payment)) +
geom_bar(fill = "coral") +
theme_classic()
Carrefour Kenya takes three payment methods: Cash, Credit card and
Ewallet
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = month)) +
geom_bar(fill = "coral") +
theme_classic()
The data was sourced for three months(January, february, March)
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = hour)) +
geom_bar(fill = "coral") +
theme_classic()
The data was collected from the 10th to the 20th hour.
# Plotting the customer type
#
ggplot(sales_dataset, aes(x = day)) +
geom_bar(fill = "coral") +
theme_classic()
The data was collected for seven days Monday to Thursday. Thursday
seemed to have the highest level of activity. After extracting the day,
month, hour, and minute variables the datetime column is redundant hence
will be dropped.
From the uni variate analysis there are only three months and a year has 12 months. Hence these is not representative of the entire year. This will be dropped also. The hours for transaction ranged from hour 10 to hour 20. This can be representative of a typical working day in the supermarket hence shall be retained. The minute a transaction is made has a minimal impact in analyzing retail sales hence shall be dropped The are seven days in a week and all represented in a data set. In Specific days high transaction levels can be observed hence this variables shall be retained.
# dropping datetime column
#
drops <- c("datetime", "minute", "month")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]
# Specifying numeric columns
#
cols <- c("Unit.price", "Quantity", "gross.margin.percentage", "year", "Rating", "Total")
# Create Covariance matrix of the numerical variables in dataset
#
cov(sales_dataset[cols])
## Unit.price Quantity gross.margin.percentage year
## Unit.price 701.9653313 0.83477848 0 0
## Quantity 0.8347785 8.54644645 0 0
## gross.margin.percentage 0.0000000 0.00000000 0 0
## year 0.0000000 0.00000000 0 0
## Rating -0.3996675 -0.07945646 0 0
## Total 4130.0351420 507.14097799 0 0
## Rating Total
## Unit.price -0.39966752 4130.03514
## Quantity -0.07945646 507.14098
## gross.margin.percentage 0.00000000 0.00000
## year 0.00000000 0.00000
## Rating 2.95351823 -15.39931
## Total -15.39930581 60459.59802
From the covariance matrix, minute and rating varied negatively relative to the other variables in dataset
# Correlation matrix of numerical data in the Carefour dataset
#
corr <- cor(sales_dataset[cols])
## Warning in cor(sales_dataset[cols]): the standard deviation is zero
# round corr to two
#
round(corr, 2)
## Unit.price Quantity gross.margin.percentage year Rating
## Unit.price 1.00 0.01 NA NA -0.01
## Quantity 0.01 1.00 NA NA -0.02
## gross.margin.percentage NA NA 1 NA NA
## year NA NA NA 1 NA
## Rating -0.01 -0.02 NA NA 1.00
## Total 0.63 0.71 NA NA -0.04
## Total
## Unit.price 0.63
## Quantity 0.71
## gross.margin.percentage NA
## year NA
## Rating -0.04
## Total 1.00
Year and gross margin percentages were constants hence had null correlations with other variables. These shall be dropped since they provide no information to aid the analysis A correlation plot shall be plotted to visualize this further.
# dropping the year and gross margin percentages
#
drops <- c("gross.margin.percentage","year")
sales_dataset <- sales_dataset[ , !(names(sales_dataset) %in% drops)]
# specifying the columns
#
cols <- c("Unit.price", "Quantity", "Rating", "Total")
# specifying numeric variables dataset after dropping redundant columns
#
num_dataset <- sales_dataset[cols]
# correlation plot of remaining numeric variables
#
rcorr(as.matrix(num_dataset),type="pearson")
## Unit.price Quantity Rating Total
## Unit.price 1.00 0.01 -0.01 0.63
## Quantity 0.01 1.00 -0.02 0.71
## Rating -0.01 -0.02 1.00 -0.04
## Total 0.63 0.71 -0.04 1.00
##
## n= 1000
##
##
## P
## Unit.price Quantity Rating Total
## Unit.price 0.7336 0.7816 0.0000
## Quantity 0.7336 0.6174 0.0000
## Rating 0.7816 0.6174 0.2496
## Total 0.0000 0.0000 0.2496
From the correlation matrix above, From the data above; quantity, unit price had a high (>0.5) positive correlation with total. The p values (0.00) obtained for these strong positive correlations were less than the confidence interval (0.05). These implied that the strong correlation between these variables was statistically significant.
# Visualizing the correlations
#
corrplot(corr)
# Creating a regression line function
#
upper_panel_regression_line = function(x,y, ...){
points(x,y,...)
linear_regression = lm(y~x)
linear_regression_line = abline(linear_regression)
}
# pair plot of numerical variables
#
pairs(num_dataset, lower.panel=NULL, upper.panel =
upper_panel_regression_line)
From the pair plot, the variables with strong positive correlations had
linear relationship
Is the covariation linear? Yes, for variables with strong linear correlation the variables form the plot above, have a linear relationship. Are the data from each of the numeric variables follow a normal distribution?
Use Shapiro-Wilk normality test –> R function: shapiro.test() Shapiro-Wilk test shall be performed as follow:
Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed
# shapiro-wilk test for normality for numeric variables
#
apply(num_dataset,2,shapiro.test)
## $Unit.price
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.95188, p-value < 2.2e-16
##
##
## $Quantity
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.93258, p-value < 2.2e-16
##
##
## $Rating
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.9582, p-value = 2.688e-16
##
##
## $Total
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.90876, p-value < 2.2e-16
From the test; The “Unit.price”, “Quantity”, “Rating”, and “Total” all had p-values less than the confidence level 0.05. The test was statically significant for all numeric variables. We reject the null hypothesis and conclude that all numeric variables in the data set are not normally distributed.
Since the data set has a large number of highly correlated variables, Principal component analysis shall be performed on the data set to reduce dimensions.
The data set has a large number of variables that are highly correlated. Hence to reduce the dimensions PCA shall be applied. However the data set also has categorical variables. PCA maximizes the projected inertia of all the columns (fairly distributed) over the given component.
Directly applying the PCA over a table containing the one-hot encoded modalities, the inertia given to a categorical variable would inherently depend on the number of modalities available to the variable, and on the probabilities of these modalities. As a result, it would be impossible to give a similar weight to all the initial variables over the calculated components.
Hence we are going to apply Factor analysis for mixed data (FAMD). It can be seen roughly as a mixed between PCA and MCA. It works by scaling the continuous variables to unit variance and categorical variables are transformed into a disjunctive data table and then scaled using specific scaling of MCA.
FAMD does not require numeric data hence the dataset shall not be encoded
# Moving the target variable total to the end of the data frame
#
sales_dataset <- sales_dataset %>% relocate(Total, .after = last_col())
# Preview dataframe
head(sales_dataset)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Payment Rating day hour Total
## 1 7 Ewallet 9.1 Sunday 13 548.9715
## 2 5 Cash 9.6 Sunday 10 80.2200
## 3 7 Credit card 7.4 Tuesday 13 340.5255
## 4 8 Ewallet 8.4 Monday 20 489.0480
## 5 7 Ewallet 5.3 Saturday 10 634.3785
## 6 7 Ewallet 4.1 Wednesday 18 627.6165
The total variable has been successfully moved. In developing a marketing strategy the minute a transaction is made is less relevant in the making of the decision
# The function FAMD() [FactoMiner package] can be used to compute FAMD.
# Format
res.famd <- FAMD(sales_dataset[2:11], # Selecting the variables excluding
# invoice.id, minute and the target variable Total for the study
sup.var = NULL,
graph = FALSE,
ncp=20)
base : a data frame with n rows (individuals) and p columns (variables). ncp: the number of dimensions kept in the results (by default 5) sup.var: a vector indicating the indexes of the supplementary variables. ind.sup: a vector indicating the indexes of the supplementary individuals. graph : a logical value. If TRUE a graph is displayed.
# Inspect principal components
#
eig.val <- get_eigenvalue(res.famd)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.3018802 4.339601 4.339601
## Dim.2 1.2622489 4.207496 8.547097
## Dim.3 1.2321415 4.107138 12.654235
## Dim.4 1.1841100 3.947033 16.601268
## Dim.5 1.1526844 3.842281 20.443550
## Dim.6 1.1471645 3.823882 24.267431
## Dim.7 1.1310607 3.770202 28.037634
## Dim.8 1.1090969 3.696990 31.734623
## Dim.9 1.0965362 3.655121 35.389744
## Dim.10 1.0742065 3.580688 38.970432
## Dim.11 1.0615482 3.538494 42.508926
## Dim.12 1.0519132 3.506377 46.015303
## Dim.13 1.0425172 3.475057 49.490361
## Dim.14 1.0301471 3.433824 52.924184
## Dim.15 1.0060851 3.353617 56.277801
## Dim.16 1.0023917 3.341306 59.619107
## Dim.17 0.9823694 3.274565 62.893672
## Dim.18 0.9610530 3.203510 66.097182
## Dim.19 0.9488909 3.162970 69.260151
## Dim.20 0.9187747 3.062582 72.322734
# visualize the contributions of the different dimensions to analysis
#
fviz_screeplot(res.famd)
# Variable results
#
var <- get_famd_var(res.famd)
var
## FAMD results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates"
## 2 "$cos2" "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# Coordinates of variables
head(var$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Unit.price 0.0158778569 0.0597587936 0.05501640 0.1559953279 0.036112088
## Quantity 0.0748252292 0.0005448101 0.05424777 0.0023011124 0.030965523
## Rating 0.0003515504 0.0904379573 0.06856060 0.0242755683 0.018505520
## Branch 0.0488628791 0.1782762956 0.27510590 0.1199921074 0.084529222
## Customer.type 0.0109090802 0.0121764594 0.11967109 0.0001948876 0.005586064
## Gender 0.1749371668 0.1779455023 0.03027944 0.0069758429 0.040675881
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## Unit.price 0.03792533 7.848787e-05 0.0009004472 0.0558549327 4.317397e-06
## Quantity 0.02500569 5.448624e-02 0.1184833335 0.1226003780 8.313675e-06
## Rating 0.01226721 1.250436e-02 0.0553616055 0.0527799297 1.345477e-02
## Branch 0.04539697 4.188296e-02 0.0135782524 0.0379092688 1.080490e-01
## Customer.type 0.09877340 1.595633e-01 0.0152433143 0.0002501625 3.304697e-02
## Gender 0.01674923 2.039167e-02 0.0174026216 0.0031910191 3.157402e-02
## Dim.11 Dim.12 Dim.13 Dim.14 Dim.15
## Unit.price 2.689639e-05 5.552713e-05 4.653004e-05 0.0347934636 7.456146e-02
## Quantity 1.762724e-04 1.569750e-02 1.301944e-02 0.0289643300 1.597997e-05
## Rating 2.594583e-02 1.360725e-02 6.241517e-02 0.0058297693 6.813763e-02
## Branch 2.307060e-02 4.755874e-02 2.288884e-02 0.0485671073 4.013904e-02
## Customer.type 5.523017e-04 4.530930e-02 1.905288e-04 0.0001792863 5.489751e-02
## Gender 5.676415e-03 2.078372e-02 1.380556e-02 0.0036613491 3.614168e-04
## Dim.16 Dim.17 Dim.18 Dim.19 Dim.20
## Unit.price 0.0669765591 0.002145952 0.0006583885 0.001998380 0.0075438747
## Quantity 0.0000624302 0.005331724 0.0270469512 0.030969840 0.1686727122
## Rating 0.0199936058 0.014699996 0.0459594867 0.033377864 0.1893699560
## Branch 0.0211366879 0.045597436 0.0131557039 0.003576857 0.0699866884
## Customer.type 0.0006007309 0.001447561 0.0773465770 0.016965620 0.0347386027
## Gender 0.0004091263 0.007518441 0.0220398950 0.010545964 0.0003529779
# Cos2: quality of representation on the factore map
head(var$cos2)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Unit.price 2.521063e-04 3.571113e-03 0.0030268038 2.433454e-02 1.304083e-03
## Quantity 5.598815e-03 2.968180e-07 0.0029428208 5.295118e-06 9.588636e-04
## Rating 1.235877e-07 8.179024e-03 0.0047005553 5.893032e-04 3.424543e-04
## Branch 1.193790e-03 1.589122e-02 0.0378416277 7.199053e-03 3.572595e-03
## Customer.type 1.190080e-04 1.482662e-04 0.0143211691 3.798118e-08 3.120411e-05
## Gender 3.060301e-02 3.166460e-02 0.0009168444 4.866238e-05 1.654527e-03
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## Unit.price 0.0014383309 6.160346e-09 8.108051e-07 3.119774e-03 1.863992e-11
## Quantity 0.0006252846 2.968750e-03 1.403830e-02 1.503085e-02 6.911719e-11
## Rating 0.0001504843 1.563591e-04 3.064907e-03 2.785721e-03 1.810308e-04
## Branch 0.0010304424 8.770911e-04 9.218447e-05 7.185563e-04 5.837291e-03
## Customer.type 0.0097561847 2.546046e-02 2.323586e-04 6.258128e-08 1.092102e-03
## Gender 0.0002805366 4.158200e-04 3.028512e-04 1.018260e-05 9.969190e-04
## Dim.11 Dim.12 Dim.13 Dim.14 Dim.15
## Unit.price 7.234158e-10 3.083262e-09 2.165045e-09 1.210585e-03 5.559412e-03
## Quantity 3.107197e-08 2.464114e-04 1.695057e-04 8.389324e-04 2.553594e-10
## Rating 6.731860e-04 1.851571e-04 3.895653e-03 3.398621e-05 4.642737e-03
## Branch 2.661262e-04 1.130917e-03 2.619495e-04 1.179382e-03 8.055713e-04
## Customer.type 3.050372e-07 2.052933e-03 3.630122e-08 3.214359e-08 3.013736e-03
## Gender 3.222168e-05 4.319630e-04 1.905935e-04 1.340548e-05 1.306221e-07
## Dim.16 Dim.17 Dim.18 Dim.19 Dim.20
## Unit.price 4.485859e-03 4.605108e-06 4.334755e-07 3.993521e-06 5.691004e-05
## Quantity 3.897530e-09 2.842728e-05 7.315376e-04 9.591310e-04 2.845048e-02
## Rating 3.997443e-04 2.160899e-04 2.112274e-03 1.114082e-03 3.586098e-02
## Branch 2.233798e-04 1.039563e-03 8.653627e-05 6.396955e-06 2.449068e-03
## Customer.type 3.608777e-07 2.095433e-06 5.982493e-03 2.878322e-04 1.206771e-03
## Gender 1.673843e-07 5.652695e-05 4.857570e-04 1.112174e-04 1.245934e-07
# Contributions to the dimensions
var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Unit.price 1.21960969 4.73431150 4.465104 13.17405745 3.1328685
## Quantity 5.74747430 0.04316186 4.402723 0.19433266 2.6863833
## Rating 0.02700328 7.16482772 5.564344 2.05011096 1.6054282
## Branch 3.75325468 14.12370406 22.327460 10.13352731 7.3332492
## Customer.type 0.83794809 0.96466391 9.712447 0.01645857 0.4846134
## Gender 13.43727083 14.09749740 2.457464 0.58912121 3.5287959
## Product.line 17.76582517 7.96155013 11.027411 20.24077601 18.9165763
## Payment 5.51439117 10.34348866 2.357113 3.63022453 11.0884671
## day 23.84364755 22.36952333 9.177146 13.38469906 18.0689951
## hour 27.85357524 18.19727143 28.508788 36.58669225 33.1546229
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## Unit.price 3.306007 0.006939316 0.08118742 5.09376094 4.019150e-04
## Quantity 2.179783 4.817268925 10.68286615 11.18069589 7.739364e-04
## Rating 1.069350 1.105543152 4.99159337 4.81333217 1.252531e+00
## Branch 3.957320 3.702980660 1.22426209 3.45718352 1.005849e+01
## Customer.type 8.610221 14.107407426 1.37438981 0.02281388 3.076408e+00
## Gender 1.460055 1.802879859 1.56908041 0.29100901 2.939288e+00
## Product.line 22.300928 30.443708058 28.17627872 9.99052373 6.066860e+00
## Payment 8.242191 9.674356147 9.45650606 7.63997557 3.397197e+00
## day 16.544085 15.621861132 17.13762774 15.95374806 2.920421e+01
## hour 32.330060 18.717055324 25.30620824 41.55695722 4.400383e+01
## Dim.11 Dim.12 Dim.13 Dim.14 Dim.15
## Unit.price 0.002533695 0.00527868 0.004463239 3.37752394 7.411049527
## Quantity 0.016605222 1.49228056 1.248846182 2.81166942 0.001588332
## Rating 2.444149726 1.29357113 5.986967363 0.56591622 6.772551853
## Branch 2.173297038 4.52116561 2.195535814 4.71457999 3.989627061
## Customer.type 0.052027943 4.30732308 0.018275842 0.01740395 5.456547222
## Gender 0.534729818 1.97580177 1.324252349 0.35542004 0.035923085
## Product.line 24.757547389 26.45945964 21.431972137 27.99759108 5.172609186
## Payment 10.751972275 2.71608510 1.674391530 5.36483830 0.262359495
## day 21.552900130 11.72909249 21.575817902 29.80619183 18.965154217
## hour 37.714236765 45.49994193 44.539477640 24.98886521 51.932590023
## Dim.16 Dim.17 Dim.18 Dim.19 Dim.20
## Unit.price 6.681675466 0.2184465 0.068507 0.2106016 0.82107991
## Quantity 0.006228124 0.5427412 2.814304 3.2637936 18.35844070
## Rating 1.994590155 1.4963817 4.782201 3.5175661 20.61114132
## Branch 2.108625627 4.6415773 1.368884 0.3769514 7.61739378
## Customer.type 0.059929761 0.1473540 8.048107 1.7879421 3.78097067
## Gender 0.040815016 0.7653374 2.293307 1.1113991 0.03841833
## Product.line 3.845050369 2.1650578 8.168468 14.5697535 7.32605826
## Payment 13.073562558 1.3884063 3.636465 16.2784712 0.03384746
## day 25.225238473 49.7819035 22.899180 30.1937079 19.63185593
## hour 46.964284452 38.8527944 45.920578 28.6898136 21.78079363
For better visualization, a plot detailing the contributions of each variable to dim(1-3) shall be plotted.
# Plot of variables
fviz_famd_var(res.famd, repel = TRUE)
# Contribution to the first dimension
fviz_contrib(res.famd, "var", axes = 1)
# Contribution to the second dimension
fviz_contrib(res.famd, "var", axes = 2)
# Contribution to the third dimension
fviz_contrib(res.famd, "var", axes = 3)
Observation All variables are represented in Dim1 and
Dim2 excluding quantity which is solely represented in Dim1 and rating
solely represented in Dim 2
Dim1: Hour, day, and product line contribute the most to PC1 Dim2: day, hour and branch in descending orde contribute the most to PC2 Dim3: The hour and branch are the greatest contributors to PC3
# Extracting the results for quantitative variables
#
quanti.var <- get_famd_var(res.famd, "quanti.var")
quanti.var
## FAMD results for quantitative variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates"
## 2 "$cos2" "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# highlighting the most contributing quantitative variables on scatter
# plot
fviz_famd_var(res.famd, "quanti.var", col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
# highligting the quality of representation of the continuous variables
# on the factor map. If a variable is well represented by two dimensions,
# the sum of the cos2 is closed to one
# Color by cos2 values: quality on the factor map
fviz_famd_var(res.famd, "quanti.var", col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
Observations From the visualization, the most important
contributing continuous variable was the rating. However it was poorly
represented in Dim 1 and Dim 2. with a sum(cos2) slightly higher than
0.09
# extracting qualitative variables format
#
quali.var <- get_famd_var(res.famd, "quali.var")
quali.var
## FAMD results for qualitative variable categories
## ===================================================
## Name Description
## 1 "$coord" "Coordinates"
## 2 "$cos2" "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# Visualize the qualitative variables
fviz_famd_var(res.famd, "quali.var", col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)
# Color by cos2 values: quality on the factor map
fviz_famd_var(res.famd, "quali.var", col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
observations
The strongest contributing factor was the gender. Whether the client was male or female. However these qualitative variables are not well represented by the first two dimensions. They has a sum(cos2) slightly higher than 0.4 which is low
# format to get these individuals
#
ind <- get_famd_ind(res.famd)
ind
## FAMD results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates"
## 2 "$cos2" "Cos2, quality of representation"
## 3 "$contrib" "Contributions"
# To plot individuals, use the function fviz_mfa_ind()
#
fviz_famd_ind(res.famd, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 946 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# coloring the invoice details based on the total plus tax
fviz_mfa_ind(res.famd,
habillage = "Gender", # color by groups
palette = c("#00AFBB", "#E7B800"),
addEllipses = TRUE, ellipse.type = "confidence",
repel = TRUE # Avoid text overlapping
)
## Warning: ggrepel: 947 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From the analysis, 20 dimensions explain 70% of the data in data
set.
We shall select the features by employing; 1. Filter methods 2. Embedded Methods
We shall select features from the numerical data in the data set
# suppressing warnings
suppressWarnings(
suppressMessages(if
(!require(caret, quietly=TRUE))
install.packages("caret")))
# specifying numeric columns
cols <- c("Unit.price", "Quantity", "minute", "Rating", "gross.income", "Tax", "cogs", "Total")
# numerical dataset
#
sales_num <- sales_org[cols]
# Calculating the correlation matrix of numerical variables
# ---
#
correlationMatrix <- cor(sales_num)
# Find attributes that are highly correlated
# ---
#
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)
# Highly correlated attributes
# ---
#
highlyCorrelated
## [1] 5 6 7
names(sales_num[,highlyCorrelated])
## [1] "gross.income" "Tax" "cogs"
The tax, cost of goods sold, and gross income have high correlations >/= 0.75
# Removing Redundant Features
# ---
#
sales_num1<-sales_num[-highlyCorrelated]
# Performing our graphical comparison
# ---
#
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(sales_num1), order = "hclust")
The Cogs, gross income, and tax columns that had high correlations have
been dropped
# loading wskm package
# ---
#
suppressWarnings(
suppressMessages(if
(!require(wskm, quietly=TRUE))
install.packages("wskm")))
library(wskm)
# Creating model removing total the target
#
set.seed(2)
model <- ewkm(sales_num[1:7], 3, lambda=2, maxiter=1000)
# Loading and installing our cluster package
# ---
#
suppressWarnings(
suppressMessages(if
(!require(cluster, quietly=TRUE))
install.packages("cluster")))
library("cluster")
# Cluster Plot against 1st 2 principal components
# ---
#
clusplot(sales_num[1:7], model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis for Iris')
# Weights remain stored in the model and we can check them as follows:
#
round(model$weights*100,2)
## Unit.price Quantity minute Rating gross.income Tax cogs
## 1 0 0 0 0.00 50 50 0
## 2 0 0 0 99.99 0 0 0
## 3 0 0 0 0.00 50 50 0
The first component is explained the tax and the gross income The second component is full explained by the rating
Dimension reduction
For dimension reduction, FAMD(Factor Analysis for Mixed Data ) was applied. IT was established that 20 principal components were required to explain 72% of the information in the data set.
From the analysis The most important components that had the most contribution were the Hour, day, and product line which contributed the most to the first principal component and day, hour and branch which contributed to the second principal component
The most important contributing continuous variable was the rating and the most important qualitative variable is gender.
Feature selection Using embedded and filtering It could be concluded that all other continuous variables aside from Cost of goods sold(cogs), gross income,and tax could help the marketing department develop a marketing strategy
Yes. The data provided by Carrefour supermarket is sufficient for the retailer to establish
Yes. It is crucial for a supermarket to gain insights from its business dealings in order to best inform a marketing strategy and maximize profitability