Defining the Question

I am a Data analyst at Carrefour Kenya and currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the hughest number of sales (total price including tax). I’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights. This first section of the project entails reducing the dataset to a low dimensional dataset using the t-SNE algorithm or PCA. I will perform my analysis and provide insights gained from the analysis.

A sales dataset has been provided to perform dimensionality reduction on. We first begin with loading and previewing the dataset at hand.

Loading the Dataset

sales <- read.csv("http://bit.ly/CarreFourDataset")

Let’s preview the top of our dataset

head(sales)

##    Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
## 3 631-41-3108      A        Normal   Male     Home and lifestyle      46.33
## 4 123-19-1176      A        Member   Male      Health and beauty      58.22
## 5 373-73-7910      A        Normal   Male      Sports and travel      86.31
## 6 699-14-3026      C        Normal   Male Electronic accessories      85.39
##   Quantity     Tax      Date  Time     Payment   cogs gross.margin.percentage
## 1        7 26.1415  1/5/2019 13:08     Ewallet 522.83                4.761905
## 2        5  3.8200  3/8/2019 10:29        Cash  76.40                4.761905
## 3        7 16.2155  3/3/2019 13:23 Credit card 324.31                4.761905
## 4        8 23.2880 1/27/2019 20:33     Ewallet 465.76                4.761905
## 5        7 30.2085  2/8/2019 10:37     Ewallet 604.17                4.761905
## 6        7 29.8865 3/25/2019 18:30     Ewallet 597.73                4.761905
##   gross.income Rating    Total
## 1      26.1415    9.1 548.9715
## 2       3.8200    9.6  80.2200
## 3      16.2155    7.4 340.5255
## 4      23.2880    8.4 489.0480
## 5      30.2085    5.3 634.3785
## 6      29.8865    4.1 627.6165

Let’s preview the bottom of our dataset

tail(sales)

##       Invoice.ID Branch Customer.type Gender           Product.line Unit.price
## 995  652-49-6720      C        Member Female Electronic accessories      60.95
## 996  233-67-5758      C        Normal   Male      Health and beauty      40.35
## 997  303-96-2227      B        Normal Female     Home and lifestyle      97.38
## 998  727-02-1313      A        Member   Male     Food and beverages      31.84
## 999  347-56-2442      A        Normal   Male     Home and lifestyle      65.82
## 1000 849-09-3807      A        Member Female    Fashion accessories      88.34
##      Quantity     Tax      Date  Time Payment   cogs gross.margin.percentage
## 995         1  3.0475 2/18/2019 11:40 Ewallet  60.95                4.761905
## 996         1  2.0175 1/29/2019 13:46 Ewallet  40.35                4.761905
## 997        10 48.6900  3/2/2019 17:16 Ewallet 973.80                4.761905
## 998         1  1.5920  2/9/2019 13:22    Cash  31.84                4.761905
## 999         1  3.2910 2/22/2019 15:33    Cash  65.82                4.761905
## 1000        7 30.9190 2/18/2019 13:28    Cash 618.38                4.761905
##      gross.income Rating     Total
## 995        3.0475    5.9   63.9975
## 996        2.0175    6.2   42.3675
## 997       48.6900    4.4 1022.4900
## 998        1.5920    7.7   33.4320
## 999        3.2910    4.1   69.1110
## 1000      30.9190    6.6  649.2990

Data Exploration

Check the structure of the dataset

str(sales)

## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...

Our dataset has 1000 rows and 16 columns. 8 of which have a character data type 1 is of integer data type and the other 7 are of the numerical data type

Checking the summary of our dataset

summary(sales)

##   Invoice.ID           Branch          Customer.type         Gender         
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.line         Unit.price       Quantity          Tax         
##  Length:1000        Min.   :10.08   Min.   : 1.00   Min.   : 0.5085  
##  Class :character   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 5.9249  
##  Mode  :character   Median :55.23   Median : 5.00   Median :12.0880  
##                     Mean   :55.67   Mean   : 5.51   Mean   :15.3794  
##                     3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.:22.4453  
##                     Max.   :99.96   Max.   :10.00   Max.   :49.6500  
##      Date               Time             Payment               cogs       
##  Length:1000        Length:1000        Length:1000        Min.   : 10.17  
##  Class :character   Class :character   Class :character   1st Qu.:118.50  
##  Mode  :character   Mode  :character   Mode  :character   Median :241.76  
##                                                           Mean   :307.59  
##                                                           3rd Qu.:448.90  
##                                                           Max.   :993.00  
##  gross.margin.percentage  gross.income         Rating           Total        
##  Min.   :4.762           Min.   : 0.5085   Min.   : 4.000   Min.   :  10.68  
##  1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500   1st Qu.: 124.42  
##  Median :4.762           Median :12.0880   Median : 7.000   Median : 253.85  
##  Mean   :4.762           Mean   :15.3794   Mean   : 6.973   Mean   : 322.97  
##  3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500   3rd Qu.: 471.35  
##  Max.   :4.762           Max.   :49.6500   Max.   :10.000   Max.   :1042.65

Data Cleaning

Checking for missing values

colSums(is.na(sales))

##              Invoice.ID                  Branch           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                     Tax                    Date 
##                       0                       0                       0 
##                    Time                 Payment                    cogs 
##                       0                       0                       0 
## gross.margin.percentage            gross.income                  Rating 
##                       0                       0                       0 
##                   Total 
##                       0

There are no missing values present in our dataset. Let’s check for duplicates now.

sales[duplicated(sales),]

##  [1] Invoice.ID              Branch                  Customer.type          
##  [4] Gender                  Product.line            Unit.price             
##  [7] Quantity                Tax                     Date                   
## [10] Time                    Payment                 cogs                   
## [13] gross.margin.percentage gross.income            Rating                 
## [16] Total                  
## <0 rows> (or 0-length row.names)

There are no duplicates in our dataset either.

Checking for outliers using boxplots

boxplot(sales$Unit.price,
        main ="Unit Price",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

The Unit price column has no outliers

boxplot(sales$Quantity,
        main ="Quantity",
        col = "green",
        border  = 'blue',
        horizontal = TRUE,
        notch = TRUE)

There are no outliers present in the quantity column

boxplot(sales$Tax,
        main ="Tax",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

The tax columns have a few outliers present

boxplot(sales$cogs,
        main ="Cogs",
        col = "green",
        border  = 'blue',
        horizontal = TRUE,
        notch = TRUE)

There are a few outliers in the cogs column

boxplot(sales$gross.income,
        main ="Gross Income",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

There are a few outliers in the Gross Income column,

boxplot(sales$Rating,
        main ="Rating",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

There are no outliers in the rating column.

Exploratory Data Analysis

Univariate Analysis

Measures of Central Tendency

Mean

Let’s find the mean of our numeric columns

colMeans(sales[sapply(sales,is.numeric)])

##              Unit.price                Quantity                     Tax 
##               55.672130                5.510000               15.379369 
##                    cogs gross.margin.percentage            gross.income 
##              307.587380                4.761905               15.379369 
##                  Rating                   Total 
##                6.972700              322.966749

Median

Let’s find the median of our numeric columns

unitprice_median <- median(sales$Unit.price)
unitprice_median

## [1] 55.23

qty_median <- median(sales$Quantity)
qty_median

## [1] 5

tax_median <- median(sales$Tax)
tax_median

## [1] 12.088

cogs_median <- median(sales$cogs)
cogs_median

## [1] 241.76

income_median <- median(sales$gross.income)
income_median

## [1] 12.088

rating_median <- median (sales$Rating)
rating_median

## [1] 7

Mode

Finding the mode of our numeric columns. Let’s create the mode function since it’s not inbuilt

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]}

getmode(sales$Unit.price)

## [1] 83.77

getmode(sales$Quantity)

## [1] 10

getmode(sales$Tax)

## [1] 39.48

getmode(sales$cogs)

## [1] 789.6

getmode(sales$gross.income)

## [1] 39.48

getmode(sales$Rating)

## [1] 6

Range

Finding the range in our columns. We expect the result to give us the minimum and maximum values of our numeric columns.

range(sales$Unit.price)

## [1] 10.08 99.96

range(sales$Quantity)

## [1]  1 10

range(sales$Tax)

## [1]  0.5085 49.6500

range(sales$cogs)

## [1]  10.17 993.00

range(sales$gross.income)

## [1]  0.5085 49.6500

range(sales$Rating)

## [1]  4 10

Quantiles

Let’s get the quantiles in our columns.

quantile(sales$Unit.price)

##     0%    25%    50%    75%   100% 
## 10.080 32.875 55.230 77.935 99.960

quantile(sales$Quantity)

##   0%  25%  50%  75% 100% 
##    1    3    5    8   10

quantile(sales$Tax)

##        0%       25%       50%       75%      100% 
##  0.508500  5.924875 12.088000 22.445250 49.650000

quantile(sales$cogs)

##       0%      25%      50%      75%     100% 
##  10.1700 118.4975 241.7600 448.9050 993.0000

quantile(sales$gross.income)

##        0%       25%       50%       75%      100% 
##  0.508500  5.924875 12.088000 22.445250 49.650000

quantile(sales$Rating)

##   0%  25%  50%  75% 100% 
##  4.0  5.5  7.0  8.5 10.0

Variance

Finding the variance of the numeric columns shows us how the data values are dispersed around the mean.

var(sales$Unit.price)

## [1] 701.9653

var(sales$Quantity)

## [1] 8.546446

var(sales$Tax)

## [1] 137.0966

var(sales$cogs)

## [1] 54838.64

var(sales$gross.income)

## [1] 137.0966

var(sales$Rating)

## [1] 2.953518

Standard Deviation

Let’s find the standard deviation of numeric columns

sd(sales$Unit.price)

## [1] 26.49463

sd(sales$Quantity)

## [1] 2.923431

sd(sales$Tax)

## [1] 11.70883

sd(sales$cogs)

## [1] 234.1765

sd(sales$gross.income)

## [1] 11.70883

sd(sales$Rating)

## [1] 1.71858

Histograms

hist(sales$Unit.price, col  = "yellow")

hist(sales$Quantity, col  = "pink")

hist(sales$Tax, col = "dark red")

hist(sales$cogs, col = "orange")

hist(sales$gross.income, col= "light blue")

hist(sales$Rating, col  ="dark green")

Bivariate Analysis

Correlation

library(corrplot)

## corrplot 0.92 loaded

correlation <- cor(sales[,c(6,7,8,12,14,15,16)])
corrplot(correlation, method = "square", type = "lower", diag = TRUE)

Most of our variables have very strong correlations

Correlation Matrix

This gives us the extent to which variables correlate with each other.

cor(sales$Unit.price, sales$Quantity)

## [1] 0.01077756

cor(sales$Tax, sales$Quantity)

## [1] 0.7055102

cor(sales$cogs, sales$gross.income)

## [1] 1

cor(sales$Quantity, sales$Rating)

## [1] -0.0158149

cor(sales$Tax, sales$gross.income)

## [1] 1

Dimensionality Reduction

t-Distributed Stochastic Neighbour Embedding(t-SNE) and PCA will be applied to perform dimensionality reduction in our dataset.

Principal Component Analysis

PCA is a technique that helps in extracting a new set of variables called principal components from an extisting large set of variables. Let’s apply this to see how it will perform.

Selecting the numerical data

df <- sales[,c(6:8,12:16)]
head(df)

##   Unit.price Quantity     Tax   cogs gross.margin.percentage gross.income
## 1      74.69        7 26.1415 522.83                4.761905      26.1415
## 2      15.28        5  3.8200  76.40                4.761905       3.8200
## 3      46.33        7 16.2155 324.31                4.761905      16.2155
## 4      58.22        8 23.2880 465.76                4.761905      23.2880
## 5      86.31        7 30.2085 604.17                4.761905      30.2085
## 6      85.39        7 29.8865 597.73                4.761905      29.8865
##   Rating    Total
## 1    9.1 548.9715
## 2    9.6  80.2200
## 3    7.4 340.5255
## 4    8.4 489.0480
## 5    5.3 634.3785
## 6    4.1 627.6165

df is then passed to the prcomp(). We also, set two arguments, center and scale, to be TRUE then preview our object with summary.

sales.pca <- prcomp(sales[,c(6,7,8,12,14,15,16)], center = TRUE, scale. = TRUE)
summary(sales.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4       PC5       PC6
## Standard deviation     2.2185 1.0002 0.9939 0.30001 5.138e-16 1.387e-16
## Proportion of Variance 0.7031 0.1429 0.1411 0.01286 0.000e+00 0.000e+00
## Cumulative Proportion  0.7031 0.8460 0.9871 1.00000 1.000e+00 1.000e+00
##                              PC7
## Standard deviation     1.295e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

We obtain 7 Principal components, each which explains a percentage of the total variation of the dataset. PC1 gives 70.3% of the total variation PC2 and PC3 gives 14% of the total variation, etc.

Let’s call str() to have a look at the PCA model.

str(sales.pca)

## List of 5
##  $ sdev    : num [1:7] 2.22 1.00 9.94e-01 3.00e-01 5.14e-16 ...
##  $ rotation: num [1:7, 1:7] -0.292 -0.325 -0.45 -0.45 -0.45 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:7] 55.67 5.51 15.38 307.59 15.38 ...
##   ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##  $ scale   : Named num [1:7] 26.49 2.92 11.71 234.18 11.71 ...
##   ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
##  $ x       : num [1:1000, 1:7] -2.005 2.306 -0.186 -1.504 -2.8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

We go ahead to plot the PCA model. First, we install the ggbiplot visualization package.

install.packages("devtools")

## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

install.packages("Rcpp", dependencies = TRUE)

## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(devtools)

## Loading required package: usethis

install_github("vqv/ggbiplot")

## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
##   Use `force = TRUE` to force installation

Loading our ggbiplot library

library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

ggbiplot(sales.pca)

From the graph above, it’s evident that gross income, unit price, and the ratings and quantity are important factors in this analysis.

PC1 explains 70.3% of the total variance. This means that nearly two-thirds of the dataset ( 7 variables ) can be encapsulated by the one Principal Component (PC1)

PC2 explains 14.3% of the variance.

Adding more detail to the plot, we provide arguments rownames as the labels.

ggbiplot(sales.pca, labels=rownames(sales), obs.scale = 1, var.scale = 1)

Let’s plot PC3 and PC4.

ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4))

From the graph, PC3 explains 14.3% of the variance hile PC4 explains 1.3% of the total variance in the dataset..

Adding more detail to the plot, we provide arguments rownames as the labels.

ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4),   labels=rownames(sales))

t- Distributed Stochastic Neighbour Embedding (t-SNE)

t-SNE helps to find a way to project data into a low dimensional space in order for the clustering in the high dimensional spaces is preserved.

Loading the t-sne library

library(Rtsne)

Curate the database for analysis

Labels <- sales$Product.line
sales$Product.line <- as.factor(sales$Product.line)

Plotting to see the dimensions

colors = rainbow(length(unique(sales$Product.line)))
names(colors) = unique(sales$Product.line)

Executing our algorithm on the curated data

tsne <- Rtsne(sales[,-5], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)

## Performing PCA
## Read the 1000 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.09 seconds (sparsity = 0.101260)!
## Learning embedding...
## Iteration 50: error is 60.785506 (50 iterations in 0.12 seconds)
## Iteration 100: error is 52.526184 (50 iterations in 0.11 seconds)
## Iteration 150: error is 50.877044 (50 iterations in 0.12 seconds)
## Iteration 200: error is 50.352214 (50 iterations in 0.12 seconds)
## Iteration 250: error is 50.074397 (50 iterations in 0.12 seconds)
## Iteration 300: error is 0.561052 (50 iterations in 0.12 seconds)
## Iteration 350: error is 0.397329 (50 iterations in 0.12 seconds)
## Iteration 400: error is 0.362099 (50 iterations in 0.12 seconds)
## Iteration 450: error is 0.344496 (50 iterations in 0.12 seconds)
## Iteration 500: error is 0.337596 (50 iterations in 0.12 seconds)
## Fitting performed in 1.21 seconds.

Checking the summary of our model

summary(tsne)

##                     Length Class  Mode   
## N                      1   -none- numeric
## Y                   2000   -none- numeric
## costs               1000   -none- numeric
## itercosts             10   -none- numeric
## origD                  1   -none- numeric
## perplexity             1   -none- numeric
## theta                  1   -none- numeric
## max_iter               1   -none- numeric
## stop_lying_iter        1   -none- numeric
## mom_switch_iter        1   -none- numeric
## momentum               1   -none- numeric
## final_momentum         1   -none- numeric
## eta                    1   -none- numeric
## exaggeration_factor    1   -none- numeric

Plotting the graph and closely monitoring it

plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=sales$Product.line, col=colors[sales$Product.line])

Conclusions and Recommendations

We note that Quantity, Rating, Unit Price and Gross Income are the most important features in this analysis.
PCA gives better insights over t-SNE for this kind of analysis. Carrefour should therfore implement this algorithm when performing dimensionality reduction.
When coming up with a marketing strategy, promotions, adverts or discounts, the team at CarreFour should take into consideration the unit price of the commodities, their quantity, their given ratings and the customer’s gross income, so as to come up with customized strategies.

Feature Selection

Filter Methods

These methods apply a metric to assign scoring to each feature. The features would then be ranked by their score. We’ll be using the findCorrelation function that’s within the caret package to create a subset of variables. This function allows us remove redundancy by correlation using the dataset given. We use numerical variables only since we are working with a correlation matrix.

df <- sales[,c(6,7,8,12,14,15,16)]
head(df)

##   Unit.price Quantity     Tax   cogs gross.income Rating    Total
## 1      74.69        7 26.1415 522.83      26.1415    9.1 548.9715
## 2      15.28        5  3.8200  76.40       3.8200    9.6  80.2200
## 3      46.33        7 16.2155 324.31      16.2155    7.4 340.5255
## 4      58.22        8 23.2880 465.76      23.2880    8.4 489.0480
## 5      86.31        7 30.2085 604.17      30.2085    5.3 634.3785
## 6      85.39        7 29.8865 597.73      29.8865    4.1 627.6165

Loading our caret package.

suppressWarnings(
        suppressMessages(if
                         (!require(caret, quietly=TRUE))
                install.packages("caret")))
library(caret)

Let’s inatall the corrplot package for correlation.

suppressWarnings(
        suppressMessages(if
                         (!require(corrplot, quietly=TRUE))
                install.packages("corrplot")))
library(corrplot)

Getting the correlation matrix

corrmatrix <- cor(df)
corrmatrix

##                Unit.price    Quantity        Tax       cogs gross.income
## Unit.price    1.000000000  0.01077756  0.6339621  0.6339621    0.6339621
## Quantity      0.010777564  1.00000000  0.7055102  0.7055102    0.7055102
## Tax           0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## cogs          0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## gross.income  0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
## Rating       -0.008777507 -0.01581490 -0.0364417 -0.0364417   -0.0364417
## Total         0.633962089  0.70551019  1.0000000  1.0000000    1.0000000
##                    Rating      Total
## Unit.price   -0.008777507  0.6339621
## Quantity     -0.015814905  0.7055102
## Tax          -0.036441705  1.0000000
## cogs         -0.036441705  1.0000000
## gross.income -0.036441705  1.0000000
## Rating        1.000000000 -0.0364417
## Total        -0.036441705  1.0000000

Finding variables that have highly correlated variables

highlyCorrelated <- findCorrelation(corrmatrix, cutoff=0.75)
highlyCorrelated

## [1] 4 7 3

List of highly correlated variables

names(df[,highlyCorrelated])

## [1] "cogs"  "Total" "Tax"

From the output above the following variables have the highest correlation: * cogs * Total * Tax

Let’s remove the redundant features

df_new <- df[-highlyCorrelated]
head(df_new)

##   Unit.price Quantity gross.income Rating
## 1      74.69        7      26.1415    9.1
## 2      15.28        5       3.8200    9.6
## 3      46.33        7      16.2155    7.4
## 4      58.22        8      23.2880    8.4
## 5      86.31        7      30.2085    5.3
## 6      85.39        7      29.8865    4.1

Graphical Comparison

par(mfrow = c(1, 2))
corrplot(corrmatrix, order = "hclust")
corrplot(cor(df_new), order = "hclust")

Embedded Methods

Here we’ll use the ewkm function from the wskm package, which is a weighted subspace clustering algorithm this well suited to very high dimensional data

Let’s load our wskm package

suppressWarnings(
        suppressMessages(if
                         (!require(wskm, quietly=TRUE))
                install.packages("wskm")))
library(wskm)

set.seed(23)
model <- ewkm(df,3, lambda = 2, maxiter = 1000)

Loading the cluster package

suppressWarnings(
        suppressMessages(if
                         (!require(cluster, quietly=TRUE))
                install.packages("cluster")))
library("cluster")

We cluster plot against the first 2 Pricipal Components.

clusplot(df, model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis for Supermarket sales')

From the plot above, we note that the first two principal components explain 84.6% of the point variability. Weights are calculated for each variable and cluster. The weights are a measure of the relative importance of each variable with regards to the observations membership to the cluster. The weights are incorporated into the distance function, and this typically reduces the distance fro more important variables

Weights remain stored in the model and they can be checked as follows:

round(model$weights*100.2)

##   Unit.price Quantity Tax cogs gross.income Rating Total
## 1          0        0   0    0            0    100     0
## 2          0        0  50    0           50      0     0
## 3          0        0  50    0           50      0     0

Feature Ranking

Here, we will be using the FSelector package. This is a package that contains functions for selecting attributes from the dataset

Loading the package required

suppressWarnings(
        suppressMessages(if
                         (!require(FSelector, quietly=TRUE))
                install.packages("FSelector")))

install.packages('FSelector')

## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

From the FSelector package, the correlation coefficient is used as a unit of evaluation This is one of the algorithms contained in the FSelector package, that can be used to rank the variables.

scores <- linear.correlation(Total~.,df)
scores

##              attr_importance
## Unit.price         0.6339621
## Quantity           0.7055102
## Tax                1.0000000
## cogs               1.0000000
## gross.income       1.0000000
## Rating             0.0364417

From the output above, a list is observed to contain rows of variables on the left, and score on the right. In order to help make a decision, we get to define a cut-off. That is, if we wanted the top 5 representative variables, through the use of the cut-off.k function that is in the FSelector package.

subset <- cutoff.k(scores, 5)
as.data.frame(subset)

##         subset
## 1          Tax
## 2         cogs
## 3 gross.income
## 4     Quantity
## 5   Unit.price

The cut-off can also be set as a percentage which would indicate that our aim is to wor with the percentage of the best variables

subset2 <-cutoff.k.percent(scores, 0.4)
as.data.frame(subset2)

##   subset2
## 1     Tax
## 2    cogs

We can use an entropy-based approach instead of using the scores for the correlation coefficiente

scores2 <- information.gain(Total~., df)
scores2

##              attr_importance
## Unit.price         0.3084863
## Quantity           0.4211154
## Tax                1.6094379
## cogs               1.6094379
## gross.income       1.6094379
## Rating             0.0000000

Choosing variables by cut-off method

subset3 <- cutoff.k(scores2, 5)
as.data.frame(subset3)

##        subset3
## 1          Tax
## 2         cogs
## 3 gross.income
## 4     Quantity
## 5   Unit.price

Conclusions and Recommendations

From our analysis, the most important variables for determining the sales in the CarrFour Supermarket are Tax, cogs, gross income, Quantity and Unit price
When coming up with a marketing strategy, promotions, discounts or adverts, the team should take into consideration the tax, unit price of the commodities, their quantity, the ratings given and the customer’s gross income to come up with tailor made strategies.

Dimensionality Reduction and Feature Selection in R

Matilda Kadzo

2022-06-11

Defining the Question

Loading the Dataset

Data Exploration

Data Cleaning

Exploratory Data Analysis

Univariate Analysis

Measures of Central Tendency

Mean

Median

Mode

Range

Quantiles

Variance

Standard Deviation

Histograms

Bivariate Analysis

Correlation

Correlation Matrix

Dimensionality Reduction

Principal Component Analysis

t- Distributed Stochastic Neighbour Embedding (t-SNE)

Conclusions and Recommendations

Feature Selection

Filter Methods

Embedded Methods

Feature Ranking

Conclusions and Recommendations