Feature Selection for Carrefour Supermarket dataset in R

1. Defining the Question

Research Question

You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.

a.) Specifying the question

Determine which features contribute the most information to the dataset.

b.) The metric for success

Perform feature selection on the dataset using the 4 methods: i.) Filter method ii.) Wrapper methods iii.) Embedded methods iv.) Feature Ranking

c.) Understanding the context

The dataset provided contains 18 attributes of customers who visit the three branches of the supermarket. Out of these, we need to select the attributes that give out the most information about the dataset.

d.) Experimental design taken

  1. Reading the data

  2. Checking the data - data understanding

  3. Implementing the solution

  4. Challenge the solution

  5. Follow up Questions

  6. Conclusion.

  7. Recommendations.

e.) Data appropriateness to answer the given question.

The dataset contains several customer attributes. Out of these, only a handful carry the most information of the dataset.

2. Loading the dataset

# import Tibble package
library(tibble)


# url <- http://bit.ly/CarreFourDataset

# load the dataset as dataframe

df <- read.csv('http://bit.ly/CarreFourDataset') 

# convert dataframe to Tibble

df <- as_tibble(df)

#check data structure of our dataset

class(df)
## [1] "tbl_df"     "tbl"        "data.frame"

3. Checking the data

# Previewing our first 5 records
#
head(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 750-67-8428 A      Member        Female Health and …       74.7        7 26.1 
## 2 226-31-3081 C      Normal        Female Electronic …       15.3        5  3.82
## 3 631-41-3108 A      Normal        Male   Home and li…       46.3        7 16.2 
## 4 123-19-1176 A      Member        Male   Health and …       58.2        8 23.3 
## 5 373-73-7910 A      Normal        Male   Sports and …       86.3        7 30.2 
## 6 699-14-3026 C      Normal        Male   Electronic …       85.4        7 29.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>
# Previewing our last 5 records
#
tail(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 652-49-6720 C      Member        Female Electronic …       61.0        1  3.05
## 2 233-67-5758 C      Normal        Male   Health and …       40.4        1  2.02
## 3 303-96-2227 B      Normal        Female Home and li…       97.4       10 48.7 
## 4 727-02-1313 A      Member        Male   Food and be…       31.8        1  1.59
## 5 347-56-2442 A      Normal        Male   Home and li…       65.8        1  3.29
## 6 849-09-3807 A      Member        Female Fashion acc…       88.3        7 30.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>
# check shape of the dataset

dim(df)
## [1] 1000   16

Our dataset contains 1000 records and 16 columns

# check column datatypes

sapply(df, class)
##              Invoice.ID                  Branch           Customer.type 
##             "character"             "character"             "character" 
##                  Gender            Product.line              Unit.price 
##             "character"             "character"               "numeric" 
##                Quantity                     Tax                    Date 
##               "integer"               "numeric"             "character" 
##                    Time                 Payment                    cogs 
##             "character"             "character"               "numeric" 
## gross.margin.percentage            gross.income                  Rating 
##               "numeric"               "numeric"               "numeric" 
##                   Total 
##               "numeric"
# inspect variable classes

str(df)
## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Invoice.ID             : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr [1:1000] "A" "C" "A" "A" ...
##  $ Customer.type          : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num [1:1000] 549 80.2 340.5 489 634.4 ...

We need to convert time into time datatype and categorical columns to factor datatype.

# convert  date and time to standard date and time format

df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")
df$Date <- as.Date(df$Date)

df$Time <- as.POSIXct(df$Time, format = "%H:%M")
df$Time <- format(df$Time,"%H:%M")

 

# convert logical values to factor

logical.to.factor <- function(column){
  
   as.factor(column)
  
}


# logical columns

df$Gender <- logical.to.factor(df$Gender)
df$Branch <- logical.to.factor(df$Branch)
df$Customer.type <- logical.to.factor(df$Customer.type)
df$Product.line <- logical.to.factor(df$Product.line)
df$Payment <- logical.to.factor(df$Payment)

# preview dataset
head(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <fct>  <fct>         <fct>  <fct>             <dbl>    <int> <dbl>
## 1 750-67-8428 A      Member        Female Health and …       74.7        7 26.1 
## 2 226-31-3081 C      Normal        Female Electronic …       15.3        5  3.82
## 3 631-41-3108 A      Normal        Male   Home and li…       46.3        7 16.2 
## 4 123-19-1176 A      Member        Male   Health and …       58.2        8 23.3 
## 5 373-73-7910 A      Normal        Male   Sports and …       86.3        7 30.2 
## 6 699-14-3026 C      Normal        Male   Electronic …       85.4        7 29.9 
## # … with 8 more variables: Date <date>, Time <chr>, Payment <fct>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>

4. Data cleaning

Validation

Dataset is provided by the client.

Outliers

# select only numerical variables
 
library("plyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data_num <- select_if(df, is.numeric) 
head(data_num)
## # A tibble: 6 × 8
##   Unit.price Quantity   Tax  cogs gross.margin.percen… gross.income Rating Total
##        <dbl>    <int> <dbl> <dbl>                <dbl>        <dbl>  <dbl> <dbl>
## 1       74.7        7 26.1  523.                  4.76        26.1     9.1 549. 
## 2       15.3        5  3.82  76.4                 4.76         3.82    9.6  80.2
## 3       46.3        7 16.2  324.                  4.76        16.2     7.4 341. 
## 4       58.2        8 23.3  466.                  4.76        23.3     8.4 489. 
## 5       86.3        7 30.2  604.                  4.76        30.2     5.3 634. 
## 6       85.4        7 29.9  598.                  4.76        29.9     4.1 628.
# Check for outliers


# import ggplot2 for plotting
library(ggplot2)


# vectorize column names

columns <-  c(names(data_num))

# plot boxplot

boxplots <- function(column_name){
  column <- unlist(data_num[,column_name])
  p1 <- ggplot(data=data_num, mapping=aes(column))+geom_boxplot() + xlab(names(data_num[,column_name]))
  print(p1)
}

# create 8 subplots 

num <- c()

for (x in 1:length(columns)){
   plot <- boxplots(x)
   
}

Outliers exist in Tax, cogs, gross.income and Total columns.

Missing data

# check for missing values
colSums(is.na(df))
##              Invoice.ID                  Branch           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                     Tax                    Date 
##                       0                       0                       0 
##                    Time                 Payment                    cogs 
##                       0                       0                       0 
## gross.margin.percentage            gross.income                  Rating 
##                       0                       0                       0 
##                   Total 
##                       0

Our dataset has no missing values.

Duplicates

# check for duplicates

anyDuplicated(df)
## [1] 0

Our dataset has no duplicated rows.

Uniformity

# convert column names to lowercase 
names(df) <- tolower(names(df))

Relevance

# check relevance of columns
unique(df$gross.margin.percentage)
## [1] 4.761905
# drop column since it is a constant
df <- df[,-13]
data_num <- data_num[,-5]

The gross margin percentage is a constant equal to 4.761905.

5. Exploratory Data Analysis

A comprehensive exploratory data analysis has been done in the pca exercise.

Link: https://rpubs.com/Magguire/913785

6. Implementing the solution

Feature Engineering

# Convert factors to numeric 

fact_to_num <- function(column){
  as.integer(column)
}
df$branch <- fact_to_num(df$branch)
df$product.line <- fact_to_num(df$product.line)
df$customer.type <- fact_to_num(df$customer.type)
df$gender <- fact_to_num(df$gender)
df$payment <- fact_to_num(df$payment)
# extract month, day of the week and hour

# import lubridate package for date and time manipulation

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# weekday
df$weekday <- as.integer(wday(df$date, week_start=1)) # Monday as first day

# day of month
df$day <- as.integer(format(as.Date(df$date), "%d"))

# Month
df$month <- as.integer(format(as.Date(df$date), "%m"))

# Hour
df$hour <- as.integer(format(strptime(df$time,"%H:%M"),'%H'))
# drop unnecessary columns

# drop invoice id, date and time
df <- select(df,-c(invoice.id, date, time))

# preview dataset
head(df)
## # A tibble: 6 × 16
##   branch customer.type gender product.line unit.price quantity   tax payment
##    <int>         <int>  <int>        <int>      <dbl>    <int> <dbl>   <int>
## 1      1             1      1            4       74.7        7 26.1        3
## 2      3             2      1            1       15.3        5  3.82       1
## 3      1             2      2            5       46.3        7 16.2        2
## 4      1             1      2            4       58.2        8 23.3        3
## 5      1             2      2            6       86.3        7 30.2        3
## 6      3             2      2            1       85.4        7 29.9        3
## # … with 8 more variables: cogs <dbl>, gross.income <dbl>, rating <dbl>,
## #   total <dbl>, weekday <int>, day <int>, month <int>, hour <int>

Performing Feature Selection

Method 1: Filter Method

# loading caret package for findCorrelation function

library(caret)
## Loading required package: lattice
# load corrplot package to plot correlation
library(corrplot)
## corrplot 0.92 loaded
# correlation matrix
# ---
#
corr <- cor(df)
corr
##                    branch customer.type       gender product.line   unit.price
## branch         1.00000000  -0.019607869 -0.056317558 -0.053937557  0.028202440
## customer.type -0.01960787   1.000000000  0.039996160 -0.036800311 -0.020237875
## gender        -0.05631756   0.039996160  1.000000000  0.005193197  0.015444630
## product.line  -0.05393756  -0.036800311  0.005193197  1.000000000  0.019321028
## unit.price     0.02820244  -0.020237875  0.015444630  0.019321028  1.000000000
## quantity       0.01596379  -0.016762706 -0.074258307  0.020256001  0.010777564
## tax            0.04104666  -0.019670283 -0.049450989  0.031620725  0.633962089
## payment       -0.05010429   0.018073436  0.044577609  0.029896383 -0.015941048
## cogs           0.04104666  -0.019670283 -0.049450989  0.031620725  0.633962089
## gross.income   0.04104666  -0.019670283 -0.049450989  0.031620725  0.633962089
## rating         0.01023848   0.018888672  0.004800208 -0.020528973 -0.008777507
## total          0.04104666  -0.019670283 -0.049450989  0.031620725  0.633962089
## weekday       -0.03950365   0.048747377 -0.019730770  0.017043488  0.027211544
## day           -0.01990763   0.026396585  0.035570507 -0.056815945  0.051708609
## month         -0.02926049   0.005279579 -0.015200915 -0.001394159 -0.018218384
## hour           0.03300711  -0.018893298  0.084081139 -0.060849916  0.008242210
##                   quantity          tax      payment         cogs gross.income
## branch         0.015963788  0.041046665 -0.050104288  0.041046665  0.041046665
## customer.type -0.016762706 -0.019670283  0.018073436 -0.019670283 -0.019670283
## gender        -0.074258307 -0.049450989  0.044577609 -0.049450989 -0.049450989
## product.line   0.020256001  0.031620725  0.029896383  0.031620725  0.031620725
## unit.price     0.010777564  0.633962089 -0.015941048  0.633962089  0.633962089
## quantity       1.000000000  0.705510186 -0.003920990  0.705510186  0.705510186
## tax            0.705510186  1.000000000 -0.012433637  1.000000000  1.000000000
## payment       -0.003920990 -0.012433637  1.000000000 -0.012433637 -0.012433637
## cogs           0.705510186  1.000000000 -0.012433637  1.000000000  1.000000000
## gross.income   0.705510186  1.000000000 -0.012433637  1.000000000  1.000000000
## rating        -0.015814905 -0.036441705 -0.005381289 -0.036441705 -0.036441705
## total          0.705510186  1.000000000 -0.012433637  1.000000000  1.000000000
## weekday       -0.006818718  0.003096704 -0.035126428  0.003096704  0.003096704
## day           -0.027744137  0.009624700 -0.006506096  0.009624700  0.009624700
## month          0.032895342  0.015114331 -0.023007131  0.015114331  0.015114331
## hour          -0.007316886 -0.002770440  0.045420537 -0.002770440 -0.002770440
##                     rating        total      weekday          day        month
## branch         0.010238476  0.041046665 -0.039503650 -0.019907626 -0.029260490
## customer.type  0.018888672 -0.019670283  0.048747377  0.026396585  0.005279579
## gender         0.004800208 -0.049450989 -0.019730770  0.035570507 -0.015200915
## product.line  -0.020528973  0.031620725  0.017043488 -0.056815945 -0.001394159
## unit.price    -0.008777507  0.633962089  0.027211544  0.051708609 -0.018218384
## quantity      -0.015814905  0.705510186 -0.006818718 -0.027744137  0.032895342
## tax           -0.036441705  1.000000000  0.003096704  0.009624700  0.015114331
## payment       -0.005381289 -0.012433637 -0.035126428 -0.006506096 -0.023007131
## cogs          -0.036441705  1.000000000  0.003096704  0.009624700  0.015114331
## gross.income  -0.036441705  1.000000000  0.003096704  0.009624700  0.015114331
## rating         1.000000000 -0.036441705  0.032613920 -0.025270643 -0.046169543
## total         -0.036441705  1.000000000  0.003096704  0.009624700  0.015114331
## weekday        0.032613920  0.003096704  1.000000000 -0.095228079 -0.131061937
## day           -0.025270643  0.009624700 -0.095228079  1.000000000  0.077649975
## month         -0.046169543  0.015114331 -0.131061937  0.077649975  1.000000000
## hour          -0.030587644 -0.002770440  0.014484537  0.021185360  0.018556725
##                       hour
## branch         0.033007115
## customer.type -0.018893298
## gender         0.084081139
## product.line  -0.060849916
## unit.price     0.008242210
## quantity      -0.007316886
## tax           -0.002770440
## payment        0.045420537
## cogs          -0.002770440
## gross.income  -0.002770440
## rating        -0.030587644
## total         -0.002770440
## weekday        0.014484537
## day            0.021185360
## month          0.018556725
## hour           1.000000000
# Find attributes that are highly correlated(min correlation=0.75)

highly_correlated <- findCorrelation(corr, cutoff=0.7)

# display index of columns
highly_correlated
## [1]  9 12  7 10
# display column names of selected highly correlated variables
names(df[,highly_correlated])
## [1] "cogs"         "total"        "tax"          "gross.income"
# create dataset without highly correlated variables

df2 <- df[-c(highly_correlated)]

# graphical comparison plot between dataset with and without highly correlated variables.

# create subplot (1row,2columns)
par(mfrow = c(1, 2))

# whole dataset
corrplot(corr, order = "hclust")

# dataset without highly correlated variables
corrplot(cor(df2), order = "hclust")

The features that are most important in this method are those that remain after dropping highly correlated variables.

ii.) Wrapper Method

# load clustvarsel package for model-based clustering to find the optimal subset of variables in our dataset that capture the most information.

library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
# load mclust package for clustering
library(mclust)
# using default method = sequential forward greedy search. Headlong search is potentially quicker but less optimal 
# G - numbers of mixture components (clusters) for which the BIC is to be calculated

out <-  clustvarsel(df, G = 1:5)
out
## ------------------------------------------------------ 
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------ 
## 
##  Variable proposed Type of step  BICclust Model G    BICdiff Decision
##              month          Add -2398.034     E 3 1079.12536 Accepted
##             gender          Add -3529.743   EII 5  333.68515 Accepted
##            payment          Add -5925.380   VEV 4   83.54031 Accepted
##              month       Remove -2427.732   EII 5  -20.48804 Accepted
##              month          Add -5925.380   VEV 4  -20.48804 Rejected
##             gender       Remove -2278.203     E 2 1315.86502 Rejected
## 
## Selected subset: gender, payment

The model selects gender and payment as the most important.

# subsets of our model

out$subset
##  gender payment 
##       3       8
# Having identified the variables that we use, we proceed to build the clustering model:

Subset <- df[,out$subset]
model <-  Mclust(Subset, G = 1:5)
summary(model)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 5 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -1168.771 1000 15 -2441.159 -2441.207
## 
## Clustering table:
##   1   2   3   4   5 
## 160 178 314 185 163

The model distinguishes 5 clusters based on these features.

# Plotting clusters

plot(model,c("classification"))

iii.) Embedded Methods

This is a weighted subspace clustering algorithm that is well suited to very high dimensional data.

# loading wskm package for  ewkm function

library(wskm)
## Loading required package: latticeExtra
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
## Loading required package: fpc
# loading cluster package to plot clusters
library(cluster)
# building our model

set.seed(1234) # prevents randomness of sample for the model

model <- ewkm(df, 3, lambda=2, maxiter=1000)
# Cluster Plot against 1st 2 principal components

clusplot(df, model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis')

Fifth pc - Unit price, day and payment

# calculating weights of each variable

round(model$weights*100,2)
##   branch customer.type gender product.line unit.price quantity tax payment cogs
## 1      0         50.00  50.00            0          0        0   0       0    0
## 2      0         43.39  56.60            0          0        0   0       0    0
## 3      0         45.77  54.22            0          0        0   0       0    0
##   gross.income rating total weekday day month hour
## 1            0      0     0       0   0     0    0
## 2            0      0     0       0   0     0    0
## 3            0      0     0       0   0     0    0

The most important variables according tothis method is customer type and gender.

iv.) Feature Ranking

# Loading FSelector package responsible for selecting attributes.

library(FSelector)
head(df)
## # A tibble: 6 × 16
##   branch customer.type gender product.line unit.price quantity   tax payment
##    <int>         <int>  <int>        <int>      <dbl>    <int> <dbl>   <int>
## 1      1             1      1            4       74.7        7 26.1        3
## 2      3             2      1            1       15.3        5  3.82       1
## 3      1             2      2            5       46.3        7 16.2        2
## 4      1             1      2            4       58.2        8 23.3        3
## 5      1             2      2            6       86.3        7 30.2        3
## 6      3             2      2            1       85.4        7 29.9        3
## # … with 8 more variables: cogs <dbl>, gross.income <dbl>, rating <dbl>,
## #   total <dbl>, weekday <int>, day <int>, month <int>, hour <int>
# Linear correlation model

Scores <- linear.correlation(rating~., df)

Scores
##               attr_importance
## branch            0.010238476
## customer.type     0.018888672
## gender            0.004800208
## product.line      0.020528973
## unit.price        0.008777507
## quantity          0.015814905
## tax               0.036441705
## payment           0.005381289
## cogs              0.036441705
## gross.income      0.036441705
## total             0.036441705
## weekday           0.032613920
## day               0.025270643
## month             0.046169543
## hour              0.030587644
# Let's find top 8 representative variables(attribute importance)

subset <- cutoff.k(Scores, 8)
as.data.frame(subset)
##         subset
## 1        month
## 2          tax
## 3         cogs
## 4 gross.income
## 5        total
## 6      weekday
## 7         hour
## 8          day
# Using entropy - based method instead of correlation coefficient;

Scores2 <- information.gain(rating~.,df)

# Finding top 8 representative variables(attribute importance)

subset <- cutoff.k(Scores, 8)
as.data.frame(subset)
##         subset
## 1        month
## 2          tax
## 3         cogs
## 4 gross.income
## 5        total
## 6      weekday
## 7         hour
## 8          day

7. Challenging the solution

Since no label was provided by the client, we decided to use ranking as our label for the final method.

8. Follow up Questions

Do we have the right data?

The data was provided by the client.

Do we have the right question?

The research question was also provided by the client.

9. Conclusion

Using four methods, we are able to determine the most important attributes. The filter method helps to reduce redundancy by filtering out highly correlated variables(cor >0.7). The next two methods try to find clusters based on variables they consider important while the final model ranks the features using a correlation model.

10. Recommendations

Our findings suggest that: customer type, gender, payment, branch and columns containing date and time information carry the most important information concerning our data. The Supermarket should plan accordingly using the findings provided in the Exploratory Analysis Step.