Research Question
You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
Determine which features contribute the most information to the dataset.
Perform feature selection on the dataset using the 4 methods: i.) Filter method ii.) Wrapper methods iii.) Embedded methods iv.) Feature Ranking
The dataset provided contains 18 attributes of customers who visit the three branches of the supermarket. Out of these, we need to select the attributes that give out the most information about the dataset.
Reading the data
Checking the data - data understanding
Implementing the solution
Challenge the solution
Follow up Questions
Conclusion.
Recommendations.
The dataset contains several customer attributes. Out of these, only a handful carry the most information of the dataset.
# import Tibble package
library(tibble)
# url <- http://bit.ly/CarreFourDataset
# load the dataset as dataframe
df <- read.csv('http://bit.ly/CarreFourDataset')
# convert dataframe to Tibble
df <- as_tibble(df)
#check data structure of our dataset
class(df)
## [1] "tbl_df" "tbl" "data.frame"
# Previewing our first 5 records
#
head(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 750-67-8428 A Member Female Health and … 74.7 7 26.1
## 2 226-31-3081 C Normal Female Electronic … 15.3 5 3.82
## 3 631-41-3108 A Normal Male Home and li… 46.3 7 16.2
## 4 123-19-1176 A Member Male Health and … 58.2 8 23.3
## 5 373-73-7910 A Normal Male Sports and … 86.3 7 30.2
## 6 699-14-3026 C Normal Male Electronic … 85.4 7 29.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# Previewing our last 5 records
#
tail(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 652-49-6720 C Member Female Electronic … 61.0 1 3.05
## 2 233-67-5758 C Normal Male Health and … 40.4 1 2.02
## 3 303-96-2227 B Normal Female Home and li… 97.4 10 48.7
## 4 727-02-1313 A Member Male Food and be… 31.8 1 1.59
## 5 347-56-2442 A Normal Male Home and li… 65.8 1 3.29
## 6 849-09-3807 A Member Female Fashion acc… 88.3 7 30.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# check shape of the dataset
dim(df)
## [1] 1000 16
Our dataset contains 1000 records and 16 columns
# check column datatypes
sapply(df, class)
## Invoice.ID Branch Customer.type
## "character" "character" "character"
## Gender Product.line Unit.price
## "character" "character" "numeric"
## Quantity Tax Date
## "integer" "numeric" "character"
## Time Payment cogs
## "character" "character" "numeric"
## gross.margin.percentage gross.income Rating
## "numeric" "numeric" "numeric"
## Total
## "numeric"
# inspect variable classes
str(df)
## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
## $ Invoice.ID : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ Customer.type : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product.line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
We need to convert time into time datatype and categorical columns to factor datatype.
# convert date and time to standard date and time format
df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")
df$Date <- as.Date(df$Date)
df$Time <- as.POSIXct(df$Time, format = "%H:%M")
df$Time <- format(df$Time,"%H:%M")
# convert logical values to factor
logical.to.factor <- function(column){
as.factor(column)
}
# logical columns
df$Gender <- logical.to.factor(df$Gender)
df$Branch <- logical.to.factor(df$Branch)
df$Customer.type <- logical.to.factor(df$Customer.type)
df$Product.line <- logical.to.factor(df$Product.line)
df$Payment <- logical.to.factor(df$Payment)
# preview dataset
head(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
## 1 750-67-8428 A Member Female Health and … 74.7 7 26.1
## 2 226-31-3081 C Normal Female Electronic … 15.3 5 3.82
## 3 631-41-3108 A Normal Male Home and li… 46.3 7 16.2
## 4 123-19-1176 A Member Male Health and … 58.2 8 23.3
## 5 373-73-7910 A Normal Male Sports and … 86.3 7 30.2
## 6 699-14-3026 C Normal Male Electronic … 85.4 7 29.9
## # … with 8 more variables: Date <date>, Time <chr>, Payment <fct>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
Dataset is provided by the client.
# select only numerical variables
library("plyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_num <- select_if(df, is.numeric)
head(data_num)
## # A tibble: 6 × 8
## Unit.price Quantity Tax cogs gross.margin.percen… gross.income Rating Total
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74.7 7 26.1 523. 4.76 26.1 9.1 549.
## 2 15.3 5 3.82 76.4 4.76 3.82 9.6 80.2
## 3 46.3 7 16.2 324. 4.76 16.2 7.4 341.
## 4 58.2 8 23.3 466. 4.76 23.3 8.4 489.
## 5 86.3 7 30.2 604. 4.76 30.2 5.3 634.
## 6 85.4 7 29.9 598. 4.76 29.9 4.1 628.
# Check for outliers
# import ggplot2 for plotting
library(ggplot2)
# vectorize column names
columns <- c(names(data_num))
# plot boxplot
boxplots <- function(column_name){
column <- unlist(data_num[,column_name])
p1 <- ggplot(data=data_num, mapping=aes(column))+geom_boxplot() + xlab(names(data_num[,column_name]))
print(p1)
}
# create 8 subplots
num <- c()
for (x in 1:length(columns)){
plot <- boxplots(x)
}
Outliers exist in Tax, cogs, gross.income and Total columns.
# check for missing values
colSums(is.na(df))
## Invoice.ID Branch Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax Date
## 0 0 0
## Time Payment cogs
## 0 0 0
## gross.margin.percentage gross.income Rating
## 0 0 0
## Total
## 0
Our dataset has no missing values.
# check for duplicates
anyDuplicated(df)
## [1] 0
Our dataset has no duplicated rows.
# convert column names to lowercase
names(df) <- tolower(names(df))
# check relevance of columns
unique(df$gross.margin.percentage)
## [1] 4.761905
# drop column since it is a constant
df <- df[,-13]
data_num <- data_num[,-5]
The gross margin percentage is a constant equal to 4.761905.
A comprehensive exploratory data analysis has been done in the pca exercise.
# Convert factors to numeric
fact_to_num <- function(column){
as.integer(column)
}
df$branch <- fact_to_num(df$branch)
df$product.line <- fact_to_num(df$product.line)
df$customer.type <- fact_to_num(df$customer.type)
df$gender <- fact_to_num(df$gender)
df$payment <- fact_to_num(df$payment)
# extract month, day of the week and hour
# import lubridate package for date and time manipulation
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# weekday
df$weekday <- as.integer(wday(df$date, week_start=1)) # Monday as first day
# day of month
df$day <- as.integer(format(as.Date(df$date), "%d"))
# Month
df$month <- as.integer(format(as.Date(df$date), "%m"))
# Hour
df$hour <- as.integer(format(strptime(df$time,"%H:%M"),'%H'))
# drop unnecessary columns
# drop invoice id, date and time
df <- select(df,-c(invoice.id, date, time))
# preview dataset
head(df)
## # A tibble: 6 × 16
## branch customer.type gender product.line unit.price quantity tax payment
## <int> <int> <int> <int> <dbl> <int> <dbl> <int>
## 1 1 1 1 4 74.7 7 26.1 3
## 2 3 2 1 1 15.3 5 3.82 1
## 3 1 2 2 5 46.3 7 16.2 2
## 4 1 1 2 4 58.2 8 23.3 3
## 5 1 2 2 6 86.3 7 30.2 3
## 6 3 2 2 1 85.4 7 29.9 3
## # … with 8 more variables: cogs <dbl>, gross.income <dbl>, rating <dbl>,
## # total <dbl>, weekday <int>, day <int>, month <int>, hour <int>
# loading caret package for findCorrelation function
library(caret)
## Loading required package: lattice
# load corrplot package to plot correlation
library(corrplot)
## corrplot 0.92 loaded
# correlation matrix
# ---
#
corr <- cor(df)
corr
## branch customer.type gender product.line unit.price
## branch 1.00000000 -0.019607869 -0.056317558 -0.053937557 0.028202440
## customer.type -0.01960787 1.000000000 0.039996160 -0.036800311 -0.020237875
## gender -0.05631756 0.039996160 1.000000000 0.005193197 0.015444630
## product.line -0.05393756 -0.036800311 0.005193197 1.000000000 0.019321028
## unit.price 0.02820244 -0.020237875 0.015444630 0.019321028 1.000000000
## quantity 0.01596379 -0.016762706 -0.074258307 0.020256001 0.010777564
## tax 0.04104666 -0.019670283 -0.049450989 0.031620725 0.633962089
## payment -0.05010429 0.018073436 0.044577609 0.029896383 -0.015941048
## cogs 0.04104666 -0.019670283 -0.049450989 0.031620725 0.633962089
## gross.income 0.04104666 -0.019670283 -0.049450989 0.031620725 0.633962089
## rating 0.01023848 0.018888672 0.004800208 -0.020528973 -0.008777507
## total 0.04104666 -0.019670283 -0.049450989 0.031620725 0.633962089
## weekday -0.03950365 0.048747377 -0.019730770 0.017043488 0.027211544
## day -0.01990763 0.026396585 0.035570507 -0.056815945 0.051708609
## month -0.02926049 0.005279579 -0.015200915 -0.001394159 -0.018218384
## hour 0.03300711 -0.018893298 0.084081139 -0.060849916 0.008242210
## quantity tax payment cogs gross.income
## branch 0.015963788 0.041046665 -0.050104288 0.041046665 0.041046665
## customer.type -0.016762706 -0.019670283 0.018073436 -0.019670283 -0.019670283
## gender -0.074258307 -0.049450989 0.044577609 -0.049450989 -0.049450989
## product.line 0.020256001 0.031620725 0.029896383 0.031620725 0.031620725
## unit.price 0.010777564 0.633962089 -0.015941048 0.633962089 0.633962089
## quantity 1.000000000 0.705510186 -0.003920990 0.705510186 0.705510186
## tax 0.705510186 1.000000000 -0.012433637 1.000000000 1.000000000
## payment -0.003920990 -0.012433637 1.000000000 -0.012433637 -0.012433637
## cogs 0.705510186 1.000000000 -0.012433637 1.000000000 1.000000000
## gross.income 0.705510186 1.000000000 -0.012433637 1.000000000 1.000000000
## rating -0.015814905 -0.036441705 -0.005381289 -0.036441705 -0.036441705
## total 0.705510186 1.000000000 -0.012433637 1.000000000 1.000000000
## weekday -0.006818718 0.003096704 -0.035126428 0.003096704 0.003096704
## day -0.027744137 0.009624700 -0.006506096 0.009624700 0.009624700
## month 0.032895342 0.015114331 -0.023007131 0.015114331 0.015114331
## hour -0.007316886 -0.002770440 0.045420537 -0.002770440 -0.002770440
## rating total weekday day month
## branch 0.010238476 0.041046665 -0.039503650 -0.019907626 -0.029260490
## customer.type 0.018888672 -0.019670283 0.048747377 0.026396585 0.005279579
## gender 0.004800208 -0.049450989 -0.019730770 0.035570507 -0.015200915
## product.line -0.020528973 0.031620725 0.017043488 -0.056815945 -0.001394159
## unit.price -0.008777507 0.633962089 0.027211544 0.051708609 -0.018218384
## quantity -0.015814905 0.705510186 -0.006818718 -0.027744137 0.032895342
## tax -0.036441705 1.000000000 0.003096704 0.009624700 0.015114331
## payment -0.005381289 -0.012433637 -0.035126428 -0.006506096 -0.023007131
## cogs -0.036441705 1.000000000 0.003096704 0.009624700 0.015114331
## gross.income -0.036441705 1.000000000 0.003096704 0.009624700 0.015114331
## rating 1.000000000 -0.036441705 0.032613920 -0.025270643 -0.046169543
## total -0.036441705 1.000000000 0.003096704 0.009624700 0.015114331
## weekday 0.032613920 0.003096704 1.000000000 -0.095228079 -0.131061937
## day -0.025270643 0.009624700 -0.095228079 1.000000000 0.077649975
## month -0.046169543 0.015114331 -0.131061937 0.077649975 1.000000000
## hour -0.030587644 -0.002770440 0.014484537 0.021185360 0.018556725
## hour
## branch 0.033007115
## customer.type -0.018893298
## gender 0.084081139
## product.line -0.060849916
## unit.price 0.008242210
## quantity -0.007316886
## tax -0.002770440
## payment 0.045420537
## cogs -0.002770440
## gross.income -0.002770440
## rating -0.030587644
## total -0.002770440
## weekday 0.014484537
## day 0.021185360
## month 0.018556725
## hour 1.000000000
# Find attributes that are highly correlated(min correlation=0.75)
highly_correlated <- findCorrelation(corr, cutoff=0.7)
# display index of columns
highly_correlated
## [1] 9 12 7 10
# display column names of selected highly correlated variables
names(df[,highly_correlated])
## [1] "cogs" "total" "tax" "gross.income"
# create dataset without highly correlated variables
df2 <- df[-c(highly_correlated)]
# graphical comparison plot between dataset with and without highly correlated variables.
# create subplot (1row,2columns)
par(mfrow = c(1, 2))
# whole dataset
corrplot(corr, order = "hclust")
# dataset without highly correlated variables
corrplot(cor(df2), order = "hclust")
The features that are most important in this method are those that remain after dropping highly correlated variables.
# load clustvarsel package for model-based clustering to find the optimal subset of variables in our dataset that capture the most information.
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
# load mclust package for clustering
library(mclust)
# using default method = sequential forward greedy search. Headlong search is potentially quicker but less optimal
# G - numbers of mixture components (clusters) for which the BIC is to be calculated
out <- clustvarsel(df, G = 1:5)
out
## ------------------------------------------------------
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------
##
## Variable proposed Type of step BICclust Model G BICdiff Decision
## month Add -2398.034 E 3 1079.12536 Accepted
## gender Add -3529.743 EII 5 333.68515 Accepted
## payment Add -5925.380 VEV 4 83.54031 Accepted
## month Remove -2427.732 EII 5 -20.48804 Accepted
## month Add -5925.380 VEV 4 -20.48804 Rejected
## gender Remove -2278.203 E 2 1315.86502 Rejected
##
## Selected subset: gender, payment
The model selects gender and payment as the most important.
# subsets of our model
out$subset
## gender payment
## 3 8
# Having identified the variables that we use, we proceed to build the clustering model:
Subset <- df[,out$subset]
model <- Mclust(Subset, G = 1:5)
summary(model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 5 components:
##
## log-likelihood n df BIC ICL
## -1168.771 1000 15 -2441.159 -2441.207
##
## Clustering table:
## 1 2 3 4 5
## 160 178 314 185 163
The model distinguishes 5 clusters based on these features.
# Plotting clusters
plot(model,c("classification"))
This is a weighted subspace clustering algorithm that is well suited to very high dimensional data.
# loading wskm package for ewkm function
library(wskm)
## Loading required package: latticeExtra
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
## Loading required package: fpc
# loading cluster package to plot clusters
library(cluster)
# building our model
set.seed(1234) # prevents randomness of sample for the model
model <- ewkm(df, 3, lambda=2, maxiter=1000)
# Cluster Plot against 1st 2 principal components
clusplot(df, model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis')
Fifth pc - Unit price, day and payment
# calculating weights of each variable
round(model$weights*100,2)
## branch customer.type gender product.line unit.price quantity tax payment cogs
## 1 0 50.00 50.00 0 0 0 0 0 0
## 2 0 43.39 56.60 0 0 0 0 0 0
## 3 0 45.77 54.22 0 0 0 0 0 0
## gross.income rating total weekday day month hour
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
The most important variables according tothis method is customer type and gender.
# Loading FSelector package responsible for selecting attributes.
library(FSelector)
head(df)
## # A tibble: 6 × 16
## branch customer.type gender product.line unit.price quantity tax payment
## <int> <int> <int> <int> <dbl> <int> <dbl> <int>
## 1 1 1 1 4 74.7 7 26.1 3
## 2 3 2 1 1 15.3 5 3.82 1
## 3 1 2 2 5 46.3 7 16.2 2
## 4 1 1 2 4 58.2 8 23.3 3
## 5 1 2 2 6 86.3 7 30.2 3
## 6 3 2 2 1 85.4 7 29.9 3
## # … with 8 more variables: cogs <dbl>, gross.income <dbl>, rating <dbl>,
## # total <dbl>, weekday <int>, day <int>, month <int>, hour <int>
# Linear correlation model
Scores <- linear.correlation(rating~., df)
Scores
## attr_importance
## branch 0.010238476
## customer.type 0.018888672
## gender 0.004800208
## product.line 0.020528973
## unit.price 0.008777507
## quantity 0.015814905
## tax 0.036441705
## payment 0.005381289
## cogs 0.036441705
## gross.income 0.036441705
## total 0.036441705
## weekday 0.032613920
## day 0.025270643
## month 0.046169543
## hour 0.030587644
# Let's find top 8 representative variables(attribute importance)
subset <- cutoff.k(Scores, 8)
as.data.frame(subset)
## subset
## 1 month
## 2 tax
## 3 cogs
## 4 gross.income
## 5 total
## 6 weekday
## 7 hour
## 8 day
# Using entropy - based method instead of correlation coefficient;
Scores2 <- information.gain(rating~.,df)
# Finding top 8 representative variables(attribute importance)
subset <- cutoff.k(Scores, 8)
as.data.frame(subset)
## subset
## 1 month
## 2 tax
## 3 cogs
## 4 gross.income
## 5 total
## 6 weekday
## 7 hour
## 8 day
Since no label was provided by the client, we decided to use ranking as our label for the final method.
The data was provided by the client.
The research question was also provided by the client.
Using four methods, we are able to determine the most important attributes. The filter method helps to reduce redundancy by filtering out highly correlated variables(cor >0.7). The next two methods try to find clusters based on variables they consider important while the final model ranks the features using a correlation model.
Our findings suggest that: customer type, gender, payment, branch and columns containing date and time information carry the most important information concerning our data. The Supermarket should plan accordingly using the findings provided in the Exploratory Analysis Step.