Identify the most relevant features that will result in the highest number of sales.
Successfully identify the features that would inform the marketing department on the most relevant marketing strategies.
As a Data analyst at Carrefour Kenya, I am currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). The project has been divided into four parts where I’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on the insights.
# calling the library
library(tibble)
# loading the data
df <- read.csv('http://bit.ly/CarreFourDataset')
# convert dataframe to tibble
df <- as_tibble(df)
# checking the class of the dataset
df %>% class()
## [1] "tbl_df" "tbl" "data.frame"
# Determining the number of rows and columns in the dataset
cat('Number of rows are', nrow(df), 'and the number of columns are', ncol(df))
## Number of rows are 1000 and the number of columns are 16
# previewing the top of the dataset
df %>% head()
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 750-67-8428 A Member Female Health and … 74.7 7 26.1
## 2 226-31-3081 C Normal Female Electronic … 15.3 5 3.82
## 3 631-41-3108 A Normal Male Home and li… 46.3 7 16.2
## 4 123-19-1176 A Member Male Health and … 58.2 8 23.3
## 5 373-73-7910 A Normal Male Sports and … 86.3 7 30.2
## 6 699-14-3026 C Normal Male Electronic … 85.4 7 29.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# previewing the bottom of the dataset
df %>% tail()
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 652-49-6720 C Member Female Electronic … 61.0 1 3.05
## 2 233-67-5758 C Normal Male Health and … 40.4 1 2.02
## 3 303-96-2227 B Normal Female Home and li… 97.4 10 48.7
## 4 727-02-1313 A Member Male Food and be… 31.8 1 1.59
## 5 347-56-2442 A Normal Male Home and li… 65.8 1 3.29
## 6 849-09-3807 A Member Female Fashion acc… 88.3 7 30.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# checking the structure of the dataset and datatype of each column
df %>% str()
## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
## $ Invoice.ID : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ Customer.type : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product.line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
All the columns have correct data types, except the date column
# Converting the date column to date format
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
df$Date <- mdy(df$Date)
# converting the character columns to factors
df <- as.data.frame(unclass(df),
stringsAsFactors = TRUE)
# checking if the changes are implemented
df %>% str()
## 'data.frame': 1000 obs. of 16 variables:
## $ Invoice.ID : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
## $ Branch : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
## $ Customer.type : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
## $ Product.line : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : Date, format: "2019-01-05" "2019-03-08" ...
## $ Time : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
## $ Payment : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
# Checking for value validity
df %>% summary()
## Invoice.ID Branch Customer.type Gender
## 101-17-6199: 1 A:340 Member:501 Female:501
## 101-81-4070: 1 B:332 Normal:499 Male :499
## 102-06-2002: 1 C:328
## 102-77-2261: 1
## 105-10-6182: 1
## 105-31-1824: 1
## (Other) :994
## Product.line Unit.price Quantity Tax
## Electronic accessories:170 Min. :10.08 Min. : 1.00 Min. : 0.5085
## Fashion accessories :178 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249
## Food and beverages :174 Median :55.23 Median : 5.00 Median :12.0880
## Health and beauty :152 Mean :55.67 Mean : 5.51 Mean :15.3794
## Home and lifestyle :160 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453
## Sports and travel :166 Max. :99.96 Max. :10.00 Max. :49.6500
##
## Date Time Payment cogs
## Min. :2019-01-01 14:42 : 7 Cash :344 Min. : 10.17
## 1st Qu.:2019-01-24 19:48 : 7 Credit card:311 1st Qu.:118.50
## Median :2019-02-13 17:38 : 6 Ewallet :345 Median :241.76
## Mean :2019-02-14 10:11 : 5 Mean :307.59
## 3rd Qu.:2019-03-08 11:40 : 5 3rd Qu.:448.90
## Max. :2019-03-30 11:51 : 5 Max. :993.00
## (Other):965
## gross.margin.percentage gross.income Rating Total
## Min. :4.762 Min. : 0.5085 Min. : 4.000 Min. : 10.68
## 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500 1st Qu.: 124.42
## Median :4.762 Median :12.0880 Median : 7.000 Median : 253.85
## Mean :4.762 Mean :15.3794 Mean : 6.973 Mean : 322.97
## 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500 3rd Qu.: 471.35
## Max. :4.762 Max. :49.6500 Max. :10.000 Max. :1042.65
##
There’s no entry below less than 0 therefore all the records are valid.
# Checking for missing values
sum(is.na(df))
## [1] 0
There are no missing values.
# Check for duplicates
sum(duplicated(df))
## [1] 0
There are no duplicates.
# Checking uniformity in column names
colnames(df)
## [1] "Invoice.ID" "Branch"
## [3] "Customer.type" "Gender"
## [5] "Product.line" "Unit.price"
## [7] "Quantity" "Tax"
## [9] "Date" "Time"
## [11] "Payment" "cogs"
## [13] "gross.margin.percentage" "gross.income"
## [15] "Rating" "Total"
There is a mix of both sentence cases and lower case. For uniformity, I’ll change the letter case to lower
names(df) <- tolower(names(df))
# viewing the results
names(df)
## [1] "invoice.id" "branch"
## [3] "customer.type" "gender"
## [5] "product.line" "unit.price"
## [7] "quantity" "tax"
## [9] "date" "time"
## [11] "payment" "cogs"
## [13] "gross.margin.percentage" "gross.income"
## [15] "rating" "total"
All the column names are now uniform.
# selecting only the numerical columns
num_cols <- df[, c(6:8,12:16)]
length(num_cols)
## [1] 8
# Box plots to visualize outliers
par(mfrow = c(2,3))
for (i in 1:length(num_cols)){
boxplot(num_cols[i], main = paste('Boxplot for', names(num_cols)[i]),
ylab = 'Count')
}
Tax, cogs, gross.income and total are the only columns with outliers. I will not remove them.
# Splitting the date column into year, month and day
# Unique years
year <- format(df$date, format="%y")
unique(year)
## [1] "19"
The data contains records for only 1 year, 2019
# Unique months
month <- format(df$date, format="%m")
# adding months column to the dataset
df$month <- month
sort(unique(month))
## [1] "01" "02" "03"
The data was collected during the first 3 months
# Day of the week
day <- wday(df$date)
# adding days column to the dataset
df$day <- day
# converting to factor
df$day <- as.factor(df$day)
sort(unique(day))
## [1] 1 2 3 4 5 6 7
# Now that the date column is split and the columns of interest added, the column can be dropped
df = subset(df, select = -c(date))
df %>% head(2)
## invoice.id branch customer.type gender product.line unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## quantity tax time payment cogs gross.margin.percentage gross.income
## 1 7 26.1415 13:08 Ewallet 522.83 4.761905 26.1415
## 2 5 3.8200 10:29 Cash 76.40 4.761905 3.8200
## rating total month day
## 1 9.1 548.9715 01 7
## 2 9.6 80.2200 03 6
# Statistical description of the variables
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
library(dplyr)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(num_cols)
## vars n mean sd median trimmed mad min
## unit.price 1 1000 55.67 26.49 55.23 55.62 33.37 10.08
## quantity 2 1000 5.51 2.92 5.00 5.51 2.97 1.00
## tax 3 1000 15.38 11.71 12.09 14.00 11.13 0.51
## cogs 4 1000 307.59 234.18 241.76 279.91 222.65 10.17
## gross.margin.percentage 5 1000 4.76 0.00 4.76 4.76 0.00 4.76
## gross.income 6 1000 15.38 11.71 12.09 14.00 11.13 0.51
## rating 7 1000 6.97 1.72 7.00 6.97 2.22 4.00
## total 8 1000 322.97 245.89 253.85 293.91 233.78 10.68
## max range skew kurtosis se
## unit.price 99.96 89.88 0.01 -1.22 0.84
## quantity 10.00 9.00 0.01 -1.22 0.09
## tax 49.65 49.14 0.89 -0.09 0.37
## cogs 993.00 982.83 0.89 -0.09 7.41
## gross.margin.percentage 4.76 0.00 NaN NaN 0.00
## gross.income 49.65 49.14 0.89 -0.09 0.37
## rating 10.00 6.00 0.01 -1.16 0.05
## total 1042.65 1031.97 0.89 -0.09 7.78
The table above gives us the measures of central tendency (mean, median) and measures of dispersion (standard deviation, minimum, maximum, range, skew, kurtosis) values.
Frequency distribution of the numeric columns
# function for plotting histograms
histogram <- function(column,title, xlab, ylim){
hist(column, main= title, xlab=xlab, ylim=ylim, ylab = "Frequency", col = "darkred")
}
# histogram for unit price
histogram(df$unit.price, "Histogram for Unit Price", "Unit Price", c(0, 150))
The unit price between 90 and 100 had the highest frequency for most products. The distribution is non-normal.
# histogram for quantity
histogram(df$quantity, "Histogram for quantity", "Quantity", c(0, 250))
Most products had a quantity between 1 and 2. The distribution is non-normal.
# histogram for tax
histogram(df$tax, "Histogram for Tax", "Tax", c(0, 250))
The distribution for the tax column is right skewed. Most products had a tax between 0-10
# histogram for cogs
histogram(df$cogs, "Histogram for Cogs", "Cogs", c(0, 250))
The distribution for the cogs column is right skewed. Most products had a tax between 0-10
# histogram for gross margin percentage
histogram(df$gross.income, "Histogram for Gross Income", "Gross Income", c(0, 250))
The distribution for the gross income column is right skewed. Most products had a gross income between 0-10
# histogram for rating
histogram(df$rating, "Histogram for Rating", "Rating", c(0, 120))
Most products got a rating between 4 and 4.5
# histogram for total
histogram(df$total, "Histogram for Total", "Total", c(0, 250))
Most transactions had a total frequency of slightly more than 200. The column is right skewed.
Bar plot representations for categorical columns
# barplot function
bar <- function(column, title, xlab, ylim){
# create frequency table
freq <- table(column)
# sort frequency table
sorted_freq <- (freq[order(freq,decreasing=TRUE)])
# adjust margins of frequency table
par(mar = c(7, 4, 2, 2) + 0.2)
# plot bar graph for first 10 values with the highest count
barplot(sorted_freq[1:10], main=title, ylab="Frequency" ,las=2, col="darkgreen")
title(xlab = xlab, line=30)
}
# plotting a bar graph for branch
bar(df$branch, "Bar plot for Branch", "Branches")
Branch A had the highest representation, B & C had equal representations.
# plotting a bar graph for gender
bar(df$gender, "Bar plot for Gender", "Gender")
Both males and females are equally represented.
# plotting a bar graph for customer type
bar(df$customer.type, "Bar plot for Customer Type", "Customer Type")
Both types of customers had equal representation.
# plotting a bar graph for product line
bar(df$product.line, "Bar plot for Product Line", "Product line")
The most bought product type is fashion accessories
# plotting a bar graph for payment
bar(df$payment, "Bar plot for Payment", "Payment")
Cash and e-wallet are the most used modes of payment.
# plotting a bar graph for payment
bar(df$month, "Bar plot for Month", "Month")
January had the highest number of representation.
# plotting a bar graph for day
bar(df$day, "Bar plot for Days", "Day")
Day 1 is Monday. Sunday had the highest representation followed by Wednesday.
library(ggplot2)
# create function to plot stacked bar chart
stacked <- function(column1, column2, title, value1, value2, legend){
# vectorize the two columns to plot
attribute1 <- c(column1)
attribute2 <- c(column2)
# create dataframe of the two columns
data <- data.frame(attribute1, attribute2)
# plot stacked bar graph
ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5))
}
# Branch vs Gender
stacked(df$branch, df$gender, "Branch against Gender", "Branch", "Gender", "Gender")
Branches B and C have more female customers than A, which has an equal number of the males and females.
# Product line vs Branch
stacked(df$product.line, df$branch, "Product line against Branch", "Product line", "Branch", "Branch")
# Branch vs Customer type
stacked(df$branch, df$customer.type, "Branch against Customer Type", "Branch", "Customer Type", "Customer Type")
# product line vs customer type
stacked(df$product.line, df$customer.type, "Product line against Customer Type", "Product line", "Customer Type", "Customer Type")
# product line vs gender
stacked(df$product.line, df$gender, "Product line against Gender", "Product line", "Gender", "Gender")
# payment vs customer type
stacked(df$payment, df$customer.type, "Payment against Customer Type", "Payment", "Customer Type", "Customer Type")
# Month against Product type
# create contingency table
month <- df$month
ptype <- df$product.line
mon <- table(month,ptype)
mon
## ptype
## month Electronic accessories Fashion accessories Food and beverages
## 01 54 64 56
## 02 54 60 62
## 03 62 54 56
## ptype
## month Health and beauty Home and lifestyle Sports and travel
## 01 49 59 70
## 02 46 38 43
## 03 57 63 53
In January, sports and travel products were the most bought while health and beauty products were the least bought.
In February, food and bevareges were the most bought and home while lifestyle products the least bought.
In March, home and lifestyle products were the most bought while sports and travel products were the least bought.
# checking for covariance in the numerical columns
cov(num_cols, use="complete.obs")
## unit.price quantity tax cogs
## unit.price 701.9653313 0.83477848 196.6683401 3933.36680
## quantity 0.8347785 8.54644645 24.1495704 482.99141
## tax 196.6683401 24.14957038 137.0965941 2741.93188
## cogs 3933.3668019 482.99140761 2741.9318829 54838.63766
## gross.margin.percentage 0.0000000 0.00000000 0.0000000 0.00000
## gross.income 196.6683401 24.14957038 137.0965941 2741.93188
## rating -0.3996675 -0.07945646 -0.7333003 -14.66601
## total 4130.0351420 507.14097799 2879.0284770 57580.56954
## gross.margin.percentage gross.income rating
## unit.price 0 196.6683401 -0.39966752
## quantity 0 24.1495704 -0.07945646
## tax 0 137.0965941 -0.73330028
## cogs 0 2741.9318829 -14.66600553
## gross.margin.percentage 0 0.0000000 0.00000000
## gross.income 0 137.0965941 -0.73330028
## rating 0 -0.7333003 2.95351823
## total 0 2879.0284770 -15.39930581
## total
## unit.price 4130.03514
## quantity 507.14098
## tax 2879.02848
## cogs 57580.56954
## gross.margin.percentage 0.00000
## gross.income 2879.02848
## rating -15.39931
## total 60459.59802
Covariance measures how two random variables vary together. A high negative covariance indicates negative correlation while a high positive covariance indicates positive correlation. A value close to zero indicates weak covariance.
From the results above, we can deduce that the following have positive covariance indicating positive correlation: unit price and tax, unit price and cogs, cogs and tax, gross income and total just to mention a few.
Rating and all other numerical variables have a negative covariance indicating negative correlation.
Gross margin percentage has a weak correlation with all the variables.
# correlation matrix to see the strengths of correlation between the variables
# import ggcorrplot package
library(ggcorrplot)
corr <- round(cor(num_cols, use='complete.obs'),1)
## Warning in cor(num_cols, use = "complete.obs"): the standard deviation is zero
ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#022D36', 'white', '#48AAAD'))
There is a strong positive correlation between total and tax, total and cogs, gross income and tax, gross income and cogs, tax and cogs, total and gross income.
The rest are weakly or moderately correlated.
# Scatter plot for total and tax
plot(df$total, df$tax, xlab="total", ylab="tax")
There is a strong positive linear correlation between the variables
# Scatter plot for gross income and tax
plot(df$gross.income, df$tax, xlab="gross income", ylab="tax")
There is a strong positive linear correlation between the variables
# Scatter plot for gross income and tax
plot(df$gross.income, df$tax, xlab="gross income", ylab="tax")
There is a strong positive linear correlation between the variables
# Scatter plot for total and cogs
plot(df$total, df$cogs, xlab="total", ylab="cogs")
There is a strong positive linear correlation between the variables
# Scatter plot for total and cogs
plot(df$gross.income, df$quantity, xlab="gross income", ylab="quantity")
There is a moderately positive correlation between the variables.
# Barplot function
barc <- function(column1,column2, xlabel, ylabel, title){
ggplot(data=df, aes(x=column1, y=column2)) + stat_summary(fun='median') + geom_bar(stat="identity",fill="darkblue") + xlab(xlabel) +ylab(ylabel) + ggtitle(title) + theme(plot.title = element_text(hjust=0.5))
}
# Product line vs Gross income
barc(df$product.line, df$gross.income, 'Product Line', 'Gross income', 'Product line vs Gross income')
## Warning: Removed 6 rows containing missing values (geom_segment).
Food and beverages brought in the most gross income
# Product line vs Total
barc(df$product.line, df$total, 'Product Line', 'Total', 'Product line vs Total')
## Warning: Removed 6 rows containing missing values (geom_segment).
Food and beverages brought in the most totals
# Product line vs Total
barc(df$product.line, df$tax, 'Product Line', 'Tax', 'Product line vs Tax')
## Warning: Removed 6 rows containing missing values (geom_segment).
Food and beverages were highly taxed
# plot quantity vs. total (color represents product line)
ggplot(df, aes(x = quantity,
y = total,
color=product.line)) +
geom_point() +
labs(title = "Totals by product line and quantity")
As the quantity increases so does the totals for all products
# plot unit price vs. tax (color represents product line)
ggplot(df, aes(x = unit.price,
y = tax,
color=product.line)) +
geom_point() +
labs(title = "Unit Price by tax and product line")
The tax for all the products goes higher as the unit price increases
# plot quantity vs. total (color represents product line)
ggplot(df, aes(x = rating,
y = total,
color=product.line)) +
geom_point() +
labs(title = "Totals by product line and rating")
The rating for all the products are more for products that are below 350
# The first column is not necessary for reduction
# gross margin percentage does not have variance therefore can't be scaled
df = subset(df, select = -c(invoice.id, gross.margin.percentage, time))
df %>% head(2)
## branch customer.type gender product.line unit.price quantity
## 1 A Member Female Health and beauty 74.69 7
## 2 C Normal Female Electronic accessories 15.28 5
## tax payment cogs gross.income rating total month day
## 1 26.1415 Ewallet 522.83 26.1415 9.1 548.9715 01 7
## 2 3.8200 Cash 76.40 3.8200 9.6 80.2200 03 6
# To perform PCA, all the columns have to be numerical.
# Converting the categorical columns to numerical
df$branch = as.integer(df$branch)
df$customer.type = as.integer(df$customer.type)
df$gender = as.integer(df$gender)
df$product.line = as.integer(df$product.line)
df$payment = as.integer(df$payment)
df$month = as.integer(df$month)
df$day = as.integer(df$day)
# previewing the changes made
df %>% head(2)
## branch customer.type gender product.line unit.price quantity tax payment
## 1 1 1 1 4 74.69 7 26.1415 3
## 2 3 2 1 1 15.28 5 3.8200 1
## cogs gross.income rating total month day
## 1 522.83 26.1415 9.1 548.9715 1 7
## 2 76.40 3.8200 9.6 80.2200 3 6
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#running our principal component
df.pca <- prcomp(df, center = TRUE, scale. = TRUE)
df.pca %>% summary()
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2203 1.07087 1.04213 1.02714 1.01085 0.98201 0.97962
## Proportion of Variance 0.3521 0.08191 0.07757 0.07536 0.07299 0.06888 0.06855
## Cumulative Proportion 0.3521 0.43404 0.51162 0.58698 0.65996 0.72884 0.79739
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.97047 0.95235 0.94761 0.29963 2.954e-16 2.043e-16
## Proportion of Variance 0.06727 0.06478 0.06414 0.00641 0.000e+00 0.000e+00
## Cumulative Proportion 0.86466 0.92945 0.99359 1.00000 1.000e+00 1.000e+00
## PC14
## Standard deviation 1.451e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
14 principal components are obtained and each explain the total variation of the dataset. PC1 explains 35.21% of the the total variance, PC2 explains 8.1%
# checking the possible variables we can get from running principal component(PC)
var=get_pca_var(df.pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
#getting the Eigenvalues
eig.val=get_eigenvalue(df.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.929865e+00 3.521332e+01 35.21332
## Dim.2 1.146761e+00 8.191147e+00 43.40447
## Dim.3 1.086027e+00 7.757339e+00 51.16181
## Dim.4 1.055012e+00 7.535799e+00 58.69761
## Dim.5 1.021816e+00 7.298687e+00 65.99629
## Dim.6 9.643398e-01 6.888141e+00 72.88444
## Dim.7 9.596465e-01 6.854618e+00 79.73905
## Dim.8 9.418214e-01 6.727296e+00 86.46635
## Dim.9 9.069626e-01 6.478304e+00 92.94465
## Dim.10 8.979714e-01 6.414082e+00 99.35873
## Dim.11 8.977715e-02 6.412654e-01 100.00000
## Dim.12 8.728133e-32 6.234380e-31 100.00000
## Dim.13 4.175243e-32 2.982316e-31 100.00000
## Dim.14 2.106113e-32 1.504366e-31 100.00000
10 principal components account for 99.35% of the data
#Plotting the PC using a scree plot
fviz_eig(df.pca, addlabels = T, ylim = c(0, 50))
# looking at the pca objects
str(df.pca)
## List of 5
## $ sdev : num [1:14] 2.22 1.07 1.04 1.03 1.01 ...
## $ rotation: num [1:14, 1:14] 0.0226 -0.0125 -0.0282 0.0174 0.2912 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:14] "branch" "customer.type" "gender" "product.line" ...
## .. ..$ : chr [1:14] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:14] 1.99 1.5 1.5 3.45 55.67 ...
## ..- attr(*, "names")= chr [1:14] "branch" "customer.type" "gender" "product.line" ...
## $ scale : Named num [1:14] 0.818 0.5 0.5 1.715 26.495 ...
## ..- attr(*, "names")= chr [1:14] "branch" "customer.type" "gender" "product.line" ...
## $ x : num [1:1000, 1:14] 2.041 -2.284 0.104 1.459 2.753 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1000] "1" "2" "3" "4" ...
## .. ..$ : chr [1:14] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
#Using visualization to check the contribution of each feature.
fviz_pca_var(df.pca, col.var = "blue")
#Plotting the contribution of attributes in each PC
fviz_contrib(df.pca,choice = "var", axes = 1)
fviz_contrib(df.pca, choice = "var", axes = 2)
fviz_contrib(df.pca, choice = "var", axes = 3)
fviz_contrib(df.pca, choice = "var", axes = 4)
fviz_contrib(df.pca,choice = "var", axes = 5)
fviz_contrib(df.pca, choice = "var", axes = 6)
fviz_contrib(df.pca, choice = "var", axes = 7)
fviz_contrib(df.pca, choice = "var", axes = 8)
fviz_contrib(df.pca,choice = "var", axes = 9)
fviz_contrib(df.pca, choice = "var", axes = 10)
# using the ewkm function from the wskm package
library(wskm)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked _by_ '.GlobalEnv':
##
## histogram
## Loading required package: latticeExtra
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
## Loading required package: fpc
set.seed(2345) # setting seed for reproducibility
model <- ewkm(df, 3, lambda=0.5, maxiter=100)
# loading the cluster package
library("cluster")
# plotting the clusters of the first 2 principal components
clusplot(df, model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis for Carrefour')
# Calculating the weights for each variable and cluster
# Weights are a measure of the relative importance of each variable with regards to the membership of the observations to that cluster.
round(model$weights*100,2)
## branch customer.type gender product.line unit.price quantity tax payment cogs
## 1 0 99.99 0 0 0 0 0 0 0
## 2 0 50.00 50 0 0 0 0 0 0
## 3 0 99.99 0 0 0 0 0 0 0
## gross.income rating total month day
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
Customer type and gender are the most important features according to this method.
# Calling the libraries
library(clustvarsel)
## Loading required package: mclust
## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
##
## sim
## The following object is masked from 'package:purrr':
##
## map
## Package 'clustvarsel' version 2.3.4
## Type 'citation("clustvarsel")' for citing this R package in publications.
library(mclust)
# Sequential forward greedy search (default)
out = clustvarsel(df, G = 1:14)
out
## ------------------------------------------------------
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------
##
## Variable proposed Type of step BICclust Model G BICdiff Decision
## quantity Add -4192.1562 E 9 804.0515 Accepted
## day Add 460.6869 VEV 9 8894.5229 Accepted
## gender Add -8556.5739 VEV 10 -7551.8666 Rejected
## day Remove -4192.1562 E 9 8894.5229 Rejected
##
## Selected subset: quantity, day
# The selection algorithm would indicate that the subset we use for the clustering model is composed of variables quantity and day, other variables are rejected.
# Having identified the variables that we use, we proceed to build the clustering model:
Subset1 = df[,out$subset]
mod = Mclust(Subset1, G = 1:14)
summary(mod)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEV (ellipsoidal, equal volume and shape) model with 10 components:
##
## log-likelihood n df BIC ICL
## -3552.247 1000 41 -7387.713 -7503.809
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9 10
## 126 90 77 98 99 93 119 93 93 112
# plotting the clusters
plot(mod,c("classification"))
Day and quantity are the most important features according to this method.
# calling the library
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(corrplot)
## corrplot 0.92 loaded
# Calculating the correlation matrix
correlationMatrix <- cor(df)
# Find attributes that are highly correlated
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)
# Highly correlated attributes
highlyCorrelated
## [1] 7 9 10
names(df[,highlyCorrelated])
## [1] "tax" "cogs" "gross.income"
# Removing Redundant Features
df1 <- df[-highlyCorrelated]
# Performing our graphical comparison
# ---
#
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(df1), order = "hclust")
Using filter methods, tax, cogs and gross income are the irrelevant variables in the dataset.
Branch A is the most visited
The most frequent time that transactions happen in the sores is between 12.30pm and 1.52pm
Most products bought have a unit price between 90 and 100
The most bought product line is fashion accessories while the least bought product line is health and beauty products.
Most customers go to the stores on Sundays.
The stores mostly have fashion accessories products, which are mostly bought by women.
Food and beverages have the most totals and taxes.
In January, sports and travel products were the most bought while health and beauty products were the least bought.
In February, food and beverages were the most bought and home while lifestyle products the least bought.
In March, home and lifestyle products were the most bought while sports and travel products were the least bought.
The feature selection methods gives the following variables as the most important: payment, gender, month, day, quantity, total, branch, unit price, customer type, rating and product line. A detailed analysis of the given features has been done above under the EDA section.