Research Question
You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
Perform dimensionality reduction and feature selection on the dataset for feature extraction on the dataset provided.
Determining which features contributing most to variability in the dataset hence useful for modelling.
The dataset provided contains attributes of customers who visit the three branches of the supermarket.
Reading the data
Checking the data - data understanding
Implementing the solution
Challenge the solution
Follow up Questions
Conclusion.
Recommendations.
Through EDA and PCA, we are able to determine which features contribute to the most variability in our dataset and which features can be used in our modelling phase.
# import Tibble package
library(tibble)
# url <- http://bit.ly/CarreFourDataset
# load the dataset as dataframe
df <- read.csv('http://bit.ly/CarreFourDataset')
# convert dataframe to Tibble
df <- as_tibble(df)
#check data structure of our dataset
class(df)
## [1] "tbl_df" "tbl" "data.frame"
# Previewing our first 5 records
#
head(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 750-67-8428 A Member Female Health and … 74.7 7 26.1
## 2 226-31-3081 C Normal Female Electronic … 15.3 5 3.82
## 3 631-41-3108 A Normal Male Home and li… 46.3 7 16.2
## 4 123-19-1176 A Member Male Health and … 58.2 8 23.3
## 5 373-73-7910 A Normal Male Sports and … 86.3 7 30.2
## 6 699-14-3026 C Normal Male Electronic … 85.4 7 29.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# Previewing our last 5 records
#
tail(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 652-49-6720 C Member Female Electronic … 61.0 1 3.05
## 2 233-67-5758 C Normal Male Health and … 40.4 1 2.02
## 3 303-96-2227 B Normal Female Home and li… 97.4 10 48.7
## 4 727-02-1313 A Member Male Food and be… 31.8 1 1.59
## 5 347-56-2442 A Normal Male Home and li… 65.8 1 3.29
## 6 849-09-3807 A Member Female Fashion acc… 88.3 7 30.9
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
# check shape of the dataset
dim(df)
## [1] 1000 16
Our dataset contains 1000 records and 16 columns
# check column datatypes
sapply(df, class)
## Invoice.ID Branch Customer.type
## "character" "character" "character"
## Gender Product.line Unit.price
## "character" "character" "numeric"
## Quantity Tax Date
## "integer" "numeric" "character"
## Time Payment cogs
## "character" "character" "numeric"
## gross.margin.percentage gross.income Rating
## "numeric" "numeric" "numeric"
## Total
## "numeric"
# inspect variable classes
str(df)
## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
## $ Invoice.ID : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ Customer.type : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product.line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
We need to convert time into time datatype and categorical columns to factor datatype.
# convert date and time to standard date and time format
df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")
df$Date <- as.Date(df$Date)
df$Time <- as.POSIXct(df$Time, format = "%H:%M")
df$Time <- format(df$Time,"%H:%M")
# convert logical values to factor
logical.to.factor <- function(column){
as.factor(column)
}
# logical columns
df$Gender <- logical.to.factor(df$Gender)
df$Branch <- logical.to.factor(df$Branch)
df$Customer.type <- logical.to.factor(df$Customer.type)
df$Product.line <- logical.to.factor(df$Product.line)
df$Payment <- logical.to.factor(df$Payment)
# preview dataset
head(df)
## # A tibble: 6 × 16
## Invoice.ID Branch Customer.type Gender Product.line Unit.price Quantity Tax
## <chr> <fct> <fct> <fct> <fct> <dbl> <int> <dbl>
## 1 750-67-8428 A Member Female Health and … 74.7 7 26.1
## 2 226-31-3081 C Normal Female Electronic … 15.3 5 3.82
## 3 631-41-3108 A Normal Male Home and li… 46.3 7 16.2
## 4 123-19-1176 A Member Male Health and … 58.2 8 23.3
## 5 373-73-7910 A Normal Male Sports and … 86.3 7 30.2
## 6 699-14-3026 C Normal Male Electronic … 85.4 7 29.9
## # … with 8 more variables: Date <date>, Time <chr>, Payment <fct>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Total <dbl>
Dataset is provided by the client.
# select only numerical variables
library("plyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_num <- select_if(df, is.numeric)
head(data_num)
## # A tibble: 6 × 8
## Unit.price Quantity Tax cogs gross.margin.percen… gross.income Rating Total
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74.7 7 26.1 523. 4.76 26.1 9.1 549.
## 2 15.3 5 3.82 76.4 4.76 3.82 9.6 80.2
## 3 46.3 7 16.2 324. 4.76 16.2 7.4 341.
## 4 58.2 8 23.3 466. 4.76 23.3 8.4 489.
## 5 86.3 7 30.2 604. 4.76 30.2 5.3 634.
## 6 85.4 7 29.9 598. 4.76 29.9 4.1 628.
# Check for outliers
# import ggplot2 for plotting
library(ggplot2)
# vectorize column names
columns <- c(names(data_num))
# plot boxplot
boxplots <- function(column_name){
column <- unlist(data_num[,column_name])
p1 <- ggplot(data=data_num, mapping=aes(column))+geom_boxplot() + xlab(names(data_num[,column_name]))
print(p1)
}
# create 8 subplots
num <- c()
for (x in 1:length(columns)){
plot <- boxplots(x)
}
#library(patchwork)
Outliers exist in Tax, cogs, gross.income and Total columns.
# check for missing values
colSums(is.na(df))
## Invoice.ID Branch Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax Date
## 0 0 0
## Time Payment cogs
## 0 0 0
## gross.margin.percentage gross.income Rating
## 0 0 0
## Total
## 0
Our dataset has no missing values.
# check for duplicates
anyDuplicated(df)
## [1] 0
Our dataset has no duplicated rows.
# convert column names to lowercase
names(df) <- tolower(names(df))
# check relevance of columns
unique(df$gross.margin.percentage)
## [1] 4.761905
# drop column since it is a constant
df <- df[,-13]
data_num <- data_num[,-5]
The gross margin percentage is a constant equal to 4.761905.
# checking summary of dataset
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(df)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## vars n mean sd median trimmed mad min max
## invoice.id* 1 1000 500.50 288.82 500.50 500.50 370.65 1.00 1000.00
## branch* 2 1000 1.99 0.82 2.00 1.99 1.48 1.00 3.00
## customer.type* 3 1000 1.50 0.50 1.00 1.50 0.00 1.00 2.00
## gender* 4 1000 1.50 0.50 1.00 1.50 0.00 1.00 2.00
## product.line* 5 1000 3.45 1.72 3.00 3.44 1.48 1.00 6.00
## unit.price 6 1000 55.67 26.49 55.23 55.62 33.37 10.08 99.96
## quantity 7 1000 5.51 2.92 5.00 5.51 2.97 1.00 10.00
## tax 8 1000 15.38 11.71 12.09 14.00 11.13 0.51 49.65
## date 9 1000 NaN NA NA NaN NA Inf -Inf
## time* 10 1000 252.18 147.07 249.00 252.49 190.51 1.00 506.00
## payment* 11 1000 2.00 0.83 2.00 2.00 1.48 1.00 3.00
## cogs 12 1000 307.59 234.18 241.76 279.91 222.65 10.17 993.00
## gross.income 13 1000 15.38 11.71 12.09 14.00 11.13 0.51 49.65
## rating 14 1000 6.97 1.72 7.00 6.97 2.22 4.00 10.00
## total 15 1000 322.97 245.89 253.85 293.91 233.78 10.68 1042.65
## range skew kurtosis se
## invoice.id* 999.00 0.00 -1.20 9.13
## branch* 2.00 0.02 -1.51 0.03
## customer.type* 1.00 0.00 -2.00 0.02
## gender* 1.00 0.00 -2.00 0.02
## product.line* 5.00 0.06 -1.28 0.05
## unit.price 89.88 0.01 -1.22 0.84
## quantity 9.00 0.01 -1.22 0.09
## tax 49.14 0.89 -0.09 0.37
## date -Inf NA NA NA
## time* 505.00 0.00 -1.25 4.65
## payment* 2.00 0.00 -1.55 0.03
## cogs 982.83 0.89 -0.09 7.41
## gross.income 49.14 0.89 -0.09 0.37
## rating 6.00 0.01 -1.16 0.05
## total 1031.97 0.89 -0.09 7.78
# First and last dates in the dataset
paste('The dataset contains records from', min(df$date), 'to', max(df$date))
## [1] "The dataset contains records from 2018-12-31 to 2019-03-29"
# Check interquatile range of numerical columns.
sapply(data_num, IQR)
## Unit.price Quantity Tax cogs gross.income Rating
## 45.06000 5.00000 16.52037 330.40750 16.52037 3.00000
## Total
## 346.92787
# import moments package
library(moments)
# Check skewness of numerical columns.
sapply(data_num, skewness)
## Unit.price Quantity Tax cogs gross.income Rating
## 0.007066827 0.012921628 0.891230392 0.891230392 0.891230392 0.008996129
## Total
## 0.891230392
Tax, Cogs, Gross Income and Total are moderately positively/right skewed while the remaining columns are fairly symmetrical.
# Check kurtosis of numerical columns.
sapply(data_num, kurtosis)
## Unit.price Quantity Tax cogs gross.income Rating
## 1.781499 1.784528 2.912530 2.912530 2.912530 1.848169
## Total
## 2.912530
The distribution of numerical columns is platykurtic(broad and light tailed)
# Pie chart function
# import plotrix for 3D plot
library(plotrix)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
piechart <- function(column, title){
# define labels and values
slices <- c(tabulate(column))
lbls <- c(levels(column))
# convert values to percentage
pct <- round(slices/sum(slices)*100)
# add percentages to labels
lbls <- paste(lbls,":", pct)
# ad % to labels
lbls <- paste(lbls,"%",sep="")
# plot pie chart in 3D
pie3D(slices,labels=lbls, col=rainbow(length(lbls)),
explode=0.1, radius=1, start=40, main=title)
}
# Gender representation
piechart(df$gender, 'Gender representation')
There is equal representation of male and female in the dataset.
# Customer type representation
piechart(df$customer.type, 'Customer type representation')
There is also an equal representation of customer type in our dataset
# Branch representation
piechart(df$branch, 'Branch representation')
There is almost an equal representation of branch type per customer.
# Payment representation
piechart(df$payment, 'Payment representation')
Most customers use Ewallet and cash methods of payment than credit cards.
# histograms function
histogram <- function(column,title, xlab){
ggplot(df, aes(x=column)) + geom_histogram(color="black", fill="#283747") + ggtitle(title) + xlab(xlab) + theme(plot.title = element_text(hjust=0.5))
}
# Tax Distribution
histogram(df$tax, 'Tax Distribution', 'Tax')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Tax distribution is right skewed. Majority of data lies between values 4 and 8.
# Cogs Distribution
histogram(df$cogs, 'Cogs Distribution', 'Cogs')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Cogs distribution is right skewed, majority of data lies between 0 and 250.
# Gross Income Distribution
histogram(df$gross.income, 'Gross Income Distribution', 'Gross Income')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Gross Income distribution is right skewed, majority of data lies between 0 and 10.
# Unit price Distribution
histogram(df$unit.price, 'Unit Price Distribution', 'Unit Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Unit price has an fairly symmetrical distribution. Skewness ~ 0.
# Unit price Distribution
histogram(df$rating, 'Rating Distribution', 'Rating')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Rating also has an fairly symmetrical distribution. Skewness ~ 0.
# Total Distribution
histogram(df$total, 'Total Distribution', 'Total')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Total distribution is right skewed, majority of data lies between 0 and 250.
# Barplot function
bar <- function(column,title, xlab, angle){
ggplot(df, aes(x=column)) + geom_bar(color="black", fill="#154360") + ggtitle(title) + xlab(xlab) + theme(axis.text.x = element_text(angle = angle, vjust = 1, hjust=1), plot.title = element_text(hjust=0.5))
}
# Quantity representation
bar(as.factor(df$quantity), 'Quantity Distribution', 'Quantity', 360)
Most customers purchased 10 items(itemset).
# Product Category representation
bar(df$product.line, 'Product Category Representation', 'Product Category',45)
Leading product category bought by customers represented in the dataset is Fashion accessories, Food and beverages and Electronic accessories in that order.
# import package for date manipulation
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Day of the week representation
bar(as.factor(wday(df$date, label=TRUE, abbr=FALSE)), 'Day of the Week Representation', 'Day of the Week', 45)
Friday has the highest representation in the dataset, followed by Monday then Tuesday.
# Month representation
bar(as.factor(format(as.Date(df$date), "%m")), 'Month Representation', 'Month', 45)
Only January, February, March and December are represented in our dataset. Representation proportion is largest for January and lowest for December.
# Hour representation
bar(as.factor(format(strptime(df$time,"%H:%M"),'%H')), 'Hour Representation', 'Hour', 0)
Most sales were recorded in the 19th hour and the least sales in the 17th hour.
# create funtion to plot stacked barchart
stacked <- function(column1, column2, title, value1, value2, legend){
# vectorize the two columns to plot
attribute1 <- c(column1)
attribute2 <- c(column2)
# create dataframe of the two columns
data <- data.frame(attribute1, attribute2)
# plot stacked bargraph
ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5))
}
# Branch against Customer Type
stacked(df$branch, df$customer.type, "Branch against Customer Type Representation", "Branch", "Customer Type",
"Customer Type")
Customers are distributed fairly uniformly across all three branches regardless of their membership status.
# Branch against Gender Representation
stacked(df$branch, df$gender, "Branch against Gender Representation", "Branch", "Gender",
"Gender")
Branch A and B recieve relatively more male customers. Branch C recieves relatively fewer male customers.
# Customer Type against Gender Representation
stacked(df$customer.type, df$gender, " Customer Type against Gender Representation", " Customer Type", "Gender", "Gender")
Slighly more female customers are registered members than male customers.
# Payment against Gender Representation
stacked(df$payment, df$gender, " Payment against Gender Representation", "Payment", "Gender", "Gender")
Most male customers prefer Ewallet while most female customers prefer cash mode of payment. An almost equal number of male and female customers prefer credit cards, but this number is relatively low.
# Customer Type against payment Representation
stacked(df$payment, df$customer.type, " Customer Type against Mode of Payment Representation", " Payment mode", "Customer Type", "Customer Type")
Distribution of customers who prefer cash and Ewallet with respect to mebership status is fairly equal. Customers who use credit card as mode of payment are mostly members.
# plot mosaicplot for revenue against region
mosaic <- function(col1,col2, xlab, ylab, title){
# contingecy table
col1 <- col1
col2 <- col2
cont <- table(col1,col2)
print(cont)
mosaicplot(cont, xlab=xlab, ylab=ylab, main=title, color="#68a3b3", las=1)
}
# Product Category against Branch Representation
mosaic(df$branch, df$product.line, "Branch", "Product Category", "Product Category against Branch Representation")
## col2
## col1 Electronic accessories Fashion accessories Food and beverages
## A 60 51 58
## B 55 62 50
## C 55 65 66
## col2
## col1 Health and beauty Home and lifestyle Sports and travel
## A 47 65 59
## B 53 50 62
## C 52 45 45
Fashion accessories and Food and beverages have recorded more sales in Branch C. Branch A leads in electronic accessories and home and lifestyle accessories. Branch B leads in sports and travel.
# Branch against payment Representation
mosaic(df$payment, df$branch, "Payment", "Branch", "Branch against Mode of Payment Representation")
## col2
## col1 A B C
## Cash 110 110 124
## Credit card 104 109 98
## Ewallet 126 113 106
Most customers in branch C prefer cash as the mode of payment, while those visiting branch A prefer Ewallet and those visiting Branch B have no preference for mode of payment.
# Product Category against Gender Representation
mosaic(df$gender, df$product.line, "Gender", "Product Category", "Product Category against Gender Representation")
## col2
## col1 Electronic accessories Fashion accessories Food and beverages
## Female 84 96 90
## Male 86 82 84
## col2
## col1 Health and beauty Home and lifestyle Sports and travel
## Female 64 79 88
## Male 88 81 78
Apparently, more male customers purchase health and beauty products than female customers. Female customers purchase more fashion accessories, food and beverage products and sports and travel products.
# Product Category against Payment Representation
mosaic(df$payment, df$product.line, "Payment", "Product Category", "Product Category against Mode of Payment Representation")
## col2
## col1 Electronic accessories Fashion accessories Food and beverages
## Cash 71 57 57
## Credit card 46 56 61
## Ewallet 53 65 56
## col2
## col1 Health and beauty Home and lifestyle Sports and travel
## Cash 49 51 59
## Credit card 50 45 53
## Ewallet 53 64 54
Electronic accessories are mostly paid through cash. Food and beverages are mostly paid through credit card. Home and lifestyle products are paid mainly through Ewallet.
# Product Category against Customer Type Representation
mosaic(df$customer.type, df$product.line, "Customer Type", "Product Category", "Product Category against Customer Type Representation")
## col2
## col1 Electronic accessories Fashion accessories Food and beverages
## Member 78 86 94
## Normal 92 92 80
## col2
## col1 Health and beauty Home and lifestyle Sports and travel
## Member 73 83 87
## Normal 79 77 79
Electronic accessories are mostly bought by non-members while sports and travel and food and beverages products are mainly bought by members.
# covariance matrix
cov(data_num)
## Unit.price Quantity Tax cogs gross.income
## Unit.price 701.9653313 0.83477848 196.6683401 3933.36680 196.6683401
## Quantity 0.8347785 8.54644645 24.1495704 482.99141 24.1495704
## Tax 196.6683401 24.14957038 137.0965941 2741.93188 137.0965941
## cogs 3933.3668019 482.99140761 2741.9318829 54838.63766 2741.9318829
## gross.income 196.6683401 24.14957038 137.0965941 2741.93188 137.0965941
## Rating -0.3996675 -0.07945646 -0.7333003 -14.66601 -0.7333003
## Total 4130.0351420 507.14097799 2879.0284770 57580.56954 2879.0284770
## Rating Total
## Unit.price -0.39966752 4130.03514
## Quantity -0.07945646 507.14098
## Tax -0.73330028 2879.02848
## cogs -14.66600553 57580.56954
## gross.income -0.73330028 2879.02848
## Rating 2.95351823 -15.39931
## Total -15.39930581 60459.59802
All variables have positive covariance towards one another except for rating which has negative covariance to all other variables.
# correlation matrix
# import ggcorrplot package for plotting correlation heatmap
library(ggcorrplot)
corr <- round(cor(data_num),2)
ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#E799A3','#43302E','#E799A3'))
Tax, cog, gross income and total have a correlation of 1. Rating is the only variable with a weak correlation value to other variables.
# Drop Quantity. Tax, cog, gross income and total columns. We shall only retain unit price and rating
df <- select(df,-c(quantity, tax, cogs, gross.income, total))
data_num <- select(data_num,-c(Tax, cogs, gross.income, Quantity, Total))
# preview dataset
head(df)
## # A tibble: 6 × 10
## invoice.id branch customer.type gender product.line unit.price date
## <chr> <fct> <fct> <fct> <fct> <dbl> <date>
## 1 750-67-8428 A Member Female Health and beau… 74.7 2019-01-04
## 2 226-31-3081 C Normal Female Electronic acce… 15.3 2019-03-07
## 3 631-41-3108 A Normal Male Home and lifest… 46.3 2019-03-02
## 4 123-19-1176 A Member Male Health and beau… 58.2 2019-01-26
## 5 373-73-7910 A Normal Male Sports and trav… 86.3 2019-02-07
## 6 699-14-3026 C Normal Male Electronic acce… 85.4 2019-03-24
## # … with 3 more variables: time <chr>, payment <fct>, rating <dbl>
corr <- round(cor(data_num),2)
ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#E799A3','#43302E','#E799A3'))
# scatterplot function
scatter <- function(column1, column2, x_label, y_label, title){
# Change point shapes and color by the levels of branch
ggplot(df, aes(x=column1, y=column2, shape=branch, color=branch)) +
geom_point() + labs(title=title) + xlab(x_label) + ylab(y_label)
}
# plot scatter for unit price against rating according to branch
scatter(df$rating, df$unit.price, 'Rating', 'Unit Price', 'Rating against Unit Price')
There is no significant pattern since rating and unit price are not correlated.
# Barplot function
barc <- function(column1,column2, xlabel, ylabel, title, angle){
# unlist columns
column1 <- unlist(column1)
column2 <- unlist(column2)
# since distribution are not gaussian, we shall use median in place of mean
mean.df <- aggregate( column2 ~ column1, df, median)
names(mean.df)[2] <- "average.unit.price"
# plot
ggplot(data=mean.df, aes(x=unlist(mean.df[1]), y=unlist(mean.df[2]))) + geom_bar(stat="identity",fill="steelblue") + xlab(xlabel) +ylab(ylabel) + ggtitle(title) + theme(axis.text.x = element_text(angle = angle, vjust = 1, hjust=1), plot.title = element_text(hjust=0.5))
}
# Product Category against Median Unit Price
barc(df$product.line,df$unit.price, 'Product Category', 'Unit Price', 'Product Category against Median Unit Price', 45)
Sports and travel category has the highest unit prices while the electronic accessories category is the least expensive.
# Product line against rating
barc(df$product.line,df$rating, 'Product category', 'Rating', 'Product Category against Rating', 45)
Food and beverages and Health and beauty categories have the highest rating. Electronic accessories and Sports and travel accessories have the least rating. However the supermarket has a high rating overall.
# Branch against Average Unit Price
barc(df$branch,df$unit.price, 'Branch', 'Unit Price', 'Branch against Average Unit Price', 0)
Least expensive items are bought from branch A while most expensive items are bought from Branch C.
# Branch against rating
barc(df$branch, df$rating, 'Branch', 'Rating', 'Branch against Rating', 0)
Branch A received highest rating(maybe due to lower prices) while Branch B received the lowest rating.
# Customer Type against rating
barc(df$customer.type, df$rating, 'Customer Type', 'Rating', 'Customer Type against Rating', 0)
There is no significant difference in average ratings based on customer type.
# Gender against rating
barc(df$gender, df$rating, 'Gender', 'Rating', 'Gender against Rating', 0)
Ratings received are received from female than male customers.
# Payment against unit price
barc(df$payment, df$unit.price, 'Mode of payment', 'Unit Price', 'Mode of payment against Unit Price', 0)
Customers prefer paying for fairly expensive products by cash and least expensive products by credit card.
# lineplot for date against unit prices
ggplot(data=df, aes(x=date, y=unit.price, group=1)) +
geom_line(color="#5F6A6A")+
geom_point() +
scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Unit Price Trend") + ylab("Unit Price")
Unit prices are consistent.
# Convert factors to numeric
fact_to_num <- function(column){
as.integer(column)
}
df$branch <- fact_to_num(df$branch)
df$product.line <- fact_to_num(df$product.line)
df$customer.type <- fact_to_num(df$customer.type)
df$gender <- fact_to_num(df$gender)
df$payment <- fact_to_num(df$payment)
# extract month, day of the week and hour
# weekday
df$weekday <- as.integer(wday(df$date, week_start=1)) # Monday as first day
# day of month
df$day <- as.integer(format(as.Date(df$date), "%d"))
# Month
df$month <- as.integer(format(as.Date(df$date), "%m"))
# Hour
df$hour <- as.integer(format(strptime(df$time,"%H:%M"),'%H'))
# drop unnecessary columns
# drop invoice id, date and time
df <- select(df,-c(invoice.id, date, time))
# preview dataset
head(df)
## # A tibble: 6 × 11
## branch customer.type gender product.line unit.price payment rating weekday
## <int> <int> <int> <int> <dbl> <int> <dbl> <int>
## 1 1 1 1 4 74.7 3 9.1 5
## 2 3 2 1 1 15.3 1 9.6 4
## 3 1 2 2 5 46.3 2 7.4 6
## 4 1 1 2 4 58.2 3 8.4 6
## 5 1 2 2 6 86.3 3 5.3 4
## 6 3 2 2 1 85.4 3 4.1 7
## # … with 3 more variables: day <int>, month <int>, hour <int>
# import factoetra package
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# peform pca
df.pca <- prcomp(df, scale=TRUE)
# Print variable observations
get_pca_var(df.pca)
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
#getting the Eigenvalues to determine variability contribution of each variable
get_eigenvalue(df.pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.2359462 11.235875 11.23587
## Dim.2 1.1419900 10.381727 21.61760
## Dim.3 1.1095499 10.086818 31.70442
## Dim.4 1.0585258 9.622962 41.32738
## Dim.5 1.0355471 9.414064 50.74145
## Dim.6 0.9878240 8.980218 59.72167
## Dim.7 0.9536648 8.669680 68.39135
## Dim.8 0.9266610 8.424191 76.81554
## Dim.9 0.9073199 8.248363 85.06390
## Dim.10 0.8569319 7.790290 92.85419
## Dim.11 0.7860391 7.145810 100.00000
The contribution of each feature is significant since all 10 features account for almost 93% of the variability in the dataset. However, we can decide to use 8 dimensions which account for almost 77% variability.
# Plot the principal components(Scree plot)
fviz_eig(df.pca, addlabels = T, ylim = c(0, 15))
Percentage accountability in variability difference between the first and last principal component is very small(11.2-7.8=3.4).
# visualize principal components
fviz_pca_var(df.pca, col.var = "#512E5F")
#Plotting the contribution of features in each Principal Component
# First Principal Component
fviz_contrib(df.pca,choice = "var", axes = 1)
Weekday, month and day contribute most in the first principal component.
# Second Principal Component
fviz_contrib(df.pca,choice = "var", axes = 2)
Gender, branch, payment and hour contribute most in the second principal component.
# Third Principal Component
fviz_contrib(df.pca, choice = "var", axes = 3)
Product.line, branch and hour contribute most in the third principal component.
# Fourth Principal Component
fviz_contrib(df.pca,choice = "var", axes = 4)
Fourth pc - Customer type, unit price, hour, rating and product line.
# Fifth Principal Component
fviz_contrib(df.pca,choice = "var", axes = 5)
Fifth pc - Unit price, day and payment
# Sixth Principal Component
fviz_contrib(df.pca,choice = "var", axes = 6)
Sixth pc - rating, weekday and payment.
# Seventh Principal Component
fviz_contrib(df.pca,choice = "var", axes = 7)
7th pc - Mode of payment, customer type. branch, gender and rating.
# Eighth Principal Component
fviz_contrib(df.pca,choice = "var", axes = 8)
8th pc - Branch, product category, gender, customer type, day and mode of payment.
We shall settle on 8 principal components which account for 76.82 variability in our dataset.
After removing highly correlated features, we remain with features whose variable contributions are very close to one another. In order to determine which of these are more important than others, we may require more data.
Number of principal components to use is dependent on the data scientist hence no proof that 8 principal components would be enough to train a machine learning model on.
The data was provided by the client.
The research question was also provided by the client.
8 principal components are able to account of 76.82% variability on the dataset. We can use these 8 principal components for modelling.
We the data science team recommend the following:
Determine why more female customers shop in branch C targetting fashion accessories and food and beverages.
Determine why more male customers shop in branch A targetting electronic accessories and home and lifestyle products.
The leading product in sales in Branch B is sports and travel(mainly bought by female in general) but it receives more male customers than female. Determine why this is the case.
The supermarket receives more customers during the 19th hour hence should look into attendant customer ratio to prevent missing out on impatient customers and reduce waiting and service time.
In the represented year(2019), January recorded the highest number of sales. More data can help determine if this is usually the case. If so, discounts/offers can be advertised at this period to attract more customers.