Principal Component Analysis for Carrefour Supermarket dataset in R

1. Defining the Question

Research Question

You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.

a.) Specifying the question

Perform dimensionality reduction and feature selection on the dataset for feature extraction on the dataset provided.

b.) The metric for success

Determining which features contributing most to variability in the dataset hence useful for modelling.

c.) Understanding the context

The dataset provided contains attributes of customers who visit the three branches of the supermarket.

d.) Experimental design taken

  1. Reading the data

  2. Checking the data - data understanding

  3. Implementing the solution

  4. Challenge the solution

  5. Follow up Questions

  6. Conclusion.

  7. Recommendations.

e.) Data appropriateness to answer the given question.

Through EDA and PCA, we are able to determine which features contribute to the most variability in our dataset and which features can be used in our modelling phase.

2. Loading the dataset

# import Tibble package
library(tibble)


# url <- http://bit.ly/CarreFourDataset

# load the dataset as dataframe

df <- read.csv('http://bit.ly/CarreFourDataset') 

# convert dataframe to Tibble

df <- as_tibble(df)

#check data structure of our dataset

class(df)
## [1] "tbl_df"     "tbl"        "data.frame"

3. Checking the data

# Previewing our first 5 records
#
head(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 750-67-8428 A      Member        Female Health and …       74.7        7 26.1 
## 2 226-31-3081 C      Normal        Female Electronic …       15.3        5  3.82
## 3 631-41-3108 A      Normal        Male   Home and li…       46.3        7 16.2 
## 4 123-19-1176 A      Member        Male   Health and …       58.2        8 23.3 
## 5 373-73-7910 A      Normal        Male   Sports and …       86.3        7 30.2 
## 6 699-14-3026 C      Normal        Male   Electronic …       85.4        7 29.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>
# Previewing our last 5 records
#
tail(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 652-49-6720 C      Member        Female Electronic …       61.0        1  3.05
## 2 233-67-5758 C      Normal        Male   Health and …       40.4        1  2.02
## 3 303-96-2227 B      Normal        Female Home and li…       97.4       10 48.7 
## 4 727-02-1313 A      Member        Male   Food and be…       31.8        1  1.59
## 5 347-56-2442 A      Normal        Male   Home and li…       65.8        1  3.29
## 6 849-09-3807 A      Member        Female Fashion acc…       88.3        7 30.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>
# check shape of the dataset

dim(df)
## [1] 1000   16

Our dataset contains 1000 records and 16 columns

# check column datatypes

sapply(df, class)
##              Invoice.ID                  Branch           Customer.type 
##             "character"             "character"             "character" 
##                  Gender            Product.line              Unit.price 
##             "character"             "character"               "numeric" 
##                Quantity                     Tax                    Date 
##               "integer"               "numeric"             "character" 
##                    Time                 Payment                    cogs 
##             "character"             "character"               "numeric" 
## gross.margin.percentage            gross.income                  Rating 
##               "numeric"               "numeric"               "numeric" 
##                   Total 
##               "numeric"
# inspect variable classes

str(df)
## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Invoice.ID             : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr [1:1000] "A" "C" "A" "A" ...
##  $ Customer.type          : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num [1:1000] 549 80.2 340.5 489 634.4 ...

We need to convert time into time datatype and categorical columns to factor datatype.

# convert  date and time to standard date and time format

df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")
df$Date <- as.Date(df$Date)

df$Time <- as.POSIXct(df$Time, format = "%H:%M")
df$Time <- format(df$Time,"%H:%M")

 

# convert logical values to factor

logical.to.factor <- function(column){
  
   as.factor(column)
  
}


# logical columns

df$Gender <- logical.to.factor(df$Gender)
df$Branch <- logical.to.factor(df$Branch)
df$Customer.type <- logical.to.factor(df$Customer.type)
df$Product.line <- logical.to.factor(df$Product.line)
df$Payment <- logical.to.factor(df$Payment)

# preview dataset
head(df)
## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <fct>  <fct>         <fct>  <fct>             <dbl>    <int> <dbl>
## 1 750-67-8428 A      Member        Female Health and …       74.7        7 26.1 
## 2 226-31-3081 C      Normal        Female Electronic …       15.3        5  3.82
## 3 631-41-3108 A      Normal        Male   Home and li…       46.3        7 16.2 
## 4 123-19-1176 A      Member        Male   Health and …       58.2        8 23.3 
## 5 373-73-7910 A      Normal        Male   Sports and …       86.3        7 30.2 
## 6 699-14-3026 C      Normal        Male   Electronic …       85.4        7 29.9 
## # … with 8 more variables: Date <date>, Time <chr>, Payment <fct>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>

4. Data cleaning

Validation

Dataset is provided by the client.

Outliers

# select only numerical variables
 
library("plyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data_num <- select_if(df, is.numeric) 
head(data_num)
## # A tibble: 6 × 8
##   Unit.price Quantity   Tax  cogs gross.margin.percen… gross.income Rating Total
##        <dbl>    <int> <dbl> <dbl>                <dbl>        <dbl>  <dbl> <dbl>
## 1       74.7        7 26.1  523.                  4.76        26.1     9.1 549. 
## 2       15.3        5  3.82  76.4                 4.76         3.82    9.6  80.2
## 3       46.3        7 16.2  324.                  4.76        16.2     7.4 341. 
## 4       58.2        8 23.3  466.                  4.76        23.3     8.4 489. 
## 5       86.3        7 30.2  604.                  4.76        30.2     5.3 634. 
## 6       85.4        7 29.9  598.                  4.76        29.9     4.1 628.
# Check for outliers


# import ggplot2 for plotting
library(ggplot2)


# vectorize column names

columns <-  c(names(data_num))

# plot boxplot

boxplots <- function(column_name){
  column <- unlist(data_num[,column_name])
  p1 <- ggplot(data=data_num, mapping=aes(column))+geom_boxplot() + xlab(names(data_num[,column_name]))
  print(p1)
}

# create 8 subplots 

num <- c()

for (x in 1:length(columns)){
   plot <- boxplots(x)
   
}

#library(patchwork)

Outliers exist in Tax, cogs, gross.income and Total columns.

Missing data

# check for missing values
colSums(is.na(df))
##              Invoice.ID                  Branch           Customer.type 
##                       0                       0                       0 
##                  Gender            Product.line              Unit.price 
##                       0                       0                       0 
##                Quantity                     Tax                    Date 
##                       0                       0                       0 
##                    Time                 Payment                    cogs 
##                       0                       0                       0 
## gross.margin.percentage            gross.income                  Rating 
##                       0                       0                       0 
##                   Total 
##                       0

Our dataset has no missing values.

Duplicates

# check for duplicates

anyDuplicated(df)
## [1] 0

Our dataset has no duplicated rows.

Uniformity

# convert column names to lowercase 
names(df) <- tolower(names(df))

Relevance

# check relevance of columns
unique(df$gross.margin.percentage)
## [1] 4.761905
# drop column since it is a constant
df <- df[,-13]
data_num <- data_num[,-5]

The gross margin percentage is a constant equal to 4.761905.

5. Exploratory Data Analysis

a.) Univariate Analysis

# checking summary of dataset
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(df)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##                vars    n   mean     sd median trimmed    mad   min     max
## invoice.id*       1 1000 500.50 288.82 500.50  500.50 370.65  1.00 1000.00
## branch*           2 1000   1.99   0.82   2.00    1.99   1.48  1.00    3.00
## customer.type*    3 1000   1.50   0.50   1.00    1.50   0.00  1.00    2.00
## gender*           4 1000   1.50   0.50   1.00    1.50   0.00  1.00    2.00
## product.line*     5 1000   3.45   1.72   3.00    3.44   1.48  1.00    6.00
## unit.price        6 1000  55.67  26.49  55.23   55.62  33.37 10.08   99.96
## quantity          7 1000   5.51   2.92   5.00    5.51   2.97  1.00   10.00
## tax               8 1000  15.38  11.71  12.09   14.00  11.13  0.51   49.65
## date              9 1000    NaN     NA     NA     NaN     NA   Inf    -Inf
## time*            10 1000 252.18 147.07 249.00  252.49 190.51  1.00  506.00
## payment*         11 1000   2.00   0.83   2.00    2.00   1.48  1.00    3.00
## cogs             12 1000 307.59 234.18 241.76  279.91 222.65 10.17  993.00
## gross.income     13 1000  15.38  11.71  12.09   14.00  11.13  0.51   49.65
## rating           14 1000   6.97   1.72   7.00    6.97   2.22  4.00   10.00
## total            15 1000 322.97 245.89 253.85  293.91 233.78 10.68 1042.65
##                  range skew kurtosis   se
## invoice.id*     999.00 0.00    -1.20 9.13
## branch*           2.00 0.02    -1.51 0.03
## customer.type*    1.00 0.00    -2.00 0.02
## gender*           1.00 0.00    -2.00 0.02
## product.line*     5.00 0.06    -1.28 0.05
## unit.price       89.88 0.01    -1.22 0.84
## quantity          9.00 0.01    -1.22 0.09
## tax              49.14 0.89    -0.09 0.37
## date              -Inf   NA       NA   NA
## time*           505.00 0.00    -1.25 4.65
## payment*          2.00 0.00    -1.55 0.03
## cogs            982.83 0.89    -0.09 7.41
## gross.income     49.14 0.89    -0.09 0.37
## rating            6.00 0.01    -1.16 0.05
## total          1031.97 0.89    -0.09 7.78
# First and last dates in the dataset

paste('The dataset contains records from', min(df$date), 'to', max(df$date)) 
## [1] "The dataset contains records from 2018-12-31 to 2019-03-29"
# Check interquatile range of numerical columns.
sapply(data_num, IQR)
##   Unit.price     Quantity          Tax         cogs gross.income       Rating 
##     45.06000      5.00000     16.52037    330.40750     16.52037      3.00000 
##        Total 
##    346.92787
# import moments package
library(moments)

# Check skewness of numerical columns.
sapply(data_num, skewness)
##   Unit.price     Quantity          Tax         cogs gross.income       Rating 
##  0.007066827  0.012921628  0.891230392  0.891230392  0.891230392  0.008996129 
##        Total 
##  0.891230392

Tax, Cogs, Gross Income and Total are moderately positively/right skewed while the remaining columns are fairly symmetrical.

# Check kurtosis of numerical columns.
sapply(data_num, kurtosis)
##   Unit.price     Quantity          Tax         cogs gross.income       Rating 
##     1.781499     1.784528     2.912530     2.912530     2.912530     1.848169 
##        Total 
##     2.912530

The distribution of numerical columns is platykurtic(broad and light tailed)

# Pie chart function

# import plotrix for 3D plot
library(plotrix)
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
## 
##     rescale
piechart <- function(column, title){
  
  # define labels and values
  slices <- c(tabulate(column))
  lbls <- c(levels(column))
  
  # convert values to percentage
  pct <- round(slices/sum(slices)*100)
  
  # add percentages to labels
  lbls <- paste(lbls,":", pct) 
  
  # ad % to labels
  lbls <- paste(lbls,"%",sep="") 
  
  # plot pie chart in 3D
  pie3D(slices,labels=lbls, col=rainbow(length(lbls)), 
        explode=0.1, radius=1, start=40, main=title)
}
# Gender representation
piechart(df$gender, 'Gender representation')

There is equal representation of male and female in the dataset.

# Customer type representation
piechart(df$customer.type, 'Customer type representation')

There is also an equal representation of customer type in our dataset

# Branch representation
piechart(df$branch, 'Branch representation')

There is almost an equal representation of branch type per customer.

# Payment representation
piechart(df$payment, 'Payment representation')

Most customers use Ewallet and cash methods of payment than credit cards.

# histograms function
histogram <- function(column,title, xlab){
  
 ggplot(df, aes(x=column)) + geom_histogram(color="black", fill="#283747") + ggtitle(title) + xlab(xlab) + theme(plot.title = element_text(hjust=0.5))
  
}  
# Tax Distribution
histogram(df$tax, 'Tax Distribution', 'Tax')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Tax distribution is right skewed. Majority of data lies between values 4 and 8.

# Cogs Distribution
histogram(df$cogs, 'Cogs Distribution', 'Cogs')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Cogs distribution is right skewed, majority of data lies between 0 and 250.

# Gross Income Distribution
histogram(df$gross.income, 'Gross Income Distribution', 'Gross Income')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Gross Income distribution is right skewed, majority of data lies between 0 and 10.

# Unit price Distribution
histogram(df$unit.price, 'Unit Price Distribution', 'Unit Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unit price has an fairly symmetrical distribution. Skewness ~ 0.

# Unit price Distribution
histogram(df$rating, 'Rating Distribution', 'Rating')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Rating also has an fairly symmetrical distribution. Skewness ~ 0.

# Total Distribution
histogram(df$total, 'Total Distribution', 'Total')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Total distribution is right skewed, majority of data lies between 0 and 250.

# Barplot function
bar <- function(column,title, xlab, angle){
  
 ggplot(df, aes(x=column)) + geom_bar(color="black", fill="#154360") + ggtitle(title) + xlab(xlab) + theme(axis.text.x = element_text(angle = angle, vjust = 1, hjust=1), plot.title = element_text(hjust=0.5))
  
}  
# Quantity representation
bar(as.factor(df$quantity), 'Quantity Distribution', 'Quantity', 360)

Most customers purchased 10 items(itemset).

# Product Category representation
bar(df$product.line, 'Product Category Representation', 'Product Category',45)

Leading product category bought by customers represented in the dataset is Fashion accessories, Food and beverages and Electronic accessories in that order.

# import package for date manipulation
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# Day of the week representation
bar(as.factor(wday(df$date, label=TRUE, abbr=FALSE)), 'Day of the Week Representation', 'Day of the Week', 45)

Friday has the highest representation in the dataset, followed by Monday then Tuesday.

# Month representation
bar(as.factor(format(as.Date(df$date), "%m")), 'Month Representation', 'Month', 45)

Only January, February, March and December are represented in our dataset. Representation proportion is largest for January and lowest for December.

# Hour representation
bar(as.factor(format(strptime(df$time,"%H:%M"),'%H')), 'Hour Representation', 'Hour', 0)

Most sales were recorded in the 19th hour and the least sales in the 17th hour.

b.) Bivariate Analysis

Categorical - Categorical

# create funtion to plot stacked barchart
stacked <- function(column1, column2, title, value1, value2, legend){
  
    # vectorize the two columns to plot
    attribute1 <- c(column1) 
    attribute2 <- c(column2)
    
    # create dataframe of the two columns
    data <- data.frame(attribute1, attribute2)
    
    # plot stacked bargraph
    ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) 
    
    
}
# Branch against Customer Type

stacked(df$branch, df$customer.type, "Branch against Customer Type Representation", "Branch", "Customer Type",
"Customer Type")

Customers are distributed fairly uniformly across all three branches regardless of their membership status.

# Branch against Gender Representation

stacked(df$branch, df$gender, "Branch against Gender Representation", "Branch", "Gender",
"Gender")

Branch A and B recieve relatively more male customers. Branch C recieves relatively fewer male customers.

# Customer Type against Gender Representation

stacked(df$customer.type, df$gender, " Customer Type against Gender Representation", " Customer Type", "Gender", "Gender")

Slighly more female customers are registered members than male customers.

# Payment against Gender Representation

stacked(df$payment, df$gender, " Payment against Gender Representation", "Payment", "Gender", "Gender")

Most male customers prefer Ewallet while most female customers prefer cash mode of payment. An almost equal number of male and female customers prefer credit cards, but this number is relatively low.

# Customer Type against payment Representation

stacked(df$payment, df$customer.type, " Customer Type against Mode of Payment Representation", " Payment mode", "Customer Type", "Customer Type")

Distribution of customers who prefer cash and Ewallet with respect to mebership status is fairly equal. Customers who use credit card as mode of payment are mostly members.

# plot mosaicplot for revenue against region

mosaic <- function(col1,col2, xlab, ylab, title){
 
  # contingecy table
  
  col1 <- col1
  col2 <- col2
  cont <- table(col1,col2)
  print(cont)
  
mosaicplot(cont, xlab=xlab, ylab=ylab, main=title, color="#68a3b3", las=1)
}
# Product Category against Branch Representation

mosaic(df$branch, df$product.line, "Branch", "Product Category", "Product Category against Branch Representation")
##     col2
## col1 Electronic accessories Fashion accessories Food and beverages
##    A                     60                  51                 58
##    B                     55                  62                 50
##    C                     55                  65                 66
##     col2
## col1 Health and beauty Home and lifestyle Sports and travel
##    A                47                 65                59
##    B                53                 50                62
##    C                52                 45                45

Fashion accessories and Food and beverages have recorded more sales in Branch C. Branch A leads in electronic accessories and home and lifestyle accessories. Branch B leads in sports and travel.

# Branch against payment Representation

mosaic(df$payment, df$branch, "Payment", "Branch", "Branch against Mode of Payment Representation")
##              col2
## col1            A   B   C
##   Cash        110 110 124
##   Credit card 104 109  98
##   Ewallet     126 113 106

Most customers in branch C prefer cash as the mode of payment, while those visiting branch A prefer Ewallet and those visiting Branch B have no preference for mode of payment.

# Product Category against Gender Representation

mosaic(df$gender, df$product.line, "Gender", "Product Category", "Product Category against Gender Representation")
##         col2
## col1     Electronic accessories Fashion accessories Food and beverages
##   Female                     84                  96                 90
##   Male                       86                  82                 84
##         col2
## col1     Health and beauty Home and lifestyle Sports and travel
##   Female                64                 79                88
##   Male                  88                 81                78

Apparently, more male customers purchase health and beauty products than female customers. Female customers purchase more fashion accessories, food and beverage products and sports and travel products.

# Product Category against Payment Representation

mosaic(df$payment, df$product.line, "Payment", "Product Category", "Product Category against Mode of Payment Representation")
##              col2
## col1          Electronic accessories Fashion accessories Food and beverages
##   Cash                            71                  57                 57
##   Credit card                     46                  56                 61
##   Ewallet                         53                  65                 56
##              col2
## col1          Health and beauty Home and lifestyle Sports and travel
##   Cash                       49                 51                59
##   Credit card                50                 45                53
##   Ewallet                    53                 64                54

Electronic accessories are mostly paid through cash. Food and beverages are mostly paid through credit card. Home and lifestyle products are paid mainly through Ewallet.

# Product Category against Customer Type Representation

mosaic(df$customer.type, df$product.line, "Customer Type", "Product Category", "Product Category against Customer Type Representation")
##         col2
## col1     Electronic accessories Fashion accessories Food and beverages
##   Member                     78                  86                 94
##   Normal                     92                  92                 80
##         col2
## col1     Health and beauty Home and lifestyle Sports and travel
##   Member                73                 83                87
##   Normal                79                 77                79

Electronic accessories are mostly bought by non-members while sports and travel and food and beverages products are mainly bought by members.

Numerical - Numerical

# covariance matrix
cov(data_num)
##                Unit.price     Quantity          Tax        cogs gross.income
## Unit.price    701.9653313   0.83477848  196.6683401  3933.36680  196.6683401
## Quantity        0.8347785   8.54644645   24.1495704   482.99141   24.1495704
## Tax           196.6683401  24.14957038  137.0965941  2741.93188  137.0965941
## cogs         3933.3668019 482.99140761 2741.9318829 54838.63766 2741.9318829
## gross.income  196.6683401  24.14957038  137.0965941  2741.93188  137.0965941
## Rating         -0.3996675  -0.07945646   -0.7333003   -14.66601   -0.7333003
## Total        4130.0351420 507.14097799 2879.0284770 57580.56954 2879.0284770
##                    Rating       Total
## Unit.price    -0.39966752  4130.03514
## Quantity      -0.07945646   507.14098
## Tax           -0.73330028  2879.02848
## cogs         -14.66600553 57580.56954
## gross.income  -0.73330028  2879.02848
## Rating         2.95351823   -15.39931
## Total        -15.39930581 60459.59802

All variables have positive covariance towards one another except for rating which has negative covariance to all other variables.

# correlation matrix

# import ggcorrplot package for plotting correlation heatmap
library(ggcorrplot)

corr <- round(cor(data_num),2)

ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#E799A3','#43302E','#E799A3'))

Tax, cog, gross income and total have a correlation of 1. Rating is the only variable with a weak correlation value to other variables.

# Drop Quantity. Tax, cog, gross income and total columns. We shall only retain unit price and rating

df <- select(df,-c(quantity, tax, cogs, gross.income, total))

data_num <- select(data_num,-c(Tax, cogs, gross.income, Quantity, Total))

# preview dataset
head(df)
## # A tibble: 6 × 10
##   invoice.id  branch customer.type gender product.line     unit.price date      
##   <chr>       <fct>  <fct>         <fct>  <fct>                 <dbl> <date>    
## 1 750-67-8428 A      Member        Female Health and beau…       74.7 2019-01-04
## 2 226-31-3081 C      Normal        Female Electronic acce…       15.3 2019-03-07
## 3 631-41-3108 A      Normal        Male   Home and lifest…       46.3 2019-03-02
## 4 123-19-1176 A      Member        Male   Health and beau…       58.2 2019-01-26
## 5 373-73-7910 A      Normal        Male   Sports and trav…       86.3 2019-02-07
## 6 699-14-3026 C      Normal        Male   Electronic acce…       85.4 2019-03-24
## # … with 3 more variables: time <chr>, payment <fct>, rating <dbl>
corr <- round(cor(data_num),2)

ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#E799A3','#43302E','#E799A3'))

# scatterplot function

scatter <- function(column1, column2, x_label, y_label, title){
  
# Change point shapes and color by the levels of branch
ggplot(df, aes(x=column1, y=column2, shape=branch, color=branch)) +
geom_point() + labs(title=title) + xlab(x_label) + ylab(y_label)
  
}

# plot scatter for unit price against rating according to branch
scatter(df$rating, df$unit.price, 'Rating', 'Unit Price', 'Rating against Unit Price')

There is no significant pattern since rating and unit price are not correlated.

Categorical - Numerical

# Barplot function




barc <- function(column1,column2, xlabel, ylabel, title, angle){
  
  # unlist columns
  column1 <- unlist(column1)
  column2 <- unlist(column2)
  
  # since distribution are not gaussian, we shall use median in place of mean
  mean.df <- aggregate( column2 ~ column1, df, median)
  names(mean.df)[2] <- "average.unit.price"
  
  # plot
  ggplot(data=mean.df, aes(x=unlist(mean.df[1]), y=unlist(mean.df[2])))  +  geom_bar(stat="identity",fill="steelblue") +  xlab(xlabel) +ylab(ylabel) + ggtitle(title) + theme(axis.text.x = element_text(angle = angle, vjust = 1, hjust=1), plot.title = element_text(hjust=0.5))

  
}  
# Product Category against Median Unit Price

barc(df$product.line,df$unit.price, 'Product Category', 'Unit Price', 'Product Category against Median Unit Price', 45)

Sports and travel category has the highest unit prices while the electronic accessories category is the least expensive.

# Product line against rating

barc(df$product.line,df$rating, 'Product category', 'Rating', 'Product Category against Rating', 45)

Food and beverages and Health and beauty categories have the highest rating. Electronic accessories and Sports and travel accessories have the least rating. However the supermarket has a high rating overall.

# Branch against Average Unit Price
barc(df$branch,df$unit.price, 'Branch', 'Unit Price', 'Branch against Average Unit Price', 0)

Least expensive items are bought from branch A while most expensive items are bought from Branch C.

# Branch against rating

barc(df$branch, df$rating, 'Branch', 'Rating', 'Branch against Rating', 0)

Branch A received highest rating(maybe due to lower prices) while Branch B received the lowest rating.

# Customer Type against rating

barc(df$customer.type, df$rating, 'Customer Type', 'Rating', 'Customer Type against Rating', 0)

There is no significant difference in average ratings based on customer type.

# Gender against rating

barc(df$gender, df$rating, 'Gender', 'Rating', 'Gender against Rating', 0)

Ratings received are received from female than male customers.

# Payment against unit price

barc(df$payment, df$unit.price, 'Mode of payment', 'Unit Price', 'Mode of payment against Unit Price', 0)

Customers prefer paying for fairly expensive products by cash and least expensive products by credit card.

# lineplot for date against unit prices

ggplot(data=df, aes(x=date, y=unit.price, group=1)) +
  geom_line(color="#5F6A6A")+
  geom_point() +
  scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Unit Price Trend") + ylab("Unit Price") 

Unit prices are consistent.

6. Implementing the solution

Feature Engineering

# Convert factors to numeric 

fact_to_num <- function(column){
  as.integer(column)
}
df$branch <- fact_to_num(df$branch)
df$product.line <- fact_to_num(df$product.line)
df$customer.type <- fact_to_num(df$customer.type)
df$gender <- fact_to_num(df$gender)
df$payment <- fact_to_num(df$payment)
# extract month, day of the week and hour

# weekday
df$weekday <- as.integer(wday(df$date, week_start=1)) # Monday as first day

# day of month
df$day <- as.integer(format(as.Date(df$date), "%d"))

# Month
df$month <- as.integer(format(as.Date(df$date), "%m"))

# Hour
df$hour <- as.integer(format(strptime(df$time,"%H:%M"),'%H'))
# drop unnecessary columns

# drop invoice id, date and time
df <- select(df,-c(invoice.id, date, time))

# preview dataset
head(df)
## # A tibble: 6 × 11
##   branch customer.type gender product.line unit.price payment rating weekday
##    <int>         <int>  <int>        <int>      <dbl>   <int>  <dbl>   <int>
## 1      1             1      1            4       74.7       3    9.1       5
## 2      3             2      1            1       15.3       1    9.6       4
## 3      1             2      2            5       46.3       2    7.4       6
## 4      1             1      2            4       58.2       3    8.4       6
## 5      1             2      2            6       86.3       3    5.3       4
## 6      3             2      2            1       85.4       3    4.1       7
## # … with 3 more variables: day <int>, month <int>, hour <int>

Performing PCA for dimensionality reduction

# import factoetra package

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# peform pca

df.pca <- prcomp(df, scale=TRUE)
# Print variable observations

get_pca_var(df.pca)
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
#getting the Eigenvalues to determine variability contribution of each variable

get_eigenvalue(df.pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   1.2359462        11.235875                    11.23587
## Dim.2   1.1419900        10.381727                    21.61760
## Dim.3   1.1095499        10.086818                    31.70442
## Dim.4   1.0585258         9.622962                    41.32738
## Dim.5   1.0355471         9.414064                    50.74145
## Dim.6   0.9878240         8.980218                    59.72167
## Dim.7   0.9536648         8.669680                    68.39135
## Dim.8   0.9266610         8.424191                    76.81554
## Dim.9   0.9073199         8.248363                    85.06390
## Dim.10  0.8569319         7.790290                    92.85419
## Dim.11  0.7860391         7.145810                   100.00000

The contribution of each feature is significant since all 10 features account for almost 93% of the variability in the dataset. However, we can decide to use 8 dimensions which account for almost 77% variability.

# Plot the principal components(Scree plot)

fviz_eig(df.pca, addlabels = T, ylim = c(0, 15))

Percentage accountability in variability difference between the first and last principal component is very small(11.2-7.8=3.4).

# visualize principal components

fviz_pca_var(df.pca, col.var = "#512E5F")

#Plotting the contribution of features in each Principal Component

# First Principal Component

fviz_contrib(df.pca,choice = "var", axes = 1)

Weekday, month and day contribute most in the first principal component.

# Second Principal Component

fviz_contrib(df.pca,choice = "var", axes = 2)

Gender, branch, payment and hour contribute most in the second principal component.

# Third Principal Component

fviz_contrib(df.pca, choice = "var", axes = 3)

Product.line, branch and hour contribute most in the third principal component.

# Fourth Principal Component

fviz_contrib(df.pca,choice = "var", axes = 4)

Fourth pc - Customer type, unit price, hour, rating and product line.

# Fifth Principal Component

fviz_contrib(df.pca,choice = "var", axes = 5)

Fifth pc - Unit price, day and payment

# Sixth Principal Component

fviz_contrib(df.pca,choice = "var", axes = 6)

Sixth pc - rating, weekday and payment.

# Seventh Principal Component

fviz_contrib(df.pca,choice = "var", axes = 7)

7th pc - Mode of payment, customer type. branch, gender and rating.

# Eighth Principal Component

fviz_contrib(df.pca,choice = "var", axes = 8)

8th pc - Branch, product category, gender, customer type, day and mode of payment.

We shall settle on 8 principal components which account for 76.82 variability in our dataset.

7. Challenging the solution

After removing highly correlated features, we remain with features whose variable contributions are very close to one another. In order to determine which of these are more important than others, we may require more data.

Number of principal components to use is dependent on the data scientist hence no proof that 8 principal components would be enough to train a machine learning model on.

8. Follow up Questions

Do we have the right data?

The data was provided by the client.

Do we have the right question?

The research question was also provided by the client.

9. Conclusion

8 principal components are able to account of 76.82% variability on the dataset. We can use these 8 principal components for modelling.

10. Recommendations

We the data science team recommend the following: