options(warn=-1)

PART I

Dimensionality Reduction with Carrefour Suppermarket

1a) Defining the Question

This section of the project entails reducing your dataset to a low dimensional dataset using the t-SNE algorithm or PCA. You will be required to perform your analysis and provide insights gained from your analysis.

b) Defining the Metric of Success

The project will be considered a success when we successfully reduce the dataset into a low dimensional dataset for analysis.

c) Understanding the Context

Carrefour is a French multinational retail corporation headquartered in Massy, France. The eighth-largest retailer in the world by revenue, it operates a chain of hypermarkets, groceries and convenience stores, which as of January 2021, comprises its 12,225 stores in over 30 countries (Wikipedia).

In Kenya, Carrefour has 8 outlets mostly located in the suburbs of Kenya’s capital city, Nairobi. The retailer’s expansion into Kenya has benefited from the failure of previously-dominant supermarket chains such as Nakumatt and Uchumi as Carrefour rushed in to occupy the retail spaces and market share they vacated.

d) Recording the Experimental Design

  1. Data sourcing/loading
  2. Data Understanding
  3. Data Relevance
  4. External Dataset Validation
  5. Data Preparation
  6. Univariate Analysis
  7. Bivariate Analysis
  8. Multivariate Analysis
  9. Implementing the solution
  10. Challenging the solution
  11. Conclusion
  12. Follow up questions

e) Data Relevance

The data is valid and has been provided by the Carrefour.

2. Data Sourcing

Libraries

# let's install the libraries we need
library(e1071)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(devtools)
## Loading required package: usethis
library(Rtsne)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(CatEncoders)
## 
## Attaching package: 'CatEncoders'
## The following object is masked from 'package:base':
## 
##     transform
library(caret)
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(magrittr)
library(knitr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::lift()      masks caret::lift()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ purrr::transpose() masks data.table::transpose()
library(devtools)
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:e1071':
## 
##     impute
## The following objects are masked from 'package:base':
## 
##     format.pval, units

a) Reading the Data

carrefour <- fread('Supermarket.csv')

b) Checking the data

# let's check the top view of the dataset
head(carrefour)
# let's check the bottom view of the dataset
tail(carrefour)
# let's check the number of rows and columns of the datset
cat('Number of rows = ', nrow(carrefour), 'and number of columns = ', ncol(carrefour), '.')
## Number of rows =  1000 and number of columns =  16 .
# let's check the data types of the columns in the dataset
sapply(carrefour, class)
##              Invoice ID                  Branch           Customer type 
##             "character"             "character"             "character" 
##                  Gender            Product line              Unit price 
##             "character"             "character"               "numeric" 
##                Quantity                     Tax                    Date 
##               "integer"               "numeric"             "character" 
##                    Time                 Payment                    cogs 
##             "character"             "character"               "numeric" 
## gross margin percentage            gross income                  Rating 
##               "numeric"               "numeric"               "numeric" 
##                   Total 
##               "numeric"

The dataset has multiple data types, but correct.

# let's see the structure of the dataset
str(carrefour)
## Classes 'data.table' and 'data.frame':   1000 obs. of  16 variables:
##  $ Invoice ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ Customer type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross margin percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# let's get the unique values for each column
carrefour %>% summarise_all(n_distinct)

These are the unique values in the dataset for each column

3. External Data Validation

The data was provided by the company about the brand and was based on a previous related marketing data, there is no need for external validation.

4. Data Preparaation

a) Validation

# let's see the columns in our dataset
colnames(carrefour)
##  [1] "Invoice ID"              "Branch"                 
##  [3] "Customer type"           "Gender"                 
##  [5] "Product line"            "Unit price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Time"                   
## [11] "Payment"                 "cogs"                   
## [13] "gross margin percentage" "gross income"           
## [15] "Rating"                  "Total"

b). Uniformity

# let's rename the columns filling the white spaces
if (!require("stringr")) install.packages("stringr")
library("stringr")

names(carrefour) <- str_replace_all(names(carrefour), " ", "_")
names(carrefour)
##  [1] "Invoice_ID"              "Branch"                 
##  [3] "Customer_type"           "Gender"                 
##  [5] "Product_line"            "Unit_price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Time"                   
## [11] "Payment"                 "cogs"                   
## [13] "gross_margin_percentage" "gross_income"           
## [15] "Rating"                  "Total"

The column names are uniform

# checking the data types of the variables
glimpse(carrefour)
## Rows: 1,000
## Columns: 16
## $ Invoice_ID              <chr> "750-67-8428", "226-31-3081", "631-41-3108", "…
## $ Branch                  <chr> "A", "C", "A", "A", "A", "C", "A", "C", "A", "…
## $ Customer_type           <chr> "Member", "Normal", "Normal", "Member", "Norma…
## $ Gender                  <chr> "Female", "Female", "Male", "Male", "Male", "M…
## $ Product_line            <chr> "Health and beauty", "Electronic accessories",…
## $ Unit_price              <dbl> 74.69, 15.28, 46.33, 58.22, 86.31, 85.39, 68.8…
## $ Quantity                <int> 7, 5, 7, 8, 7, 7, 6, 10, 2, 3, 4, 4, 5, 10, 10…
## $ Tax                     <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, 29…
## $ Date                    <chr> "1/5/2019", "3/8/2019", "3/3/2019", "1/27/2019…
## $ Time                    <chr> "13:08", "10:29", "13:23", "20:33", "10:37", "…
## $ Payment                 <chr> "Ewallet", "Cash", "Credit card", "Ewallet", "…
## $ cogs                    <dbl> 522.83, 76.40, 324.31, 465.76, 604.17, 597.73,…
## $ gross_margin_percentage <dbl> 4.761905, 4.761905, 4.761905, 4.761905, 4.7619…
## $ gross_income            <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, 29…
## $ Rating                  <dbl> 9.1, 9.6, 7.4, 8.4, 5.3, 4.1, 5.8, 8.0, 7.2, 5…
## $ Total                   <dbl> 548.9715, 80.2200, 340.5255, 489.0480, 634.378…

The data types are valid and correct

c). Completeness

# let's check for duplicate values
duplicates <- table(duplicated.data.frame(carrefour))
duplicates
## 
## FALSE 
##  1000

We dont have duplicates in the dataset

d). Missing Values

missing.values <- sum(is.na(carrefour))
cat(missing.values)
## 0

The dataset does not have any missing values.

# let's select all numeric values and storing them in a variable
num_cols <- select_if(carrefour, is.numeric)

# let's see the shape of numerical data
dim(num_cols)
## [1] 1000    8
# let's check for outliers in the numerical columns
boxplot(num_cols, notch =  TRUE)

The boxplot indicates that the Unit_Price, Tax , cogs, gross_income and Total columns have outliers, we will not impute outliers because they represent actual values of unit price, tax and gross income from the Supermarket’s data.

4. Exploratory Data Analysis

# let's get the descriptive statistics of the dataset
describe(carrefour)
## carrefour 
## 
##  16  Variables      1000  Observations
## --------------------------------------------------------------------------------
## Invoice_ID 
##        n  missing distinct 
##     1000        0     1000 
## 
## lowest : 101-17-6199 101-81-4070 102-06-2002 102-77-2261 105-10-6182
## highest: 894-41-5205 895-03-6665 895-66-0685 896-34-0956 898-04-2717
## --------------------------------------------------------------------------------
## Branch 
##        n  missing distinct 
##     1000        0        3 
##                             
## Value          A     B     C
## Frequency    340   332   328
## Proportion 0.340 0.332 0.328
## --------------------------------------------------------------------------------
## Customer_type 
##        n  missing distinct 
##     1000        0        2 
##                         
## Value      Member Normal
## Frequency     501    499
## Proportion  0.501  0.499
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##     1000        0        2 
##                         
## Value      Female   Male
## Frequency     501    499
## Proportion  0.501  0.499
## --------------------------------------------------------------------------------
## Product_line 
##        n  missing distinct 
##     1000        0        6 
## 
## lowest : Electronic accessories Fashion accessories    Food and beverages     Health and beauty      Home and lifestyle    
## highest: Fashion accessories    Food and beverages     Health and beauty      Home and lifestyle     Sports and travel     
##                                                                                
## Value      Electronic accessories    Fashion accessories     Food and beverages
## Frequency                     170                    178                    174
## Proportion                  0.170                  0.178                  0.174
##                                                                                
## Value           Health and beauty     Home and lifestyle      Sports and travel
## Frequency                     152                    160                    166
## Proportion                  0.152                  0.160                  0.166
## --------------------------------------------------------------------------------
## Unit_price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      943        1    55.67     30.6    15.28    19.31 
##      .25      .50      .75      .90      .95 
##    32.88    55.23    77.94    93.12    97.22 
## 
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       10     0.99     5.51     3.36        1        1 
##      .25      .50      .75      .90      .95 
##        3        5        8       10       10 
## 
## lowest :  1  2  3  4  5, highest:  6  7  8  9 10
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    112    91    90   109   102    98   102    85    92   119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## Tax 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    15.38    12.89    1.956    3.243 
##      .25      .50      .75      .90      .95 
##    5.925   12.088   22.445   34.234   39.166 
## 
## lowest :  0.5085  0.6045  0.6270  0.6390  0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Date 
##        n  missing distinct 
##     1000        0       89 
## 
## lowest : 1/1/2019  1/10/2019 1/11/2019 1/12/2019 1/13/2019
## highest: 3/5/2019  3/6/2019  3/7/2019  3/8/2019  3/9/2019 
## --------------------------------------------------------------------------------
## Time 
##        n  missing distinct 
##     1000        0      506 
## 
## lowest : 10:00 10:01 10:02 10:03 10:04, highest: 20:52 20:54 20:55 20:57 20:59
## --------------------------------------------------------------------------------
## Payment 
##        n  missing distinct 
##     1000        0        3 
##                                               
## Value             Cash Credit card     Ewallet
## Frequency          344         311         345
## Proportion       0.344       0.311       0.345
## --------------------------------------------------------------------------------
## cogs 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    307.6    257.8    39.11    64.86 
##      .25      .50      .75      .90      .95 
##   118.50   241.76   448.91   684.68   783.33 
## 
## lowest :  10.17  12.09  12.54  12.78  13.98, highest: 973.80 975.00 985.20 989.80 993.00
## --------------------------------------------------------------------------------
## gross_margin_percentage 
##        n  missing distinct     Info     Mean      Gmd 
##     1000        0        1        0    4.762        0 
##                    
## Value      4.761905
## Frequency      1000
## Proportion        1
## --------------------------------------------------------------------------------
## gross_income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1    15.38    12.89    1.956    3.243 
##      .25      .50      .75      .90      .95 
##    5.925   12.088   22.445   34.234   39.166 
## 
## lowest :  0.5085  0.6045  0.6270  0.6390  0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Rating 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       61        1    6.973    1.985    4.295    4.500 
##      .25      .50      .75      .90      .95 
##    5.500    7.000    8.500    9.400    9.700 
## 
## lowest :  4.0  4.1  4.2  4.3  4.4, highest:  9.6  9.7  9.8  9.9 10.0
## --------------------------------------------------------------------------------
## Total 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      990        1      323    270.7    41.07    68.10 
##      .25      .50      .75      .90      .95 
##   124.42   253.85   471.35   718.91   822.50 
## 
## lowest :   10.6785   12.6945   13.1670   13.4190   14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------
# let's find the statistical analysis on the numerical data
info1 <- "\n standard deviation:\n"
cat(paste(info1))
## 
##  standard deviation:
print(sapply(num_cols[1:8], sd))
##              Unit_price                Quantity                     Tax 
##               23.610564                1.457738               10.124388 
##                    cogs gross_margin_percentage            gross_income 
##              202.487764                0.000000               10.124388 
##                  Rating                   Total 
##                1.951876              212.612153
# let's find the variance of the numerical data
info2 <- "\n variance:\n"
cat(paste(info2))
## 
##  variance:
print(sapply(num_cols[1:8], var))
##              Unit_price                Quantity                     Tax 
##              557.458736                2.125000              102.503237 
##                    cogs gross_margin_percentage            gross_income 
##            41001.294686                0.000000              102.503237 
##                  Rating                   Total 
##                3.809821            45203.927391
#  let's find the mean of the numerical data
info7 <- "\n mean:\n"
cat(paste(info7))
## 
##  mean:
print(sapply(num_cols[1:8], mean))
##              Unit_price                Quantity                     Tax 
##               63.577500                7.125000               23.374000 
##                    cogs gross_margin_percentage            gross_income 
##              467.480000                4.761905               23.374000 
##                  Rating                   Total 
##                7.212500              490.854000
# let's finding the Quartiles
info8 <- "\n quantiles:\n"
cat(paste(info8))
## 
##  quantiles:
print(sapply(num_cols[1:8], quantile))
##      Unit_price Quantity      Tax     cogs gross_margin_percentage gross_income
## 0%      15.2800     5.00  3.82000  76.4000                4.761905      3.82000
## 25%     55.2475     6.75 19.54288 390.8575                4.761905     19.54288
## 50%     71.2000     7.00 24.71475 494.2950                4.761905     24.71475
## 75%     77.3650     7.25 29.96700 599.3400                4.761905     29.96700
## 100%    86.3100    10.00 36.78000 735.6000                4.761905     36.78000
##      Rating    Total
## 0%    4.100  80.2200
## 25%   5.675 410.4004
## 50%   7.700 519.0097
## 75%   8.575 629.3070
## 100%  9.600 772.3800
# let's find the Interquartile Range
info3 <- "interquartile:\n"
cat(paste(info3))
## interquartile:
sapply(num_cols[1:8], IQR)
##              Unit_price                Quantity                     Tax 
##                22.11750                 0.50000                10.42413 
##                    cogs gross_margin_percentage            gross_income 
##               208.48250                 0.00000                10.42413 
##                  Rating                   Total 
##                 2.90000               218.90663
# let's find the minimum values
info4 <- "\n minimum:\n"
cat(paste(info4))
## 
##  minimum:
min <- sapply(num_cols[1:8], min)
print(min)
##              Unit_price                Quantity                     Tax 
##               15.280000                5.000000                3.820000 
##                    cogs gross_margin_percentage            gross_income 
##               76.400000                4.761905                3.820000 
##                  Rating                   Total 
##                4.100000               80.220000
# let's find the maximum values
info5 <- "\n maximum:\n"
cat(paste(info5))
## 
##  maximum:
max <- sapply(num_cols[1:8], max)
print(max)
##              Unit_price                Quantity                     Tax 
##               86.310000               10.000000               36.780000 
##                    cogs gross_margin_percentage            gross_income 
##              735.600000                4.761905               36.780000 
##                  Rating                   Total 
##                9.600000              772.380000
# let's find the range
info6 <- "\n range:\n"
cat(paste(info6))
## 
##  range:
range = max - min
print(range)
##              Unit_price                Quantity                     Tax 
##                   71.03                    5.00                   32.96 
##                    cogs gross_margin_percentage            gross_income 
##                  659.20                    0.00                   32.96 
##                  Rating                   Total 
##                    5.50                  692.16

5. Skewness and Kurtosis

# let's find the skewness and kurtosis of the data
stats <- data.frame(
  skew = apply(num_cols,2, skewness),
  Kurt = apply(num_cols,2, kurtosis))

# printing the rounded values
stat <- round(stats, 3)
stat

Summary:

Skewness for each column’s interpretion:

  1. If the skewness is less than −1 or greater than +1, the distribution is highly skewed.
  2. If the skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
  3. If the skewness is between −½ and +½, the distribution is approximately symmetric.

Kurtosis for each column interpretion:

Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. This means that if the data follows a normal distribution, it follows a mesokurtic distribution. Leptokurtic indicates a positive excess kurtosis. The leptokurtic distribution shows heavy tails on either side, indicating large outliers. A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution.

names(num_cols)
## [1] "Unit_price"              "Quantity"               
## [3] "Tax"                     "cogs"                   
## [5] "gross_margin_percentage" "gross_income"           
## [7] "Rating"                  "Total"
# selecting all categorical columns
char_cols <- carrefour %>% select(c(2,3,4, 5, 11))
names(char_cols)
## [1] "Branch"        "Customer_type" "Gender"        "Product_line" 
## [5] "Payment"
# let's find the correlation matrix for numerical columns against gender variables

# loading the library
library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# plotting the correlation for branch
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Branch), title="Correlation Between Numerical Variables by branch") 

# plotting the correlation for customer type
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Customer_type), title="Correlation Between Numerical Variables by customer type") 

# plotting the correlation for gender
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Gender), title="Correlation Between Numerical Variables by Gender") 

# plotting the correlation for payment
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Payment), title="Correlation Between Numerical Variables by payment") 

# plotting the correlation for product line
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Product_line), title="Correlation Between Numerical Variables by Product line") 

# let's find the correlation matrix for numerical columns against gender variables
ggpairs(char_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Gender, title="Correlation Between Categorical Variables by Gender"), lower =list(combo=wrap("facethist", binwidth=1, size=2.5)) )

Pearson’s Correlation

cor(num_cols, method = "pearson", use = "complete.obs")
##                           Unit_price    Quantity        Tax       cogs
## Unit_price               1.000000000  0.01077756  0.6339621  0.6339621
## Quantity                 0.010777564  1.00000000  0.7055102  0.7055102
## Tax                      0.633962089  0.70551019  1.0000000  1.0000000
## cogs                     0.633962089  0.70551019  1.0000000  1.0000000
## gross_margin_percentage           NA          NA         NA         NA
## gross_income             0.633962089  0.70551019  1.0000000  1.0000000
## Rating                  -0.008777507 -0.01581490 -0.0364417 -0.0364417
## Total                    0.633962089  0.70551019  1.0000000  1.0000000
##                         gross_margin_percentage gross_income       Rating
## Unit_price                                   NA    0.6339621 -0.008777507
## Quantity                                     NA    0.7055102 -0.015814905
## Tax                                          NA    1.0000000 -0.036441705
## cogs                                         NA    1.0000000 -0.036441705
## gross_margin_percentage                       1           NA           NA
## gross_income                                 NA    1.0000000 -0.036441705
## Rating                                       NA   -0.0364417  1.000000000
## Total                                        NA    1.0000000 -0.036441705
##                              Total
## Unit_price               0.6339621
## Quantity                 0.7055102
## Tax                      1.0000000
## cogs                     1.0000000
## gross_margin_percentage         NA
## gross_income             1.0000000
## Rating                  -0.0364417
## Total                    1.0000000

The rating column is the only variable with a negative correlation. An increase in its value decreases the other value, there’s also a strong correlation between total and gross income.

# let's convert the categorical columns into numerical by encoding
data <- data.frame(carrefour)

data$Branch <- as.factor(as.integer(data$Branch))
data$Customer_type <- as.factor(as.integer(data$Customer_type))
data$Gender <- as.numeric(as.factor(as.integer(data$Gender)))
data$Product_line <- as.factor(as.integer(data$Product_line))
data$Payment <- as.factor(as.integer(data$Payment))

sapply(data, class)
##              Invoice_ID                  Branch           Customer_type 
##             "character"                "factor"                "factor" 
##                  Gender            Product_line              Unit_price 
##               "numeric"                "factor"               "numeric" 
##                Quantity                     Tax                    Date 
##               "integer"               "numeric"             "character" 
##                    Time                 Payment                    cogs 
##             "character"                "factor"               "numeric" 
## gross_margin_percentage            gross_income                  Rating 
##               "numeric"               "numeric"               "numeric" 
##                   Total 
##               "numeric"
# converting date into date time
format(Sys.Date(), "%a %b %d")
## [1] "Sun Jun 12"
# change date column to be in the date/time class
data$Date <- as.Date(data$Date, "%m/%d/%y")

#Data types of the variavbles

sapply(data, class)
##              Invoice_ID                  Branch           Customer_type 
##             "character"                "factor"                "factor" 
##                  Gender            Product_line              Unit_price 
##               "numeric"                "factor"               "numeric" 
##                Quantity                     Tax                    Date 
##               "integer"               "numeric"                  "Date" 
##                    Time                 Payment                    cogs 
##             "character"                "factor"               "numeric" 
## gross_margin_percentage            gross_income                  Rating 
##               "numeric"               "numeric"               "numeric" 
##                   Total 
##               "numeric"
# corrplot
corr_ <- cor(num_cols)

corrplot(corr_, method = 'color')

6. Dimensionality Reduction

Principal Component Analysis

It is an unsupervised, non-parametric statistical technique used for dimensionality reduction in machine learning by focusing on variables that are actually differentiating between the data.

# let's see the data types annd the columns
sapply(num_cols, class)
##              Unit_price                Quantity                     Tax 
##               "numeric"               "integer"               "numeric" 
##                    cogs gross_margin_percentage            gross_income 
##               "numeric"               "numeric"               "numeric" 
##                  Rating                   Total 
##               "numeric"               "numeric"
# assigning the  numerical columns to a variable
df <- num_cols

# Since total is obtained by tax + Quantity*Unit.price we will drop the total column
df$Total <- NULL

# #Drop gross.margin.percentage column since it has only one value in all rows
df$gross_margin_percentage <- NULL

df_1 <- select_if(df, is.numeric)

#Preview the data
head(df_1)
# let's apply the PCA function
pca_apply <- prcomp(df, center = TRUE, scale. = TRUE)

# lets see the variables in the class "prcomp"

summary(pca_apply)
## Importance of components:
##                           PC1    PC2    PC3    PC4      PC5       PC6
## Standard deviation     1.9817 1.0002 0.9939 0.2909 2.51e-16 1.477e-16
## Proportion of Variance 0.6545 0.1667 0.1646 0.0141 0.00e+00 0.000e+00
## Cumulative Proportion  0.6545 0.8213 0.9859 1.0000 1.00e+00 1.000e+00

The first Principal Component PC1 catters for 65.45% of the variance in the dataset, PC2 and PC3 catter for over 16% of the variance which accounts for 98.59% of the total variance in the data.

# let's use the str() function to view the details of the pca object\
str(pca_apply)
## List of 5
##  $ sdev    : num [1:6] 1.98 1.00 9.94e-01 2.91e-01 2.51e-16 ...
##  $ rotation: num [1:6, 1:6] -0.328 -0.365 -0.503 -0.503 -0.503 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
##   .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:6] 55.67 5.51 15.38 307.59 15.38 ...
##   ..- attr(*, "names")= chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
##  $ scale   : Named num [1:6] 26.49 2.92 11.71 234.18 11.71 ...
##   ..- attr(*, "names")= chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
##  $ x       : num [1:1000, 1:6] -1.781 2.087 -0.173 -1.343 -2.497 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
# plotting our pca
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggbiplot(pca_apply)

From the PCA graph above, the variables unit price, gross income and quantity contribute to PC1.

# let's use the square of sdev and calculate the percentage of variation each PC has.
pca_variation <- pca_apply$sdev^2
pca_var_percentage <- round(pca_variation/sum(pca_variation) * 100, 1)
barplot(pca_var_percentage, main = "Variation Plot", xlab = "PCs 1 - 5 respectively",
        ylab = "Percentage Variance", ylim = c(0, 100))

We can clearly see that PC1’s barplot takes almosy 65% of the variation in the data. PC2 and PC3 take 17%, the rest have 1% of the variation.

# Adding more detail to the plot, we provide arguments rownames as labels
ggbiplot(pca_apply, labels=rownames(data), obs.scale = 1, var.scale = 1)

names(df)
## [1] "Unit_price"   "Quantity"     "Tax"          "cogs"         "gross_income"
## [6] "Rating"
# let's group the data
data.cluster <- c(rep("Unit_price", 4), rep("Quantity", 3), rep("Tax", 5), rep("cogs", 4), rep("gross_income", 5), rep("Rating", 4))
  
ggbiplot(pca_apply,ellipse=TRUE,  labels=rownames(num_cols), groups=data.cluster, obs.scale = 1, var.scale = 1)

Feature Selection

a) Using Filter Methods

# let's calculate the correlation mateix
correlationMatrix <- cor(df_1)

# let's get the highly correlated attributes
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = .75)
highlyCorrelated
## [1] 3 4
names(df_1[, ..highlyCorrelated])
## [1] "Tax"  "cogs"

We need to remove the variables with a higher correlation and compare the results graphically

# removing the features we dont need
df_2 <- df_1[, -c(3,4)]
head(df_2)
# graphical comparison of relevant attributes
M <- cor(df_2)

p.mat <- cor(M)

par(mfrow = c(1, 2))

corrplot(correlationMatrix, order = "hclust")

corrplot(cor(df_2), order = "hclust", title="Correlation Matrix", mar=c(0,0,1,0))

The first method dropped the tax and cogs variables.

b) Embedded Methods

# a cluster plot of the first 2 principal components
library(cluster)
library(wskm)
## Loading required package: latticeExtra
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
## Loading required package: fpc
set.seed(2)
model<- ewkm(df_1, 4, lambda=3, maxiter=1000)

clusplot(df_1, model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis supermarket')

# let's calculate the weights which is the relative importance of each variable
round(model$weights*100,2)
##   Unit_price Quantity Tax cogs gross_income Rating
## 1          0     0.00  50    0           50   0.00
## 2          0     0.00   0    0            0  99.99
## 3          0     0.00  50    0           50   0.00
## 4          0    99.99   0    0            0   0.00

This method implies that the 1st and 2nd principal components account for 82.12% of the point variability in the data.

The cluster weights show that in cluster 1&3 the most important features are tax and gross income. In cluster 2,rating is the most important feature with an importance of 99.99% while in cluster 4 quantity is the most important feature with 99.9%

PART II

Association Analysis

1) Data Loading

# Loading the arules library
#
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
# Loading our transactions dataset from our csv file
# ---
# We will use read.transactions fuction which will load data from comma-separated files 
# and convert them to the class transactions, which is the kind of data that 
# we will require while working with models of association rules
# ---
# 
path <-"Supermarket_Sales.csv"

Transactions<-read.transactions(path, sep = ",")
Transactions
## transactions in sparse format with
##  7501 transactions (rows) and
##  119 items (columns)
# Verifying the object's class
# ---
# This should show us transactions as the type of data that we will need
# ---
# 
class(Transactions)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing our first 5 transactions
#
inspect(Transactions[1:5])
##     items               
## [1] {almonds,           
##      antioxydant juice, 
##      avocado,           
##      cottage cheese,    
##      energy drink,      
##      frozen smoothie,   
##      green grapes,      
##      green tea,         
##      honey,             
##      low fat yogurt,    
##      mineral water,     
##      olive oil,         
##      salad,             
##      salmon,            
##      shrimp,            
##      spinach,           
##      tomato juice,      
##      vegetables mix,    
##      whole weat flour,  
##      yams}              
## [2] {burgers,           
##      eggs,              
##      meatballs}         
## [3] {chutney}           
## [4] {avocado,           
##      turkey}            
## [5] {energy bar,        
##      green tea,         
##      milk,              
##      mineral water,     
##      whole wheat rice}
# If we wanted to preview the items that make up our dataset,
# alternatively we can do the following
# ---
# 
items<-as.data.frame(itemLabels(Transactions))
colnames(items) <- "Item"
head(items, 10)    
# Generating a summary of the transaction dataset
# ---
# This would give us some information such as the most purchased items, 
# distribution of the item sets (no. of items purchased in each transaction), etc.
# ---
# 
summary(Transactions)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus
# Exploring the frequency of some articles 
# i.e. transacations ranging from 8 to 10 and performing 
# some operation in percentage terms of the total transactions 
# 
itemFrequency(Transactions[, 8:10],type = "absolute")
##   black tea blueberries  body spray 
##         107          69          86
round(itemFrequency(Transactions[, 8:10],type = "relative")*100,2)
##   black tea blueberries  body spray 
##        1.43        0.92        1.15
# Producing a chart of frequencies and fitering 
# to consider only items with a minimum percentage 
# of support/ considering a top x of items
# ---
# Displaying top 10 most common items in the transactions dataset 
# and the items whose relative importance is at least 10%
# 
par(mfrow = c(1, 2))

# plot the frequency of items
itemFrequencyPlot(Transactions, topN = 10,col="darkgreen")
itemFrequencyPlot(Transactions, support = 0.1,col="darkred")

# Building a model based on association rules 
# using the apriori function 
# ---
# We use Min Support as 0.001 and confidence as 0.8
# ---
# 
rules <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules
## set of 74 rules
# We use measures of significance and interest on the rules, 
# determining which ones are interesting and which to discard.
# ---
# However since we built the model using 0.001 Min support 
# and confidence as 0.8 we obtained 410 rules.
# However, in order to illustrate the sensitivity of the model to these two parameters, 
# we will see what happens if we increase the support or lower the confidence level
# 

# Building a apriori model with Min Support as 0.002 and confidence as 0.8.
rules2 <- apriori (Transactions,parameter = list(supp = 0.002, conf = 0.8)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Building apriori model with Min Support as 0.002 and confidence as 0.6.
rules3 <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.6)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [545 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules2
## set of 2 rules
rules3
## set of 545 rules

In our first example, we increased the minimum support of 0.001 to 0.002 and model rules went from 410 to only 11. This would lead us to understand that using a high level of support can make the model lose interesting rules. In the second example, we decreased the minimum confidence level to 0.6 and the number of model rules went from 410 to 2918. This would mean that using a low confidence level increases the number of rules to quite an extent and many will not be useful.

# We can perform an exploration of our model 
# through the use of the summary function as shown
# ---
# Upon running the code, the function would give us information about the model 
# i.e. the size of rules, depending on the items that contain these rules. 
# In our above case, most rules have 3 and 4 items though some rules do have upto 6. 
# More statistical information such as support, lift and confidence is also provided.
# ---
# 
summary(rules)
## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.041   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001067   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001133   Median :0.8333   Median :0.001333   Median : 3.795  
##  Mean   :0.001256   Mean   :0.8504   Mean   :0.001479   Mean   : 4.823  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.877  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :12.722  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 8.500  
##  Mean   : 9.419  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##          data ntransactions support confidence
##  Transactions          7501   0.001        0.8
##                                                                      call
##  apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))
# Observing rules built in our model i.e. first 5 model rules
# ---
# 
inspect(rules[1:5])
##     lhs                              rhs             support     confidence
## [1] {frozen smoothie, spinach}    => {mineral water} 0.001066524 0.8888889 
## [2] {bacon, pancakes}             => {spaghetti}     0.001733102 0.8125000 
## [3] {nonfat milk, turkey}         => {mineral water} 0.001199840 0.8181818 
## [4] {ground beef, nonfat milk}    => {mineral water} 0.001599787 0.8571429 
## [5] {mushroom cream sauce, pasta} => {escalope}      0.002532996 0.9500000 
##     coverage    lift      count
## [1] 0.001199840  3.729058  8   
## [2] 0.002133049  4.666587 13   
## [3] 0.001466471  3.432428  9   
## [4] 0.001866418  3.595877 12   
## [5] 0.002666311 11.976387 19
# Interpretation of the first rule:
# ---
# If someone buys liquor and red/blush wine, they are 90% likely to buy bottled beer too
# ---
# Ordering these rules by a criteria such as the level of confidence
# then looking at the first five rules.
# We can also use different criteria such as: (by = "lift" or by = "support")
# 
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
##     lhs                        rhs                 support confidence    coverage      lift count
## [1] {french fries,                                                                               
##      mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.001066524       1.00 0.001066524 12.606723     8
## [2] {ground beef,                                                                                
##      light cream,                                                                                
##      olive oil}             => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [3] {cake,                                                                                       
##      meatballs,                                                                                  
##      mineral water}         => {milk}          0.001066524       1.00 0.001066524  7.717078     8
## [4] {cake,                                                                                       
##      olive oil,                                                                                  
##      shrimp}                => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [5] {mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.002532996       0.95 0.002666311 11.976387    19
# Interpretation
# ---
# The given five rules have a confidence of 100
# ---
# If we're interested in making a promotion relating to the sale of mineral water, 
# we could create a subset of rules concerning these products 
# ---
# This would tell us the items that the customers bought before purchasing yogurt
# ---
# 
yogurt <- subset(rules, subset = rhs %pin% "mineral water")
 
# Then order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)
inspect(yogurt[1:5])
##     lhs                     rhs                 support confidence    coverage     lift count
## [1] {ground beef,                                                                            
##      light cream,                                                                            
##      olive oil}          => {mineral water} 0.001199840  1.0000000 0.001199840 4.195190     9
## [2] {cake,                                                                                   
##      olive oil,                                                                              
##      shrimp}             => {mineral water} 0.001199840  1.0000000 0.001199840 4.195190     9
## [3] {red wine,                                                                               
##      soup}               => {mineral water} 0.001866418  0.9333333 0.001999733 3.915511    14
## [4] {ground beef,                                                                            
##      pancakes,                                                                               
##      whole wheat rice}   => {mineral water} 0.001333156  0.9090909 0.001466471 3.813809    10
## [5] {frozen vegetables,                                                                      
##      milk,                                                                                   
##      spaghetti,                                                                              
##      turkey}             => {mineral water} 0.001199840  0.9000000 0.001333156 3.775671     9
# What if we wanted to determine items that customers might buy 
# who have previously bought mineral water?
# ---
# 
# Subset the rules
yogurt <- subset(rules, subset = lhs %pin% "mineral water")

# Order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)

# inspect top 5
inspect(yogurt[1:5])
##     lhs                                     rhs           support    
## [1] {cake, meatballs, mineral water}     => {milk}        0.001066524
## [2] {eggs, mineral water, pasta}         => {shrimp}      0.001333156
## [3] {herb & pepper, mineral water, rice} => {ground beef} 0.001333156
## [4] {light cream, mineral water, shrimp} => {spaghetti}   0.001066524
## [5] {grated cheese, mineral water, rice} => {ground beef} 0.001066524
##     confidence coverage    lift      count
## [1] 1.0000000  0.001066524  7.717078  8   
## [2] 0.9090909  0.001466471 12.722185 10   
## [3] 0.9090909  0.001466471  9.252498 10   
## [4] 0.8888889  0.001199840  5.105326  8   
## [5] 0.8888889  0.001199840  9.046887  8

PART III

Anomaly Detection

# Installing anomalize package
# ---
if (!require("anomalize")) install.packages("anomalize")
## Loading required package: anomalize
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library("anomalize")
library(tidyverse)
library(anomalize)
# Loading the dataset
ts <- read.csv("Supermarket_Sales_Forecasting.csv",TRUE,",")
head(ts)
# we have to convert the dataframe btc into a tibble object  %>% select(-one_of('Date'))
tib_ts <- ts %>% rownames_to_column() %>% as.tibble() %>% 
  mutate(date = as.Date(Date, "%m/%d/%y"))%>% select(-one_of('Date'))

# preview the tibble
head(tib_ts)
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
x <- as_tbl_time(tib_ts, index = date)

x %>% 
  as_period("daily")%>% 
  #let's perform time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components using the time_decompose() function.
  
  time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
  
  #anomalize can detect and flag anomalies in the decomposed data 
  
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
  
  # Visualize the outcome.
  
  plot_anomaly_decomposition()
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 5 weeks
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 47 weeks
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Anomaly Detection
x %>% 
  as_period("daily")%>% 
  time_decompose(Sales) %>%
  anomalize(remainder) %>%
  time_recompose() %>%
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 5 weeks
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 47 weeks

The anomaly detected was in the upper bound close to the month of february, but insignificant.

Conclusion

It is evident that mineral water is the top selling product as a person who bought milk previously is most likely to buy has an 100% posibility of buying mineral water and 90.9% chance of buying shrimp and ground beef.

The combination of PCA and feature selection techniques showed that the most important features were gross income and unit price.

Branch A is more popular for men while branch C is more popular amongst the females.The top selling categories were electronics and fashion accessories.

There were no significant anomalies during the period under investigation.