options(warn=-1)
This section of the project entails reducing your dataset to a low dimensional dataset using the t-SNE algorithm or PCA. You will be required to perform your analysis and provide insights gained from your analysis.
The project will be considered a success when we successfully reduce the dataset into a low dimensional dataset for analysis.
Carrefour is a French multinational retail corporation headquartered in Massy, France. The eighth-largest retailer in the world by revenue, it operates a chain of hypermarkets, groceries and convenience stores, which as of January 2021, comprises its 12,225 stores in over 30 countries (Wikipedia).
In Kenya, Carrefour has 8 outlets mostly located in the suburbs of Kenya’s capital city, Nairobi. The retailer’s expansion into Kenya has benefited from the failure of previously-dominant supermarket chains such as Nakumatt and Uchumi as Carrefour rushed in to occupy the retail spaces and market share they vacated.
The data is valid and has been provided by the Carrefour.
# let's install the libraries we need
library(e1071)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(devtools)
## Loading required package: usethis
library(Rtsne)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(CatEncoders)
##
## Attaching package: 'CatEncoders'
## The following object is masked from 'package:base':
##
## transform
library(caret)
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(magrittr)
library(knitr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ purrr::transpose() masks data.table::transpose()
library(devtools)
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:base':
##
## format.pval, units
carrefour <- fread('Supermarket.csv')
# let's check the top view of the dataset
head(carrefour)
# let's check the bottom view of the dataset
tail(carrefour)
# let's check the number of rows and columns of the datset
cat('Number of rows = ', nrow(carrefour), 'and number of columns = ', ncol(carrefour), '.')
## Number of rows = 1000 and number of columns = 16 .
# let's check the data types of the columns in the dataset
sapply(carrefour, class)
## Invoice ID Branch Customer type
## "character" "character" "character"
## Gender Product line Unit price
## "character" "character" "numeric"
## Quantity Tax Date
## "integer" "numeric" "character"
## Time Payment cogs
## "character" "character" "numeric"
## gross margin percentage gross income Rating
## "numeric" "numeric" "numeric"
## Total
## "numeric"
The dataset has multiple data types, but correct.
# let's see the structure of the dataset
str(carrefour)
## Classes 'data.table' and 'data.frame': 1000 obs. of 16 variables:
## $ Invoice ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross margin percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
# let's get the unique values for each column
carrefour %>% summarise_all(n_distinct)
These are the unique values in the dataset for each column
The data was provided by the company about the brand and was based on a previous related marketing data, there is no need for external validation.
# let's see the columns in our dataset
colnames(carrefour)
## [1] "Invoice ID" "Branch"
## [3] "Customer type" "Gender"
## [5] "Product line" "Unit price"
## [7] "Quantity" "Tax"
## [9] "Date" "Time"
## [11] "Payment" "cogs"
## [13] "gross margin percentage" "gross income"
## [15] "Rating" "Total"
# let's rename the columns filling the white spaces
if (!require("stringr")) install.packages("stringr")
library("stringr")
names(carrefour) <- str_replace_all(names(carrefour), " ", "_")
names(carrefour)
## [1] "Invoice_ID" "Branch"
## [3] "Customer_type" "Gender"
## [5] "Product_line" "Unit_price"
## [7] "Quantity" "Tax"
## [9] "Date" "Time"
## [11] "Payment" "cogs"
## [13] "gross_margin_percentage" "gross_income"
## [15] "Rating" "Total"
The column names are uniform
# checking the data types of the variables
glimpse(carrefour)
## Rows: 1,000
## Columns: 16
## $ Invoice_ID <chr> "750-67-8428", "226-31-3081", "631-41-3108", "…
## $ Branch <chr> "A", "C", "A", "A", "A", "C", "A", "C", "A", "…
## $ Customer_type <chr> "Member", "Normal", "Normal", "Member", "Norma…
## $ Gender <chr> "Female", "Female", "Male", "Male", "Male", "M…
## $ Product_line <chr> "Health and beauty", "Electronic accessories",…
## $ Unit_price <dbl> 74.69, 15.28, 46.33, 58.22, 86.31, 85.39, 68.8…
## $ Quantity <int> 7, 5, 7, 8, 7, 7, 6, 10, 2, 3, 4, 4, 5, 10, 10…
## $ Tax <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, 29…
## $ Date <chr> "1/5/2019", "3/8/2019", "3/3/2019", "1/27/2019…
## $ Time <chr> "13:08", "10:29", "13:23", "20:33", "10:37", "…
## $ Payment <chr> "Ewallet", "Cash", "Credit card", "Ewallet", "…
## $ cogs <dbl> 522.83, 76.40, 324.31, 465.76, 604.17, 597.73,…
## $ gross_margin_percentage <dbl> 4.761905, 4.761905, 4.761905, 4.761905, 4.7619…
## $ gross_income <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, 29…
## $ Rating <dbl> 9.1, 9.6, 7.4, 8.4, 5.3, 4.1, 5.8, 8.0, 7.2, 5…
## $ Total <dbl> 548.9715, 80.2200, 340.5255, 489.0480, 634.378…
The data types are valid and correct
# let's check for duplicate values
duplicates <- table(duplicated.data.frame(carrefour))
duplicates
##
## FALSE
## 1000
We dont have duplicates in the dataset
missing.values <- sum(is.na(carrefour))
cat(missing.values)
## 0
The dataset does not have any missing values.
# let's select all numeric values and storing them in a variable
num_cols <- select_if(carrefour, is.numeric)
# let's see the shape of numerical data
dim(num_cols)
## [1] 1000 8
# let's check for outliers in the numerical columns
boxplot(num_cols, notch = TRUE)
The boxplot indicates that the Unit_Price, Tax , cogs, gross_income and
Total columns have outliers, we will not impute outliers because they
represent actual values of unit price, tax and gross income from the
Supermarket’s data.
# let's get the descriptive statistics of the dataset
describe(carrefour)
## carrefour
##
## 16 Variables 1000 Observations
## --------------------------------------------------------------------------------
## Invoice_ID
## n missing distinct
## 1000 0 1000
##
## lowest : 101-17-6199 101-81-4070 102-06-2002 102-77-2261 105-10-6182
## highest: 894-41-5205 895-03-6665 895-66-0685 896-34-0956 898-04-2717
## --------------------------------------------------------------------------------
## Branch
## n missing distinct
## 1000 0 3
##
## Value A B C
## Frequency 340 332 328
## Proportion 0.340 0.332 0.328
## --------------------------------------------------------------------------------
## Customer_type
## n missing distinct
## 1000 0 2
##
## Value Member Normal
## Frequency 501 499
## Proportion 0.501 0.499
## --------------------------------------------------------------------------------
## Gender
## n missing distinct
## 1000 0 2
##
## Value Female Male
## Frequency 501 499
## Proportion 0.501 0.499
## --------------------------------------------------------------------------------
## Product_line
## n missing distinct
## 1000 0 6
##
## lowest : Electronic accessories Fashion accessories Food and beverages Health and beauty Home and lifestyle
## highest: Fashion accessories Food and beverages Health and beauty Home and lifestyle Sports and travel
##
## Value Electronic accessories Fashion accessories Food and beverages
## Frequency 170 178 174
## Proportion 0.170 0.178 0.174
##
## Value Health and beauty Home and lifestyle Sports and travel
## Frequency 152 160 166
## Proportion 0.152 0.160 0.166
## --------------------------------------------------------------------------------
## Unit_price
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 943 1 55.67 30.6 15.28 19.31
## .25 .50 .75 .90 .95
## 32.88 55.23 77.94 93.12 97.22
##
## lowest : 10.08 10.13 10.16 10.17 10.18, highest: 99.82 99.83 99.89 99.92 99.96
## --------------------------------------------------------------------------------
## Quantity
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 10 0.99 5.51 3.36 1 1
## .25 .50 .75 .90 .95
## 3 5 8 10 10
##
## lowest : 1 2 3 4 5, highest: 6 7 8 9 10
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 112 91 90 109 102 98 102 85 92 119
## Proportion 0.112 0.091 0.090 0.109 0.102 0.098 0.102 0.085 0.092 0.119
## --------------------------------------------------------------------------------
## Tax
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 15.38 12.89 1.956 3.243
## .25 .50 .75 .90 .95
## 5.925 12.088 22.445 34.234 39.166
##
## lowest : 0.5085 0.6045 0.6270 0.6390 0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Date
## n missing distinct
## 1000 0 89
##
## lowest : 1/1/2019 1/10/2019 1/11/2019 1/12/2019 1/13/2019
## highest: 3/5/2019 3/6/2019 3/7/2019 3/8/2019 3/9/2019
## --------------------------------------------------------------------------------
## Time
## n missing distinct
## 1000 0 506
##
## lowest : 10:00 10:01 10:02 10:03 10:04, highest: 20:52 20:54 20:55 20:57 20:59
## --------------------------------------------------------------------------------
## Payment
## n missing distinct
## 1000 0 3
##
## Value Cash Credit card Ewallet
## Frequency 344 311 345
## Proportion 0.344 0.311 0.345
## --------------------------------------------------------------------------------
## cogs
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 307.6 257.8 39.11 64.86
## .25 .50 .75 .90 .95
## 118.50 241.76 448.91 684.68 783.33
##
## lowest : 10.17 12.09 12.54 12.78 13.98, highest: 973.80 975.00 985.20 989.80 993.00
## --------------------------------------------------------------------------------
## gross_margin_percentage
## n missing distinct Info Mean Gmd
## 1000 0 1 0 4.762 0
##
## Value 4.761905
## Frequency 1000
## Proportion 1
## --------------------------------------------------------------------------------
## gross_income
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 15.38 12.89 1.956 3.243
## .25 .50 .75 .90 .95
## 5.925 12.088 22.445 34.234 39.166
##
## lowest : 0.5085 0.6045 0.6270 0.6390 0.6990
## highest: 48.6900 48.7500 49.2600 49.4900 49.6500
## --------------------------------------------------------------------------------
## Rating
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 61 1 6.973 1.985 4.295 4.500
## .25 .50 .75 .90 .95
## 5.500 7.000 8.500 9.400 9.700
##
## lowest : 4.0 4.1 4.2 4.3 4.4, highest: 9.6 9.7 9.8 9.9 10.0
## --------------------------------------------------------------------------------
## Total
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 990 1 323 270.7 41.07 68.10
## .25 .50 .75 .90 .95
## 124.42 253.85 471.35 718.91 822.50
##
## lowest : 10.6785 12.6945 13.1670 13.4190 14.6790
## highest: 1022.4900 1023.7500 1034.4600 1039.2900 1042.6500
## --------------------------------------------------------------------------------
# let's find the statistical analysis on the numerical data
info1 <- "\n standard deviation:\n"
cat(paste(info1))
##
## standard deviation:
print(sapply(num_cols[1:8], sd))
## Unit_price Quantity Tax
## 23.610564 1.457738 10.124388
## cogs gross_margin_percentage gross_income
## 202.487764 0.000000 10.124388
## Rating Total
## 1.951876 212.612153
# let's find the variance of the numerical data
info2 <- "\n variance:\n"
cat(paste(info2))
##
## variance:
print(sapply(num_cols[1:8], var))
## Unit_price Quantity Tax
## 557.458736 2.125000 102.503237
## cogs gross_margin_percentage gross_income
## 41001.294686 0.000000 102.503237
## Rating Total
## 3.809821 45203.927391
# let's find the mean of the numerical data
info7 <- "\n mean:\n"
cat(paste(info7))
##
## mean:
print(sapply(num_cols[1:8], mean))
## Unit_price Quantity Tax
## 63.577500 7.125000 23.374000
## cogs gross_margin_percentage gross_income
## 467.480000 4.761905 23.374000
## Rating Total
## 7.212500 490.854000
# let's finding the Quartiles
info8 <- "\n quantiles:\n"
cat(paste(info8))
##
## quantiles:
print(sapply(num_cols[1:8], quantile))
## Unit_price Quantity Tax cogs gross_margin_percentage gross_income
## 0% 15.2800 5.00 3.82000 76.4000 4.761905 3.82000
## 25% 55.2475 6.75 19.54288 390.8575 4.761905 19.54288
## 50% 71.2000 7.00 24.71475 494.2950 4.761905 24.71475
## 75% 77.3650 7.25 29.96700 599.3400 4.761905 29.96700
## 100% 86.3100 10.00 36.78000 735.6000 4.761905 36.78000
## Rating Total
## 0% 4.100 80.2200
## 25% 5.675 410.4004
## 50% 7.700 519.0097
## 75% 8.575 629.3070
## 100% 9.600 772.3800
# let's find the Interquartile Range
info3 <- "interquartile:\n"
cat(paste(info3))
## interquartile:
sapply(num_cols[1:8], IQR)
## Unit_price Quantity Tax
## 22.11750 0.50000 10.42413
## cogs gross_margin_percentage gross_income
## 208.48250 0.00000 10.42413
## Rating Total
## 2.90000 218.90663
# let's find the minimum values
info4 <- "\n minimum:\n"
cat(paste(info4))
##
## minimum:
min <- sapply(num_cols[1:8], min)
print(min)
## Unit_price Quantity Tax
## 15.280000 5.000000 3.820000
## cogs gross_margin_percentage gross_income
## 76.400000 4.761905 3.820000
## Rating Total
## 4.100000 80.220000
# let's find the maximum values
info5 <- "\n maximum:\n"
cat(paste(info5))
##
## maximum:
max <- sapply(num_cols[1:8], max)
print(max)
## Unit_price Quantity Tax
## 86.310000 10.000000 36.780000
## cogs gross_margin_percentage gross_income
## 735.600000 4.761905 36.780000
## Rating Total
## 9.600000 772.380000
# let's find the range
info6 <- "\n range:\n"
cat(paste(info6))
##
## range:
range = max - min
print(range)
## Unit_price Quantity Tax
## 71.03 5.00 32.96
## cogs gross_margin_percentage gross_income
## 659.20 0.00 32.96
## Rating Total
## 5.50 692.16
# let's find the skewness and kurtosis of the data
stats <- data.frame(
skew = apply(num_cols,2, skewness),
Kurt = apply(num_cols,2, kurtosis))
# printing the rounded values
stat <- round(stats, 3)
stat
Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. This means that if the data follows a normal distribution, it follows a mesokurtic distribution. Leptokurtic indicates a positive excess kurtosis. The leptokurtic distribution shows heavy tails on either side, indicating large outliers. A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution.
names(num_cols)
## [1] "Unit_price" "Quantity"
## [3] "Tax" "cogs"
## [5] "gross_margin_percentage" "gross_income"
## [7] "Rating" "Total"
# selecting all categorical columns
char_cols <- carrefour %>% select(c(2,3,4, 5, 11))
names(char_cols)
## [1] "Branch" "Customer_type" "Gender" "Product_line"
## [5] "Payment"
# let's find the correlation matrix for numerical columns against gender variables
# loading the library
library("GGally")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# plotting the correlation for branch
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Branch), title="Correlation Between Numerical Variables by branch")
# plotting the correlation for customer type
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Customer_type), title="Correlation Between Numerical Variables by customer type")
# plotting the correlation for gender
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Gender), title="Correlation Between Numerical Variables by Gender")
# plotting the correlation for payment
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Payment), title="Correlation Between Numerical Variables by payment")
# plotting the correlation for product line
ggpairs(num_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Product_line), title="Correlation Between Numerical Variables by Product line")
# let's find the correlation matrix for numerical columns against gender variables
ggpairs(char_cols, columns = 2:4, ggplot2::aes(colour=carrefour$Gender, title="Correlation Between Categorical Variables by Gender"), lower =list(combo=wrap("facethist", binwidth=1, size=2.5)) )
cor(num_cols, method = "pearson", use = "complete.obs")
## Unit_price Quantity Tax cogs
## Unit_price 1.000000000 0.01077756 0.6339621 0.6339621
## Quantity 0.010777564 1.00000000 0.7055102 0.7055102
## Tax 0.633962089 0.70551019 1.0000000 1.0000000
## cogs 0.633962089 0.70551019 1.0000000 1.0000000
## gross_margin_percentage NA NA NA NA
## gross_income 0.633962089 0.70551019 1.0000000 1.0000000
## Rating -0.008777507 -0.01581490 -0.0364417 -0.0364417
## Total 0.633962089 0.70551019 1.0000000 1.0000000
## gross_margin_percentage gross_income Rating
## Unit_price NA 0.6339621 -0.008777507
## Quantity NA 0.7055102 -0.015814905
## Tax NA 1.0000000 -0.036441705
## cogs NA 1.0000000 -0.036441705
## gross_margin_percentage 1 NA NA
## gross_income NA 1.0000000 -0.036441705
## Rating NA -0.0364417 1.000000000
## Total NA 1.0000000 -0.036441705
## Total
## Unit_price 0.6339621
## Quantity 0.7055102
## Tax 1.0000000
## cogs 1.0000000
## gross_margin_percentage NA
## gross_income 1.0000000
## Rating -0.0364417
## Total 1.0000000
The rating column is the only variable with a negative correlation. An increase in its value decreases the other value, there’s also a strong correlation between total and gross income.
# let's convert the categorical columns into numerical by encoding
data <- data.frame(carrefour)
data$Branch <- as.factor(as.integer(data$Branch))
data$Customer_type <- as.factor(as.integer(data$Customer_type))
data$Gender <- as.numeric(as.factor(as.integer(data$Gender)))
data$Product_line <- as.factor(as.integer(data$Product_line))
data$Payment <- as.factor(as.integer(data$Payment))
sapply(data, class)
## Invoice_ID Branch Customer_type
## "character" "factor" "factor"
## Gender Product_line Unit_price
## "numeric" "factor" "numeric"
## Quantity Tax Date
## "integer" "numeric" "character"
## Time Payment cogs
## "character" "factor" "numeric"
## gross_margin_percentage gross_income Rating
## "numeric" "numeric" "numeric"
## Total
## "numeric"
# converting date into date time
format(Sys.Date(), "%a %b %d")
## [1] "Sun Jun 12"
# change date column to be in the date/time class
data$Date <- as.Date(data$Date, "%m/%d/%y")
#Data types of the variavbles
sapply(data, class)
## Invoice_ID Branch Customer_type
## "character" "factor" "factor"
## Gender Product_line Unit_price
## "numeric" "factor" "numeric"
## Quantity Tax Date
## "integer" "numeric" "Date"
## Time Payment cogs
## "character" "factor" "numeric"
## gross_margin_percentage gross_income Rating
## "numeric" "numeric" "numeric"
## Total
## "numeric"
# corrplot
corr_ <- cor(num_cols)
corrplot(corr_, method = 'color')
It is an unsupervised, non-parametric statistical technique used for dimensionality reduction in machine learning by focusing on variables that are actually differentiating between the data.
# let's see the data types annd the columns
sapply(num_cols, class)
## Unit_price Quantity Tax
## "numeric" "integer" "numeric"
## cogs gross_margin_percentage gross_income
## "numeric" "numeric" "numeric"
## Rating Total
## "numeric" "numeric"
# assigning the numerical columns to a variable
df <- num_cols
# Since total is obtained by tax + Quantity*Unit.price we will drop the total column
df$Total <- NULL
# #Drop gross.margin.percentage column since it has only one value in all rows
df$gross_margin_percentage <- NULL
df_1 <- select_if(df, is.numeric)
#Preview the data
head(df_1)
# let's apply the PCA function
pca_apply <- prcomp(df, center = TRUE, scale. = TRUE)
# lets see the variables in the class "prcomp"
summary(pca_apply)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.9817 1.0002 0.9939 0.2909 2.51e-16 1.477e-16
## Proportion of Variance 0.6545 0.1667 0.1646 0.0141 0.00e+00 0.000e+00
## Cumulative Proportion 0.6545 0.8213 0.9859 1.0000 1.00e+00 1.000e+00
The first Principal Component PC1 catters for 65.45% of the variance in the dataset, PC2 and PC3 catter for over 16% of the variance which accounts for 98.59% of the total variance in the data.
# let's use the str() function to view the details of the pca object\
str(pca_apply)
## List of 5
## $ sdev : num [1:6] 1.98 1.00 9.94e-01 2.91e-01 2.51e-16 ...
## $ rotation: num [1:6, 1:6] -0.328 -0.365 -0.503 -0.503 -0.503 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
## .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:6] 55.67 5.51 15.38 307.59 15.38 ...
## ..- attr(*, "names")= chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
## $ scale : Named num [1:6] 26.49 2.92 11.71 234.18 11.71 ...
## ..- attr(*, "names")= chr [1:6] "Unit_price" "Quantity" "Tax" "cogs" ...
## $ x : num [1:1000, 1:6] -1.781 2.087 -0.173 -1.343 -2.497 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
# plotting our pca
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggbiplot(pca_apply)
From the PCA graph above, the variables unit price, gross income and
quantity contribute to PC1.
# let's use the square of sdev and calculate the percentage of variation each PC has.
pca_variation <- pca_apply$sdev^2
pca_var_percentage <- round(pca_variation/sum(pca_variation) * 100, 1)
barplot(pca_var_percentage, main = "Variation Plot", xlab = "PCs 1 - 5 respectively",
ylab = "Percentage Variance", ylim = c(0, 100))
We can clearly see that PC1’s barplot takes almosy 65% of the variation
in the data. PC2 and PC3 take 17%, the rest have 1% of the
variation.
# Adding more detail to the plot, we provide arguments rownames as labels
ggbiplot(pca_apply, labels=rownames(data), obs.scale = 1, var.scale = 1)
names(df)
## [1] "Unit_price" "Quantity" "Tax" "cogs" "gross_income"
## [6] "Rating"
# let's group the data
data.cluster <- c(rep("Unit_price", 4), rep("Quantity", 3), rep("Tax", 5), rep("cogs", 4), rep("gross_income", 5), rep("Rating", 4))
ggbiplot(pca_apply,ellipse=TRUE, labels=rownames(num_cols), groups=data.cluster, obs.scale = 1, var.scale = 1)
# let's calculate the correlation mateix
correlationMatrix <- cor(df_1)
# let's get the highly correlated attributes
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = .75)
highlyCorrelated
## [1] 3 4
names(df_1[, ..highlyCorrelated])
## [1] "Tax" "cogs"
We need to remove the variables with a higher correlation and compare the results graphically
# removing the features we dont need
df_2 <- df_1[, -c(3,4)]
head(df_2)
# graphical comparison of relevant attributes
M <- cor(df_2)
p.mat <- cor(M)
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(df_2), order = "hclust", title="Correlation Matrix", mar=c(0,0,1,0))
The first method dropped the tax and cogs variables.
# a cluster plot of the first 2 principal components
library(cluster)
library(wskm)
## Loading required package: latticeExtra
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
## Loading required package: fpc
set.seed(2)
model<- ewkm(df_1, 4, lambda=3, maxiter=1000)
clusplot(df_1, model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis supermarket')
# let's calculate the weights which is the relative importance of each variable
round(model$weights*100,2)
## Unit_price Quantity Tax cogs gross_income Rating
## 1 0 0.00 50 0 50 0.00
## 2 0 0.00 0 0 0 99.99
## 3 0 0.00 50 0 50 0.00
## 4 0 99.99 0 0 0 0.00
This method implies that the 1st and 2nd principal components account for 82.12% of the point variability in the data.
The cluster weights show that in cluster 1&3 the most important features are tax and gross income. In cluster 2,rating is the most important feature with an importance of 99.99% while in cluster 4 quantity is the most important feature with 99.9%
# Loading the arules library
#
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Loading our transactions dataset from our csv file
# ---
# We will use read.transactions fuction which will load data from comma-separated files
# and convert them to the class transactions, which is the kind of data that
# we will require while working with models of association rules
# ---
#
path <-"Supermarket_Sales.csv"
Transactions<-read.transactions(path, sep = ",")
Transactions
## transactions in sparse format with
## 7501 transactions (rows) and
## 119 items (columns)
# Verifying the object's class
# ---
# This should show us transactions as the type of data that we will need
# ---
#
class(Transactions)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing our first 5 transactions
#
inspect(Transactions[1:5])
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
# If we wanted to preview the items that make up our dataset,
# alternatively we can do the following
# ---
#
items<-as.data.frame(itemLabels(Transactions))
colnames(items) <- "Item"
head(items, 10)
# Generating a summary of the transaction dataset
# ---
# This would give us some information such as the most purchased items,
# distribution of the item sets (no. of items purchased in each transaction), etc.
# ---
#
summary(Transactions)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
# Exploring the frequency of some articles
# i.e. transacations ranging from 8 to 10 and performing
# some operation in percentage terms of the total transactions
#
itemFrequency(Transactions[, 8:10],type = "absolute")
## black tea blueberries body spray
## 107 69 86
round(itemFrequency(Transactions[, 8:10],type = "relative")*100,2)
## black tea blueberries body spray
## 1.43 0.92 1.15
# Producing a chart of frequencies and fitering
# to consider only items with a minimum percentage
# of support/ considering a top x of items
# ---
# Displaying top 10 most common items in the transactions dataset
# and the items whose relative importance is at least 10%
#
par(mfrow = c(1, 2))
# plot the frequency of items
itemFrequencyPlot(Transactions, topN = 10,col="darkgreen")
itemFrequencyPlot(Transactions, support = 0.1,col="darkred")
# Building a model based on association rules
# using the apriori function
# ---
# We use Min Support as 0.001 and confidence as 0.8
# ---
#
rules <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 74 rules
# We use measures of significance and interest on the rules,
# determining which ones are interesting and which to discard.
# ---
# However since we built the model using 0.001 Min support
# and confidence as 0.8 we obtained 410 rules.
# However, in order to illustrate the sensitivity of the model to these two parameters,
# we will see what happens if we increase the support or lower the confidence level
#
# Building a apriori model with Min Support as 0.002 and confidence as 0.8.
rules2 <- apriori (Transactions,parameter = list(supp = 0.002, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Building apriori model with Min Support as 0.002 and confidence as 0.6.
rules3 <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [545 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules2
## set of 2 rules
rules3
## set of 545 rules
In our first example, we increased the minimum support of 0.001 to 0.002 and model rules went from 410 to only 11. This would lead us to understand that using a high level of support can make the model lose interesting rules. In the second example, we decreased the minimum confidence level to 0.6 and the number of model rules went from 410 to 2918. This would mean that using a low confidence level increases the number of rules to quite an extent and many will not be useful.
# We can perform an exploration of our model
# through the use of the summary function as shown
# ---
# Upon running the code, the function would give us information about the model
# i.e. the size of rules, depending on the items that contain these rules.
# In our above case, most rules have 3 and 4 items though some rules do have upto 6.
# More statistical information such as support, lift and confidence is also provided.
# ---
#
summary(rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## Transactions 7501 0.001 0.8
## call
## apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))
# Observing rules built in our model i.e. first 5 model rules
# ---
#
inspect(rules[1:5])
## lhs rhs support confidence
## [1] {frozen smoothie, spinach} => {mineral water} 0.001066524 0.8888889
## [2] {bacon, pancakes} => {spaghetti} 0.001733102 0.8125000
## [3] {nonfat milk, turkey} => {mineral water} 0.001199840 0.8181818
## [4] {ground beef, nonfat milk} => {mineral water} 0.001599787 0.8571429
## [5] {mushroom cream sauce, pasta} => {escalope} 0.002532996 0.9500000
## coverage lift count
## [1] 0.001199840 3.729058 8
## [2] 0.002133049 4.666587 13
## [3] 0.001466471 3.432428 9
## [4] 0.001866418 3.595877 12
## [5] 0.002666311 11.976387 19
# Interpretation of the first rule:
# ---
# If someone buys liquor and red/blush wine, they are 90% likely to buy bottled beer too
# ---
# Ordering these rules by a criteria such as the level of confidence
# then looking at the first five rules.
# We can also use different criteria such as: (by = "lift" or by = "support")
#
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {french fries,
## mushroom cream sauce,
## pasta} => {escalope} 0.001066524 1.00 0.001066524 12.606723 8
## [2] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.00 0.001199840 4.195190 9
## [3] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.00 0.001066524 7.717078 8
## [4] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.00 0.001199840 4.195190 9
## [5] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.95 0.002666311 11.976387 19
# Interpretation
# ---
# The given five rules have a confidence of 100
# ---
# If we're interested in making a promotion relating to the sale of mineral water,
# we could create a subset of rules concerning these products
# ---
# This would tell us the items that the customers bought before purchasing yogurt
# ---
#
yogurt <- subset(rules, subset = rhs %pin% "mineral water")
# Then order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)
inspect(yogurt[1:5])
## lhs rhs support confidence coverage lift count
## [1] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [2] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [4] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 0.001466471 3.813809 10
## [5] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
# What if we wanted to determine items that customers might buy
# who have previously bought mineral water?
# ---
#
# Subset the rules
yogurt <- subset(rules, subset = lhs %pin% "mineral water")
# Order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)
# inspect top 5
inspect(yogurt[1:5])
## lhs rhs support
## [1] {cake, meatballs, mineral water} => {milk} 0.001066524
## [2] {eggs, mineral water, pasta} => {shrimp} 0.001333156
## [3] {herb & pepper, mineral water, rice} => {ground beef} 0.001333156
## [4] {light cream, mineral water, shrimp} => {spaghetti} 0.001066524
## [5] {grated cheese, mineral water, rice} => {ground beef} 0.001066524
## confidence coverage lift count
## [1] 1.0000000 0.001066524 7.717078 8
## [2] 0.9090909 0.001466471 12.722185 10
## [3] 0.9090909 0.001466471 9.252498 10
## [4] 0.8888889 0.001199840 5.105326 8
## [5] 0.8888889 0.001199840 9.046887 8
# Installing anomalize package
# ---
if (!require("anomalize")) install.packages("anomalize")
## Loading required package: anomalize
## ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library("anomalize")
library(tidyverse)
library(anomalize)
# Loading the dataset
ts <- read.csv("Supermarket_Sales_Forecasting.csv",TRUE,",")
head(ts)
# we have to convert the dataframe btc into a tibble object %>% select(-one_of('Date'))
tib_ts <- ts %>% rownames_to_column() %>% as.tibble() %>%
mutate(date = as.Date(Date, "%m/%d/%y"))%>% select(-one_of('Date'))
# preview the tibble
head(tib_ts)
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
x <- as_tbl_time(tib_ts, index = date)
x %>%
as_period("daily")%>%
#let's perform time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components using the time_decompose() function.
time_decompose(Sales, method = "stl", frequency = "auto", trend = "auto") %>%
#anomalize can detect and flag anomalies in the decomposed data
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
# Visualize the outcome.
plot_anomaly_decomposition()
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 5 weeks
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 47 weeks
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Anomaly Detection
x %>%
as_period("daily")%>%
time_decompose(Sales) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## frequency = 5 weeks
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
## trend = 47 weeks
The anomaly detected was in the upper bound close to the month of
february, but insignificant.
It is evident that mineral water is the top selling product as a person who bought milk previously is most likely to buy has an 100% posibility of buying mineral water and 90.9% chance of buying shrimp and ground beef.
The combination of PCA and feature selection techniques showed that the most important features were gross income and unit price.
Branch A is more popular for men while branch C is more popular amongst the females.The top selling categories were electronics and fashion accessories.
There were no significant anomalies during the period under investigation.