I am a Data analyst at Carrefour Kenya and currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the hughest number of sales (total price including tax). I’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights. This first section of the project entails reducing the dataset to a low dimensional dataset using the t-SNE algorithm or PCA. I will perform my analysis and provide insights gained from the analysis.
A sales dataset has been provided to perform dimensionality reduction on. We first begin with loading and previewing the dataset at hand.
sales <- read.csv("http://bit.ly/CarreFourDataset")
Let’s preview the top of our dataset
head(sales)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
Let’s preview the bottom of our dataset
tail(sales)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 995 652-49-6720 C Member Female Electronic accessories 60.95
## 996 233-67-5758 C Normal Male Health and beauty 40.35
## 997 303-96-2227 B Normal Female Home and lifestyle 97.38
## 998 727-02-1313 A Member Male Food and beverages 31.84
## 999 347-56-2442 A Normal Male Home and lifestyle 65.82
## 1000 849-09-3807 A Member Female Fashion accessories 88.34
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 995 1 3.0475 2/18/2019 11:40 Ewallet 60.95 4.761905
## 996 1 2.0175 1/29/2019 13:46 Ewallet 40.35 4.761905
## 997 10 48.6900 3/2/2019 17:16 Ewallet 973.80 4.761905
## 998 1 1.5920 2/9/2019 13:22 Cash 31.84 4.761905
## 999 1 3.2910 2/22/2019 15:33 Cash 65.82 4.761905
## 1000 7 30.9190 2/18/2019 13:28 Cash 618.38 4.761905
## gross.income Rating Total
## 995 3.0475 5.9 63.9975
## 996 2.0175 6.2 42.3675
## 997 48.6900 4.4 1022.4900
## 998 1.5920 7.7 33.4320
## 999 3.2910 4.1 69.1110
## 1000 30.9190 6.6 649.2990
Check the structure of the dataset
str(sales)
## 'data.frame': 1000 obs. of 16 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
Our dataset has 1000 rows and 16 columns. 8 of which have a character data type 1 is of integer data type and the other 7 are of the numerical data type
Checking the summary of our dataset
summary(sales)
## Invoice.ID Branch Customer.type Gender
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.line Unit.price Quantity Tax
## Length:1000 Min. :10.08 Min. : 1.00 Min. : 0.5085
## Class :character 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249
## Mode :character Median :55.23 Median : 5.00 Median :12.0880
## Mean :55.67 Mean : 5.51 Mean :15.3794
## 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453
## Max. :99.96 Max. :10.00 Max. :49.6500
## Date Time Payment cogs
## Length:1000 Length:1000 Length:1000 Min. : 10.17
## Class :character Class :character Class :character 1st Qu.:118.50
## Mode :character Mode :character Mode :character Median :241.76
## Mean :307.59
## 3rd Qu.:448.90
## Max. :993.00
## gross.margin.percentage gross.income Rating Total
## Min. :4.762 Min. : 0.5085 Min. : 4.000 Min. : 10.68
## 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500 1st Qu.: 124.42
## Median :4.762 Median :12.0880 Median : 7.000 Median : 253.85
## Mean :4.762 Mean :15.3794 Mean : 6.973 Mean : 322.97
## 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500 3rd Qu.: 471.35
## Max. :4.762 Max. :49.6500 Max. :10.000 Max. :1042.65
Checking for missing values
colSums(is.na(sales))
## Invoice.ID Branch Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax Date
## 0 0 0
## Time Payment cogs
## 0 0 0
## gross.margin.percentage gross.income Rating
## 0 0 0
## Total
## 0
There are no missing values present in our dataset. Let’s check for duplicates now.
sales[duplicated(sales),]
## [1] Invoice.ID Branch Customer.type
## [4] Gender Product.line Unit.price
## [7] Quantity Tax Date
## [10] Time Payment cogs
## [13] gross.margin.percentage gross.income Rating
## [16] Total
## <0 rows> (or 0-length row.names)
There are no duplicates in our dataset either.
Checking for outliers using boxplots
boxplot(sales$Unit.price,
main ="Unit Price",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)
The Unit price column has no outliers
boxplot(sales$Quantity,
main ="Quantity",
col = "green",
border = 'blue',
horizontal = TRUE,
notch = TRUE)
There are no outliers present in the quantity column
boxplot(sales$Tax,
main ="Tax",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)
The tax columns have a few outliers present
boxplot(sales$cogs,
main ="Cogs",
col = "green",
border = 'blue',
horizontal = TRUE,
notch = TRUE)
There are a few outliers in the cogs column
boxplot(sales$gross.income,
main ="Gross Income",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)
There are a few outliers in the Gross Income column,
boxplot(sales$Rating,
main ="Rating",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)
There are no outliers in the rating column.
Let’s find the mean of our numeric columns
colMeans(sales[sapply(sales,is.numeric)])
## Unit.price Quantity Tax
## 55.672130 5.510000 15.379369
## cogs gross.margin.percentage gross.income
## 307.587380 4.761905 15.379369
## Rating Total
## 6.972700 322.966749
Let’s find the median of our numeric columns
unitprice_median <- median(sales$Unit.price)
unitprice_median
## [1] 55.23
qty_median <- median(sales$Quantity)
qty_median
## [1] 5
tax_median <- median(sales$Tax)
tax_median
## [1] 12.088
cogs_median <- median(sales$cogs)
cogs_median
## [1] 241.76
income_median <- median(sales$gross.income)
income_median
## [1] 12.088
rating_median <- median (sales$Rating)
rating_median
## [1] 7
Finding the mode of our numeric columns. Let’s create the mode function since it’s not inbuilt
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}
getmode(sales$Unit.price)
## [1] 83.77
getmode(sales$Quantity)
## [1] 10
getmode(sales$Tax)
## [1] 39.48
getmode(sales$cogs)
## [1] 789.6
getmode(sales$gross.income)
## [1] 39.48
getmode(sales$Rating)
## [1] 6
Finding the range in our columns. We expect the result to give us the minimum and maximum values of our numeric columns.
range(sales$Unit.price)
## [1] 10.08 99.96
range(sales$Quantity)
## [1] 1 10
range(sales$Tax)
## [1] 0.5085 49.6500
range(sales$cogs)
## [1] 10.17 993.00
range(sales$gross.income)
## [1] 0.5085 49.6500
range(sales$Rating)
## [1] 4 10
Let’s get the quantiles in our columns.
quantile(sales$Unit.price)
## 0% 25% 50% 75% 100%
## 10.080 32.875 55.230 77.935 99.960
quantile(sales$Quantity)
## 0% 25% 50% 75% 100%
## 1 3 5 8 10
quantile(sales$Tax)
## 0% 25% 50% 75% 100%
## 0.508500 5.924875 12.088000 22.445250 49.650000
quantile(sales$cogs)
## 0% 25% 50% 75% 100%
## 10.1700 118.4975 241.7600 448.9050 993.0000
quantile(sales$gross.income)
## 0% 25% 50% 75% 100%
## 0.508500 5.924875 12.088000 22.445250 49.650000
quantile(sales$Rating)
## 0% 25% 50% 75% 100%
## 4.0 5.5 7.0 8.5 10.0
Finding the variance of the numeric columns shows us how the data values are dispersed around the mean.
var(sales$Unit.price)
## [1] 701.9653
var(sales$Quantity)
## [1] 8.546446
var(sales$Tax)
## [1] 137.0966
var(sales$cogs)
## [1] 54838.64
var(sales$gross.income)
## [1] 137.0966
var(sales$Rating)
## [1] 2.953518
Let’s find the standard deviation of numeric columns
sd(sales$Unit.price)
## [1] 26.49463
sd(sales$Quantity)
## [1] 2.923431
sd(sales$Tax)
## [1] 11.70883
sd(sales$cogs)
## [1] 234.1765
sd(sales$gross.income)
## [1] 11.70883
sd(sales$Rating)
## [1] 1.71858
hist(sales$Unit.price, col = "yellow")
hist(sales$Quantity, col = "pink")
hist(sales$Tax, col = "dark red")
hist(sales$cogs, col = "orange")
hist(sales$gross.income, col= "light blue")
hist(sales$Rating, col ="dark green")
library(corrplot)
## corrplot 0.92 loaded
correlation <- cor(sales[,c(6,7,8,12,14,15,16)])
corrplot(correlation, method = "square", type = "lower", diag = TRUE)
Most of our variables have very strong correlations
This gives us the extent to which variables correlate with each other.
cor(sales$Unit.price, sales$Quantity)
## [1] 0.01077756
cor(sales$Tax, sales$Quantity)
## [1] 0.7055102
cor(sales$cogs, sales$gross.income)
## [1] 1
cor(sales$Quantity, sales$Rating)
## [1] -0.0158149
cor(sales$Tax, sales$gross.income)
## [1] 1
t-Distributed Stochastic Neighbour Embedding(t-SNE) and PCA will be applied to perform dimensionality reduction in our dataset.
PCA is a technique that helps in extracting a new set of variables called principal components from an extisting large set of variables. Let’s apply this to see how it will perform.
Selecting the numerical data
df <- sales[,c(6:8,12:16)]
head(df)
## Unit.price Quantity Tax cogs gross.margin.percentage gross.income
## 1 74.69 7 26.1415 522.83 4.761905 26.1415
## 2 15.28 5 3.8200 76.40 4.761905 3.8200
## 3 46.33 7 16.2155 324.31 4.761905 16.2155
## 4 58.22 8 23.2880 465.76 4.761905 23.2880
## 5 86.31 7 30.2085 604.17 4.761905 30.2085
## 6 85.39 7 29.8865 597.73 4.761905 29.8865
## Rating Total
## 1 9.1 548.9715
## 2 9.6 80.2200
## 3 7.4 340.5255
## 4 8.4 489.0480
## 5 5.3 634.3785
## 6 4.1 627.6165
df is then passed to the prcomp(). We also, set two arguments, center and scale, to be TRUE then preview our object with summary.
sales.pca <- prcomp(sales[,c(6,7,8,12,14,15,16)], center = TRUE, scale. = TRUE)
summary(sales.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.2185 1.0002 0.9939 0.30001 5.138e-16 1.387e-16
## Proportion of Variance 0.7031 0.1429 0.1411 0.01286 0.000e+00 0.000e+00
## Cumulative Proportion 0.7031 0.8460 0.9871 1.00000 1.000e+00 1.000e+00
## PC7
## Standard deviation 1.295e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
We obtain 7 Principal components, each which explains a percentage of the total variation of the dataset. PC1 gives 70.3% of the total variation PC2 and PC3 gives 14% of the total variation, etc.
Let’s call str() to have a look at the PCA model.
str(sales.pca)
## List of 5
## $ sdev : num [1:7] 2.22 1.00 9.94e-01 3.00e-01 5.14e-16 ...
## $ rotation: num [1:7, 1:7] -0.292 -0.325 -0.45 -0.45 -0.45 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:7] 55.67 5.51 15.38 307.59 15.38 ...
## ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## $ scale : Named num [1:7] 26.49 2.92 11.71 234.18 11.71 ...
## ..- attr(*, "names")= chr [1:7] "Unit.price" "Quantity" "Tax" "cogs" ...
## $ x : num [1:1000, 1:7] -2.005 2.306 -0.186 -1.504 -2.8 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
We go ahead to plot the PCA model. First, we install the ggbiplot visualization package.
install.packages("devtools")
## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("Rcpp", dependencies = TRUE)
## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(devtools)
## Loading required package: usethis
install_github("vqv/ggbiplot")
## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
## Use `force = TRUE` to force installation
Loading our ggbiplot library
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
ggbiplot(sales.pca)
From the graph above, it’s evident that gross income, unit price, and the ratings and quantity are important factors in this analysis.
PC1 explains 70.3% of the total variance. This means that nearly two-thirds of the dataset ( 7 variables ) can be encapsulated by the one Principal Component (PC1)
PC2 explains 14.3% of the variance.
Adding more detail to the plot, we provide arguments rownames as the labels.
ggbiplot(sales.pca, labels=rownames(sales), obs.scale = 1, var.scale = 1)
Let’s plot PC3 and PC4.
ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4))
From the graph, PC3 explains 14.3% of the variance hile PC4 explains 1.3% of the total variance in the dataset..
Adding more detail to the plot, we provide arguments rownames as the labels.
ggbiplot(sales.pca,ellipse=TRUE,choices=c(3,4), labels=rownames(sales))
t-SNE helps to find a way to project data into a low dimensional space in order for the clustering in the high dimensional spaces is preserved.
Loading the t-sne library
library(Rtsne)
Curate the database for analysis
Labels <- sales$Product.line
sales$Product.line <- as.factor(sales$Product.line)
Plotting to see the dimensions
colors = rainbow(length(unique(sales$Product.line)))
names(colors) = unique(sales$Product.line)
Executing our algorithm on the curated data
tsne <- Rtsne(sales[,-5], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1000 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.09 seconds (sparsity = 0.101260)!
## Learning embedding...
## Iteration 50: error is 60.785506 (50 iterations in 0.12 seconds)
## Iteration 100: error is 52.526184 (50 iterations in 0.11 seconds)
## Iteration 150: error is 50.877044 (50 iterations in 0.12 seconds)
## Iteration 200: error is 50.352214 (50 iterations in 0.12 seconds)
## Iteration 250: error is 50.074397 (50 iterations in 0.12 seconds)
## Iteration 300: error is 0.561052 (50 iterations in 0.12 seconds)
## Iteration 350: error is 0.397329 (50 iterations in 0.12 seconds)
## Iteration 400: error is 0.362099 (50 iterations in 0.12 seconds)
## Iteration 450: error is 0.344496 (50 iterations in 0.12 seconds)
## Iteration 500: error is 0.337596 (50 iterations in 0.12 seconds)
## Fitting performed in 1.21 seconds.
Checking the summary of our model
summary(tsne)
## Length Class Mode
## N 1 -none- numeric
## Y 2000 -none- numeric
## costs 1000 -none- numeric
## itercosts 10 -none- numeric
## origD 1 -none- numeric
## perplexity 1 -none- numeric
## theta 1 -none- numeric
## max_iter 1 -none- numeric
## stop_lying_iter 1 -none- numeric
## mom_switch_iter 1 -none- numeric
## momentum 1 -none- numeric
## final_momentum 1 -none- numeric
## eta 1 -none- numeric
## exaggeration_factor 1 -none- numeric
Plotting the graph and closely monitoring it
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=sales$Product.line, col=colors[sales$Product.line])
These methods apply a metric to assign scoring to each feature. The features would then be ranked by their score. We’ll be using the findCorrelation function that’s within the caret package to create a subset of variables. This function allows us remove redundancy by correlation using the dataset given. We use numerical variables only since we are working with a correlation matrix.
df <- sales[,c(6,7,8,12,14,15,16)]
head(df)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 74.69 7 26.1415 522.83 26.1415 9.1 548.9715
## 2 15.28 5 3.8200 76.40 3.8200 9.6 80.2200
## 3 46.33 7 16.2155 324.31 16.2155 7.4 340.5255
## 4 58.22 8 23.2880 465.76 23.2880 8.4 489.0480
## 5 86.31 7 30.2085 604.17 30.2085 5.3 634.3785
## 6 85.39 7 29.8865 597.73 29.8865 4.1 627.6165
Loading our caret package.
suppressWarnings(
suppressMessages(if
(!require(caret, quietly=TRUE))
install.packages("caret")))
library(caret)
Let’s inatall the corrplot package for correlation.
suppressWarnings(
suppressMessages(if
(!require(corrplot, quietly=TRUE))
install.packages("corrplot")))
library(corrplot)
Getting the correlation matrix
corrmatrix <- cor(df)
corrmatrix
## Unit.price Quantity Tax cogs gross.income
## Unit.price 1.000000000 0.01077756 0.6339621 0.6339621 0.6339621
## Quantity 0.010777564 1.00000000 0.7055102 0.7055102 0.7055102
## Tax 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## cogs 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## gross.income 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## Rating -0.008777507 -0.01581490 -0.0364417 -0.0364417 -0.0364417
## Total 0.633962089 0.70551019 1.0000000 1.0000000 1.0000000
## Rating Total
## Unit.price -0.008777507 0.6339621
## Quantity -0.015814905 0.7055102
## Tax -0.036441705 1.0000000
## cogs -0.036441705 1.0000000
## gross.income -0.036441705 1.0000000
## Rating 1.000000000 -0.0364417
## Total -0.036441705 1.0000000
Finding variables that have highly correlated variables
highlyCorrelated <- findCorrelation(corrmatrix, cutoff=0.75)
highlyCorrelated
## [1] 4 7 3
List of highly correlated variables
names(df[,highlyCorrelated])
## [1] "cogs" "Total" "Tax"
From the output above the following variables have the highest correlation: * cogs * Total * Tax
Let’s remove the redundant features
df_new <- df[-highlyCorrelated]
head(df_new)
## Unit.price Quantity gross.income Rating
## 1 74.69 7 26.1415 9.1
## 2 15.28 5 3.8200 9.6
## 3 46.33 7 16.2155 7.4
## 4 58.22 8 23.2880 8.4
## 5 86.31 7 30.2085 5.3
## 6 85.39 7 29.8865 4.1
Graphical Comparison
par(mfrow = c(1, 2))
corrplot(corrmatrix, order = "hclust")
corrplot(cor(df_new), order = "hclust")
Here we’ll use the ewkm function from the wskm package, which is a weighted subspace clustering algorithm this well suited to very high dimensional data
Let’s load our wskm package
suppressWarnings(
suppressMessages(if
(!require(wskm, quietly=TRUE))
install.packages("wskm")))
library(wskm)
set.seed(23)
model <- ewkm(df,3, lambda = 2, maxiter = 1000)
Loading the cluster package
suppressWarnings(
suppressMessages(if
(!require(cluster, quietly=TRUE))
install.packages("cluster")))
library("cluster")
We cluster plot against the first 2 Pricipal Components.
clusplot(df, model$cluster, color=TRUE, shade=TRUE,
labels=2, lines=1,main='Cluster Analysis for Supermarket sales')
From the plot above, we note that the first two principal components explain 84.6% of the point variability. Weights are calculated for each variable and cluster. The weights are a measure of the relative importance of each variable with regards to the observations membership to the cluster. The weights are incorporated into the distance function, and this typically reduces the distance fro more important variables
Weights remain stored in the model and they can be checked as follows:
round(model$weights*100.2)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 0 0 0 0 0 100 0
## 2 0 0 50 0 50 0 0
## 3 0 0 50 0 50 0 0
Here, we will be using the FSelector package. This is a package that contains functions for selecting attributes from the dataset
Loading the package required
suppressWarnings(
suppressMessages(if
(!require(FSelector, quietly=TRUE))
install.packages("FSelector")))
install.packages('FSelector')
## Installing package into '/home/binti/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
From the FSelector package, the correlation coefficient is used as a unit of evaluation This is one of the algorithms contained in the FSelector package, that can be used to rank the variables.
scores <- linear.correlation(Total~.,df)
scores
## attr_importance
## Unit.price 0.6339621
## Quantity 0.7055102
## Tax 1.0000000
## cogs 1.0000000
## gross.income 1.0000000
## Rating 0.0364417
From the output above, a list is observed to contain rows of variables on the left, and score on the right. In order to help make a decision, we get to define a cut-off. That is, if we wanted the top 5 representative variables, through the use of the cut-off.k function that is in the FSelector package.
subset <- cutoff.k(scores, 5)
as.data.frame(subset)
## subset
## 1 Tax
## 2 cogs
## 3 gross.income
## 4 Quantity
## 5 Unit.price
The cut-off can also be set as a percentage which would indicate that our aim is to wor with the percentage of the best variables
subset2 <-cutoff.k.percent(scores, 0.4)
as.data.frame(subset2)
## subset2
## 1 Tax
## 2 cogs
We can use an entropy-based approach instead of using the scores for the correlation coefficiente
scores2 <- information.gain(Total~., df)
scores2
## attr_importance
## Unit.price 0.3084863
## Quantity 0.4211154
## Tax 1.6094379
## cogs 1.6094379
## gross.income 1.6094379
## Rating 0.0000000
Choosing variables by cut-off method
subset3 <- cutoff.k(scores2, 5)
as.data.frame(subset3)
## subset3
## 1 Tax
## 2 cogs
## 3 gross.income
## 4 Quantity
## 5 Unit.price