DIMENSIONALITY REDUCTION AND FEATURE SELECTION

1. DEFINING THE QUESTION

a) Specifying the Question

Identify the most relevant features that will result in the highest number of sales.

b) Defining the Metric of Success

Successfully identify the features that would inform the marketing department on the most relevant marketing strategies.

c) Understanding the Context

As a Data analyst at Carrefour Kenya, I am currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). The project has been divided into four parts where I’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on the insights.

d) Recording the Experimental Design

Load, preview and understand the data
Data cleaning and manipulation
Exploratory Data Analysis
Dimensionality reduction
Feature selection
Conclusion

2. LOADING AND PREVIEWING THE DATA

Loading the dataset

# calling the library 
library(tibble)

# loading the data

df <- read.csv('http://bit.ly/CarreFourDataset')

# convert dataframe to tibble 
df <- as_tibble(df)

# checking the class of the dataset
df %>% class()

## [1] "tbl_df"     "tbl"        "data.frame"

Previewing the data

# Determining the number of rows and columns in the dataset
cat('Number of rows are', nrow(df), 'and the number of columns are', ncol(df))

## Number of rows are 1000 and the number of columns are 16

# previewing the top of the dataset
df %>% head()

## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 750-67-8428 A      Member        Female Health and …       74.7        7 26.1 
## 2 226-31-3081 C      Normal        Female Electronic …       15.3        5  3.82
## 3 631-41-3108 A      Normal        Male   Home and li…       46.3        7 16.2 
## 4 123-19-1176 A      Member        Male   Health and …       58.2        8 23.3 
## 5 373-73-7910 A      Normal        Male   Sports and …       86.3        7 30.2 
## 6 699-14-3026 C      Normal        Male   Electronic …       85.4        7 29.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>

# previewing the bottom of the dataset
df %>% tail()

## # A tibble: 6 × 16
##   Invoice.ID  Branch Customer.type Gender Product.line Unit.price Quantity   Tax
##   <chr>       <chr>  <chr>         <chr>  <chr>             <dbl>    <int> <dbl>
## 1 652-49-6720 C      Member        Female Electronic …       61.0        1  3.05
## 2 233-67-5758 C      Normal        Male   Health and …       40.4        1  2.02
## 3 303-96-2227 B      Normal        Female Home and li…       97.4       10 48.7 
## 4 727-02-1313 A      Member        Male   Food and be…       31.8        1  1.59
## 5 347-56-2442 A      Normal        Male   Home and li…       65.8        1  3.29
## 6 849-09-3807 A      Member        Female Fashion acc…       88.3        7 30.9 
## # … with 8 more variables: Date <chr>, Time <chr>, Payment <chr>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Total <dbl>

# checking the structure of the dataset and datatype of each column
df %>% str()

## tibble [1,000 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Invoice.ID             : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr [1:1000] "A" "C" "A" "A" ...
##  $ Customer.type          : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : chr [1:1000] "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr [1:1000] "13:08" "10:29" "13:23" "20:33" ...
##  $ Payment                : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num [1:1000] 549 80.2 340.5 489 634.4 ...

All the columns have correct data types, except the date column

# Converting the date column to date format

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

df$Date <- mdy(df$Date)

# converting the character columns to factors
df <- as.data.frame(unclass(df),
                     stringsAsFactors = TRUE)

# checking if the changes are implemented
df %>% str()

## 'data.frame':    1000 obs. of  16 variables:
##  $ Invoice.ID             : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
##  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ Customer.type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ Product.line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax                    : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Date                   : Date, format: "2019-01-05" "2019-03-08" ...
##  $ Time                   : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
##  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...

3. DATA CLEANING AND MANIPULATION

a) Validation

# Checking for value validity
df %>% summary()

##        Invoice.ID  Branch  Customer.type    Gender   
##  101-17-6199:  1   A:340   Member:501    Female:501  
##  101-81-4070:  1   B:332   Normal:499    Male  :499  
##  102-06-2002:  1   C:328                             
##  102-77-2261:  1                                     
##  105-10-6182:  1                                     
##  105-31-1824:  1                                     
##  (Other)    :994                                     
##                  Product.line   Unit.price       Quantity          Tax         
##  Electronic accessories:170   Min.   :10.08   Min.   : 1.00   Min.   : 0.5085  
##  Fashion accessories   :178   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 5.9249  
##  Food and beverages    :174   Median :55.23   Median : 5.00   Median :12.0880  
##  Health and beauty     :152   Mean   :55.67   Mean   : 5.51   Mean   :15.3794  
##  Home and lifestyle    :160   3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.:22.4453  
##  Sports and travel     :166   Max.   :99.96   Max.   :10.00   Max.   :49.6500  
##                                                                                
##       Date                 Time            Payment         cogs       
##  Min.   :2019-01-01   14:42  :  7   Cash       :344   Min.   : 10.17  
##  1st Qu.:2019-01-24   19:48  :  7   Credit card:311   1st Qu.:118.50  
##  Median :2019-02-13   17:38  :  6   Ewallet    :345   Median :241.76  
##  Mean   :2019-02-14   10:11  :  5                     Mean   :307.59  
##  3rd Qu.:2019-03-08   11:40  :  5                     3rd Qu.:448.90  
##  Max.   :2019-03-30   11:51  :  5                     Max.   :993.00  
##                       (Other):965                                     
##  gross.margin.percentage  gross.income         Rating           Total        
##  Min.   :4.762           Min.   : 0.5085   Min.   : 4.000   Min.   :  10.68  
##  1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500   1st Qu.: 124.42  
##  Median :4.762           Median :12.0880   Median : 7.000   Median : 253.85  
##  Mean   :4.762           Mean   :15.3794   Mean   : 6.973   Mean   : 322.97  
##  3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500   3rd Qu.: 471.35  
##  Max.   :4.762           Max.   :49.6500   Max.   :10.000   Max.   :1042.65  
##

There’s no entry below less than 0 therefore all the records are valid.

b) Consistency

# Checking for missing values
sum(is.na(df))

## [1] 0

There are no missing values.

c) Completeness

# Check for duplicates
sum(duplicated(df))

## [1] 0

There are no duplicates.

d) Uniformity

# Checking uniformity in column names 
colnames(df)

##  [1] "Invoice.ID"              "Branch"                 
##  [3] "Customer.type"           "Gender"                 
##  [5] "Product.line"            "Unit.price"             
##  [7] "Quantity"                "Tax"                    
##  [9] "Date"                    "Time"                   
## [11] "Payment"                 "cogs"                   
## [13] "gross.margin.percentage" "gross.income"           
## [15] "Rating"                  "Total"

There is a mix of both sentence cases and lower case. For uniformity, I’ll change the letter case to lower

names(df) <- tolower(names(df))

# viewing the results 
names(df)

##  [1] "invoice.id"              "branch"                 
##  [3] "customer.type"           "gender"                 
##  [5] "product.line"            "unit.price"             
##  [7] "quantity"                "tax"                    
##  [9] "date"                    "time"                   
## [11] "payment"                 "cogs"                   
## [13] "gross.margin.percentage" "gross.income"           
## [15] "rating"                  "total"

All the column names are now uniform.

e) Outliers

# selecting only the numerical columns
num_cols <- df[, c(6:8,12:16)]
length(num_cols)

## [1] 8

# Box plots to visualize outliers
par(mfrow = c(2,3))
for (i in 1:length(num_cols)){
  boxplot(num_cols[i], main = paste('Boxplot for',  names(num_cols)[i]), 
          ylab = 'Count')
}

Tax, cogs, gross.income and total are the only columns with outliers. I will not remove them.

f) More cleaning actions

# Splitting the date column into year, month and day
# Unique years
year <- format(df$date, format="%y")
unique(year)

## [1] "19"

The data contains records for only 1 year, 2019

# Unique months
month <- format(df$date, format="%m")

# adding months column to the dataset 
df$month <- month

sort(unique(month))

## [1] "01" "02" "03"

The data was collected during the first 3 months

# Day of the week 
day <- wday(df$date)

# adding days column to the dataset 
df$day <- day

# converting to factor
df$day <- as.factor(df$day)
sort(unique(day))

## [1] 1 2 3 4 5 6 7

# Now that the date column is split and the columns of interest added, the column can be dropped
df = subset(df, select = -c(date))
df %>% head(2)

##    invoice.id branch customer.type gender           product.line unit.price
## 1 750-67-8428      A        Member Female      Health and beauty      74.69
## 2 226-31-3081      C        Normal Female Electronic accessories      15.28
##   quantity     tax  time payment   cogs gross.margin.percentage gross.income
## 1        7 26.1415 13:08 Ewallet 522.83                4.761905      26.1415
## 2        5  3.8200 10:29    Cash  76.40                4.761905       3.8200
##   rating    total month day
## 1    9.1 548.9715    01   7
## 2    9.6  80.2200    03   6

4. Exploratory Data Analysis

a) Univariate Analysis

# Statistical description of the variables
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()

library(dplyr)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(num_cols)

##                         vars    n   mean     sd median trimmed    mad   min
## unit.price                 1 1000  55.67  26.49  55.23   55.62  33.37 10.08
## quantity                   2 1000   5.51   2.92   5.00    5.51   2.97  1.00
## tax                        3 1000  15.38  11.71  12.09   14.00  11.13  0.51
## cogs                       4 1000 307.59 234.18 241.76  279.91 222.65 10.17
## gross.margin.percentage    5 1000   4.76   0.00   4.76    4.76   0.00  4.76
## gross.income               6 1000  15.38  11.71  12.09   14.00  11.13  0.51
## rating                     7 1000   6.97   1.72   7.00    6.97   2.22  4.00
## total                      8 1000 322.97 245.89 253.85  293.91 233.78 10.68
##                             max   range skew kurtosis   se
## unit.price                99.96   89.88 0.01    -1.22 0.84
## quantity                  10.00    9.00 0.01    -1.22 0.09
## tax                       49.65   49.14 0.89    -0.09 0.37
## cogs                     993.00  982.83 0.89    -0.09 7.41
## gross.margin.percentage    4.76    0.00  NaN      NaN 0.00
## gross.income              49.65   49.14 0.89    -0.09 0.37
## rating                    10.00    6.00 0.01    -1.16 0.05
## total                   1042.65 1031.97 0.89    -0.09 7.78

The table above gives us the measures of central tendency (mean, median) and measures of dispersion (standard deviation, minimum, maximum, range, skew, kurtosis) values.

Frequency distribution of the numeric columns

# function for plotting histograms
histogram <- function(column,title, xlab, ylim){
  
  hist(column, main= title, xlab=xlab, ylim=ylim, ylab = "Frequency", col = "darkred")
  
}

# histogram for unit price
histogram(df$unit.price, "Histogram for Unit Price", "Unit Price", c(0, 150))

The unit price between 90 and 100 had the highest frequency for most products. The distribution is non-normal.

# histogram for quantity
histogram(df$quantity, "Histogram for quantity", "Quantity", c(0, 250))

Most products had a quantity between 1 and 2. The distribution is non-normal.

# histogram for tax
histogram(df$tax, "Histogram for Tax", "Tax", c(0, 250))

The distribution for the tax column is right skewed. Most products had a tax between 0-10

# histogram for cogs
histogram(df$cogs, "Histogram for Cogs", "Cogs", c(0, 250))

The distribution for the cogs column is right skewed. Most products had a tax between 0-10

# histogram for gross margin percentage
histogram(df$gross.income, "Histogram for Gross Income", "Gross Income", c(0, 250))

The distribution for the gross income column is right skewed. Most products had a gross income between 0-10

# histogram for rating
histogram(df$rating, "Histogram for Rating", "Rating", c(0, 120))

Most products got a rating between 4 and 4.5

# histogram for total
histogram(df$total, "Histogram for Total", "Total", c(0, 250))

Most transactions had a total frequency of slightly more than 200. The column is right skewed.

Bar plot representations for categorical columns

# barplot function
bar <- function(column, title, xlab, ylim){
  
  # create frequency table
  freq <- table(column)
  
  # sort frequency table
  sorted_freq <- (freq[order(freq,decreasing=TRUE)])
  
  # adjust margins of frequency table
  par(mar = c(7, 4, 2, 2) + 0.2)
  
  
  # plot bar graph for first 10 values with the highest count
  barplot(sorted_freq[1:10], main=title, ylab="Frequency" ,las=2, col="darkgreen")
  title(xlab = xlab, line=30)
  
}

# plotting a bar graph for branch
bar(df$branch, "Bar plot for Branch", "Branches")

Branch A had the highest representation, B & C had equal representations.

# plotting a bar graph for gender
bar(df$gender, "Bar plot for Gender", "Gender")

Both males and females are equally represented.

# plotting a bar graph for customer type
bar(df$customer.type, "Bar plot for Customer Type", "Customer Type")

Both types of customers had equal representation.

# plotting a bar graph for product line
bar(df$product.line, "Bar plot for Product Line", "Product line")

The most bought product type is fashion accessories

# plotting a bar graph for payment
bar(df$payment, "Bar plot for Payment", "Payment")

Cash and e-wallet are the most used modes of payment.

# plotting a bar graph for payment
bar(df$month, "Bar plot for Month", "Month")

January had the highest number of representation.

# plotting a bar graph for day
bar(df$day, "Bar plot for Days", "Day")

Day 1 is Monday. Sunday had the highest representation followed by Wednesday.

b) Bivariate Analysis

Categorical - Categorical

library(ggplot2)
# create function to plot stacked bar chart
stacked <- function(column1, column2, title, value1, value2, legend){
  
    # vectorize the two columns to plot
    attribute1 <- c(column1) 
    attribute2 <- c(column2)
    
    # create dataframe of the two columns
    data <- data.frame(attribute1, attribute2)
    
    # plot stacked bar graph
    ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) 
    
    
}

# Branch vs Gender
stacked(df$branch, df$gender, "Branch against Gender", "Branch", "Gender", "Gender")

Branches B and C have more female customers than A, which has an equal number of the males and females.

# Product line vs Branch
stacked(df$product.line, df$branch, "Product line against Branch", "Product line", "Branch", "Branch")

Electronic accessories products are more in branch A
Fashion accessories products are more in branch B
Food and beverages are more in branch C
Health and beauty products are more more in branch B
Home and lifestyle products are more in branch A
Sports and travel products are more in store B

#  Branch vs Customer type
stacked(df$branch, df$customer.type, "Branch against Customer Type", "Branch", "Customer Type", "Customer Type")

Branch A has more customers than B and C.
All the brancehes have an equal number of customer types

# product line vs customer type
stacked(df$product.line, df$customer.type, "Product line against Customer Type", "Product line", "Customer Type", "Customer Type")

Electronic accessories are mostly bought by normal customers
Fashion accessories are mostly bought by member customers
Food and beverages are mostly bought by member customers
Health and beauty products get bought by both types of customers
Home and lifestyle products are bought mostly by member customers
Sports and travel products are bought by member customers

# product line vs gender 
stacked(df$product.line, df$gender, "Product line against Gender", "Product line", "Gender", "Gender")

Electronic accessories get bought by both genders
Fashion accessories are mostly bought by females
Food and beverages are mostly bought by females
Health and beauty products are mostly bought by males
Home and lifestyle products are bought mostly by females
Sports and travel products are bought by females

# payment vs customer type
stacked(df$payment, df$customer.type, "Payment against Customer Type", "Payment", "Customer Type", "Customer Type")

Cash is used by both member and normal customers
Credit card is used mostly by member customers
E-wallet is used mostly by normal customers

# Month against Product type

# create contingency table 
month <- df$month
ptype <- df$product.line
mon <- table(month,ptype)
mon

##      ptype
## month Electronic accessories Fashion accessories Food and beverages
##    01                     54                  64                 56
##    02                     54                  60                 62
##    03                     62                  54                 56
##      ptype
## month Health and beauty Home and lifestyle Sports and travel
##    01                49                 59                70
##    02                46                 38                43
##    03                57                 63                53

In January, sports and travel products were the most bought while health and beauty products were the least bought.
In February, food and bevareges were the most bought and home while lifestyle products the least bought.
In March, home and lifestyle products were the most bought while sports and travel products were the least bought.

Numerical - numerical

# checking for covariance in the numerical columns
cov(num_cols, use="complete.obs")

##                           unit.price     quantity          tax        cogs
## unit.price               701.9653313   0.83477848  196.6683401  3933.36680
## quantity                   0.8347785   8.54644645   24.1495704   482.99141
## tax                      196.6683401  24.14957038  137.0965941  2741.93188
## cogs                    3933.3668019 482.99140761 2741.9318829 54838.63766
## gross.margin.percentage    0.0000000   0.00000000    0.0000000     0.00000
## gross.income             196.6683401  24.14957038  137.0965941  2741.93188
## rating                    -0.3996675  -0.07945646   -0.7333003   -14.66601
## total                   4130.0351420 507.14097799 2879.0284770 57580.56954
##                         gross.margin.percentage gross.income       rating
## unit.price                                    0  196.6683401  -0.39966752
## quantity                                      0   24.1495704  -0.07945646
## tax                                           0  137.0965941  -0.73330028
## cogs                                          0 2741.9318829 -14.66600553
## gross.margin.percentage                       0    0.0000000   0.00000000
## gross.income                                  0  137.0965941  -0.73330028
## rating                                        0   -0.7333003   2.95351823
## total                                         0 2879.0284770 -15.39930581
##                               total
## unit.price               4130.03514
## quantity                  507.14098
## tax                      2879.02848
## cogs                    57580.56954
## gross.margin.percentage     0.00000
## gross.income             2879.02848
## rating                    -15.39931
## total                   60459.59802

Covariance measures how two random variables vary together. A high negative covariance indicates negative correlation while a high positive covariance indicates positive correlation. A value close to zero indicates weak covariance.

From the results above, we can deduce that the following have positive covariance indicating positive correlation: unit price and tax, unit price and cogs, cogs and tax, gross income and total just to mention a few.

Rating and all other numerical variables have a negative covariance indicating negative correlation.

Gross margin percentage has a weak correlation with all the variables.

# correlation matrix to see the strengths of correlation between the variables

# import ggcorrplot package
library(ggcorrplot)

corr <- round(cor(num_cols, use='complete.obs'),1)

## Warning in cor(num_cols, use = "complete.obs"): the standard deviation is zero

ggcorrplot(corr, lab=TRUE, title='Correlation Heatmap', colors=c('#022D36', 'white', '#48AAAD'))

There is a strong positive correlation between total and tax, total and cogs, gross income and tax, gross income and cogs, tax and cogs, total and gross income.

The rest are weakly or moderately correlated.

# Scatter plot for total and tax
plot(df$total, df$tax, xlab="total", ylab="tax")

There is a strong positive linear correlation between the variables

# Scatter plot for gross income and tax
plot(df$gross.income, df$tax, xlab="gross income", ylab="tax")

There is a strong positive linear correlation between the variables

# Scatter plot for gross income and tax
plot(df$gross.income, df$tax, xlab="gross income", ylab="tax")

There is a strong positive linear correlation between the variables

# Scatter plot for total and cogs
plot(df$total, df$cogs, xlab="total", ylab="cogs")

There is a strong positive linear correlation between the variables

# Scatter plot for total and cogs
plot(df$gross.income, df$quantity, xlab="gross income", ylab="quantity")

There is a moderately positive correlation between the variables.

Numerical - Categorical

# Barplot function

barc <- function(column1,column2, xlabel, ylabel, title){
  
  ggplot(data=df, aes(x=column1, y=column2)) +  stat_summary(fun='median') +  geom_bar(stat="identity",fill="darkblue") + xlab(xlabel) +ylab(ylabel) + ggtitle(title) + theme(plot.title = element_text(hjust=0.5))
  
}

# Product line vs Gross income
barc(df$product.line, df$gross.income, 'Product Line', 'Gross income', 'Product line vs Gross income')

## Warning: Removed 6 rows containing missing values (geom_segment).

Food and beverages brought in the most gross income

# Product line vs Total
barc(df$product.line, df$total, 'Product Line', 'Total', 'Product line vs Total')

## Warning: Removed 6 rows containing missing values (geom_segment).

Food and beverages brought in the most totals

# Product line vs Total
barc(df$product.line, df$tax, 'Product Line', 'Tax', 'Product line vs Tax')

## Warning: Removed 6 rows containing missing values (geom_segment).

Food and beverages were highly taxed

c) Multivariate Analysis

# plot quantity vs. total (color represents product line)
ggplot(df, aes(x = quantity, 
                     y = total, 
                     color=product.line)) +
  geom_point() +
  labs(title = "Totals by product line and quantity")

As the quantity increases so does the totals for all products

# plot unit price vs. tax (color represents product line)
ggplot(df, aes(x = unit.price, 
                     y = tax, 
                     color=product.line)) +
  geom_point() +
  labs(title = "Unit Price by tax and product line")

The tax for all the products goes higher as the unit price increases

# plot quantity vs. total (color represents product line)
ggplot(df, aes(x = rating, 
                     y = total, 
                     color=product.line)) +
  geom_point() +
  labs(title = "Totals by product line and rating")

The rating for all the products are more for products that are below 350

5. Dimensionality Reduction using PCA

# The first column is not necessary for reduction
# gross margin percentage does not have variance therefore can't be scaled
df = subset(df, select = -c(invoice.id, gross.margin.percentage, time))
df %>% head(2)

##   branch customer.type gender           product.line unit.price quantity
## 1      A        Member Female      Health and beauty      74.69        7
## 2      C        Normal Female Electronic accessories      15.28        5
##       tax payment   cogs gross.income rating    total month day
## 1 26.1415 Ewallet 522.83      26.1415    9.1 548.9715    01   7
## 2  3.8200    Cash  76.40       3.8200    9.6  80.2200    03   6

# To perform PCA, all the columns have to be numerical.
# Converting the categorical columns to numerical 

df$branch = as.integer(df$branch)
df$customer.type = as.integer(df$customer.type)
df$gender = as.integer(df$gender)
df$product.line = as.integer(df$product.line)
df$payment = as.integer(df$payment)
df$month = as.integer(df$month)
df$day = as.integer(df$day)

# previewing the changes made
df %>% head(2)

##   branch customer.type gender product.line unit.price quantity     tax payment
## 1      1             1      1            4      74.69        7 26.1415       3
## 2      3             2      1            1      15.28        5  3.8200       1
##     cogs gross.income rating    total month day
## 1 522.83      26.1415    9.1 548.9715     1   7
## 2  76.40       3.8200    9.6  80.2200     3   6

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#running our principal component
df.pca <- prcomp(df, center = TRUE, scale. = TRUE)
df.pca %>% summary()

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2203 1.07087 1.04213 1.02714 1.01085 0.98201 0.97962
## Proportion of Variance 0.3521 0.08191 0.07757 0.07536 0.07299 0.06888 0.06855
## Cumulative Proportion  0.3521 0.43404 0.51162 0.58698 0.65996 0.72884 0.79739
##                            PC8     PC9    PC10    PC11      PC12      PC13
## Standard deviation     0.97047 0.95235 0.94761 0.29963 2.954e-16 2.043e-16
## Proportion of Variance 0.06727 0.06478 0.06414 0.00641 0.000e+00 0.000e+00
## Cumulative Proportion  0.86466 0.92945 0.99359 1.00000 1.000e+00 1.000e+00
##                             PC14
## Standard deviation     1.451e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

14 principal components are obtained and each explain the total variation of the dataset. PC1 explains 35.21% of the the total variance, PC2 explains 8.1%

# checking the possible variables we can get from running principal component(PC)
var=get_pca_var(df.pca)
var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

#getting the Eigenvalues
eig.val=get_eigenvalue(df.pca)
eig.val

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.929865e+00     3.521332e+01                    35.21332
## Dim.2  1.146761e+00     8.191147e+00                    43.40447
## Dim.3  1.086027e+00     7.757339e+00                    51.16181
## Dim.4  1.055012e+00     7.535799e+00                    58.69761
## Dim.5  1.021816e+00     7.298687e+00                    65.99629
## Dim.6  9.643398e-01     6.888141e+00                    72.88444
## Dim.7  9.596465e-01     6.854618e+00                    79.73905
## Dim.8  9.418214e-01     6.727296e+00                    86.46635
## Dim.9  9.069626e-01     6.478304e+00                    92.94465
## Dim.10 8.979714e-01     6.414082e+00                    99.35873
## Dim.11 8.977715e-02     6.412654e-01                   100.00000
## Dim.12 8.728133e-32     6.234380e-31                   100.00000
## Dim.13 4.175243e-32     2.982316e-31                   100.00000
## Dim.14 2.106113e-32     1.504366e-31                   100.00000

10 principal components account for 99.35% of the data

#Plotting the PC using a scree plot
fviz_eig(df.pca, addlabels = T, ylim = c(0, 50))

# looking at the pca objects 

str(df.pca)

## List of 5
##  $ sdev    : num [1:14] 2.22 1.07 1.04 1.03 1.01 ...
##  $ rotation: num [1:14, 1:14] 0.0226 -0.0125 -0.0282 0.0174 0.2912 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:14] "branch" "customer.type" "gender" "product.line" ...
##   .. ..$ : chr [1:14] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:14] 1.99 1.5 1.5 3.45 55.67 ...
##   ..- attr(*, "names")= chr [1:14] "branch" "customer.type" "gender" "product.line" ...
##  $ scale   : Named num [1:14] 0.818 0.5 0.5 1.715 26.495 ...
##   ..- attr(*, "names")= chr [1:14] "branch" "customer.type" "gender" "product.line" ...
##  $ x       : num [1:1000, 1:14] 2.041 -2.284 0.104 1.459 2.753 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:1000] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

#Using visualization to check the contribution of each feature. 
fviz_pca_var(df.pca, col.var = "blue")

#Plotting the contribution of attributes in each PC
fviz_contrib(df.pca,choice = "var", axes = 1)

fviz_contrib(df.pca, choice = "var", axes = 2)

fviz_contrib(df.pca, choice = "var", axes = 3)

fviz_contrib(df.pca, choice = "var", axes = 4)

fviz_contrib(df.pca,choice = "var", axes = 5)

fviz_contrib(df.pca, choice = "var", axes = 6)

fviz_contrib(df.pca, choice = "var", axes = 7)

fviz_contrib(df.pca, choice = "var", axes = 8)

fviz_contrib(df.pca,choice = "var", axes = 9)

fviz_contrib(df.pca, choice = "var", axes = 10)

Dim1: cogs, total, tax, gross income, quantity and unit price contribute the most
Dim2: gender, branch, payment and month contribute the most
Dim3: product line, day and customer type contribute the most
Dim4: rating, month and gender contributes the most
Dim5: Customer type, unit price and quantity contribute the most

6. Feature Selection

a) Using Embedded methods

# using the ewkm function from the wskm package
library(wskm)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked _by_ '.GlobalEnv':
## 
##     histogram

## Loading required package: latticeExtra

## 
## Attaching package: 'latticeExtra'

## The following object is masked from 'package:ggplot2':
## 
##     layer

## Loading required package: fpc

set.seed(2345)    # setting seed for reproducibility

model <- ewkm(df, 3, lambda=0.5, maxiter=100)

# loading the cluster package
library("cluster")

# plotting the clusters of the first 2 principal components
clusplot(df, model$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis for Carrefour')

# Calculating the weights for each variable and cluster 
# Weights are a measure of the relative importance of each variable with regards to the membership of the observations to that cluster.

round(model$weights*100,2)

##   branch customer.type gender product.line unit.price quantity tax payment cogs
## 1      0         99.99      0            0          0        0   0       0    0
## 2      0         50.00     50            0          0        0   0       0    0
## 3      0         99.99      0            0          0        0   0       0    0
##   gross.income rating total month day
## 1            0      0     0     0   0
## 2            0      0     0     0   0
## 3            0      0     0     0   0

Customer type and gender are the most important features according to this method.

b) Using Wrapper method

# Calling the libraries
library(clustvarsel)

## Loading required package: mclust

## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.

## 
## Attaching package: 'mclust'

## The following object is masked from 'package:psych':
## 
##     sim

## The following object is masked from 'package:purrr':
## 
##     map

## Package 'clustvarsel' version 2.3.4

## Type 'citation("clustvarsel")' for citing this R package in publications.

library(mclust)

# Sequential forward greedy search (default)

out = clustvarsel(df, G = 1:14)
out

## ------------------------------------------------------ 
## Variable selection for Gaussian model-based clustering
## Stepwise (forward/backward) greedy search
## ------------------------------------------------------ 
## 
##  Variable proposed Type of step   BICclust Model  G    BICdiff Decision
##           quantity          Add -4192.1562     E  9   804.0515 Accepted
##                day          Add   460.6869   VEV  9  8894.5229 Accepted
##             gender          Add -8556.5739   VEV 10 -7551.8666 Rejected
##                day       Remove -4192.1562     E  9  8894.5229 Rejected
## 
## Selected subset: quantity, day

# The selection algorithm would indicate that the subset we use for the clustering model is composed of variables quantity and day, other variables are rejected. 
# Having identified the variables that we use, we proceed to build the clustering model:

Subset1 = df[,out$subset]
mod = Mclust(Subset1, G = 1:14)
summary(mod)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 10 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -3552.247 1000 41 -7387.713 -7503.809
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9  10 
## 126  90  77  98  99  93 119  93  93 112

# plotting the clusters
plot(mod,c("classification"))

Day and quantity are the most important features according to this method.

c) Using Filter method

# calling the library
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(corrplot)

## corrplot 0.92 loaded

# Calculating the correlation matrix

correlationMatrix <- cor(df)

# Find attributes that are highly correlated

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

# Highly correlated attributes

highlyCorrelated

## [1]  7  9 10

names(df[,highlyCorrelated])

## [1] "tax"          "cogs"         "gross.income"

# Removing Redundant Features 

df1 <- df[-highlyCorrelated]

# Performing our graphical comparison
# ---
# 
par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(df1), order = "hclust")

Using filter methods, tax, cogs and gross income are the irrelevant variables in the dataset.

7. CONCLUSION

Branch A is the most visited
The most frequent time that transactions happen in the sores is between 12.30pm and 1.52pm
Most products bought have a unit price between 90 and 100
The most bought product line is fashion accessories while the least bought product line is health and beauty products.
Most customers go to the stores on Sundays.
The stores mostly have fashion accessories products, which are mostly bought by women.
Food and beverages have the most totals and taxes.
In January, sports and travel products were the most bought while health and beauty products were the least bought.
In February, food and beverages were the most bought and home while lifestyle products the least bought.
In March, home and lifestyle products were the most bought while sports and travel products were the least bought.
The feature selection methods gives the following variables as the most important: payment, gender, month, day, quantity, total, branch, unit price, customer type, rating and product line. A detailed analysis of the given features has been done above under the EDA section.

8. RECOMMENDATIONS

Ensure availability of more items in the other branches to improve the number of visits.
Do regular promotions of products during other weekdays so the stores can be visited often.
Offer quality products and probably give discounts during months where particular product lines are mostly bought by customers.

DIMENSIONALITY REDUCTION AND FEATURE SELECTION

Rehema

2022-06-10

1. DEFINING THE QUESTION

a) Specifying the Question

b) Defining the Metric of Success

c) Understanding the Context

d) Recording the Experimental Design

2. LOADING AND PREVIEWING THE DATA

Loading the dataset

Previewing the data

3. DATA CLEANING AND MANIPULATION

a) Validation

b) Consistency

c) Completeness

d) Uniformity

e) Outliers

f) More cleaning actions

4. Exploratory Data Analysis

a) Univariate Analysis

b) Bivariate Analysis

Categorical - Categorical

Numerical - numerical

Numerical - Categorical

c) Multivariate Analysis

5. Dimensionality Reduction using PCA

6. Feature Selection

a) Using Embedded methods

b) Using Wrapper method

c) Using Filter method

7. CONCLUSION

8. RECOMMENDATIONS