Name: Marwan Otrok

1. Introduction

1.1 Customer Segmentation:

When a company has a large customer base, they can identify and group customers into clusters or segments based on similar demographic, behavioral, or geographic characteristics. This is called segmentation and its purpose is to help organizations better understand their customers’ behavior and traits in order to create more personalized marketing strategies that target specific groups. By using data to segment customers, companies can gain a competitive advantage over rivals in the market and contribute to business growth.

1.2 Clustering and Dimension Reduction:

Clustering algorithms are often used to perform customer segmentation, which involves dividing customers into distinct groups based on their shared characteristics. The primary goal of clustering is to group data objects into different subsets, so that objects within the same subset are similar to one another, while those in different subsets are dissimilar. Dimension reduction is a technique used to reduce the number of variables or features in a dataset while retaining as much of the original information as possible, thus making the data more manageable and easier to analyze.

1.3 Problem:

In practice, data sets extracted from property management systems in the hospitality industry often include not only numerical data types, but also categorical data. This can pose a challenge because conventional clustering algorithms that rely on distance-based similarity calculations are not well-suited for clustering binary and mixed/categorical attributes. For example, the k-means algorithm is a popular choice for clustering large datasets, but it faces issues when calculating the cost function using Euclidean distance, which is only appropriate for numerical data. Similarly, dimension reduction can also pose a challenge when dealing with mixed or categorical data in the hospitality industry. Traditional dimension reduction techniques such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) are only applicable to numerical data. In many cases, datasets comprise diverse types of variables, and dismissing nominal variables may lead to a notable reduction in analytical efficacy. Consequently, algorithms capable of handling mixed-type data can prove invaluable in such scenarios.

1.4 Objective of this Analysis:

The primary aim of this article is to introduce and compare various methods of cluster analysis and dimension reduction that were conducted on continuous & mixed data related to “A Hotel´s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-218)”. The paper was divided into two parts. In the first part, only continuous and count variables were selected where simple K-means, PAM & CLARA algorithms were applied to the data. In the second part, some categorical variables were added to the previous dataset and two types of clustering were applied to the new data: k-prototypes and k-means on the data with categorical variables converted to binary values and treated as numeric. Lastly, Factor analysis of mixed data (FAMD) is used to perform dimension reduction and explore the underlying structure of mixed data. FAMD can be thought of as an extension of traditional factor analysis that can handle mixed data types. It involves constructing factors from the original data that capture the underlying relationships between the variables, and can be used to identify the most important features for a given problem.

1.5 Dataset

The chosen dataset was obtained from this article: “A Hotel´s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-218). This article offers a hotel customer dataset over the period of three years along with a detailed description of its features, which is also qualified to be used for the application of clustering and segmentation purposes. The aim of this article is to reduce the lack of existing real-world business data that can be used for educational purposes, that is why they offered such dataset to the public. This data article describes a hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data.

Due to time limitation for certain algorithms that are computationally complex and cannot be executed efficiently on a large database. As a result, only data of customers with the Portuguese nationality was used in the analysis, as they are one of the top 3 nationalities to book and stay at this hotel, and it would be interesting to further explore the segmentation of the local guests. While there are clustering algorithms such as CLARA or CLARANS that can handle large data sets, not all of the methods used in the analysis had comparable alternatives for large data sets. To address this, the decision was made to limit the dataset to guests with Portuguese Nationality. Additionally, observations with missing values in any of the variables used in the analysis were excluded. Upon data cleaning and simplifications, the dataset contains 6639 observations and includes 11 continuous/count variables, 15 categorical variables (including 13 binary).

Link to the article: https://doi.org/10.1016/j.dib.2020.106583

1.5.1 Continuous and count data for both (1st & 2nd Part)

  1. Age: Customer’s age (in years) at the last day of the extraction period.

  2. DaysSinceCreation: Number of days since the customer record was created (number of days elapsed between the creation date and the last day of the extraction period)

  3. AverageLeadTime: The average number of days elapsed between the customer’s booking date and arrival date. In other words, this variable is calculated by dividing the sum of the number of days elapsed between the moment each booking was made and its arrival date, by the total of bookings made by the customer

  4. TotalRevenue: Total amount of money spent by customers on lodging and other expense.

  5. BookingsCanceled: Number of bookings the customer made but subsequently canceled (the costumer informed the hotel he/she would not come to stay)

  6. BookingsNoShowed: Number of bookings the customer made but subsequently made a “no-show” (did not cancel, but did not check-in to stay at the hotel)

  7. BookingsCheckedIn: Number of bookings the customer made, and which end up with a staying

  8. PersonsNights: The total number of persons/nights that the costumer stayed at the hotel. This value is calculated by summing all customers checked-in bookings’ persons/nights. Person/nights of each booking is the result of the multiplication of the number of staying nights by the sum of adults and children

  9. RoomNights: Total of room/nights the customer stayed at the hotel (checked-in bookings). Room/nights are the multiplication of the number of rooms of each booking by the number of nights of the booking

  10. DaysSinceLastStay: The number of days elapsed between the last day of the extraction and the customer’s last arrival date (of a checked-in booking). A value of −1 indicates the customer never stayed at the hotel

  11. DaysSinceFirstStay: The number of days elapsed between the last day of the extraction and the customer’s first arrival date (of a checked-in booking). A value of −1 indicates the customer never stayed at the hotel

1.5.2 Categorical/Binary data for the second Part

  1. DistributionChannel: Categorical - Distribution channel usually used by the customer to make bookings at the hotel

  2. MarketSegment: Categorical - Current market segment of the customer

  3. SRHighFloor: Boolean - Indication if the customer usually asks for a room on a higher floor (0: No, 1: Yes)

  4. SRLowFloor: Boolean - Indication if the customer usually asks for a room on a lower floor (0: No, 1: Yes)

  5. SRAccessibleRoom: Boolean - Indication if the customer usually asks for an accessible room (0: No, 1: Yes)

  6. SRMediumFloor: Boolean - Indication if the customer usually asks for a room on a middle floor (0: No, 1: Yes)

  7. SRBathtub: Boolean - Indication if the customer usually asks for a room with a bathtub (0: No, 1: Yes)

  8. SRShower: Boolean - Indication if the customer usually asks for a room with a shower (0: No, 1: Yes)

  9. SRCrib: Boolean - Indication if the customer usually asks for a crib (0: No, 1: Yes)

  10. SRKingSizeBed: Boolean - Indication if the customer usually asks for a room with a king-size bed (0: No, 1: Yes)

  11. SRTwinBed: Boolean - Indication if the customer usually asks for a room with a twin bed (0: No, 1: Yes)

  12. SRNearElevator: Boolean - Indication if the customer usually asks for a room near the elevator (0: No, 1: Yes)

  13. SRAwayFromElevator: Boolean - Indication if the customer usually asks for a room away from the elevator (0: No, 1: Yes)

  14. SRNoAlcoholInMiniBar: Boolean - Indication if the customer usually asks for a room with no alcohol in the mini-bar (0: No, 1: Yes)

  15. SRQuietRoom: Boolean - Indication if the customer usually asks for a room away from the noise (0: No, 1: Yes)

2.1 Importing Libraries

library(NbClust)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'stringr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.2.2
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.2
library(dplyr)
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(rfm)
## Warning: package 'rfm' was built under R version 4.2.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
library(hopkins)
## Warning: package 'hopkins' was built under R version 4.2.2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.2
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(stringr)
library(NbClust)
library(ClusterR)
## Loading required package: gtools
## 
## Attaching package: 'gtools'
## 
## The following object is masked from 'package:psych':
## 
##     logit
library(fpc)
library(clusterSim)
## Warning: package 'clusterSim' was built under R version 4.2.2
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.2.2
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(FactoMineR)
library(knitr)
library(clustMixType)
## Warning: package 'clustMixType' was built under R version 4.2.2
library(CatEncoders)
## Warning: package 'CatEncoders' was built under R version 4.2.2
## 
## Attaching package: 'CatEncoders'
## 
## The following object is masked from 'package:base':
## 
##     transform

2.2 Loading Data

##   ID Nationality  Age DaysSinceCreation
## 1  1         PRT   51               150
## 2  2         PRT NULL              1095
## 3  3         DEU   31              1095
## 4  4         FRA   60              1095
## 5  5         FRA   51              1095
## 6  6         JPN   54              1095
##                                                             NameHash
## 1 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 2 0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE66EF88352FD1BE8DE368
## 3 0x31C5E4B74E23231295FDB724AD578C02C4A723F4BA2B4AF99F129EC2F4B3AD41
## 4 0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA
## 5 0x9C1DEF02C9BE242842C1C1ABF2C5AA249A1EEB4763B47FF457133EE9199F1037
## 6 0x6E70C1504EB27252542F58E4D3C8C83516E093334721A3CE1DD194FE3F98DA0F
##                                                            DocIDHash
## 1 0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 2 0x5FA1E0098A31497057C5A6B9FE9D49FD6DD47CCE7C268E6548699E78E587AAEA
## 3 0xC7CF344F5B03295037595B1337AC905CA188F1B5B3A56C8C6E1A24202C9C672C
## 4 0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8
## 5 0xE175754CF77247B202DD0820F49407C762C14A603B3A6CFEA2A4DC06A5F7E00C
## 6 0xE82EC1D6938A04CF19E1F7F55A402E7ABC686261537A24EAE7FF5CA92646528E
##   AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled BookingsNoShowed
## 1              45         371.00       105.30                1                0
## 2              61         280.00        53.00                0                0
## 3               0           0.00         0.00                0                0
## 4              93         240.00        60.00                0                0
## 5               0           0.00         0.00                0                0
## 6              58         230.00        24.00                0                0
##   BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay
## 1                 3             8          5               151
## 2                 1            10          5              1100
## 3                 0             0          0                -1
## 4                 1            10          5              1100
## 5                 0             0          0                -1
## 6                 1             4          2              1097
##   DaysSinceFirstStay   DistributionChannel         MarketSegment SRHighFloor
## 1               1074             Corporate             Corporate           0
## 2               1100 Travel Agent/Operator Travel Agent/Operator           0
## 3                 -1 Travel Agent/Operator Travel Agent/Operator           0
## 4               1100 Travel Agent/Operator Travel Agent/Operator           0
## 5                 -1 Travel Agent/Operator Travel Agent/Operator           0
## 6               1097 Travel Agent/Operator                 Other           0
##   SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib
## 1          0                0             0         0        0      0
## 2          0                0             0         0        0      0
## 3          0                0             0         0        0      0
## 4          0                0             0         0        0      0
## 5          0                0             0         0        0      0
## 6          0                0             0         0        0      0
##   SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator
## 1             0         0              0                  0
## 2             0         0              0                  0
## 3             0         0              0                  0
## 4             0         0              0                  0
## 5             0         0              0                  0
## 6             0         0              0                  0
##   SRNoAlcoholInMiniBar SRQuietRoom
## 1                    0           0
## 2                    0           0
## 3                    0           0
## 4                    0           0
## 5                    0           0
## 6                    0           0

3. Initial Analysis

In the first steps of the analysis, basic statitics will be obtained to get more familiarized with the data in hand.

print(class(df))
## [1] "data.frame"
print(dim(df))
## [1] 83590    31
print(str(df))
## 'data.frame':    83590 obs. of  31 variables:
##  $ ID                  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Nationality         : chr  "PRT" "PRT" "DEU" "FRA" ...
##  $ Age                 : chr  "51" "NULL" "31" "60" ...
##  $ DaysSinceCreation   : num  150 1095 1095 1095 1095 ...
##  $ NameHash            : chr  "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE66EF88352FD1BE8DE368" "0x31C5E4B74E23231295FDB724AD578C02C4A723F4BA2B4AF99F129EC2F4B3AD41" "0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA" ...
##  $ DocIDHash           : chr  "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x5FA1E0098A31497057C5A6B9FE9D49FD6DD47CCE7C268E6548699E78E587AAEA" "0xC7CF344F5B03295037595B1337AC905CA188F1B5B3A56C8C6E1A24202C9C672C" "0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8" ...
##  $ AverageLeadTime     : num  45 61 0 93 0 58 0 38 0 96 ...
##  $ LodgingRevenue      : chr  "371.00" "280.00" "0.00" "240.00" ...
##  $ OtherRevenue        : chr  "105.30" "53.00" "0.00" "60.00" ...
##  $ BookingsCanceled    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsNoShowed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsCheckedIn   : num  3 1 0 1 0 1 0 1 0 1 ...
##  $ PersonsNights       : num  8 10 0 10 0 4 0 10 0 6 ...
##  $ RoomNights          : num  5 5 0 5 0 2 0 5 0 3 ...
##  $ DaysSinceLastStay   : num  151 1100 -1 1100 -1 ...
##  $ DaysSinceFirstStay  : num  1074 1100 -1 1100 -1 ...
##  $ DistributionChannel : chr  "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
##  $ MarketSegment       : chr  "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
##  $ SRHighFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRLowFloor          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAccessibleRoom    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRMediumFloor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRBathtub           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRShower            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRCrib              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRKingSizeBed       : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ SRTwinBed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNearElevator      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAwayFromElevator  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNoAlcoholInMiniBar: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRQuietRoom         : num  0 0 0 0 0 0 0 0 0 0 ...
## NULL
# identifying the number of unique values for each variable
apply(df, 2, function(x) length(unique(x)))
##                   ID          Nationality                  Age 
##                83590                  188                  106 
##    DaysSinceCreation             NameHash            DocIDHash 
##                 1095                80642                76993 
##      AverageLeadTime       LodgingRevenue         OtherRevenue 
##                  418                10257                 4490 
##     BookingsCanceled     BookingsNoShowed    BookingsCheckedIn 
##                    6                    4                   29 
##        PersonsNights           RoomNights    DaysSinceLastStay 
##                   56                   48                 1105 
##   DaysSinceFirstStay  DistributionChannel        MarketSegment 
##                 1108                    4                    7 
##          SRHighFloor           SRLowFloor     SRAccessibleRoom 
##                    2                    2                    2 
##        SRMediumFloor            SRBathtub             SRShower 
##                    2                    2                    2 
##               SRCrib        SRKingSizeBed            SRTwinBed 
##                    2                    2                    2 
##       SRNearElevator   SRAwayFromElevator SRNoAlcoholInMiniBar 
##                    2                    2                    2 
##          SRQuietRoom 
##                    2

3.1 Date Cleaning

In the previous cell, the describe function shows few things that requires modification. For instance, the variables of Age & AverageLeadTime contain some negative values, that needs to be omitted. Additionaly, there are missing values in the Age column that shall be dropped. In reference to total number of records, this is not going to affect our data that much.

df$Age <- as.integer(df$Age)
## Warning: NAs introduced by coercion
class(df$Age)
## [1] "integer"
describe(df)
##                      vars     n     mean       sd  median  trimmed      mad min
## ID                      1 83590 41795.50 24130.50 41795.5 41795.50 30982.63   1
## Nationality*            2 83590    76.55    45.87    58.0    73.34    37.06   1
## Age                     3 79811    45.40    16.57    46.0    45.68    17.79 -11
## DaysSinceCreation       4 83590   453.64   313.39   397.0   435.98   369.17   0
## NameHash*               5 83590 40325.44 23285.98 40360.5 40328.59 29905.52   1
## DocIDHash*              6 83590 38046.94 21830.94 36692.5 37936.41 27232.40   1
## AverageLeadTime         7 83590    66.20    87.76    29.0    49.18    43.00  -1
## LodgingRevenue*         8 83590  3554.28  3139.98  3173.0  3275.90  4271.37   1
## OtherRevenue*           9 83590  1737.87  1542.10  1445.0  1648.17  2140.87   1
## BookingsCanceled       10 83590     0.00     0.07     0.0     0.00     0.00   0
## BookingsNoShowed       11 83590     0.00     0.03     0.0     0.00     0.00   0
## BookingsCheckedIn      12 83590     0.79     0.70     1.0     0.83     0.00   0
## PersonsNights          13 83590     4.65     4.57     4.0     4.03     4.45   0
## RoomNights             14 83590     2.36     2.28     2.0     2.14     1.48   0
## DaysSinceLastStay      15 83590   401.07   347.20   366.0   378.32   493.71  -1
## DaysSinceFirstStay     16 83590   403.35   347.97   369.0   380.98   498.15  -1
## DistributionChannel*   17 83590     3.62     0.84     4.0     3.81     0.00   1
## MarketSegment*         18 83590     5.63     1.04     6.0     5.73     0.00   1
## SRHighFloor            19 83590     0.05     0.21     0.0     0.00     0.00   0
## SRLowFloor             20 83590     0.00     0.04     0.0     0.00     0.00   0
## SRAccessibleRoom       21 83590     0.00     0.02     0.0     0.00     0.00   0
## SRMediumFloor          22 83590     0.00     0.03     0.0     0.00     0.00   0
## SRBathtub              23 83590     0.00     0.05     0.0     0.00     0.00   0
## SRShower               24 83590     0.00     0.04     0.0     0.00     0.00   0
## SRCrib                 25 83590     0.01     0.11     0.0     0.00     0.00   0
## SRKingSizeBed          26 83590     0.35     0.48     0.0     0.32     0.00   0
## SRTwinBed              27 83590     0.14     0.35     0.0     0.05     0.00   0
## SRNearElevator         28 83590     0.00     0.02     0.0     0.00     0.00   0
## SRAwayFromElevator     29 83590     0.00     0.06     0.0     0.00     0.00   0
## SRNoAlcoholInMiniBar   30 83590     0.00     0.01     0.0     0.00     0.00   0
## SRQuietRoom            31 83590     0.09     0.28     0.0     0.00     0.00   0
##                        max range  skew kurtosis    se
## ID                   83590 83589  0.00    -1.20 83.46
## Nationality*           188   187  0.71    -0.68  0.16
## Age                    122   133 -0.16    -0.29  0.06
## DaysSinceCreation     1095  1095  0.40    -1.14  1.08
## NameHash*            80642 80641  0.00    -1.20 80.54
## DocIDHash*           76993 76992  0.06    -1.14 75.51
## AverageLeadTime        588   589  1.91     4.48  0.30
## LodgingRevenue*      10257 10256  0.46    -1.00 10.86
## OtherRevenue*         4490  4489  0.27    -1.45  5.33
## BookingsCanceled         9     9 58.30  5270.38  0.00
## BookingsNoShowed         3     3 55.13  3587.55  0.00
## BookingsCheckedIn       66    66 26.88  1836.80  0.00
## PersonsNights          116   116  1.93    12.44  0.02
## RoomNights             185   185 11.19   647.65  0.01
## DaysSinceLastStay     1104  1105  0.31    -1.29  1.20
## DaysSinceFirstStay    1186  1187  0.30    -1.29  1.20
## DistributionChannel*     4     3 -1.87     1.87  0.00
## MarketSegment*           7     6 -1.17     1.48  0.00
## SRHighFloor              1     1  4.26    16.11  0.00
## SRLowFloor               1     1 26.56   703.37  0.00
## SRAccessibleRoom         1     1 63.07  3975.38  0.00
## SRMediumFloor            1     1 33.79  1140.04  0.00
## SRBathtub                1     1 18.66   346.21  0.00
## SRShower                 1     1 24.11   579.53  0.00
## SRCrib                   1     1  8.52    70.66  0.00
## SRKingSizeBed            1     1  0.62    -1.62  0.00
## SRTwinBed                1     1  2.04     2.18  0.00
## SRNearElevator           1     1 54.61  2980.29  0.00
## SRAwayFromElevator       1     1 16.80   280.29  0.00
## SRNoAlcoholInMiniBar     1     1 91.41  8353.80  0.00
## SRQuietRoom              1     1  2.90     6.41  0.00
# Removal of Missing Values (NAs)
colSums(is.na(df))
##                   ID          Nationality                  Age 
##                    0                    0                 3779 
##    DaysSinceCreation             NameHash            DocIDHash 
##                    0                    0                    0 
##      AverageLeadTime       LodgingRevenue         OtherRevenue 
##                    0                    0                    0 
##     BookingsCanceled     BookingsNoShowed    BookingsCheckedIn 
##                    0                    0                    0 
##        PersonsNights           RoomNights    DaysSinceLastStay 
##                    0                    0                    0 
##   DaysSinceFirstStay  DistributionChannel        MarketSegment 
##                    0                    0                    0 
##          SRHighFloor           SRLowFloor     SRAccessibleRoom 
##                    0                    0                    0 
##        SRMediumFloor            SRBathtub             SRShower 
##                    0                    0                    0 
##               SRCrib        SRKingSizeBed            SRTwinBed 
##                    0                    0                    0 
##       SRNearElevator   SRAwayFromElevator SRNoAlcoholInMiniBar 
##                    0                    0                    0 
##          SRQuietRoom 
##                    0
dim(df)
## [1] 83590    31
df <- df[complete.cases(df$Age),]
dim(df)
## [1] 79811    31
df[is.na(df$Age),]
##  [1] ID                   Nationality          Age                 
##  [4] DaysSinceCreation    NameHash             DocIDHash           
##  [7] AverageLeadTime      LodgingRevenue       OtherRevenue        
## [10] BookingsCanceled     BookingsNoShowed     BookingsCheckedIn   
## [13] PersonsNights        RoomNights           DaysSinceLastStay   
## [16] DaysSinceFirstStay   DistributionChannel  MarketSegment       
## [19] SRHighFloor          SRLowFloor           SRAccessibleRoom    
## [22] SRMediumFloor        SRBathtub            SRShower            
## [25] SRCrib               SRKingSizeBed        SRTwinBed           
## [28] SRNearElevator       SRAwayFromElevator   SRNoAlcoholInMiniBar
## [31] SRQuietRoom         
## <0 rows> (or 0-length row.names)
colSums(is.na(df))
##                   ID          Nationality                  Age 
##                    0                    0                    0 
##    DaysSinceCreation             NameHash            DocIDHash 
##                    0                    0                    0 
##      AverageLeadTime       LodgingRevenue         OtherRevenue 
##                    0                    0                    0 
##     BookingsCanceled     BookingsNoShowed    BookingsCheckedIn 
##                    0                    0                    0 
##        PersonsNights           RoomNights    DaysSinceLastStay 
##                    0                    0                    0 
##   DaysSinceFirstStay  DistributionChannel        MarketSegment 
##                    0                    0                    0 
##          SRHighFloor           SRLowFloor     SRAccessibleRoom 
##                    0                    0                    0 
##        SRMediumFloor            SRBathtub             SRShower 
##                    0                    0                    0 
##               SRCrib        SRKingSizeBed            SRTwinBed 
##                    0                    0                    0 
##       SRNearElevator   SRAwayFromElevator SRNoAlcoholInMiniBar 
##                    0                    0                    0 
##          SRQuietRoom 
##                    0
# Remove negative Values in Age 
df <- df[df$Age > 0,]  
# Remove negative values in AverageLeadTime
df <- df[df$AverageLeadTime >= 0,]
dim(df)
## [1] 79743    31
# Removal of Outliers
df <- df[df$Age < 101,]
dim(df)
## [1] 79735    31
length(which(df$DaysSinceLastStay == -1))
## [1] 19025
df <- df[df$DaysSinceLastStay != -1,]
describe(df$DaysSinceLastStay)
##    vars     n   mean     sd median trimmed   mad min  max range skew kurtosis
## X1    1 60710 518.39 301.16    519  515.47 400.3   0 1104  1104 0.05    -1.18
##      se
## X1 1.22
print(dim(df))
## [1] 60710    31
# Total Revenue
# This new variable will be created (sum of Lodging & Other revenue) to fully represent the monetary dimension
df$TotalRevenue <- as.numeric(df$LodgingRevenue) + as.numeric(df$OtherRevenue)
class(df$TotalRevenue)
## [1] "numeric"
str(df)
## 'data.frame':    60710 obs. of  32 variables:
##  $ ID                  : num  1 4 6 8 10 12 14 16 17 19 ...
##  $ Nationality         : chr  "PRT" "FRA" "JPN" "FRA" ...
##  $ Age                 : int  51 60 54 32 25 58 42 68 72 24 ...
##  $ DaysSinceCreation   : num  150 1095 1095 1095 1095 ...
##  $ NameHash            : chr  "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA" "0x6E70C1504EB27252542F58E4D3C8C83516E093334721A3CE1DD194FE3F98DA0F" "0x5A3A2D6A659769FCA243FC2A97644D27A75FB9AA4DF38D55145E5BEBDB4F06AA" ...
##  $ DocIDHash           : chr  "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8" "0xE82EC1D6938A04CF19E1F7F55A402E7ABC686261537A24EAE7FF5CA92646528E" "0xB27F5644C88A7148360EFFF55D8F40565BAC3084B4C4A03F9EED486EB2437B2C" ...
##  $ AverageLeadTime     : num  45 93 58 38 96 60 87 11 11 109 ...
##  $ LodgingRevenue      : chr  "371.00" "240.00" "230.00" "535.00" ...
##  $ OtherRevenue        : chr  "105.30" "60.00" "24.00" "94.00" ...
##  $ BookingsCanceled    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsNoShowed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsCheckedIn   : num  3 1 1 1 1 1 1 1 1 1 ...
##  $ PersonsNights       : num  8 10 4 10 6 10 8 6 3 8 ...
##  $ RoomNights          : num  5 5 2 5 3 5 4 3 3 4 ...
##  $ DaysSinceLastStay   : num  151 1100 1097 1100 1098 ...
##  $ DaysSinceFirstStay  : num  1074 1100 1097 1100 1098 ...
##  $ DistributionChannel : chr  "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
##  $ MarketSegment       : chr  "Corporate" "Travel Agent/Operator" "Other" "Other" ...
##  $ SRHighFloor         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ SRLowFloor          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAccessibleRoom    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRMediumFloor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRBathtub           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRShower            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRCrib              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRKingSizeBed       : num  0 0 0 1 0 0 0 0 0 1 ...
##  $ SRTwinBed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNearElevator      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAwayFromElevator  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNoAlcoholInMiniBar: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRQuietRoom         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalRevenue        : num  476 300 254 629 243 ...
# top 15 nationalities that book and stay at this hotel 
 df %>%
  count(Nationality) %>%
  arrange(desc(n)) %>%
  slice(1:15)
##    Nationality    n
## 1          FRA 9381
## 2          DEU 7890
## 3          PRT 6639
## 4          GBR 6565
## 5          ESP 3937
## 6          ITA 2561
## 7          USA 2430
## 8          BEL 2328
## 9          NLD 2060
## 10         BRA 2022
## 11         CHE 1589
## 12         IRL 1489
## 13         AUT 1094
## 14         CAN  978
## 15         SWE  907

Guests with Portuguese Nationality are one of the top 3 nationalities that come after France and Germany. However, guests with Portuguese nationality were selected given that they are considered to be local guests with the assumption that local guests are more likely to visit a hotel more regularly than a foreigner (e.g. guests business trips). In addition, this had to be done for simplification purposes due to the complexity of time that some algorithms require if applied on large dataset.

df_PRT <- subset(df, Nationality %in% c("PRT"))
dim(df_PRT)
## [1] 6639   32

4. First Part of Analysis (Dataset with only Continouous Data)

Clustering algorithms that are based on distance measures or density estimates are more suitable for continuous or interval data. In this part, the below listed clustering algorithms will be applied in the first part of the analysis, where the dataset will only include variables with continuous data, with the aim to further compare its results with other clustering algorithms - in the second part of the analysis - that are specifically designed for mixed data, such as k-prototypes.

  1. K-means
  2. Partitioning Around Medoids (PAM)
  3. Clustering Large Applications based on Randomized Search (CLARANS)

4.1 Dataset Preparation

# This dataset will only include continuous and count variables for Part 1 analysis (excluding mixed data)
df_PRT_1 <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue)
head(df_PRT_1)
##   df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1         51                      150                     45
## 2         39                     1095                      1
## 3         71                     1095                     85
## 4         43                     1095                     78
## 5         38                     1095                     98
## 6         28                     1094                    103
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1                       1                       0                        3
## 2                       0                       0                        9
## 3                       0                       0                        1
## 4                       0                       0                        1
## 5                       0                       0                        1
## 6                       0                       0                        1
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1                    8                 5                      151
## 2                   18                14                      591
## 3                    6                 3                     1098
## 4                    6                 3                     1098
## 5                    3                 3                     1098
## 6                   12                 4                     1098
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                      1074              476.30
## 2                      1117             1338.99
## 3                      1098              234.60
## 4                      1098              281.10
## 5                      1098              158.10
## 6                      1098             1728.00

4.2 Data Stabdardization

Standardizing the data ensures that all variables are on the same scale. Standardizing the data prior to applying the Hopkins statistic and clustering algorithms can help ensure that the results are meaningful, interpretable, and robust to outliers and differences in variable scale.

# Scaling
df_PRT_scaled<- scale(df_PRT_1)
head(df_PRT_scaled)
##      df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## [1,]  0.4088669                -1.263162            -0.06144993
## [2,] -0.4658037                 1.659981            -0.66333119
## [3,]  1.8666513                 1.659981             0.48571486
## [4,] -0.1742468                 1.659981             0.38996102
## [5,] -0.5386930                 1.659981             0.66354341
## [6,] -1.2675852                 1.656887             0.73193901
##      df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## [1,]              4.66405102             -0.05799844                1.0314323
## [2,]             -0.07785323             -0.05799844                4.4907295
## [3,]             -0.07785323             -0.05799844               -0.1216668
## [4,]             -0.07785323             -0.05799844               -0.1216668
## [5,]             -0.07785323             -0.05799844               -0.1216668
## [6,]             -0.07785323             -0.05799844               -0.1216668
##      df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## [1,]            0.7894798         0.7450725               -1.2076818
## [2,]            2.9915454         3.4657733                0.1441954
## [3,]            0.3490667         0.1404723                1.7019267
## [4,]            0.3490667         0.1404723                1.7019267
## [5,]           -0.3115530         0.1404723                1.7019267
## [6,]            1.6703061         0.4427724                1.7019267
##      df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## [1,]                  1.586808           0.2919791
## [2,]                  1.719782           2.2157474
## [3,]                  1.661026          -0.2470033
## [4,]                  1.661026          -0.1433100
## [5,]                  1.661026          -0.4175956
## [6,]                  1.661026           3.0832260

4.3 Clustering Tendency

Hopkins statistic is a measure used in cluster analysis to assess the clustering tendency of a dataset. It measures the degree of cluster tendency by comparing the spatial distribution of the observed data to a uniform distribution. A Hopkins statistic value close to 1 indicates that the data is highly clustered, while a value close to 0 indicates that the data is uniformly distributed.

#hopkins(df_PRT_1)
?hopkins
## starting httpd help server ... done
hopkins(df_PRT_scaled)
## [1] 0.9999953

The above value returned by Hopkins function suggests that the data has a very strong clustering tendency. In other words, the data is highly likely to be grouped into clusters, and may be a good candidate for clustering analysis.

4.4 K-means

4.4.1 Optimal NUmber of Clusters:

Selecting the optimal number of clusters in k-means is an important step in the clustering process. There are several methods that can be used to determine the optimal number of clusters, including the elbow method, the silhouette method, and the gap statistic.

4.4.1.1 Silhouette:

The silhouette method is a way to measure the quality of a clustering solution by assessing the degree of similarity between each data point and its assigned cluster compared to other clusters. The silhouette function in R computes the silhouette width for each data point, which ranges from -1 to 1, with larger values indicating better clustering.

fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans,
             method = c("silhouette"), k.max = 8, nboot = 100,)

In the plot above, we can see that the 2nd, 3rd and 4th cluster has the highest average silhouette width, indicating that it is a well-defined cluster. The remaining clusters have average silhouette widths, indicating that they may be less well-defined. It would be useful to examine the distribution of silhouette widths for each of k= 2,2 & 4 to further assess the quality of the clustering solution.

4.4.1.2 Within-cluster Sum of Squares (WSS)

The method argument is set to “WSS” to specify the WSS method for determining the appropriate number of clusters. The resulting plot shows the WSS for cluster solutions ranging from 1 to 8. The appropriate number of clusters can be determined based on the “elbow” point in the plot, where the WSS begins to level off, which is K=3 in this case.

fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

?fviz_nbclust
4.4.1.3 Gap Statistic:

The output of provides the gap statistic values for a range of cluster solutions. The optimal number of clusters is typically the value that maximizes the gap statistic, as this indicates the largest difference between the observed data and the null reference distribution. In the output, the optimal number of clusters is denoted by the blue vertical line: k=2

fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans, method = c("gap_stat"), k.max = 8, nboot = 50,)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 331950)
## Warning: did not converge in 10 iterations

4.4.1.4 Akaike Information Criterion (AIC)

AIC values are shown for 1 to 8 clusters. The optimal number of clusters is the value that minimizes the AIC criterion, which in this case is 3, with an AIC value of 45589.2.

opt<-Optimal_Clusters_KMeans(df_PRT_scaled, max_clusters=8, plot_clusters=TRUE, criterion="AIC")

4.4.1.5 Bayesian Information Criterion (BIC)

The optimal number of clusters according to the BIC criterion is the one that minimizes the BIC score. In other words, the number of clusters that balances the goodness of fit of the model with its complexity.

From the below plot, the optimal number of clusters is k=3.

opt2<-Optimal_Clusters_KMeans(df_PRT_scaled, max_clusters=8, plot_clusters=TRUE, criterion="BIC")

Given the above results, the optimal number of clusters when applying k-means is as followed k= 2, 3, 4.

4.4.2 K-means Application

k-means clustering is applied to explore the structure of a dataset and identify natural groupings of observations.

# # Perform k-means clustering for k = 2
kmeans2 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=2)

fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans2)
##   cluster size ave.sil.width
## 1       1 3329          0.39
## 2       2 3310          0.24

# Get cluster assignments for each observation
cluster_assignments <- kmeans2$cluster

# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)

# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
##   Group.1  df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1       1  0.02219716                0.8610029            -0.04073196
## 2       2 -0.02232457               -0.8659452             0.04096577
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.05221361             -0.04715477              -0.05879883
## 2              0.05251333              0.04742545               0.05913635
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          -0.01487938       -0.05258565                0.8597482
## 2           0.01496479        0.05288750               -0.8646833
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                 0.8605461         -0.09380382
## 2                -0.8654858          0.09434227

For k = 2, the average silhouette width is 0.31, which suggests a moderately good clustering solution. The mean value of each variable for each cluster is computed, in order to identify which variables are most strongly associated with each cluster, where if one cluster has much higher mean values for a particular variable than the other cluster, that variable is likely to be strongly associated with the first cluster. This is done to obtain a better understanding why the silhouette plot for cluster 2 show a low silhouette coefficient, indicating that there might be points that are considered to be outliers, as they are not well-clustered with the other data points.

Variables like (TotalRevenue, RoonNights, BookingsCheckedin) are included in cluster 2, which explains the low silhouette coefficient for certain data points, which represent customers that their spending is way more than the other customers, due to multiple bookings at the same hotel and greater number of booked nights per room. These customers are a minority but also may potentially be categorized as loyal.

# Perform k-means clustering for k = 3
kmeans3 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=3)

fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans3)
##   cluster size ave.sil.width
## 1       1 3319          0.36
## 2       2 3274          0.30
## 3       3   46         -0.08

# Get cluster assignments for each observation
cluster_assignments <- kmeans3$cluster

# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)

# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
##   Group.1  df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1       1  0.02711539                0.8620822            -0.03616890
## 2       2 -0.03024945               -0.8820315             0.04159471
## 3       3  0.19653745                0.5765284            -0.35079304
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.06070865             -0.04712210              -0.06295221
## 2             -0.04454113             -0.03227149              -0.05351632
## 3              7.55042752              5.69684993               8.35110466
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          -0.01624132       -0.06382388                0.8601425
## 2          -0.06054171       -0.04160946               -0.8581631
## 3           5.48083698        7.56653979               -0.9823244
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                 0.8609211         -0.09351338
## 2                -0.8821248          0.02698286
## 3                 0.6669449          4.82671804

For k = 3, the average silhouette width is 0.33, which suggests a moderately good clustering solution as well. In this case here, we notice that all the potential outliers are clustered in cluster 3, which also has a low silhouette coefficient.

# Perform k-means clustering for k = 4
kmeans4 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=4)

fviz_cluster(kmeans4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans4)
##   cluster size ave.sil.width
## 1       1 3009          0.39
## 2       2 2743          0.40
## 3       3   21          0.01
## 4       4  866         -0.09

# Get cluster assignments for each observation
cluster_assignments <- kmeans4$cluster

# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)

# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
##   Group.1   df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1       1 -0.009355879                0.8873338             -0.1518803
## 2       2 -0.094740998               -0.9341515             -0.2781166
## 3       3  0.183257427                0.3699388             -0.4184098
## 4       4  0.328150105               -0.1332314              1.4187856
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.06209416             -0.04600157              -0.09139268
## 2             -0.05883720             -0.03606468              -0.07647608
## 3             12.11561485             10.82887921              12.64478724
## 4              0.10831854              0.01147537               0.25315697
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1           -0.1548689        -0.1515803                0.8996174
## 2           -0.2667570        -0.1791304               -0.9042274
## 3            7.0601238         9.7277038               -1.1665695
## 4            1.2118391         0.8581732               -0.2334354
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                 0.8850805          -0.2033953
## 2                -0.9369314          -0.1704813
## 3                 0.5434815           4.7688032
## 4                -0.1208053           1.1310644

For k = 4, the average silhouette width is 0.33, which suggests a moderately good clustering solution as well. Cluster 3 & 4 contain lower silhouette coefficients that may suggest that they are not well clustered.

However, let´s look at other different methods to see if they may yield different results.

Calinski-Harabasz index is calculated for the clusters obtained from the k-means clustering algorithm. Calinski-Harabasz index is a measure of the separation between clusters and is computed by comparing the ratio of the between-cluster dispersion and the within-cluster dispersion for a range of cluster solutions.

round(calinhara(df_PRT_scaled,kmeans2$cluster),digits=2)
## [1] 1712.95
round(calinhara(df_PRT_scaled,kmeans3$cluster),digits=2)
## [1] 2006.59
round(calinhara(df_PRT_scaled,kmeans4$cluster),digits=2)
## [1] 1790.25

The above results show that the value of Calinski-Harabasz index is very similar for k=2 and k=3, while it decreases substantially for k=4. This suggests that k=3 might be a good choice for the number of clusters, as it leads to well-separated clusters without over-segmenting the data.

kmeans2$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.02219716                0.8610029            -0.04073196
## 2 -0.02232457               -0.8659452             0.04096577
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.05221361             -0.04715477              -0.05879883
## 2              0.05251333              0.04742545               0.05913635
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          -0.01487938       -0.05258565                0.8597482
## 2           0.01496479        0.05288750               -0.8646833
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                 0.8605461         -0.09380382
## 2                -0.8654858          0.09434227
kmeans3$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.02711539                0.8620822            -0.03616890
## 2 -0.03024945               -0.8820315             0.04159471
## 3  0.19653745                0.5765284            -0.35079304
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.06070865             -0.04712210              -0.06295221
## 2             -0.04454113             -0.03227149              -0.05351632
## 3              7.55042752              5.69684993               8.35110466
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          -0.01624132       -0.06382388                0.8601425
## 2          -0.06054171       -0.04160946               -0.8581631
## 3           5.48083698        7.56653979               -0.9823244
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1                 0.8609211         -0.09351338
## 2                -0.8821248          0.02698286
## 3                 0.6669449          4.82671804

The results above show that there are three distinct clusters in the data.

Cluster 1 has above-average values in df_PRT.DaysSinceCreation, below-average values in df_PRT.TotalRevenue, and close-to-average values in other dimensions.

Cluster 2 has below-average values in df_PRT.DaysSinceCreation, above-average values in df_PRT.TotalRevenue, and close-to-average values in other dimensions.

Cluster 3 has very high values in df_PRT.BookingsCanceled, df_PRT.BookingsNoShowed, df_PRT.BookingsCheckedIn, df_PRT.PersonsNights, df_PRT.RoomNights, and df_PRT.TotalRevenue, and very low values in df_PRT.AverageLeadTime, suggesting that this cluster represents a group of customers who make many bookings, but often cancel or no-show, leading to significant revenue losses.

4.5 Partitioning Around Medoids (PAM)

PAM is especially useful when you have a large dataset and the distance metric used in clustering is not a Euclidean distance. Unlike k-means, which uses the mean of each cluster as its center point, PAM uses a medoid, which is the point that is closest to all other points in the cluster. PAM is more robust to outliers than k-means, as outliers will not be chosen as medoids.

4.5.1 Optimal Number of Clusters

4.5.1.1 Silhouette
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 8, nboot = 100,)

In the plot above, we can see that unlike K-means, only k=2 cluster has the highest average silhouette width, indicating that it is a well-defined cluster. The remaining clusters have average silhouette widths, indicating that they may be less well-defined.

4.5.1.2 Within-cluster Sum of Squares (WSS)
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::pam, method = c("wss"), k.max = 8, nboot = 100,)

The appropriate number of clusters can be determined based on the “elbow” point in the plot, where the WSS begins to level off, which can be k=2 & K=3 in this case.

# Perform PAM clustering using 2 clusters
pam2<-eclust(df_PRT_scaled, "pam", hc_metric="euclidean", k=2) 

fviz_cluster(pam2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1 3163          0.34
## 2       2 3476          0.29

# Perform PAM clustering using 3 clusters
pam3<-eclust(df_PRT_scaled, "pam", hc_metric="euclidean", k=3) 

fviz_cluster(pam3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(pam3)
##   cluster size ave.sil.width
## 1       1 2274         -0.03
## 2       2 2387          0.35
## 3       3 1978          0.30

The results of PAM are similar to the previous results obtained from k-means.

4.7 Principle Component Analysis (PCA)

PCA is often used as a pre-processing step before applying clustering algorithms, especially when the number of variables in the dataset is very high. PCA is particularly useful when the original variables are highly correlated or when the number of variables is very high. By reducing the number of variables, PCA can make it easier to visualize the data and identify patterns that may be difficult to see in the original high-dimensional space.

pca <- PCA(df_PRT_scaled, ncp=2, scale.unit = F, graph=FALSE)
get_eigenvalue(pca)
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.478576524      31.62818694                    31.62819
## Dim.2  2.948872899      26.81197400                    58.44016
## Dim.3  1.300611460      11.82552177                    70.26568
## Dim.4  1.020574814       9.27935056                    79.54503
## Dim.5  0.785002084       7.13745766                    86.68249
## Dim.6  0.638227362       5.80294100                    92.48543
## Dim.7  0.383710116       3.48879928                    95.97423
## Dim.8  0.273390857       2.48574585                    98.45998
## Dim.9  0.133220338       1.21127643                    99.67125
## Dim.10 0.033457133       0.30420157                    99.97546
## Dim.11 0.002699537       0.02454494                   100.00000
fviz_eig(pca, addlabels=TRUE)

The first principal component (Dim.1) has the largest eigenvalue of 3.4786, which accounts for 31.63% of the total variance. The second principal component (Dim.2) has an eigenvalue of 2.9489, explaining an additional 26.81% of the total variance.

The cumulative percentage of variance explained is useful for determining how many principal components to retain in the analysis. In this case, the first two principal components explain 58.44% of the total variance, the first three explain 70.27%, and the first four explain 79.55%. Typically, a threshold of 70-80% is used to decide how many principal components to keep.

pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)

By looking at both the contribution and cosine scores in the above visualization, we can identify variables that are driving the separation between clusters and determine which variables are most important for distinguishing between different groups of observations. All variables are far from the origin and having a igh contribution to their corresponding principal components, except for Age and AverageLeadTime have a low contribution as they are close to origin. For dim 2, all variables are considered to be well aligned given their high cosine score, while for dim 1 most of the variable are considered to be moderately -aligned.

fviz_contrib(pca, choice = "var", axes = 1, top = 10)

fviz_contrib(pca, choice = "var", axes = 2, top = 10)

The variables with a high contribution to first and second principal components are the ones that have the highest impact on each of the components, and therefore, have the strongest relationship with other variables in the dataset that are also well-represented by each component. These variables may be the most important to consider when interpreting the results of subsequent analyses that rely on the principal components when applying clustering.

fviz_pca_ind(pca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             )

4.7.1 K-means on PCA

K-means will be applied to the reduced data as shown below.

datapca <- data.frame(pca$ind$coord)

hopkins(datapca)
## [1] 0.9895887
get_clust_tendency(datapca, 100, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.9791029

The number of clusters is determined use by using techniques such as the elbow method and silhouette analysis, where the obrained results are similar to the previous ones (K=2 & k=3)

fviz_nbclust(datapca, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

fviz_nbclust(datapca, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

kmeanspca3 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=3, graph = T)

fviz_cluster(kmeanspca3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeanspca3)
##   cluster size ave.sil.width
## 1       1 3269          0.51
## 2       2   44          0.35
## 3       3 3326          0.54

kmeanspca2 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=2, graph = T)

fviz_silhouette(kmeanspca2)
##   cluster size ave.sil.width
## 1       1 3338          0.56
## 2       2 3301          0.44

kmeanspca4 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=4, graph = T)

fviz_silhouette(kmeanspca4)
##   cluster size ave.sil.width
## 1       1 3230          0.53
## 2       2  106          0.32
## 3       3    8          0.39
## 4       4 3295          0.55

round(calinhara(datapca,kmeanspca2$cluster),digits=2)
## [1] 3583.4
round(calinhara(datapca,kmeanspca3$cluster),digits=2)
## [1] 5743.88
round(calinhara(datapca,kmeanspca4$cluster),digits=2)
## [1] 6177.33

The increase in the average silhouette width from 0.33 (without PCA) to 0.52 after applying PCA prior to k-means clustering indicates that PCA has improved the clustering performance. PCA reduces the dimensionality of the data by transforming the original variables into a new set of uncorrelated variables known as principal components. These principal components capture the maximum variance in the data, and the first few principal components usually account for most of the variability in the data.

5. Second Part of Analysis with Mixed Data

5.1 K-means on converted categorical attributes to binary values

Converting categorical attributes to binary values (also known as one-hot encoding) is a common approach to use with k-means clustering.

5.1.1 Data Preparetion

In this dataset, the categorical data are included in order to apply other algorithms for mixed data.

df_PRT1 <- subset(df, Nationality %in% c("PRT"))
dim(df_PRT)
## [1] 6639   32
str(df_PRT1)
## 'data.frame':    6639 obs. of  32 variables:
##  $ ID                  : num  1 31 32 34 55 84 123 127 135 155 ...
##  $ Nationality         : chr  "PRT" "PRT" "PRT" "PRT" ...
##  $ Age                 : int  51 39 71 43 38 28 41 44 45 60 ...
##  $ DaysSinceCreation   : num  150 1095 1095 1095 1095 ...
##  $ NameHash            : chr  "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0" "0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB" "0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725" ...
##  $ DocIDHash           : chr  "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7" "0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5" "0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19" ...
##  $ AverageLeadTime     : num  45 1 85 78 98 103 2 1 0 10 ...
##  $ LodgingRevenue      : chr  "371.00" "1083.50" "180.60" "180.60" ...
##  $ OtherRevenue        : chr  "105.30" "255.49" "54.00" "100.50" ...
##  $ BookingsCanceled    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsNoShowed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsCheckedIn   : num  3 9 1 1 1 1 1 1 1 1 ...
##  $ PersonsNights       : num  8 18 6 6 3 12 9 8 4 4 ...
##  $ RoomNights          : num  5 14 3 3 3 4 3 2 1 2 ...
##  $ DaysSinceLastStay   : num  151 591 1098 1098 1098 ...
##  $ DaysSinceFirstStay  : num  1074 1117 1098 1098 1098 ...
##  $ DistributionChannel : chr  "Corporate" "Direct" "Direct" "Direct" ...
##  $ MarketSegment       : chr  "Corporate" "Corporate" "Direct" "Direct" ...
##  $ SRHighFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRLowFloor          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAccessibleRoom    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRMediumFloor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRBathtub           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRShower            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRCrib              : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ SRKingSizeBed       : num  0 1 0 0 0 1 1 1 0 0 ...
##  $ SRTwinBed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNearElevator      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAwayFromElevator  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNoAlcoholInMiniBar: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRQuietRoom         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalRevenue        : num  476 1339 235 281 158 ...
df_PRT_Mixed_cat <- data.frame(df_PRT1$Age, df_PRT1$DaysSinceCreation, df_PRT1$AverageLeadTime, df_PRT1$BookingsCanceled, df_PRT1$BookingsNoShowed, df_PRT1$BookingsCheckedIn, df_PRT1$PersonsNights, df_PRT1$RoomNights, df_PRT1$DaysSinceLastStay, df_PRT1$DaysSinceFirstStay, df_PRT1$TotalRevenue, df_PRT1$DistributionChannel, df_PRT1$MarketSegment,df_PRT1$SRHighFloor,df_PRT1$SRLowFloor,df_PRT1$SRAccessibleRoom,df_PRT1$SRMediumFloor, df_PRT1$SRBathtub, df_PRT1$SRShower, df_PRT1$SRCrib, df_PRT1$SRKingSizeBed, df_PRT1$SRTwinBed, df_PRT1$SRNearElevator, df_PRT1$SRQuietRoom)
head(df_PRT1, 10)
##      ID Nationality Age DaysSinceCreation
## 1     1         PRT  51               150
## 31   31         PRT  39              1095
## 32   32         PRT  71              1095
## 34   34         PRT  43              1095
## 55   55         PRT  38              1095
## 84   84         PRT  28              1094
## 123 123         PRT  41              1094
## 127 127         PRT  44              1094
## 135 135         PRT  45              1094
## 155 155         PRT  60              1094
##                                                               NameHash
## 1   0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31  0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32  0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34  0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55  0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84  0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
##                                                              DocIDHash
## 1   0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31  0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32  0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34  0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55  0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84  0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
##     AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1                45         371.00       105.30                1
## 31                1        1083.50       255.49                0
## 32               85         180.60        54.00                0
## 34               78         180.60       100.50                0
## 55               98         142.80        15.30                0
## 84              103         770.10       957.90                0
## 123               2         404.50       473.30                0
## 127               1         376.00       224.00                0
## 135               0         165.00        24.00                0
## 155              10         249.10       173.50                0
##     BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1                  0                 3             8          5
## 31                 0                 9            18         14
## 32                 0                 1             6          3
## 34                 0                 1             6          3
## 55                 0                 1             3          3
## 84                 0                 1            12          4
## 123                0                 1             9          3
## 127                0                 1             8          2
## 135                0                 1             4          1
## 155                0                 1             4          2
##     DaysSinceLastStay DaysSinceFirstStay   DistributionChannel
## 1                 151               1074             Corporate
## 31                591               1117                Direct
## 32               1098               1098                Direct
## 34               1098               1098                Direct
## 55               1098               1098 Travel Agent/Operator
## 84               1098               1098                Direct
## 123              1097               1097                Direct
## 127              1096               1096                Direct
## 135              1095               1095                Direct
## 155              1096               1096 Travel Agent/Operator
##             MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1               Corporate           0          0                0             0
## 31              Corporate           0          0                0             0
## 32                 Direct           0          0                0             0
## 34                 Direct           0          0                0             0
## 55  Travel Agent/Operator           0          0                0             0
## 84                 Direct           0          0                0             0
## 123                Direct           0          0                0             0
## 127                Direct           0          0                0             0
## 135                Direct           0          0                0             0
## 155 Travel Agent/Operator           0          0                0             0
##     SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1           0        0      0             0         0              0
## 31          0        0      0             1         0              0
## 32          0        0      0             0         0              0
## 34          0        0      1             0         0              0
## 55          0        0      0             0         0              0
## 84          0        0      0             1         0              0
## 123         0        0      1             1         0              0
## 127         0        0      0             1         0              0
## 135         0        0      0             0         0              0
## 155         0        0      0             0         0              0
##     SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1                    0                    0           0       476.30
## 31                   0                    0           0      1338.99
## 32                   0                    0           0       234.60
## 34                   0                    0           0       281.10
## 55                   0                    0           0       158.10
## 84                   0                    0           0      1728.00
## 123                  0                    0           0       877.80
## 127                  0                    0           0       600.00
## 135                  0                    0           0       189.00
## 155                  0                    0           0       422.60
str(df_PRT_Mixed_cat)
## 'data.frame':    6639 obs. of  24 variables:
##  $ df_PRT1.Age                : int  51 39 71 43 38 28 41 44 45 60 ...
##  $ df_PRT1.DaysSinceCreation  : num  150 1095 1095 1095 1095 ...
##  $ df_PRT1.AverageLeadTime    : num  45 1 85 78 98 103 2 1 0 10 ...
##  $ df_PRT1.BookingsCanceled   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.BookingsNoShowed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.BookingsCheckedIn  : num  3 9 1 1 1 1 1 1 1 1 ...
##  $ df_PRT1.PersonsNights      : num  8 18 6 6 3 12 9 8 4 4 ...
##  $ df_PRT1.RoomNights         : num  5 14 3 3 3 4 3 2 1 2 ...
##  $ df_PRT1.DaysSinceLastStay  : num  151 591 1098 1098 1098 ...
##  $ df_PRT1.DaysSinceFirstStay : num  1074 1117 1098 1098 1098 ...
##  $ df_PRT1.TotalRevenue       : num  476 1339 235 281 158 ...
##  $ df_PRT1.DistributionChannel: chr  "Corporate" "Direct" "Direct" "Direct" ...
##  $ df_PRT1.MarketSegment      : chr  "Corporate" "Corporate" "Direct" "Direct" ...
##  $ df_PRT1.SRHighFloor        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRLowFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRAccessibleRoom   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRMediumFloor      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRBathtub          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRShower           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRCrib             : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ df_PRT1.SRKingSizeBed      : num  0 1 0 0 0 1 1 1 0 0 ...
##  $ df_PRT1.SRTwinBed          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRNearElevator     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRQuietRoom        : num  0 0 0 0 0 0 0 0 0 0 ...
#define original categorical labels
labs = LabelEncoder.fit(df_PRT1$DistributionChannel)

#convert labels to numeric values
df_PRT1$DistributionChannel = transform(labs, df_PRT1$DistributionChannel)

#define original categorical labels
labs = LabelEncoder.fit(df_PRT1$MarketSegment)

#convert labels to numeric values
df_PRT1$MarketSegment = transform(labs, df_PRT1$MarketSegment)

str(df_PRT1)
## 'data.frame':    6639 obs. of  32 variables:
##  $ ID                  : num  1 31 32 34 55 84 123 127 135 155 ...
##  $ Nationality         : chr  "PRT" "PRT" "PRT" "PRT" ...
##  $ Age                 : int  51 39 71 43 38 28 41 44 45 60 ...
##  $ DaysSinceCreation   : num  150 1095 1095 1095 1095 ...
##  $ NameHash            : chr  "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0" "0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB" "0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725" ...
##  $ DocIDHash           : chr  "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7" "0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5" "0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19" ...
##  $ AverageLeadTime     : num  45 1 85 78 98 103 2 1 0 10 ...
##  $ LodgingRevenue      : chr  "371.00" "1083.50" "180.60" "180.60" ...
##  $ OtherRevenue        : chr  "105.30" "255.49" "54.00" "100.50" ...
##  $ BookingsCanceled    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsNoShowed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BookingsCheckedIn   : num  3 9 1 1 1 1 1 1 1 1 ...
##  $ PersonsNights       : num  8 18 6 6 3 12 9 8 4 4 ...
##  $ RoomNights          : num  5 14 3 3 3 4 3 2 1 2 ...
##  $ DaysSinceLastStay   : num  151 591 1098 1098 1098 ...
##  $ DaysSinceFirstStay  : num  1074 1117 1098 1098 1098 ...
##  $ DistributionChannel : int  1 2 2 2 4 2 2 2 2 4 ...
##  $ MarketSegment       : int  3 3 4 4 7 4 4 4 4 7 ...
##  $ SRHighFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRLowFloor          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAccessibleRoom    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRMediumFloor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRBathtub           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRShower            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRCrib              : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ SRKingSizeBed       : num  0 1 0 0 0 1 1 1 0 0 ...
##  $ SRTwinBed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNearElevator      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRAwayFromElevator  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRNoAlcoholInMiniBar: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRQuietRoom         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalRevenue        : num  476 1339 235 281 158 ...
df_PRT_Cat <- data.frame(df_PRT1$Age, df_PRT1$DaysSinceCreation, df_PRT1$AverageLeadTime, df_PRT1$BookingsCanceled, df_PRT1$BookingsNoShowed, df_PRT1$BookingsCheckedIn, df_PRT1$PersonsNights, df_PRT1$RoomNights, df_PRT1$DaysSinceLastStay, df_PRT1$DaysSinceFirstStay, df_PRT1$TotalRevenue, df_PRT1$DistributionChannel, df_PRT1$MarketSegment,df_PRT1$SRLowFloor,df_PRT1$SRAccessibleRoom,df_PRT1$SRMediumFloor, df_PRT1$SRBathtub, df_PRT1$SRShower, df_PRT1$SRCrib, df_PRT1$SRKingSizeBed, df_PRT1$SRTwinBed, df_PRT1$SRNearElevator, df_PRT1$SRQuietRoom)
head(df_PRT_Cat, 10)
##    df_PRT1.Age df_PRT1.DaysSinceCreation df_PRT1.AverageLeadTime
## 1           51                       150                      45
## 2           39                      1095                       1
## 3           71                      1095                      85
## 4           43                      1095                      78
## 5           38                      1095                      98
## 6           28                      1094                     103
## 7           41                      1094                       2
## 8           44                      1094                       1
## 9           45                      1094                       0
## 10          60                      1094                      10
##    df_PRT1.BookingsCanceled df_PRT1.BookingsNoShowed df_PRT1.BookingsCheckedIn
## 1                         1                        0                         3
## 2                         0                        0                         9
## 3                         0                        0                         1
## 4                         0                        0                         1
## 5                         0                        0                         1
## 6                         0                        0                         1
## 7                         0                        0                         1
## 8                         0                        0                         1
## 9                         0                        0                         1
## 10                        0                        0                         1
##    df_PRT1.PersonsNights df_PRT1.RoomNights df_PRT1.DaysSinceLastStay
## 1                      8                  5                       151
## 2                     18                 14                       591
## 3                      6                  3                      1098
## 4                      6                  3                      1098
## 5                      3                  3                      1098
## 6                     12                  4                      1098
## 7                      9                  3                      1097
## 8                      8                  2                      1096
## 9                      4                  1                      1095
## 10                     4                  2                      1096
##    df_PRT1.DaysSinceFirstStay df_PRT1.TotalRevenue df_PRT1.DistributionChannel
## 1                        1074               476.30                           1
## 2                        1117              1338.99                           2
## 3                        1098               234.60                           2
## 4                        1098               281.10                           2
## 5                        1098               158.10                           4
## 6                        1098              1728.00                           2
## 7                        1097               877.80                           2
## 8                        1096               600.00                           2
## 9                        1095               189.00                           2
## 10                       1096               422.60                           4
##    df_PRT1.MarketSegment df_PRT1.SRLowFloor df_PRT1.SRAccessibleRoom
## 1                      3                  0                        0
## 2                      3                  0                        0
## 3                      4                  0                        0
## 4                      4                  0                        0
## 5                      7                  0                        0
## 6                      4                  0                        0
## 7                      4                  0                        0
## 8                      4                  0                        0
## 9                      4                  0                        0
## 10                     7                  0                        0
##    df_PRT1.SRMediumFloor df_PRT1.SRBathtub df_PRT1.SRShower df_PRT1.SRCrib
## 1                      0                 0                0              0
## 2                      0                 0                0              0
## 3                      0                 0                0              0
## 4                      0                 0                0              1
## 5                      0                 0                0              0
## 6                      0                 0                0              0
## 7                      0                 0                0              1
## 8                      0                 0                0              0
## 9                      0                 0                0              0
## 10                     0                 0                0              0
##    df_PRT1.SRKingSizeBed df_PRT1.SRTwinBed df_PRT1.SRNearElevator
## 1                      0                 0                      0
## 2                      1                 0                      0
## 3                      0                 0                      0
## 4                      0                 0                      0
## 5                      0                 0                      0
## 6                      1                 0                      0
## 7                      1                 0                      0
## 8                      1                 0                      0
## 9                      0                 0                      0
## 10                     0                 0                      0
##    df_PRT1.SRQuietRoom
## 1                    0
## 2                    0
## 3                    0
## 4                    0
## 5                    0
## 6                    0
## 7                    0
## 8                    0
## 9                    0
## 10                   0
str(df_PRT_Cat)
## 'data.frame':    6639 obs. of  23 variables:
##  $ df_PRT1.Age                : int  51 39 71 43 38 28 41 44 45 60 ...
##  $ df_PRT1.DaysSinceCreation  : num  150 1095 1095 1095 1095 ...
##  $ df_PRT1.AverageLeadTime    : num  45 1 85 78 98 103 2 1 0 10 ...
##  $ df_PRT1.BookingsCanceled   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.BookingsNoShowed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.BookingsCheckedIn  : num  3 9 1 1 1 1 1 1 1 1 ...
##  $ df_PRT1.PersonsNights      : num  8 18 6 6 3 12 9 8 4 4 ...
##  $ df_PRT1.RoomNights         : num  5 14 3 3 3 4 3 2 1 2 ...
##  $ df_PRT1.DaysSinceLastStay  : num  151 591 1098 1098 1098 ...
##  $ df_PRT1.DaysSinceFirstStay : num  1074 1117 1098 1098 1098 ...
##  $ df_PRT1.TotalRevenue       : num  476 1339 235 281 158 ...
##  $ df_PRT1.DistributionChannel: int  1 2 2 2 4 2 2 2 2 4 ...
##  $ df_PRT1.MarketSegment      : int  3 3 4 4 7 4 4 4 4 7 ...
##  $ df_PRT1.SRLowFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRAccessibleRoom   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRMediumFloor      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRBathtub          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRShower           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRCrib             : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ df_PRT1.SRKingSizeBed      : num  0 1 0 0 0 1 1 1 0 0 ...
##  $ df_PRT1.SRTwinBed          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRNearElevator     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT1.SRQuietRoom        : num  0 0 0 0 0 0 0 0 0 0 ...
dim(df_PRT_Cat)
## [1] 6639   23
df_PRT_Cat_scaled<- scale(df_PRT_Cat)
head(df_PRT_Cat_scaled)
##      df_PRT1.Age df_PRT1.DaysSinceCreation df_PRT1.AverageLeadTime
## [1,]   0.4088669                 -1.263162             -0.06144993
## [2,]  -0.4658037                  1.659981             -0.66333119
## [3,]   1.8666513                  1.659981              0.48571486
## [4,]  -0.1742468                  1.659981              0.38996102
## [5,]  -0.5386930                  1.659981              0.66354341
## [6,]  -1.2675852                  1.656887              0.73193901
##      df_PRT1.BookingsCanceled df_PRT1.BookingsNoShowed
## [1,]               4.66405102              -0.05799844
## [2,]              -0.07785323              -0.05799844
## [3,]              -0.07785323              -0.05799844
## [4,]              -0.07785323              -0.05799844
## [5,]              -0.07785323              -0.05799844
## [6,]              -0.07785323              -0.05799844
##      df_PRT1.BookingsCheckedIn df_PRT1.PersonsNights df_PRT1.RoomNights
## [1,]                 1.0314323             0.7894798          0.7450725
## [2,]                 4.4907295             2.9915454          3.4657733
## [3,]                -0.1216668             0.3490667          0.1404723
## [4,]                -0.1216668             0.3490667          0.1404723
## [5,]                -0.1216668            -0.3115530          0.1404723
## [6,]                -0.1216668             1.6703061          0.4427724
##      df_PRT1.DaysSinceLastStay df_PRT1.DaysSinceFirstStay df_PRT1.TotalRevenue
## [1,]                -1.2076818                   1.586808            0.2919791
## [2,]                 0.1441954                   1.719782            2.2157474
## [3,]                 1.7019267                   1.661026           -0.2470033
## [4,]                 1.7019267                   1.661026           -0.1433100
## [5,]                 1.7019267                   1.661026           -0.4175956
## [6,]                 1.7019267                   1.661026            3.0832260
##      df_PRT1.DistributionChannel df_PRT1.MarketSegment df_PRT1.SRLowFloor
## [1,]                  -1.9056150            -1.4831535        -0.04073544
## [2,]                  -1.0338877            -1.4831535        -0.04073544
## [3,]                  -1.0338877            -0.7622621        -0.04073544
## [4,]                  -1.0338877            -0.7622621        -0.04073544
## [5,]                   0.7095668             1.4004123        -0.04073544
## [6,]                  -1.0338877            -0.7622621        -0.04073544
##      df_PRT1.SRAccessibleRoom df_PRT1.SRMediumFloor df_PRT1.SRBathtub
## [1,]              -0.01735787           -0.03883679        -0.0475831
## [2,]              -0.01735787           -0.03883679        -0.0475831
## [3,]              -0.01735787           -0.03883679        -0.0475831
## [4,]              -0.01735787           -0.03883679        -0.0475831
## [5,]              -0.01735787           -0.03883679        -0.0475831
## [6,]              -0.01735787           -0.03883679        -0.0475831
##      df_PRT1.SRShower df_PRT1.SRCrib df_PRT1.SRKingSizeBed df_PRT1.SRTwinBed
## [1,]      -0.03684103     -0.1191848            -0.6558488        -0.3125756
## [2,]      -0.03684103     -0.1191848             1.5245119        -0.3125756
## [3,]      -0.03684103     -0.1191848            -0.6558488        -0.3125756
## [4,]      -0.03684103      8.3890700            -0.6558488        -0.3125756
## [5,]      -0.03684103     -0.1191848            -0.6558488        -0.3125756
## [6,]      -0.03684103     -0.1191848             1.5245119        -0.3125756
##      df_PRT1.SRNearElevator df_PRT1.SRQuietRoom
## [1,]            -0.01227294           -0.216006
## [2,]            -0.01227294           -0.216006
## [3,]            -0.01227294           -0.216006
## [4,]            -0.01227294           -0.216006
## [5,]            -0.01227294           -0.216006
## [6,]            -0.01227294           -0.216006
hopkins(df_PRT_Cat_scaled)
## [1] 1
fviz_nbclust(df_PRT_Cat_scaled, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

fviz_nbclust(df_PRT_Cat_scaled, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

kmeanscat2 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=2)

kmeanscat3 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=3)

kmeanscat4 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=4)

fviz_silhouette(kmeanscat2)
##   cluster size ave.sil.width
## 1       1 4251          0.19
## 2       2 2388          0.13

fviz_silhouette(kmeanscat3)
##   cluster size ave.sil.width
## 1       1 3761          0.17
## 2       2 2217          0.20
## 3       3  661         -0.06

fviz_silhouette(kmeanscat4)
##   cluster size ave.sil.width
## 1       1 1952          0.13
## 2       2 1592          0.20
## 3       3  647         -0.11
## 4       4 2448          0.24

round(calinhara(df_PRT_Cat_scaled,kmeanscat2$cluster),digits=2)
## [1] 569.19
round(calinhara(df_PRT_Cat_scaled,kmeanscat3$cluster),digits=2) 
## [1] 460.48
round(calinhara(df_PRT_Cat_scaled,kmeanscat4$cluster),digits=2)
## [1] 494.16

The resulted average silhouette width is the lowest given all the other algorithms that were applied for both mixed and only continuous data.

Using k-means on binary data can result in biased or misleading results, particularly if the data is imbalanced, since k-means is designed to minimize the sum of squared distances between data points and cluster centroids, which is not well-suited for binary data.

Therefore, while it is possible to use kmeans on converted binary categorical attributes, it is important to consider the potential drawbacks and to evaluate the results carefully.

5.2 K-Prototype

K-prototype is an extension of K-means clustering algorithm that is suitable for datasets that have both categorical and continuous variables. It is typically used when we have a dataset with a mix of numerical and categorical data and we want to cluster the data based on both types of variables. For customer segmentation, K-prototype can be used to segment customers based on demographic variables such as age, gender, and income, as well as their behavior such as their purchase history and preferences.

5.2.1 Data Preparation

df_PRT_Mixed <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue, df_PRT$DistributionChannel, df_PRT$MarketSegment,df_PRT$SRHighFloor,df_PRT$SRLowFloor,df_PRT$SRAccessibleRoom,df_PRT$SRMediumFloor, df_PRT$SRBathtub, df_PRT$SRShower, df_PRT$SRCrib, df_PRT$SRKingSizeBed, df_PRT$SRTwinBed, df_PRT$SRNearElevator, df_PRT$SRQuietRoom)
head(df_PRT, 10)
##      ID Nationality Age DaysSinceCreation
## 1     1         PRT  51               150
## 31   31         PRT  39              1095
## 32   32         PRT  71              1095
## 34   34         PRT  43              1095
## 55   55         PRT  38              1095
## 84   84         PRT  28              1094
## 123 123         PRT  41              1094
## 127 127         PRT  44              1094
## 135 135         PRT  45              1094
## 155 155         PRT  60              1094
##                                                               NameHash
## 1   0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31  0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32  0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34  0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55  0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84  0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
##                                                              DocIDHash
## 1   0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31  0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32  0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34  0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55  0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84  0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
##     AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1                45         371.00       105.30                1
## 31                1        1083.50       255.49                0
## 32               85         180.60        54.00                0
## 34               78         180.60       100.50                0
## 55               98         142.80        15.30                0
## 84              103         770.10       957.90                0
## 123               2         404.50       473.30                0
## 127               1         376.00       224.00                0
## 135               0         165.00        24.00                0
## 155              10         249.10       173.50                0
##     BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1                  0                 3             8          5
## 31                 0                 9            18         14
## 32                 0                 1             6          3
## 34                 0                 1             6          3
## 55                 0                 1             3          3
## 84                 0                 1            12          4
## 123                0                 1             9          3
## 127                0                 1             8          2
## 135                0                 1             4          1
## 155                0                 1             4          2
##     DaysSinceLastStay DaysSinceFirstStay   DistributionChannel
## 1                 151               1074             Corporate
## 31                591               1117                Direct
## 32               1098               1098                Direct
## 34               1098               1098                Direct
## 55               1098               1098 Travel Agent/Operator
## 84               1098               1098                Direct
## 123              1097               1097                Direct
## 127              1096               1096                Direct
## 135              1095               1095                Direct
## 155              1096               1096 Travel Agent/Operator
##             MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1               Corporate           0          0                0             0
## 31              Corporate           0          0                0             0
## 32                 Direct           0          0                0             0
## 34                 Direct           0          0                0             0
## 55  Travel Agent/Operator           0          0                0             0
## 84                 Direct           0          0                0             0
## 123                Direct           0          0                0             0
## 127                Direct           0          0                0             0
## 135                Direct           0          0                0             0
## 155 Travel Agent/Operator           0          0                0             0
##     SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1           0        0      0             0         0              0
## 31          0        0      0             1         0              0
## 32          0        0      0             0         0              0
## 34          0        0      1             0         0              0
## 55          0        0      0             0         0              0
## 84          0        0      0             1         0              0
## 123         0        0      1             1         0              0
## 127         0        0      0             1         0              0
## 135         0        0      0             0         0              0
## 155         0        0      0             0         0              0
##     SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1                    0                    0           0       476.30
## 31                   0                    0           0      1338.99
## 32                   0                    0           0       234.60
## 34                   0                    0           0       281.10
## 55                   0                    0           0       158.10
## 84                   0                    0           0      1728.00
## 123                  0                    0           0       877.80
## 127                  0                    0           0       600.00
## 135                  0                    0           0       189.00
## 155                  0                    0           0       422.60
str(df_PRT_Mixed)
## 'data.frame':    6639 obs. of  24 variables:
##  $ df_PRT.Age                : int  51 39 71 43 38 28 41 44 45 60 ...
##  $ df_PRT.DaysSinceCreation  : num  150 1095 1095 1095 1095 ...
##  $ df_PRT.AverageLeadTime    : num  45 1 85 78 98 103 2 1 0 10 ...
##  $ df_PRT.BookingsCanceled   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.BookingsNoShowed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.BookingsCheckedIn  : num  3 9 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.PersonsNights      : num  8 18 6 6 3 12 9 8 4 4 ...
##  $ df_PRT.RoomNights         : num  5 14 3 3 3 4 3 2 1 2 ...
##  $ df_PRT.DaysSinceLastStay  : num  151 591 1098 1098 1098 ...
##  $ df_PRT.DaysSinceFirstStay : num  1074 1117 1098 1098 1098 ...
##  $ df_PRT.TotalRevenue       : num  476 1339 235 281 158 ...
##  $ df_PRT.DistributionChannel: chr  "Corporate" "Direct" "Direct" "Direct" ...
##  $ df_PRT.MarketSegment      : chr  "Corporate" "Corporate" "Direct" "Direct" ...
##  $ df_PRT.SRHighFloor        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRLowFloor         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRAccessibleRoom   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRMediumFloor      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRBathtub          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRShower           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRCrib             : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ df_PRT.SRKingSizeBed      : num  0 1 0 0 0 1 1 1 0 0 ...
##  $ df_PRT.SRTwinBed          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRNearElevator     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ df_PRT.SRQuietRoom        : num  0 0 0 0 0 0 0 0 0 0 ...
# Scaling Continouos Variables
df_PRT$Age <- scale(df_PRT$Age)
df_PRT$DaysSinceCreation <- scale(df_PRT$DaysSinceCreation)
df_PRT$AverageLeadTime <- scale(df_PRT$AverageLeadTime)
df_PRT$BookingsCanceled <- scale(df_PRT$BookingsCanceled)
df_PRT$BookingsNoShowed <- scale(df_PRT$BookingsNoShowed)
df_PRT$BookingsCheckedIn <- scale(df_PRT$BookingsCheckedIn)
df_PRT$PersonsNights <- scale(df_PRT$PersonsNights)
df_PRT$RoomNights <- scale(df_PRT$RoomNights)
df_PRT$DaysSinceLastStay <- scale(df_PRT$DaysSinceLastStay)
df_PRT$DaysSinceFirstStay <- scale(df_PRT$DaysSinceFirstStay)
df_PRT$TotalRevenue <- scale(df_PRT$TotalRevenue)

# Convert categorical variables to factors using the "as.factor" function
df_PRT$DistributionChannel <- as.factor(df_PRT$DistributionChannel)
df_PRT$MarketSegment <- as.factor(df_PRT$MarketSegment)
df_PRT$SRHighFloor <- factor(df_PRT$SRHighFloor)
df_PRT$SRLowFloor <- as.factor(df_PRT$SRLowFloor)
df_PRT$SRAccessibleRoom <- as.factor(df_PRT$SRAccessibleRoom)
df_PRT$SRMediumFloor <- as.factor(df_PRT$SRMediumFloor)
df_PRT$SRBathtub <- as.factor(df_PRT$SRBathtub)
df_PRT$SRShower <- as.factor(df_PRT$SRShower)
df_PRT$SRCrib <- as.factor(df_PRT$SRCrib)
df_PRT$SRKingSizeBed <- as.factor(df_PRT$SRKingSizeBed)
df_PRT$SRTwinBed <- as.factor(df_PRT$SRTwinBed)
df_PRT$SRNearElevator <- as.factor(df_PRT$SRNearElevator)
df_PRT$SRQuietRoom <- as.factor(df_PRT$SRQuietRoom)
df_PRT_Mixed_K <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue, df_PRT$DistributionChannel, df_PRT$MarketSegment,df_PRT$SRHighFloor,df_PRT$SRLowFloor,df_PRT$SRAccessibleRoom,df_PRT$SRMediumFloor, df_PRT$SRBathtub, df_PRT$SRShower, df_PRT$SRCrib, df_PRT$SRKingSizeBed, df_PRT$SRTwinBed, df_PRT$SRNearElevator, df_PRT$SRQuietRoom)
head(df_PRT, 10)
##      ID Nationality         Age DaysSinceCreation
## 1     1         PRT  0.40886692         -1.263162
## 31   31         PRT -0.46580373          1.659981
## 32   32         PRT  1.86665135          1.659981
## 34   34         PRT -0.17424685          1.659981
## 55   55         PRT -0.53869296          1.659981
## 84   84         PRT -1.26758517          1.656887
## 123 123         PRT -0.32002529          1.656887
## 127 127         PRT -0.10135763          1.656887
## 135 135         PRT -0.02846841          1.656887
## 155 155         PRT  1.06486991          1.656887
##                                                               NameHash
## 1   0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31  0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32  0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34  0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55  0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84  0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
##                                                              DocIDHash
## 1   0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31  0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32  0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34  0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55  0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84  0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
##     AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1       -0.06144993         371.00       105.30       4.66405102
## 31      -0.66333119        1083.50       255.49      -0.07785323
## 32       0.48571486         180.60        54.00      -0.07785323
## 34       0.38996102         180.60       100.50      -0.07785323
## 55       0.66354341         142.80        15.30      -0.07785323
## 84       0.73193901         770.10       957.90      -0.07785323
## 123     -0.64965207         404.50       473.30      -0.07785323
## 127     -0.66333119         376.00       224.00      -0.07785323
## 135     -0.67701031         165.00        24.00      -0.07785323
## 155     -0.54021911         249.10       173.50      -0.07785323
##     BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1        -0.05799844         1.0314323    0.78947982  0.7450725
## 31       -0.05799844         4.4907295    2.99154543  3.4657733
## 32       -0.05799844        -0.1216668    0.34906670  0.1404723
## 34       -0.05799844        -0.1216668    0.34906670  0.1404723
## 55       -0.05799844        -0.1216668   -0.31155298  0.1404723
## 84       -0.05799844        -0.1216668    1.67030607  0.4427724
## 123      -0.05799844        -0.1216668    1.00968638  0.1404723
## 127      -0.05799844        -0.1216668    0.78947982 -0.1618278
## 135      -0.05799844        -0.1216668   -0.09134642 -0.4641279
## 155      -0.05799844        -0.1216668   -0.09134642 -0.1618278
##     DaysSinceLastStay DaysSinceFirstStay   DistributionChannel
## 1          -1.2076818           1.586808             Corporate
## 31          0.1441954           1.719782                Direct
## 32          1.7019267           1.661026                Direct
## 34          1.7019267           1.661026                Direct
## 55          1.7019267           1.661026 Travel Agent/Operator
## 84          1.7019267           1.661026                Direct
## 123         1.6988542           1.657934                Direct
## 127         1.6957818           1.654841                Direct
## 135         1.6927093           1.651749                Direct
## 155         1.6957818           1.654841 Travel Agent/Operator
##             MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1               Corporate           0          0                0             0
## 31              Corporate           0          0                0             0
## 32                 Direct           0          0                0             0
## 34                 Direct           0          0                0             0
## 55  Travel Agent/Operator           0          0                0             0
## 84                 Direct           0          0                0             0
## 123                Direct           0          0                0             0
## 127                Direct           0          0                0             0
## 135                Direct           0          0                0             0
## 155 Travel Agent/Operator           0          0                0             0
##     SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1           0        0      0             0         0              0
## 31          0        0      0             1         0              0
## 32          0        0      0             0         0              0
## 34          0        0      1             0         0              0
## 55          0        0      0             0         0              0
## 84          0        0      0             1         0              0
## 123         0        0      1             1         0              0
## 127         0        0      0             1         0              0
## 135         0        0      0             0         0              0
## 155         0        0      0             0         0              0
##     SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1                    0                    0           0    0.2919791
## 31                   0                    0           0    2.2157474
## 32                   0                    0           0   -0.2470033
## 34                   0                    0           0   -0.1433100
## 55                   0                    0           0   -0.4175956
## 84                   0                    0           0    3.0832260
## 123                  0                    0           0    1.1873100
## 127                  0                    0           0    0.5678258
## 135                  0                    0           0   -0.3486897
## 155                  0                    0           0    0.1722300
str(df_PRT_Mixed_K)
## 'data.frame':    6639 obs. of  24 variables:
##  $ df_PRT.Age                : num  0.409 -0.466 1.867 -0.174 -0.539 ...
##  $ df_PRT.DaysSinceCreation  : num  -1.26 1.66 1.66 1.66 1.66 ...
##  $ df_PRT.AverageLeadTime    : num  -0.0614 -0.6633 0.4857 0.39 0.6635 ...
##  $ df_PRT.BookingsCanceled   : num  4.6641 -0.0779 -0.0779 -0.0779 -0.0779 ...
##  $ df_PRT.BookingsNoShowed   : num  -0.058 -0.058 -0.058 -0.058 -0.058 ...
##  $ df_PRT.BookingsCheckedIn  : num  1.031 4.491 -0.122 -0.122 -0.122 ...
##  $ df_PRT.PersonsNights      : num  0.789 2.992 0.349 0.349 -0.312 ...
##  $ df_PRT.RoomNights         : num  0.745 3.466 0.14 0.14 0.14 ...
##  $ df_PRT.DaysSinceLastStay  : num  -1.208 0.144 1.702 1.702 1.702 ...
##  $ df_PRT.DaysSinceFirstStay : num  1.59 1.72 1.66 1.66 1.66 ...
##  $ df_PRT.TotalRevenue       : num  0.292 2.216 -0.247 -0.143 -0.418 ...
##  $ df_PRT.DistributionChannel: Factor w/ 4 levels "Corporate","Direct",..: 1 2 2 2 4 2 2 2 2 4 ...
##  $ df_PRT.MarketSegment      : Factor w/ 7 levels "Aviation","Complementary",..: 3 3 4 4 7 4 4 4 4 7 ...
##  $ df_PRT.SRHighFloor        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRLowFloor         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRAccessibleRoom   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRMediumFloor      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRBathtub          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRShower           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRCrib             : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  $ df_PRT.SRKingSizeBed      : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...
##  $ df_PRT.SRTwinBed          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRNearElevator     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRQuietRoom        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

5.2.2 Lambda Parameter

In k-prototype clustering, lambda is a parameter that controls the trade-off between the dissimilarity of numeric and categorical features. Specifically, lambda determines the weight of the categorical features in the clustering process. When lambda is set to 0, the k-prototype algorithm becomes equivalent to k-means for numeric data, while when lambda is set to 1, the algorithm becomes equivalent to clustering only on categorical data. A value of lambda between 0 and 1 allows for a combination of both categorical and numerical data in the clustering process. By default, the lambda parameter is calculated based on the variance of numerical variables, but it can be adjusted to use the standard deviation instead. Using standard deviation instead of variance in calculating the lambda parameter has the benefit of providing more robustness against outliers since the standard deviation is less sensitive to outliers compared to variance.

5.2.3 Optimal Number of Clusters (for Lambda calculated with variance)

5.2.3.1 Silhouette
Essil <- sapply(2:10, function(i) {
  kpres <- kproto(df_PRT_Mixed_K, k = i)
  validation_kproto(method = "silhouette", object = kpres)
})
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
plot(2:10, Essil, type = "b", ylab = "Silhouette", xlab = "Number of clusters")

K=2,4 & 8 have the highest average silhouette width, indicating a well-defined cluster compared to others.

# Perform k-prototype clustering with 2 clusters
?clusplot
kp2Var <- kproto(df_PRT_Mixed_K, k = 2, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric variances:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric variance: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp2Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Perform k-prototype clustering with 3 clusters
kp4Var <- kproto(df_PRT_Mixed_K, k = 4, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric variances:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric variance: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp4Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Perform k-prototype clustering with 3 clusters
kp8Var <- kproto(df_PRT_Mixed_K, k = 8, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric variances:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric variance: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp8Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

5.2.4 Optimal Number of Clusters (for Lambda calculated with standard deviation)

5.2.4.1 WSS
Es2 <- numeric(10)
for(i in 1:10){
  kpres <- kproto(df_PRT_Mixed_K, k = i, lambdaest(df_PRT_Mixed_K, num.method = 2))
  Es2[i] <- kpres$tot.withinss
}
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
plot(1:10, Es2, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")

From the above graph, k=4,5,8 & 9 are also to be suggested as the optimal number of clusters to be used when performing K-prototype algorithm.

5.2.4.2 Silhouette
Essil2 <- numeric(10)
for(i in 2:10){
  kpres <- kproto(df_PRT_Mixed_K, k = i, lambdaest(df_PRT_Mixed_K, num.method = 2))
  Essil2[i] <- validation_kproto(method = "silhouette", object = kpres)
}
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
plot(1:10, Essil2, type = "b", ylab = "Silhouette", xlab = "Number of clusters")

K=9 has the highest average silhouette width followed by k=4, indicating a well-defined cluster compared to others.

# Perform k-prototype clustering with 2 clusters
kp4Sd <- kproto(df_PRT_Mixed_K, k = 4, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp4Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Perform k-prototype clustering with 2 clusters
kp5Sd <- kproto(df_PRT_Mixed_K, k = 5, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp5Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Perform k-prototype clustering with 2 clusters
kp8Sd <- kproto(df_PRT_Mixed_K, k = 8, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp8Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Perform k-prototype clustering with 2 clusters
kp9Sd <- kproto(df_PRT_Mixed_K, k = 9, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
##                 df_PRT.Age   df_PRT.DaysSinceCreation 
##                          0                          0 
##     df_PRT.AverageLeadTime    df_PRT.BookingsCanceled 
##                          0                          0 
##    df_PRT.BookingsNoShowed   df_PRT.BookingsCheckedIn 
##                          0                          0 
##       df_PRT.PersonsNights          df_PRT.RoomNights 
##                          0                          0 
##   df_PRT.DaysSinceLastStay  df_PRT.DaysSinceFirstStay 
##                          0                          0 
##        df_PRT.TotalRevenue df_PRT.DistributionChannel 
##                          0                          0 
##       df_PRT.MarketSegment         df_PRT.SRHighFloor 
##                          0                          0 
##          df_PRT.SRLowFloor    df_PRT.SRAccessibleRoom 
##                          0                          0 
##       df_PRT.SRMediumFloor           df_PRT.SRBathtub 
##                          0                          0 
##            df_PRT.SRShower              df_PRT.SRCrib 
##                          0                          0 
##       df_PRT.SRKingSizeBed           df_PRT.SRTwinBed 
##                          0                          0 
##      df_PRT.SRNearElevator         df_PRT.SRQuietRoom 
##                          0                          0 
## 0 observation(s) with NAs.
## 
## Numeric standard deviations:
##                df_PRT.Age  df_PRT.DaysSinceCreation    df_PRT.AverageLeadTime 
##                         1                         1                         1 
##   df_PRT.BookingsCanceled   df_PRT.BookingsNoShowed  df_PRT.BookingsCheckedIn 
##                         1                         1                         1 
##      df_PRT.PersonsNights         df_PRT.RoomNights  df_PRT.DaysSinceLastStay 
##                         1                         1                         1 
## df_PRT.DaysSinceFirstStay       df_PRT.TotalRevenue 
##                         1                         1 
## Average numeric standard deviation: 1 
## 
## Heuristic for categorical variables: (method = 1) 
## df_PRT.DistributionChannel       df_PRT.MarketSegment 
##               0.5107484357               0.7795660169 
##         df_PRT.SRHighFloor          df_PRT.SRLowFloor 
##               0.0682879136               0.0033082616 
##    df_PRT.SRAccessibleRoom       df_PRT.SRMediumFloor 
##               0.0006023189               0.0030079643 
##           df_PRT.SRBathtub            df_PRT.SRShower 
##               0.0045085433               0.0027075763 
##              df_PRT.SRCrib       df_PRT.SRKingSizeBed 
##               0.0276238119               0.4206373758 
##           df_PRT.SRTwinBed      df_PRT.SRNearElevator 
##               0.1621899432               0.0003012048 
##         df_PRT.SRQuietRoom 
##               0.0851944063 
## Average categorical variation: 0.1591295 
## 
## Estimated lambda: 6.284189 
## 
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp9Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

kp2Var$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.03022443                0.8618161            -0.03120781
## 2 -0.03102709               -0.8847031             0.03203658
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1              0.01379821             0.002827812               0.03862871
## 2             -0.01416465            -0.002902910              -0.03965457
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1           0.06239875        0.03781794                0.8385431
## 2          -0.06405586       -0.03882226               -0.8608121
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                 0.8619061         -0.03137890      Travel Agent/Operator
## 2                -0.8847955          0.03221222      Travel Agent/Operator
##   df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1                Other                  0                 0
## 2                Other                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    0                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
kp4Var$centers
##   df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.1229554               -0.1467219            -0.03964562
## 2  0.2881788                0.5500291             0.52793903
## 3 -0.0852627               -0.3025952            -0.42557610
## 4  0.2144957                0.5737608            -0.34384775
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.05192347             -0.04566126              -0.04481980
## 2             -0.06983874             -0.05121937              -0.09503184
## 3             -0.03777374             -0.01985941              -0.04278337
## 4              7.71994487              5.82473545               8.46251520
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1            0.1533192        0.03550415               -0.1348671
## 2           -0.1484140       -0.03409533                0.5729967
## 3           -0.2267508       -0.20414658               -0.3054045
## 4            5.4236046        7.62407903               -0.9848269
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                -0.1469145           0.1356722      Travel Agent/Operator
## 2                 0.5488496          -0.1386580      Travel Agent/Operator
## 3                -0.3033911          -0.1941631                     Direct
## 4                 0.6662951           4.8153723                  Corporate
##   df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1                Other                  0                 0
## 2               Groups                  0                 0
## 3               Direct                  0                 0
## 4            Corporate                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    0                0                     0
## 3             0                    0                0                     0
## 4             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
kp8Var$centers
##     df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.061254664               -1.0161557            -0.36456323
## 2  0.036566373               -0.2319464            -0.49615505
## 3  0.414825661                0.4075808             0.92227193
## 4 -0.184595319                0.7893740            -0.19848037
## 5 -0.003912145               -0.2663671            -0.09466020
## 6 -0.156279758                0.8217644            -0.45456177
## 7 -0.123844728               -1.1116870             0.03481985
## 8 -0.057610051                0.3013993             0.01400926
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.05977735             -0.05799844              -0.08210686
## 2              0.16060460              0.20133054               0.23287803
## 3             -0.07393753             -0.05799844              -0.11690586
## 4             -0.07785323             -0.05799844              -0.11098996
## 5             -0.07785323             -0.05799844              -0.07541951
## 6             -0.04464661             -0.05799844              -0.06433485
## 7             -0.07785323             -0.04636124              -0.10828461
## 8              0.21908683              0.12747924               0.31213396
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          -0.19991077      -0.222518252               -0.9776815
## 2          -0.31376770      -0.026748269               -0.3880886
## 3          -0.11080315      -0.003313678                0.4536805
## 4           0.04866146      -0.063238125                0.8275117
## 5           0.02052323      -0.037351256               -0.2479936
## 6          -0.14932798      -0.205436881                0.8086575
## 7           0.01854390      -0.023541552               -1.0566356
## 8           0.52620782       0.421221160                0.2382958
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                -1.0157199         -0.13876781                     Direct
## 2                -0.2304069         -0.14671198                  Corporate
## 3                 0.4074129         -0.09671927      Travel Agent/Operator
## 4                 0.7875424         -0.06736439      Travel Agent/Operator
## 5                -0.2670119          0.10021160      Travel Agent/Operator
## 6                 0.8190367         -0.18854062                     Direct
## 7                -1.1122927          0.08686339      Travel Agent/Operator
## 8                 0.3045435          0.40461341      Travel Agent/Operator
##   df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1               Direct                  0                 0
## 2            Corporate                  0                 0
## 3               Groups                  0                 0
## 4                Other                  0                 0
## 5                Other                  0                 0
## 6               Direct                  0                 0
## 7                Other                  0                 0
## 8                Other                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
## 5                       0                    0                0               0
## 6                       0                    0                0               0
## 7                       0                    0                0               0
## 8                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    0                0                     0
## 3             0                    0                0                     0
## 4             0                    0                0                     0
## 5             0                    1                0                     0
## 6             0                    0                0                     0
## 7             0                    0                0                     0
## 8             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
## 5                  1
## 6                  0
## 7                  0
## 8                  0
kp4Sd$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.04301233                0.1072099              0.2245371
## 2 -0.05636145               -1.0641813             -0.4022000
## 3 -0.12799166                0.8798871             -0.4603846
## 4  0.21449566                0.5737608             -0.3438478
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.06585393             -0.04692611             -0.082673114
## 2             -0.01811270             -0.01062496             -0.014074487
## 3             -0.03902494             -0.04568231             -0.003642332
## 4              7.71994487              5.82473545              8.462515197
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1           0.06092893        0.00764666                0.1316017
## 2          -0.25849533       -0.16849265               -1.0503773
## 3          -0.18488349       -0.16615960                0.8252034
## 4           5.42360455        7.62407903               -0.9848269
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                 0.1068864          0.03334654      Travel Agent/Operator
## 2                -1.0652561         -0.14345495                     Direct
## 3                 0.8784614         -0.18368616                     Direct
## 4                 0.6662951          4.81537235                  Corporate
##   df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1                Other                  0                 0
## 2               Direct                  0                 0
## 3               Direct                  0                 0
## 4            Corporate                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    0                0                     0
## 3             0                    0                0                     0
## 4             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
kp5Sd$centers
##     df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.393312571               0.08496927              1.1975527
## 2 -0.136699029              -0.95366843             -0.1491111
## 3  0.004884595               0.95935902             -0.1361254
## 4 -0.104120577              -0.14525267             -0.4105885
## 5  0.214495665               0.57376085             -0.3438478
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.07785323             -0.03496904              -0.10014964
## 2             -0.04763397             -0.03243739              -0.05399954
## 3             -0.07312079             -0.05199402              -0.06671622
## 4             -0.01793747             -0.03627847              -0.02453451
## 5              7.71994487              5.82473545               8.46251520
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1          0.202824260        0.12658677                0.1150763
## 2         -0.090644751       -0.04399013               -0.9178029
## 3         -0.003329724       -0.05382035                0.9643873
## 4         -0.167663493       -0.17128601               -0.1686804
## 5          5.423604552        7.62407903               -0.9848269
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                0.08088115          0.04049591      Travel Agent/Operator
## 2               -0.95467703          0.04384743      Travel Agent/Operator
## 3                0.95819619         -0.05296965      Travel Agent/Operator
## 4               -0.14264283         -0.14165073                     Direct
## 5                0.66629505          4.81537235                  Corporate
##    df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Travel Agent/Operator                  0                 0
## 2                 Other                  0                 0
## 3                 Other                  0                 0
## 4                Direct                  0                 0
## 5             Corporate                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
## 5                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    0                0                     0
## 3             0                    0                0                     0
## 4             0                    0                0                     0
## 5             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
## 5                  0
kp8Sd$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1  0.19848212                0.5534327            -0.34342633
## 2 -0.19909924                0.8237105            -0.46773856
## 3 -0.01048388                0.8586176             0.04882682
## 4  0.22977485               -0.4741717             1.01501688
## 5 -0.03251781               -1.0706102            -0.39948478
## 6  0.24175668                1.0227551             0.22150706
## 7 -0.19860636               -0.3978247            -0.26394827
## 8 -0.03669783               -0.8537098            -0.07299062
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             7.789397007              5.95843394               8.52657626
## 2            -0.039874587             -0.03046328              -0.03986801
## 3            -0.049728052             -0.05799844               0.02195764
## 4            -0.072421491             -0.04421509              -0.11110003
## 5            -0.043059379             -0.04538538              -0.06425359
## 6            -0.077853225             -0.04695912              -0.10315374
## 7            -0.061431046             -0.05799844              -0.08722358
## 8            -0.007254155              0.01664712               0.01567502
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1           5.48888801        7.71858598               -0.9998726
## 2          -0.19011183       -0.19434059                0.7868890
## 3           0.43814195        0.19282798                0.8152028
## 4          -0.03888025        0.08922329               -0.4247424
## 5          -0.26123324       -0.22805493               -1.0368066
## 6          -0.05417393       -0.04423580                1.0530115
## 7          -0.07971646       -0.11497779               -0.3664215
## 8          -0.01293540       -0.01442834               -0.8630454
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                 0.6480450          4.84865759                  Corporate
## 2                 0.8209011         -0.19188243                     Direct
## 3                 0.8632603          0.33662092      Travel Agent/Operator
## 4                -0.4732629          0.13244813      Travel Agent/Operator
## 5                -1.0708035         -0.15932247                     Direct
## 6                 1.0191316         -0.19298918      Travel Agent/Operator
## 7                -0.3966113         -0.04374251      Travel Agent/Operator
## 8                -0.8582784         -0.03990061      Travel Agent/Operator
##    df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1             Corporate                  0                 0
## 2                Direct                  0                 0
## 3                 Other                  0                 0
## 4                Groups                  0                 0
## 5                Direct                  0                 0
## 6 Travel Agent/Operator                  0                 0
## 7                 Other                  0                 0
## 8                 Other                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
## 5                       0                    0                0               0
## 6                       0                    0                0               0
## 7                       0                    0                0               0
## 8                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    1                0                     0
## 2             0                    0                0                     0
## 3             0                    1                0                     0
## 4             0                    0                0                     0
## 5             0                    0                0                     0
## 6             0                    0                0                     0
## 7             0                    0                0                     0
## 8             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
## 5                  0
## 6                  0
## 7                  0
## 8                  0
kp9Sd$centers
##    df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.10912617               0.42212414            -0.46653696
## 2  0.01725412              -1.12752595            -0.32208154
## 3 -0.15643681              -1.36123997            -0.13662319
## 4  0.06387621               1.14397679            -0.11932906
## 5  0.31143337               0.31005134            -0.29120330
## 6 -0.01980585               0.29187986            -0.01972745
## 7 -0.26730490              -0.32929994            -0.24850353
## 8  0.40044176               0.09174735             1.26780366
## 9  0.03956153               0.22965152            -0.39795627
##   df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1             -0.04709018             -0.04684658              -0.06182107
## 2             -0.03051308             -0.03797704              -0.02381647
## 3             -0.06176079             -0.04438661              -0.08775212
## 4             -0.07339655             -0.04668936              -0.10486884
## 5              1.95439145              0.92427623               3.27291568
## 6             -0.07382784             -0.04778379              -0.06782941
## 7             -0.06564755             -0.05799844              -0.09421206
## 8             -0.07785323             -0.03247805              -0.11310721
## 9             14.78011342             13.57924830              13.94614191
##   df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1           -0.2229806       -0.22206365                0.4049520
## 2           -0.1935721       -0.16886970               -1.1285292
## 3           -0.1376795       -0.08796258               -1.3106887
## 4           -0.0845167       -0.09704917                1.1740229
## 5            2.8994182        3.49662028               -0.8107716
## 6            0.1264300       -0.01273069                0.2980056
## 7           -0.1726840       -0.08946249               -0.2955166
## 8            0.1585168        0.10777391                0.1350856
## 9            8.0122550       10.82174222               -1.2382014
##   df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1                0.41963546        -0.188831485                     Direct
## 2               -1.12665119        -0.160706366                     Direct
## 3               -1.36268294        -0.001398094      Travel Agent/Operator
## 4                1.14258807        -0.126101412      Travel Agent/Operator
## 5                0.31821808         3.652950764                  Corporate
## 6                0.29227489         0.048872250      Travel Agent/Operator
## 7               -0.32661316        -0.063176002      Travel Agent/Operator
## 8                0.08952496        -0.011145755      Travel Agent/Operator
## 9                0.47332754         5.318602039                  Corporate
##    df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1                Direct                  0                 0
## 2                Direct                  0                 0
## 3                 Other                  0                 0
## 4                 Other                  0                 0
## 5             Corporate                  0                 0
## 6                 Other                  0                 0
## 7                 Other                  0                 0
## 8 Travel Agent/Operator                  0                 0
## 9             Corporate                  0                 0
##   df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1                       0                    0                0               0
## 2                       0                    0                0               0
## 3                       0                    0                0               0
## 4                       0                    0                0               0
## 5                       0                    0                0               0
## 6                       0                    0                0               0
## 7                       0                    0                0               0
## 8                       0                    0                0               0
## 9                       0                    0                0               0
##   df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1             0                    0                0                     0
## 2             0                    1                0                     0
## 3             0                    0                0                     0
## 4             0                    0                0                     0
## 5             0                    1                0                     0
## 6             0                    1                0                     0
## 7             0                    0                0                     0
## 8             0                    0                0                     0
## 9             0                    1                0                     0
##   df_PRT.SRQuietRoom
## 1                  0
## 2                  0
## 3                  0
## 4                  0
## 5                  0
## 6                  0
## 7                  0
## 8                  0
## 9                  0
validation_kproto(method = "silhouette", object = kp2Var)
## [1] 0.241107
validation_kproto(method = "silhouette", object = kp4Var)
## [1] 0.2033768
validation_kproto(method = "silhouette", object = kp8Var)
## [1] 0.1317051
validation_kproto(method = "silhouette", object = kp4Sd)
## [1] 0.1807716
validation_kproto(method = "silhouette", object = kp5Sd)
## [1] 0.2272975
validation_kproto(method = "silhouette", object = kp8Sd)
## [1] 0.2254407
validation_kproto(method = "silhouette", object = kp9Sd)
## [1] 0.1840354

A very low Silhouette statistic was observed for the clustering generated with the lambda estimated using both standard deviation and variance suggesting that this clustering is not the most optimal as a expected given that this approach is more appropriate for mixed data. As a result of the low Silhouette statistic, it appears that this clustering approach was ineffective.

This may be be due to poor separation of clusters: it is possible the clustering algorithm failed to identify distinct clusters in the data, so the silhouette statistic is low for many data points. This could happen if there is a lot of overlap between the different categories in the categorical variables, or if the continuous variables do not provide sufficient separation between the clusters.

5.3 GOWER DISTANCE

Gower distance can be used in clustering when the data being analyzed contains a mixture of data types. Other distance metrics, such as Euclidean distance or Manhattan distance, may not be appropriate for mixed data because they assume that all variables have the same scale and distribution. Gower distance takes into account the different scales and distributions of the variables, making it a better choice for mixed data.

For quantitative (interval) data, Gower’s distance calculates the range-normalized Manhattan distance between two observations. For ordinal data, the variable is first ranked and then the Manhattan distance is used with a special adjustment for ties. For nominal data, the variables are first converted into k binary columns and then the Dice coefficient is used to calculate the distance.

One advantage of Gower’s distance is that it does not assume a specific distribution for the data, making it more robust to outliers and skewness. However, it can be computationally expensive for large datasets.

gower_dist <- daisy(df_PRT_Mixed_K,
                    metric = "gower",
                    type = list(logratio = 3))
## Warning in daisy(df_PRT_Mixed_K, metric = "gower", type = list(logratio = 3)):
## NaNs produced
Esgower <- numeric(10)
for(i in 2:10){
  pames <- pam(gower_dist, diss = TRUE, k = i)
  Esgower[i] <- pames$silinfo$avg.width}
plot(1:10, Esgower, type = "b", ylab = "Silhouette", xlab = "Number of Clusters") 

In this case, K =2, K= 5 & K=10 will be selected for clustering.

pamgower <- pam(gower_dist, diss = TRUE, k=2)
fviz_silhouette(pamgower)
##   cluster size ave.sil.width
## 1       1 1941          0.32
## 2       2 4698          0.28

pamgower <- pam(gower_dist, diss = TRUE, k=5)
fviz_silhouette(pamgower)
##   cluster size ave.sil.width
## 1       1  735          0.36
## 2       2 1403          0.32
## 3       3 1456          0.36
## 4       4 1394          0.30
## 5       5 1651          0.23

pamgower <- pam(gower_dist, diss = TRUE, k=10)
fviz_silhouette(pamgower)
##    cluster size ave.sil.width
## 1        1  693          0.33
## 2        2  640          0.42
## 3        3  755          0.36
## 4        4  619          0.40
## 5        5  545          0.43
## 6        6  734          0.31
## 7        7  715          0.26
## 8        8  603          0.35
## 9        9  732          0.38
## 10      10  603          0.29

The results obtained are slightly better in comparison to K-prototype algorithm and similar to results obtained from applied clustering on only continuous data in part 1. But, generally the results are not considered to be the optimal further adoption.

5.4 Factor Analysis of Mixed Data (FAMD) with K-means

FAMD is a method for analyzing data sets that contain a mixture of continuous, categorical, and count variables. FAMD is an extension of Principal Component Analysis (PCA) that is specifically designed for mixed data. It works by constructing a set of principal components that capture the most important features of the data, while taking into account the different types of variables.

FAMD is particularly useful when working with data that contains a mix of categorical and continuous variables. It can be used to identify patterns and relationships in the data that might not be apparent from a simple analysis of the variables separately. FAMD can also be used for data reduction, to identify the most important variables for a particular analysis.

str(df_PRT_Mixed_K)
## 'data.frame':    6639 obs. of  24 variables:
##  $ df_PRT.Age                : num  0.409 -0.466 1.867 -0.174 -0.539 ...
##  $ df_PRT.DaysSinceCreation  : num  -1.26 1.66 1.66 1.66 1.66 ...
##  $ df_PRT.AverageLeadTime    : num  -0.0614 -0.6633 0.4857 0.39 0.6635 ...
##  $ df_PRT.BookingsCanceled   : num  4.6641 -0.0779 -0.0779 -0.0779 -0.0779 ...
##  $ df_PRT.BookingsNoShowed   : num  -0.058 -0.058 -0.058 -0.058 -0.058 ...
##  $ df_PRT.BookingsCheckedIn  : num  1.031 4.491 -0.122 -0.122 -0.122 ...
##  $ df_PRT.PersonsNights      : num  0.789 2.992 0.349 0.349 -0.312 ...
##  $ df_PRT.RoomNights         : num  0.745 3.466 0.14 0.14 0.14 ...
##  $ df_PRT.DaysSinceLastStay  : num  -1.208 0.144 1.702 1.702 1.702 ...
##  $ df_PRT.DaysSinceFirstStay : num  1.59 1.72 1.66 1.66 1.66 ...
##  $ df_PRT.TotalRevenue       : num  0.292 2.216 -0.247 -0.143 -0.418 ...
##  $ df_PRT.DistributionChannel: Factor w/ 4 levels "Corporate","Direct",..: 1 2 2 2 4 2 2 2 2 4 ...
##  $ df_PRT.MarketSegment      : Factor w/ 7 levels "Aviation","Complementary",..: 3 3 4 4 7 4 4 4 4 7 ...
##  $ df_PRT.SRHighFloor        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRLowFloor         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRAccessibleRoom   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRMediumFloor      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRBathtub          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRShower           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRCrib             : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  $ df_PRT.SRKingSizeBed      : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...
##  $ df_PRT.SRTwinBed          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRNearElevator     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ df_PRT.SRQuietRoom        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
famd <- FAMD(df_PRT_Mixed_K, ncp=50, graph=FALSE)
get_eigenvalue(famd)
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.584092743     11.561589494                    11.56159
## Dim.2  3.067958598      9.896640639                    21.45823
## Dim.3  2.174948290      7.015962227                    28.47419
## Dim.4  1.896785604      6.118663237                    34.59286
## Dim.5  1.720483490      5.549946743                    40.14280
## Dim.6  1.658684137      5.350593989                    45.49340
## Dim.7  1.128821432      3.641359458                    49.13476
## Dim.8  1.117848927      3.605964282                    52.74072
## Dim.9  1.065240947      3.436261120                    56.17698
## Dim.10 1.026606299      3.311633223                    59.48861
## Dim.11 1.015636043      3.276245299                    62.76486
## Dim.12 1.011654193      3.263400624                    66.02826
## Dim.13 0.991563147      3.198590796                    69.22685
## Dim.14 0.987386367      3.185117314                    72.41197
## Dim.15 0.978225394      3.155565788                    75.56753
## Dim.16 0.969621045      3.127809824                    78.69534
## Dim.17 0.939634021      3.031077486                    81.72642
## Dim.18 0.930299737      3.000966893                    84.72739
## Dim.19 0.896980746      2.893486278                    87.62087
## Dim.20 0.771394522      2.488369426                    90.10924
## Dim.21 0.697714913      2.250693267                    92.35994
## Dim.22 0.620553495      2.001785468                    94.36172
## Dim.23 0.510055836      1.645341407                    96.00706
## Dim.24 0.360962624      1.164395561                    97.17146
## Dim.25 0.292455201      0.943403875                    98.11486
## Dim.26 0.253859205      0.818900661                    98.93376
## Dim.27 0.132623884      0.427818982                    99.36158
## Dim.28 0.126186683      0.407053816                    99.76864
## Dim.29 0.037931497      0.122359667                    99.89100
## Dim.30 0.031102913      0.100331978                    99.99133
## Dim.31 0.002688065      0.008671177                   100.00000
fviz_eig(famd, ncp=24, addlabels=TRUE)

famd <- FAMD(df_PRT_Mixed_K, ncp=12, graph=FALSE)
famdvar <- get_famd_var(famd)
fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_contrib(famd, choice = "var", axes = 1, top = 24)

fviz_contrib(famd, choice = "var", axes = 2, top = 20)

fviz_contrib(famd, choice = "var", axes = 3, top = 20)

fviz_contrib(famd, choice = "var", axes = 4, top = 20)

fviz_contrib(famd, choice = "var", axes = 5, top = 20)

fviz_contrib(famd, choice = "var", axes = 6, top = 20)

datafamd <- data.frame(famd$ind$coord)
hopkins(datafamd)
## [1] 1
get_clust_tendency(datafamd, 2, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.9872454
fviz_nbclust(datafamd, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

fviz_nbclust(datafamd, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

kmeansfamd4 <- eclust(datafamd, "kmeans", hc_metric="euclidean", k=4, graph = T)

fviz_cluster(kmeansfamd4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeansfamd4)
##   cluster size ave.sil.width
## 1       1 2189          0.19
## 2       2 1748          0.10
## 3       3   48          0.11
## 4       4 2654          0.20

kmeansfamd7 <- eclust(datafamd, "kmeans", hc_metric="euclidean", k=7, graph = T)

fviz_cluster(kmeansfamd7, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeansfamd7)
##   cluster size ave.sil.width
## 1       1 1212          0.23
## 2       2 1470          0.22
## 3       3  432          0.10
## 4       4 1238          0.24
## 5       5 1517          0.17
## 6       6  734          0.26
## 7       7   36          0.13

The silhouette statistics obtained in this case are different to the one obtained with the use of PCA, and are considered to be unsatisfactory given the low silhouette statistics.

6. Summary

Exploring and comparing various approaches of clustering and dimension reduction can help to identify the best technique for the given dataset, as different techniques have different assumptions and requirements. Additionally, it can help to understand the underlying structure of the dataset and provide insights into the relationships between the variables. By applying different techniques, it is possible to identify the best combination of techniques to extract useful information from the data. This can help to make informed decisions in areas such as marketing in the tourism sector.

7. References:

7.1 Nuno Antonio, Ana de Almeida, Luís Nunes, A hotel’s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015–2018), Data in Brief, Volume 33,2020, 106583, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2020.106583.(https://www.sciencedirect.com/science/article/pii/S2352340920314645)

7.2 van Leeuwen, Rik and Koole, Ger, Data-Driven Market Segmentation in Hospitality Using Unsupervised Machine Learning. Available at SSRN: https://ssrn.com/abstract=4091700 or http://dx.doi.org/10.2139/ssrn.4091700

7.3 Charrad, Malika & Ghazzali, Nadia & Boiteau, Véronique & Niknafs, Azam. (2013). An examination of indices for determining the number of clusters : NbClust Package.

7.4 G. Szepannek. clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal Vol. 10/2, 2018, ISSN 2073-4859, https://journal.r-project.org/archive/2018/RJ-2018-048/RJ-2018-048.pdf

7.5 Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet. (2022). Package ‘FactoMineR’: Multivariate Exploratory Data Analysis and Data Mining. Version 2.7. Retrieved from https://cran.r-project.org/package=FactoMineR.