When a company has a large customer base, they can identify and group customers into clusters or segments based on similar demographic, behavioral, or geographic characteristics. This is called segmentation and its purpose is to help organizations better understand their customers’ behavior and traits in order to create more personalized marketing strategies that target specific groups. By using data to segment customers, companies can gain a competitive advantage over rivals in the market and contribute to business growth.
Clustering algorithms are often used to perform customer segmentation, which involves dividing customers into distinct groups based on their shared characteristics. The primary goal of clustering is to group data objects into different subsets, so that objects within the same subset are similar to one another, while those in different subsets are dissimilar. Dimension reduction is a technique used to reduce the number of variables or features in a dataset while retaining as much of the original information as possible, thus making the data more manageable and easier to analyze.
In practice, data sets extracted from property management systems in the hospitality industry often include not only numerical data types, but also categorical data. This can pose a challenge because conventional clustering algorithms that rely on distance-based similarity calculations are not well-suited for clustering binary and mixed/categorical attributes. For example, the k-means algorithm is a popular choice for clustering large datasets, but it faces issues when calculating the cost function using Euclidean distance, which is only appropriate for numerical data. Similarly, dimension reduction can also pose a challenge when dealing with mixed or categorical data in the hospitality industry. Traditional dimension reduction techniques such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) are only applicable to numerical data. In many cases, datasets comprise diverse types of variables, and dismissing nominal variables may lead to a notable reduction in analytical efficacy. Consequently, algorithms capable of handling mixed-type data can prove invaluable in such scenarios.
The primary aim of this article is to introduce and compare various methods of cluster analysis and dimension reduction that were conducted on continuous & mixed data related to “A Hotel´s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-218)”. The paper was divided into two parts. In the first part, only continuous and count variables were selected where simple K-means, PAM & CLARA algorithms were applied to the data. In the second part, some categorical variables were added to the previous dataset and two types of clustering were applied to the new data: k-prototypes and k-means on the data with categorical variables converted to binary values and treated as numeric. Lastly, Factor analysis of mixed data (FAMD) is used to perform dimension reduction and explore the underlying structure of mixed data. FAMD can be thought of as an extension of traditional factor analysis that can handle mixed data types. It involves constructing factors from the original data that capture the underlying relationships between the variables, and can be used to identify the most important features for a given problem.
The chosen dataset was obtained from this article: “A Hotel´s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-218). This article offers a hotel customer dataset over the period of three years along with a detailed description of its features, which is also qualified to be used for the application of clustering and segmentation purposes. The aim of this article is to reduce the lack of existing real-world business data that can be used for educational purposes, that is why they offered such dataset to the public. This data article describes a hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data.
Due to time limitation for certain algorithms that are computationally complex and cannot be executed efficiently on a large database. As a result, only data of customers with the Portuguese nationality was used in the analysis, as they are one of the top 3 nationalities to book and stay at this hotel, and it would be interesting to further explore the segmentation of the local guests. While there are clustering algorithms such as CLARA or CLARANS that can handle large data sets, not all of the methods used in the analysis had comparable alternatives for large data sets. To address this, the decision was made to limit the dataset to guests with Portuguese Nationality. Additionally, observations with missing values in any of the variables used in the analysis were excluded. Upon data cleaning and simplifications, the dataset contains 6639 observations and includes 11 continuous/count variables, 15 categorical variables (including 13 binary).
Link to the article: https://doi.org/10.1016/j.dib.2020.106583
Age: Customer’s age (in years) at the last day of the extraction period.
DaysSinceCreation: Number of days since the customer record was created (number of days elapsed between the creation date and the last day of the extraction period)
AverageLeadTime: The average number of days elapsed between the customer’s booking date and arrival date. In other words, this variable is calculated by dividing the sum of the number of days elapsed between the moment each booking was made and its arrival date, by the total of bookings made by the customer
TotalRevenue: Total amount of money spent by customers on lodging and other expense.
BookingsCanceled: Number of bookings the customer made but subsequently canceled (the costumer informed the hotel he/she would not come to stay)
BookingsNoShowed: Number of bookings the customer made but subsequently made a “no-show” (did not cancel, but did not check-in to stay at the hotel)
BookingsCheckedIn: Number of bookings the customer made, and which end up with a staying
PersonsNights: The total number of persons/nights that the costumer stayed at the hotel. This value is calculated by summing all customers checked-in bookings’ persons/nights. Person/nights of each booking is the result of the multiplication of the number of staying nights by the sum of adults and children
RoomNights: Total of room/nights the customer stayed at the hotel (checked-in bookings). Room/nights are the multiplication of the number of rooms of each booking by the number of nights of the booking
DaysSinceLastStay: The number of days elapsed between the last day of the extraction and the customer’s last arrival date (of a checked-in booking). A value of −1 indicates the customer never stayed at the hotel
DaysSinceFirstStay: The number of days elapsed between the last day of the extraction and the customer’s first arrival date (of a checked-in booking). A value of −1 indicates the customer never stayed at the hotel
DistributionChannel: Categorical - Distribution channel usually used by the customer to make bookings at the hotel
MarketSegment: Categorical - Current market segment of the customer
SRHighFloor: Boolean - Indication if the customer usually asks for a room on a higher floor (0: No, 1: Yes)
SRLowFloor: Boolean - Indication if the customer usually asks for a room on a lower floor (0: No, 1: Yes)
SRAccessibleRoom: Boolean - Indication if the customer usually asks for an accessible room (0: No, 1: Yes)
SRMediumFloor: Boolean - Indication if the customer usually asks for a room on a middle floor (0: No, 1: Yes)
SRBathtub: Boolean - Indication if the customer usually asks for a room with a bathtub (0: No, 1: Yes)
SRShower: Boolean - Indication if the customer usually asks for a room with a shower (0: No, 1: Yes)
SRCrib: Boolean - Indication if the customer usually asks for a crib (0: No, 1: Yes)
SRKingSizeBed: Boolean - Indication if the customer usually asks for a room with a king-size bed (0: No, 1: Yes)
SRTwinBed: Boolean - Indication if the customer usually asks for a room with a twin bed (0: No, 1: Yes)
SRNearElevator: Boolean - Indication if the customer usually asks for a room near the elevator (0: No, 1: Yes)
SRAwayFromElevator: Boolean - Indication if the customer usually asks for a room away from the elevator (0: No, 1: Yes)
SRNoAlcoholInMiniBar: Boolean - Indication if the customer usually asks for a room with no alcohol in the mini-bar (0: No, 1: Yes)
SRQuietRoom: Boolean - Indication if the customer usually asks for a room away from the noise (0: No, 1: Yes)
library(NbClust)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'stringr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.2.2
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.2
library(dplyr)
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(rfm)
## Warning: package 'rfm' was built under R version 4.2.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
library(hopkins)
## Warning: package 'hopkins' was built under R version 4.2.2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.2
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(stringr)
library(NbClust)
library(ClusterR)
## Loading required package: gtools
##
## Attaching package: 'gtools'
##
## The following object is masked from 'package:psych':
##
## logit
library(fpc)
library(clusterSim)
## Warning: package 'clusterSim' was built under R version 4.2.2
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 4.2.2
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(FactoMineR)
library(knitr)
library(clustMixType)
## Warning: package 'clustMixType' was built under R version 4.2.2
library(CatEncoders)
## Warning: package 'CatEncoders' was built under R version 4.2.2
##
## Attaching package: 'CatEncoders'
##
## The following object is masked from 'package:base':
##
## transform
## ID Nationality Age DaysSinceCreation
## 1 1 PRT 51 150
## 2 2 PRT NULL 1095
## 3 3 DEU 31 1095
## 4 4 FRA 60 1095
## 5 5 FRA 51 1095
## 6 6 JPN 54 1095
## NameHash
## 1 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 2 0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE66EF88352FD1BE8DE368
## 3 0x31C5E4B74E23231295FDB724AD578C02C4A723F4BA2B4AF99F129EC2F4B3AD41
## 4 0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA
## 5 0x9C1DEF02C9BE242842C1C1ABF2C5AA249A1EEB4763B47FF457133EE9199F1037
## 6 0x6E70C1504EB27252542F58E4D3C8C83516E093334721A3CE1DD194FE3F98DA0F
## DocIDHash
## 1 0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 2 0x5FA1E0098A31497057C5A6B9FE9D49FD6DD47CCE7C268E6548699E78E587AAEA
## 3 0xC7CF344F5B03295037595B1337AC905CA188F1B5B3A56C8C6E1A24202C9C672C
## 4 0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8
## 5 0xE175754CF77247B202DD0820F49407C762C14A603B3A6CFEA2A4DC06A5F7E00C
## 6 0xE82EC1D6938A04CF19E1F7F55A402E7ABC686261537A24EAE7FF5CA92646528E
## AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled BookingsNoShowed
## 1 45 371.00 105.30 1 0
## 2 61 280.00 53.00 0 0
## 3 0 0.00 0.00 0 0
## 4 93 240.00 60.00 0 0
## 5 0 0.00 0.00 0 0
## 6 58 230.00 24.00 0 0
## BookingsCheckedIn PersonsNights RoomNights DaysSinceLastStay
## 1 3 8 5 151
## 2 1 10 5 1100
## 3 0 0 0 -1
## 4 1 10 5 1100
## 5 0 0 0 -1
## 6 1 4 2 1097
## DaysSinceFirstStay DistributionChannel MarketSegment SRHighFloor
## 1 1074 Corporate Corporate 0
## 2 1100 Travel Agent/Operator Travel Agent/Operator 0
## 3 -1 Travel Agent/Operator Travel Agent/Operator 0
## 4 1100 Travel Agent/Operator Travel Agent/Operator 0
## 5 -1 Travel Agent/Operator Travel Agent/Operator 0
## 6 1097 Travel Agent/Operator Other 0
## SRLowFloor SRAccessibleRoom SRMediumFloor SRBathtub SRShower SRCrib
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## SRKingSizeBed SRTwinBed SRNearElevator SRAwayFromElevator
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## SRNoAlcoholInMiniBar SRQuietRoom
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
In the first steps of the analysis, basic statitics will be obtained to get more familiarized with the data in hand.
print(class(df))
## [1] "data.frame"
print(dim(df))
## [1] 83590 31
print(str(df))
## 'data.frame': 83590 obs. of 31 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Nationality : chr "PRT" "PRT" "DEU" "FRA" ...
## $ Age : chr "51" "NULL" "31" "60" ...
## $ DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ NameHash : chr "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0x21EDE41906B45079E75385B5AA33287CA09DE1AB86DE66EF88352FD1BE8DE368" "0x31C5E4B74E23231295FDB724AD578C02C4A723F4BA2B4AF99F129EC2F4B3AD41" "0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA" ...
## $ DocIDHash : chr "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x5FA1E0098A31497057C5A6B9FE9D49FD6DD47CCE7C268E6548699E78E587AAEA" "0xC7CF344F5B03295037595B1337AC905CA188F1B5B3A56C8C6E1A24202C9C672C" "0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8" ...
## $ AverageLeadTime : num 45 61 0 93 0 58 0 38 0 96 ...
## $ LodgingRevenue : chr "371.00" "280.00" "0.00" "240.00" ...
## $ OtherRevenue : chr "105.30" "53.00" "0.00" "60.00" ...
## $ BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BookingsCheckedIn : num 3 1 0 1 0 1 0 1 0 1 ...
## $ PersonsNights : num 8 10 0 10 0 4 0 10 0 6 ...
## $ RoomNights : num 5 5 0 5 0 2 0 5 0 3 ...
## $ DaysSinceLastStay : num 151 1100 -1 1100 -1 ...
## $ DaysSinceFirstStay : num 1074 1100 -1 1100 -1 ...
## $ DistributionChannel : chr "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
## $ MarketSegment : chr "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
## $ SRHighFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRCrib : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRKingSizeBed : num 0 0 0 0 0 0 0 1 1 0 ...
## $ SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAwayFromElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNoAlcoholInMiniBar: num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## NULL
# identifying the number of unique values for each variable
apply(df, 2, function(x) length(unique(x)))
## ID Nationality Age
## 83590 188 106
## DaysSinceCreation NameHash DocIDHash
## 1095 80642 76993
## AverageLeadTime LodgingRevenue OtherRevenue
## 418 10257 4490
## BookingsCanceled BookingsNoShowed BookingsCheckedIn
## 6 4 29
## PersonsNights RoomNights DaysSinceLastStay
## 56 48 1105
## DaysSinceFirstStay DistributionChannel MarketSegment
## 1108 4 7
## SRHighFloor SRLowFloor SRAccessibleRoom
## 2 2 2
## SRMediumFloor SRBathtub SRShower
## 2 2 2
## SRCrib SRKingSizeBed SRTwinBed
## 2 2 2
## SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar
## 2 2 2
## SRQuietRoom
## 2
In the previous cell, the describe function shows few things that requires modification. For instance, the variables of Age & AverageLeadTime contain some negative values, that needs to be omitted. Additionaly, there are missing values in the Age column that shall be dropped. In reference to total number of records, this is not going to affect our data that much.
df$Age <- as.integer(df$Age)
## Warning: NAs introduced by coercion
class(df$Age)
## [1] "integer"
describe(df)
## vars n mean sd median trimmed mad min
## ID 1 83590 41795.50 24130.50 41795.5 41795.50 30982.63 1
## Nationality* 2 83590 76.55 45.87 58.0 73.34 37.06 1
## Age 3 79811 45.40 16.57 46.0 45.68 17.79 -11
## DaysSinceCreation 4 83590 453.64 313.39 397.0 435.98 369.17 0
## NameHash* 5 83590 40325.44 23285.98 40360.5 40328.59 29905.52 1
## DocIDHash* 6 83590 38046.94 21830.94 36692.5 37936.41 27232.40 1
## AverageLeadTime 7 83590 66.20 87.76 29.0 49.18 43.00 -1
## LodgingRevenue* 8 83590 3554.28 3139.98 3173.0 3275.90 4271.37 1
## OtherRevenue* 9 83590 1737.87 1542.10 1445.0 1648.17 2140.87 1
## BookingsCanceled 10 83590 0.00 0.07 0.0 0.00 0.00 0
## BookingsNoShowed 11 83590 0.00 0.03 0.0 0.00 0.00 0
## BookingsCheckedIn 12 83590 0.79 0.70 1.0 0.83 0.00 0
## PersonsNights 13 83590 4.65 4.57 4.0 4.03 4.45 0
## RoomNights 14 83590 2.36 2.28 2.0 2.14 1.48 0
## DaysSinceLastStay 15 83590 401.07 347.20 366.0 378.32 493.71 -1
## DaysSinceFirstStay 16 83590 403.35 347.97 369.0 380.98 498.15 -1
## DistributionChannel* 17 83590 3.62 0.84 4.0 3.81 0.00 1
## MarketSegment* 18 83590 5.63 1.04 6.0 5.73 0.00 1
## SRHighFloor 19 83590 0.05 0.21 0.0 0.00 0.00 0
## SRLowFloor 20 83590 0.00 0.04 0.0 0.00 0.00 0
## SRAccessibleRoom 21 83590 0.00 0.02 0.0 0.00 0.00 0
## SRMediumFloor 22 83590 0.00 0.03 0.0 0.00 0.00 0
## SRBathtub 23 83590 0.00 0.05 0.0 0.00 0.00 0
## SRShower 24 83590 0.00 0.04 0.0 0.00 0.00 0
## SRCrib 25 83590 0.01 0.11 0.0 0.00 0.00 0
## SRKingSizeBed 26 83590 0.35 0.48 0.0 0.32 0.00 0
## SRTwinBed 27 83590 0.14 0.35 0.0 0.05 0.00 0
## SRNearElevator 28 83590 0.00 0.02 0.0 0.00 0.00 0
## SRAwayFromElevator 29 83590 0.00 0.06 0.0 0.00 0.00 0
## SRNoAlcoholInMiniBar 30 83590 0.00 0.01 0.0 0.00 0.00 0
## SRQuietRoom 31 83590 0.09 0.28 0.0 0.00 0.00 0
## max range skew kurtosis se
## ID 83590 83589 0.00 -1.20 83.46
## Nationality* 188 187 0.71 -0.68 0.16
## Age 122 133 -0.16 -0.29 0.06
## DaysSinceCreation 1095 1095 0.40 -1.14 1.08
## NameHash* 80642 80641 0.00 -1.20 80.54
## DocIDHash* 76993 76992 0.06 -1.14 75.51
## AverageLeadTime 588 589 1.91 4.48 0.30
## LodgingRevenue* 10257 10256 0.46 -1.00 10.86
## OtherRevenue* 4490 4489 0.27 -1.45 5.33
## BookingsCanceled 9 9 58.30 5270.38 0.00
## BookingsNoShowed 3 3 55.13 3587.55 0.00
## BookingsCheckedIn 66 66 26.88 1836.80 0.00
## PersonsNights 116 116 1.93 12.44 0.02
## RoomNights 185 185 11.19 647.65 0.01
## DaysSinceLastStay 1104 1105 0.31 -1.29 1.20
## DaysSinceFirstStay 1186 1187 0.30 -1.29 1.20
## DistributionChannel* 4 3 -1.87 1.87 0.00
## MarketSegment* 7 6 -1.17 1.48 0.00
## SRHighFloor 1 1 4.26 16.11 0.00
## SRLowFloor 1 1 26.56 703.37 0.00
## SRAccessibleRoom 1 1 63.07 3975.38 0.00
## SRMediumFloor 1 1 33.79 1140.04 0.00
## SRBathtub 1 1 18.66 346.21 0.00
## SRShower 1 1 24.11 579.53 0.00
## SRCrib 1 1 8.52 70.66 0.00
## SRKingSizeBed 1 1 0.62 -1.62 0.00
## SRTwinBed 1 1 2.04 2.18 0.00
## SRNearElevator 1 1 54.61 2980.29 0.00
## SRAwayFromElevator 1 1 16.80 280.29 0.00
## SRNoAlcoholInMiniBar 1 1 91.41 8353.80 0.00
## SRQuietRoom 1 1 2.90 6.41 0.00
# Removal of Missing Values (NAs)
colSums(is.na(df))
## ID Nationality Age
## 0 0 3779
## DaysSinceCreation NameHash DocIDHash
## 0 0 0
## AverageLeadTime LodgingRevenue OtherRevenue
## 0 0 0
## BookingsCanceled BookingsNoShowed BookingsCheckedIn
## 0 0 0
## PersonsNights RoomNights DaysSinceLastStay
## 0 0 0
## DaysSinceFirstStay DistributionChannel MarketSegment
## 0 0 0
## SRHighFloor SRLowFloor SRAccessibleRoom
## 0 0 0
## SRMediumFloor SRBathtub SRShower
## 0 0 0
## SRCrib SRKingSizeBed SRTwinBed
## 0 0 0
## SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar
## 0 0 0
## SRQuietRoom
## 0
dim(df)
## [1] 83590 31
df <- df[complete.cases(df$Age),]
dim(df)
## [1] 79811 31
df[is.na(df$Age),]
## [1] ID Nationality Age
## [4] DaysSinceCreation NameHash DocIDHash
## [7] AverageLeadTime LodgingRevenue OtherRevenue
## [10] BookingsCanceled BookingsNoShowed BookingsCheckedIn
## [13] PersonsNights RoomNights DaysSinceLastStay
## [16] DaysSinceFirstStay DistributionChannel MarketSegment
## [19] SRHighFloor SRLowFloor SRAccessibleRoom
## [22] SRMediumFloor SRBathtub SRShower
## [25] SRCrib SRKingSizeBed SRTwinBed
## [28] SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar
## [31] SRQuietRoom
## <0 rows> (or 0-length row.names)
colSums(is.na(df))
## ID Nationality Age
## 0 0 0
## DaysSinceCreation NameHash DocIDHash
## 0 0 0
## AverageLeadTime LodgingRevenue OtherRevenue
## 0 0 0
## BookingsCanceled BookingsNoShowed BookingsCheckedIn
## 0 0 0
## PersonsNights RoomNights DaysSinceLastStay
## 0 0 0
## DaysSinceFirstStay DistributionChannel MarketSegment
## 0 0 0
## SRHighFloor SRLowFloor SRAccessibleRoom
## 0 0 0
## SRMediumFloor SRBathtub SRShower
## 0 0 0
## SRCrib SRKingSizeBed SRTwinBed
## 0 0 0
## SRNearElevator SRAwayFromElevator SRNoAlcoholInMiniBar
## 0 0 0
## SRQuietRoom
## 0
# Remove negative Values in Age
df <- df[df$Age > 0,]
# Remove negative values in AverageLeadTime
df <- df[df$AverageLeadTime >= 0,]
dim(df)
## [1] 79743 31
# Removal of Outliers
df <- df[df$Age < 101,]
dim(df)
## [1] 79735 31
length(which(df$DaysSinceLastStay == -1))
## [1] 19025
df <- df[df$DaysSinceLastStay != -1,]
describe(df$DaysSinceLastStay)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 60710 518.39 301.16 519 515.47 400.3 0 1104 1104 0.05 -1.18
## se
## X1 1.22
print(dim(df))
## [1] 60710 31
# Total Revenue
# This new variable will be created (sum of Lodging & Other revenue) to fully represent the monetary dimension
df$TotalRevenue <- as.numeric(df$LodgingRevenue) + as.numeric(df$OtherRevenue)
class(df$TotalRevenue)
## [1] "numeric"
str(df)
## 'data.frame': 60710 obs. of 32 variables:
## $ ID : num 1 4 6 8 10 12 14 16 17 19 ...
## $ Nationality : chr "PRT" "FRA" "JPN" "FRA" ...
## $ Age : int 51 60 54 32 25 58 42 68 72 24 ...
## $ DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ NameHash : chr "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xFF534C83C0EF23D1CE516BC80A65D0197003D27937D485BC549171D52CE13CEA" "0x6E70C1504EB27252542F58E4D3C8C83516E093334721A3CE1DD194FE3F98DA0F" "0x5A3A2D6A659769FCA243FC2A97644D27A75FB9AA4DF38D55145E5BEBDB4F06AA" ...
## $ DocIDHash : chr "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0xBD3823A9B4EC35D6CAF4B27AE423A677C0200DB61E823EE8BE57787729DCBDB8" "0xE82EC1D6938A04CF19E1F7F55A402E7ABC686261537A24EAE7FF5CA92646528E" "0xB27F5644C88A7148360EFFF55D8F40565BAC3084B4C4A03F9EED486EB2437B2C" ...
## $ AverageLeadTime : num 45 93 58 38 96 60 87 11 11 109 ...
## $ LodgingRevenue : chr "371.00" "240.00" "230.00" "535.00" ...
## $ OtherRevenue : chr "105.30" "60.00" "24.00" "94.00" ...
## $ BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BookingsCheckedIn : num 3 1 1 1 1 1 1 1 1 1 ...
## $ PersonsNights : num 8 10 4 10 6 10 8 6 3 8 ...
## $ RoomNights : num 5 5 2 5 3 5 4 3 3 4 ...
## $ DaysSinceLastStay : num 151 1100 1097 1100 1098 ...
## $ DaysSinceFirstStay : num 1074 1100 1097 1100 1098 ...
## $ DistributionChannel : chr "Corporate" "Travel Agent/Operator" "Travel Agent/Operator" "Travel Agent/Operator" ...
## $ MarketSegment : chr "Corporate" "Travel Agent/Operator" "Other" "Other" ...
## $ SRHighFloor : num 0 0 0 0 0 0 0 0 0 1 ...
## $ SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRCrib : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRKingSizeBed : num 0 0 0 1 0 0 0 0 0 1 ...
## $ SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAwayFromElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNoAlcoholInMiniBar: num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalRevenue : num 476 300 254 629 243 ...
# top 15 nationalities that book and stay at this hotel
df %>%
count(Nationality) %>%
arrange(desc(n)) %>%
slice(1:15)
## Nationality n
## 1 FRA 9381
## 2 DEU 7890
## 3 PRT 6639
## 4 GBR 6565
## 5 ESP 3937
## 6 ITA 2561
## 7 USA 2430
## 8 BEL 2328
## 9 NLD 2060
## 10 BRA 2022
## 11 CHE 1589
## 12 IRL 1489
## 13 AUT 1094
## 14 CAN 978
## 15 SWE 907
Guests with Portuguese Nationality are one of the top 3 nationalities that come after France and Germany. However, guests with Portuguese nationality were selected given that they are considered to be local guests with the assumption that local guests are more likely to visit a hotel more regularly than a foreigner (e.g. guests business trips). In addition, this had to be done for simplification purposes due to the complexity of time that some algorithms require if applied on large dataset.
df_PRT <- subset(df, Nationality %in% c("PRT"))
dim(df_PRT)
## [1] 6639 32
Clustering algorithms that are based on distance measures or density estimates are more suitable for continuous or interval data. In this part, the below listed clustering algorithms will be applied in the first part of the analysis, where the dataset will only include variables with continuous data, with the aim to further compare its results with other clustering algorithms - in the second part of the analysis - that are specifically designed for mixed data, such as k-prototypes.
# This dataset will only include continuous and count variables for Part 1 analysis (excluding mixed data)
df_PRT_1 <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue)
head(df_PRT_1)
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 51 150 45
## 2 39 1095 1
## 3 71 1095 85
## 4 43 1095 78
## 5 38 1095 98
## 6 28 1094 103
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 0 3
## 2 0 0 9
## 3 0 0 1
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 8 5 151
## 2 18 14 591
## 3 6 3 1098
## 4 6 3 1098
## 5 3 3 1098
## 6 12 4 1098
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1074 476.30
## 2 1117 1338.99
## 3 1098 234.60
## 4 1098 281.10
## 5 1098 158.10
## 6 1098 1728.00
Standardizing the data ensures that all variables are on the same scale. Standardizing the data prior to applying the Hopkins statistic and clustering algorithms can help ensure that the results are meaningful, interpretable, and robust to outliers and differences in variable scale.
# Scaling
df_PRT_scaled<- scale(df_PRT_1)
head(df_PRT_scaled)
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## [1,] 0.4088669 -1.263162 -0.06144993
## [2,] -0.4658037 1.659981 -0.66333119
## [3,] 1.8666513 1.659981 0.48571486
## [4,] -0.1742468 1.659981 0.38996102
## [5,] -0.5386930 1.659981 0.66354341
## [6,] -1.2675852 1.656887 0.73193901
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## [1,] 4.66405102 -0.05799844 1.0314323
## [2,] -0.07785323 -0.05799844 4.4907295
## [3,] -0.07785323 -0.05799844 -0.1216668
## [4,] -0.07785323 -0.05799844 -0.1216668
## [5,] -0.07785323 -0.05799844 -0.1216668
## [6,] -0.07785323 -0.05799844 -0.1216668
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## [1,] 0.7894798 0.7450725 -1.2076818
## [2,] 2.9915454 3.4657733 0.1441954
## [3,] 0.3490667 0.1404723 1.7019267
## [4,] 0.3490667 0.1404723 1.7019267
## [5,] -0.3115530 0.1404723 1.7019267
## [6,] 1.6703061 0.4427724 1.7019267
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## [1,] 1.586808 0.2919791
## [2,] 1.719782 2.2157474
## [3,] 1.661026 -0.2470033
## [4,] 1.661026 -0.1433100
## [5,] 1.661026 -0.4175956
## [6,] 1.661026 3.0832260
Hopkins statistic is a measure used in cluster analysis to assess the clustering tendency of a dataset. It measures the degree of cluster tendency by comparing the spatial distribution of the observed data to a uniform distribution. A Hopkins statistic value close to 1 indicates that the data is highly clustered, while a value close to 0 indicates that the data is uniformly distributed.
#hopkins(df_PRT_1)
?hopkins
## starting httpd help server ... done
hopkins(df_PRT_scaled)
## [1] 0.9999953
The above value returned by Hopkins function suggests that the data has a very strong clustering tendency. In other words, the data is highly likely to be grouped into clusters, and may be a good candidate for clustering analysis.
Selecting the optimal number of clusters in k-means is an important step in the clustering process. There are several methods that can be used to determine the optimal number of clusters, including the elbow method, the silhouette method, and the gap statistic.
The silhouette method is a way to measure the quality of a clustering solution by assessing the degree of similarity between each data point and its assigned cluster compared to other clusters. The silhouette function in R computes the silhouette width for each data point, which ranges from -1 to 1, with larger values indicating better clustering.
fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans,
method = c("silhouette"), k.max = 8, nboot = 100,)
In the plot above, we can see that the 2nd, 3rd and 4th cluster has the highest average silhouette width, indicating that it is a well-defined cluster. The remaining clusters have average silhouette widths, indicating that they may be less well-defined. It would be useful to examine the distribution of silhouette widths for each of k= 2,2 & 4 to further assess the quality of the clustering solution.
The method argument is set to “WSS” to specify the WSS method for determining the appropriate number of clusters. The resulting plot shows the WSS for cluster solutions ranging from 1 to 8. The appropriate number of clusters can be determined based on the “elbow” point in the plot, where the WSS begins to level off, which is K=3 in this case.
fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
?fviz_nbclust
The output of provides the gap statistic values for a range of cluster solutions. The optimal number of clusters is typically the value that maximizes the gap statistic, as this indicates the largest difference between the observed data and the null reference distribution. In the output, the optimal number of clusters is denoted by the blue vertical line: k=2
fviz_nbclust(df_PRT_scaled, FUNcluster = kmeans, method = c("gap_stat"), k.max = 8, nboot = 50,)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 331950)
## Warning: did not converge in 10 iterations
AIC values are shown for 1 to 8 clusters. The optimal number of clusters is the value that minimizes the AIC criterion, which in this case is 3, with an AIC value of 45589.2.
opt<-Optimal_Clusters_KMeans(df_PRT_scaled, max_clusters=8, plot_clusters=TRUE, criterion="AIC")
The optimal number of clusters according to the BIC criterion is the one that minimizes the BIC score. In other words, the number of clusters that balances the goodness of fit of the model with its complexity.
From the below plot, the optimal number of clusters is k=3.
opt2<-Optimal_Clusters_KMeans(df_PRT_scaled, max_clusters=8, plot_clusters=TRUE, criterion="BIC")
Given the above results, the optimal number of clusters when applying k-means is as followed k= 2, 3, 4.
k-means clustering is applied to explore the structure of a dataset and identify natural groupings of observations.
# # Perform k-means clustering for k = 2
kmeans2 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=2)
fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans2)
## cluster size ave.sil.width
## 1 1 3329 0.39
## 2 2 3310 0.24
# Get cluster assignments for each observation
cluster_assignments <- kmeans2$cluster
# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)
# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
## Group.1 df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 0.02219716 0.8610029 -0.04073196
## 2 2 -0.02232457 -0.8659452 0.04096577
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.05221361 -0.04715477 -0.05879883
## 2 0.05251333 0.04742545 0.05913635
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.01487938 -0.05258565 0.8597482
## 2 0.01496479 0.05288750 -0.8646833
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 0.8605461 -0.09380382
## 2 -0.8654858 0.09434227
For k = 2, the average silhouette width is 0.31, which suggests a moderately good clustering solution. The mean value of each variable for each cluster is computed, in order to identify which variables are most strongly associated with each cluster, where if one cluster has much higher mean values for a particular variable than the other cluster, that variable is likely to be strongly associated with the first cluster. This is done to obtain a better understanding why the silhouette plot for cluster 2 show a low silhouette coefficient, indicating that there might be points that are considered to be outliers, as they are not well-clustered with the other data points.
Variables like (TotalRevenue, RoonNights, BookingsCheckedin) are included in cluster 2, which explains the low silhouette coefficient for certain data points, which represent customers that their spending is way more than the other customers, due to multiple bookings at the same hotel and greater number of booked nights per room. These customers are a minority but also may potentially be categorized as loyal.
# Perform k-means clustering for k = 3
kmeans3 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=3)
fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans3)
## cluster size ave.sil.width
## 1 1 3319 0.36
## 2 2 3274 0.30
## 3 3 46 -0.08
# Get cluster assignments for each observation
cluster_assignments <- kmeans3$cluster
# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)
# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
## Group.1 df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 0.02711539 0.8620822 -0.03616890
## 2 2 -0.03024945 -0.8820315 0.04159471
## 3 3 0.19653745 0.5765284 -0.35079304
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.06070865 -0.04712210 -0.06295221
## 2 -0.04454113 -0.03227149 -0.05351632
## 3 7.55042752 5.69684993 8.35110466
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.01624132 -0.06382388 0.8601425
## 2 -0.06054171 -0.04160946 -0.8581631
## 3 5.48083698 7.56653979 -0.9823244
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 0.8609211 -0.09351338
## 2 -0.8821248 0.02698286
## 3 0.6669449 4.82671804
For k = 3, the average silhouette width is 0.33, which suggests a moderately good clustering solution as well. In this case here, we notice that all the potential outliers are clustered in cluster 3, which also has a low silhouette coefficient.
# Perform k-means clustering for k = 4
kmeans4 <- eclust(df_PRT_scaled, "kmeans", hc_metric="euclidean", k=4)
fviz_cluster(kmeans4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans4)
## cluster size ave.sil.width
## 1 1 3009 0.39
## 2 2 2743 0.40
## 3 3 21 0.01
## 4 4 866 -0.09
# Get cluster assignments for each observation
cluster_assignments <- kmeans4$cluster
# Assign the cluster assignments to the original data frame
df_PRT_clusters <- data.frame(df_PRT_scaled, cluster = cluster_assignments)
# Get the mean value of each variable for each cluster
aggregate(df_PRT_clusters[,1:ncol(df_PRT_scaled)], by=list(df_PRT_clusters$cluster), FUN=mean)
## Group.1 df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 -0.009355879 0.8873338 -0.1518803
## 2 2 -0.094740998 -0.9341515 -0.2781166
## 3 3 0.183257427 0.3699388 -0.4184098
## 4 4 0.328150105 -0.1332314 1.4187856
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.06209416 -0.04600157 -0.09139268
## 2 -0.05883720 -0.03606468 -0.07647608
## 3 12.11561485 10.82887921 12.64478724
## 4 0.10831854 0.01147537 0.25315697
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.1548689 -0.1515803 0.8996174
## 2 -0.2667570 -0.1791304 -0.9042274
## 3 7.0601238 9.7277038 -1.1665695
## 4 1.2118391 0.8581732 -0.2334354
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 0.8850805 -0.2033953
## 2 -0.9369314 -0.1704813
## 3 0.5434815 4.7688032
## 4 -0.1208053 1.1310644
For k = 4, the average silhouette width is 0.33, which suggests a moderately good clustering solution as well. Cluster 3 & 4 contain lower silhouette coefficients that may suggest that they are not well clustered.
However, let´s look at other different methods to see if they may yield different results.
Calinski-Harabasz index is calculated for the clusters obtained from the k-means clustering algorithm. Calinski-Harabasz index is a measure of the separation between clusters and is computed by comparing the ratio of the between-cluster dispersion and the within-cluster dispersion for a range of cluster solutions.
round(calinhara(df_PRT_scaled,kmeans2$cluster),digits=2)
## [1] 1712.95
round(calinhara(df_PRT_scaled,kmeans3$cluster),digits=2)
## [1] 2006.59
round(calinhara(df_PRT_scaled,kmeans4$cluster),digits=2)
## [1] 1790.25
The above results show that the value of Calinski-Harabasz index is very similar for k=2 and k=3, while it decreases substantially for k=4. This suggests that k=3 might be a good choice for the number of clusters, as it leads to well-separated clusters without over-segmenting the data.
kmeans2$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.02219716 0.8610029 -0.04073196
## 2 -0.02232457 -0.8659452 0.04096577
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.05221361 -0.04715477 -0.05879883
## 2 0.05251333 0.04742545 0.05913635
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.01487938 -0.05258565 0.8597482
## 2 0.01496479 0.05288750 -0.8646833
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 0.8605461 -0.09380382
## 2 -0.8654858 0.09434227
kmeans3$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.02711539 0.8620822 -0.03616890
## 2 -0.03024945 -0.8820315 0.04159471
## 3 0.19653745 0.5765284 -0.35079304
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.06070865 -0.04712210 -0.06295221
## 2 -0.04454113 -0.03227149 -0.05351632
## 3 7.55042752 5.69684993 8.35110466
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.01624132 -0.06382388 0.8601425
## 2 -0.06054171 -0.04160946 -0.8581631
## 3 5.48083698 7.56653979 -0.9823244
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 0.8609211 -0.09351338
## 2 -0.8821248 0.02698286
## 3 0.6669449 4.82671804
The results above show that there are three distinct clusters in the data.
Cluster 1 has above-average values in df_PRT.DaysSinceCreation, below-average values in df_PRT.TotalRevenue, and close-to-average values in other dimensions.
Cluster 2 has below-average values in df_PRT.DaysSinceCreation, above-average values in df_PRT.TotalRevenue, and close-to-average values in other dimensions.
Cluster 3 has very high values in df_PRT.BookingsCanceled, df_PRT.BookingsNoShowed, df_PRT.BookingsCheckedIn, df_PRT.PersonsNights, df_PRT.RoomNights, and df_PRT.TotalRevenue, and very low values in df_PRT.AverageLeadTime, suggesting that this cluster represents a group of customers who make many bookings, but often cancel or no-show, leading to significant revenue losses.
PAM is especially useful when you have a large dataset and the distance metric used in clustering is not a Euclidean distance. Unlike k-means, which uses the mean of each cluster as its center point, PAM uses a medoid, which is the point that is closest to all other points in the cluster. PAM is more robust to outliers than k-means, as outliers will not be chosen as medoids.
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 8, nboot = 100,)
In the plot above, we can see that unlike K-means, only k=2 cluster has the highest average silhouette width, indicating that it is a well-defined cluster. The remaining clusters have average silhouette widths, indicating that they may be less well-defined.
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::pam, method = c("wss"), k.max = 8, nboot = 100,)
The appropriate number of clusters can be determined based on the “elbow” point in the plot, where the WSS begins to level off, which can be k=2 & K=3 in this case.
# Perform PAM clustering using 2 clusters
pam2<-eclust(df_PRT_scaled, "pam", hc_metric="euclidean", k=2)
fviz_cluster(pam2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 3163 0.34
## 2 2 3476 0.29
# Perform PAM clustering using 3 clusters
pam3<-eclust(df_PRT_scaled, "pam", hc_metric="euclidean", k=3)
fviz_cluster(pam3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 2274 -0.03
## 2 2 2387 0.35
## 3 3 1978 0.30
The results of PAM are similar to the previous results obtained from k-means.
CLARANS can be a good choice when dealing with large data sets, complex cluster shapes, and outliers.
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::clara, method = c("silhouette"), k.max = 8, nboot = 100,)
In the plot above, we can see that k=2 & k=3 cluster are having the highest average silhouette width, indicating well-defined clusters. The remaining proposesd number of optimal clusters have average silhouette widths, indicating that they may be less well-defined. The result is similar to the previous two algorithms.
fviz_nbclust(df_PRT_scaled, FUNcluster = cluster::clara, method = c("wss"), k.max = 8, nboot = 100,)
The appropriate number of clusters can be determined based on the “elbow” point in the plot, where the WSS begins to level off, which can be k=2 & K=3 in this case, similar to the optimal number of clusters in previous algorithms.
cl2<-eclust(df_PRT_scaled, "clara", k=2)
fviz_silhouette(cl2)
## cluster size ave.sil.width
## 1 1 3378 0.24
## 2 2 3261 0.39
cl3<-eclust(df_PRT_scaled, "clara", k=3)
fviz_silhouette(cl3)
## cluster size ave.sil.width
## 1 1 3081 0.32
## 2 2 69 -0.14
## 3 3 3489 0.33
The above obtained results are similar to the the previous algorithms.
PCA is often used as a pre-processing step before applying clustering algorithms, especially when the number of variables in the dataset is very high. PCA is particularly useful when the original variables are highly correlated or when the number of variables is very high. By reducing the number of variables, PCA can make it easier to visualize the data and identify patterns that may be difficult to see in the original high-dimensional space.
pca <- PCA(df_PRT_scaled, ncp=2, scale.unit = F, graph=FALSE)
get_eigenvalue(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.478576524 31.62818694 31.62819
## Dim.2 2.948872899 26.81197400 58.44016
## Dim.3 1.300611460 11.82552177 70.26568
## Dim.4 1.020574814 9.27935056 79.54503
## Dim.5 0.785002084 7.13745766 86.68249
## Dim.6 0.638227362 5.80294100 92.48543
## Dim.7 0.383710116 3.48879928 95.97423
## Dim.8 0.273390857 2.48574585 98.45998
## Dim.9 0.133220338 1.21127643 99.67125
## Dim.10 0.033457133 0.30420157 99.97546
## Dim.11 0.002699537 0.02454494 100.00000
fviz_eig(pca, addlabels=TRUE)
The first principal component (Dim.1) has the largest eigenvalue of 3.4786, which accounts for 31.63% of the total variance. The second principal component (Dim.2) has an eigenvalue of 2.9489, explaining an additional 26.81% of the total variance.
The cumulative percentage of variance explained is useful for determining how many principal components to retain in the analysis. In this case, the first two principal components explain 58.44% of the total variance, the first three explain 70.27%, and the first four explain 79.55%. Typically, a threshold of 70-80% is used to decide how many principal components to keep.
pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
By looking at both the contribution and cosine scores in the above visualization, we can identify variables that are driving the separation between clusters and determine which variables are most important for distinguishing between different groups of observations. All variables are far from the origin and having a igh contribution to their corresponding principal components, except for Age and AverageLeadTime have a low contribution as they are close to origin. For dim 2, all variables are considered to be well aligned given their high cosine score, while for dim 1 most of the variable are considered to be moderately -aligned.
fviz_contrib(pca, choice = "var", axes = 1, top = 10)
fviz_contrib(pca, choice = "var", axes = 2, top = 10)
The variables with a high contribution to first and second principal components are the ones that have the highest impact on each of the components, and therefore, have the strongest relationship with other variables in the dataset that are also well-represented by each component. These variables may be the most important to consider when interpreting the results of subsequent analyses that rely on the principal components when applying clustering.
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
)
K-means will be applied to the reduced data as shown below.
datapca <- data.frame(pca$ind$coord)
hopkins(datapca)
## [1] 0.9895887
get_clust_tendency(datapca, 100, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.9791029
The number of clusters is determined use by using techniques such as the elbow method and silhouette analysis, where the obrained results are similar to the previous ones (K=2 & k=3)
fviz_nbclust(datapca, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
fviz_nbclust(datapca, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
kmeanspca3 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=3, graph = T)
fviz_cluster(kmeanspca3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeanspca3)
## cluster size ave.sil.width
## 1 1 3269 0.51
## 2 2 44 0.35
## 3 3 3326 0.54
kmeanspca2 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=2, graph = T)
fviz_silhouette(kmeanspca2)
## cluster size ave.sil.width
## 1 1 3338 0.56
## 2 2 3301 0.44
kmeanspca4 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=4, graph = T)
fviz_silhouette(kmeanspca4)
## cluster size ave.sil.width
## 1 1 3230 0.53
## 2 2 106 0.32
## 3 3 8 0.39
## 4 4 3295 0.55
round(calinhara(datapca,kmeanspca2$cluster),digits=2)
## [1] 3583.4
round(calinhara(datapca,kmeanspca3$cluster),digits=2)
## [1] 5743.88
round(calinhara(datapca,kmeanspca4$cluster),digits=2)
## [1] 6177.33
The increase in the average silhouette width from 0.33 (without PCA) to 0.52 after applying PCA prior to k-means clustering indicates that PCA has improved the clustering performance. PCA reduces the dimensionality of the data by transforming the original variables into a new set of uncorrelated variables known as principal components. These principal components capture the maximum variance in the data, and the first few principal components usually account for most of the variability in the data.
Converting categorical attributes to binary values (also known as one-hot encoding) is a common approach to use with k-means clustering.
In this dataset, the categorical data are included in order to apply other algorithms for mixed data.
df_PRT1 <- subset(df, Nationality %in% c("PRT"))
dim(df_PRT)
## [1] 6639 32
str(df_PRT1)
## 'data.frame': 6639 obs. of 32 variables:
## $ ID : num 1 31 32 34 55 84 123 127 135 155 ...
## $ Nationality : chr "PRT" "PRT" "PRT" "PRT" ...
## $ Age : int 51 39 71 43 38 28 41 44 45 60 ...
## $ DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ NameHash : chr "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0" "0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB" "0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725" ...
## $ DocIDHash : chr "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7" "0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5" "0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19" ...
## $ AverageLeadTime : num 45 1 85 78 98 103 2 1 0 10 ...
## $ LodgingRevenue : chr "371.00" "1083.50" "180.60" "180.60" ...
## $ OtherRevenue : chr "105.30" "255.49" "54.00" "100.50" ...
## $ BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BookingsCheckedIn : num 3 9 1 1 1 1 1 1 1 1 ...
## $ PersonsNights : num 8 18 6 6 3 12 9 8 4 4 ...
## $ RoomNights : num 5 14 3 3 3 4 3 2 1 2 ...
## $ DaysSinceLastStay : num 151 591 1098 1098 1098 ...
## $ DaysSinceFirstStay : num 1074 1117 1098 1098 1098 ...
## $ DistributionChannel : chr "Corporate" "Direct" "Direct" "Direct" ...
## $ MarketSegment : chr "Corporate" "Corporate" "Direct" "Direct" ...
## $ SRHighFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRCrib : num 0 0 0 1 0 0 1 0 0 0 ...
## $ SRKingSizeBed : num 0 1 0 0 0 1 1 1 0 0 ...
## $ SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAwayFromElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNoAlcoholInMiniBar: num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalRevenue : num 476 1339 235 281 158 ...
df_PRT_Mixed_cat <- data.frame(df_PRT1$Age, df_PRT1$DaysSinceCreation, df_PRT1$AverageLeadTime, df_PRT1$BookingsCanceled, df_PRT1$BookingsNoShowed, df_PRT1$BookingsCheckedIn, df_PRT1$PersonsNights, df_PRT1$RoomNights, df_PRT1$DaysSinceLastStay, df_PRT1$DaysSinceFirstStay, df_PRT1$TotalRevenue, df_PRT1$DistributionChannel, df_PRT1$MarketSegment,df_PRT1$SRHighFloor,df_PRT1$SRLowFloor,df_PRT1$SRAccessibleRoom,df_PRT1$SRMediumFloor, df_PRT1$SRBathtub, df_PRT1$SRShower, df_PRT1$SRCrib, df_PRT1$SRKingSizeBed, df_PRT1$SRTwinBed, df_PRT1$SRNearElevator, df_PRT1$SRQuietRoom)
head(df_PRT1, 10)
## ID Nationality Age DaysSinceCreation
## 1 1 PRT 51 150
## 31 31 PRT 39 1095
## 32 32 PRT 71 1095
## 34 34 PRT 43 1095
## 55 55 PRT 38 1095
## 84 84 PRT 28 1094
## 123 123 PRT 41 1094
## 127 127 PRT 44 1094
## 135 135 PRT 45 1094
## 155 155 PRT 60 1094
## NameHash
## 1 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31 0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32 0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34 0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55 0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84 0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
## DocIDHash
## 1 0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31 0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32 0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34 0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55 0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84 0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
## AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1 45 371.00 105.30 1
## 31 1 1083.50 255.49 0
## 32 85 180.60 54.00 0
## 34 78 180.60 100.50 0
## 55 98 142.80 15.30 0
## 84 103 770.10 957.90 0
## 123 2 404.50 473.30 0
## 127 1 376.00 224.00 0
## 135 0 165.00 24.00 0
## 155 10 249.10 173.50 0
## BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1 0 3 8 5
## 31 0 9 18 14
## 32 0 1 6 3
## 34 0 1 6 3
## 55 0 1 3 3
## 84 0 1 12 4
## 123 0 1 9 3
## 127 0 1 8 2
## 135 0 1 4 1
## 155 0 1 4 2
## DaysSinceLastStay DaysSinceFirstStay DistributionChannel
## 1 151 1074 Corporate
## 31 591 1117 Direct
## 32 1098 1098 Direct
## 34 1098 1098 Direct
## 55 1098 1098 Travel Agent/Operator
## 84 1098 1098 Direct
## 123 1097 1097 Direct
## 127 1096 1096 Direct
## 135 1095 1095 Direct
## 155 1096 1096 Travel Agent/Operator
## MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1 Corporate 0 0 0 0
## 31 Corporate 0 0 0 0
## 32 Direct 0 0 0 0
## 34 Direct 0 0 0 0
## 55 Travel Agent/Operator 0 0 0 0
## 84 Direct 0 0 0 0
## 123 Direct 0 0 0 0
## 127 Direct 0 0 0 0
## 135 Direct 0 0 0 0
## 155 Travel Agent/Operator 0 0 0 0
## SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1 0 0 0 0 0 0
## 31 0 0 0 1 0 0
## 32 0 0 0 0 0 0
## 34 0 0 1 0 0 0
## 55 0 0 0 0 0 0
## 84 0 0 0 1 0 0
## 123 0 0 1 1 0 0
## 127 0 0 0 1 0 0
## 135 0 0 0 0 0 0
## 155 0 0 0 0 0 0
## SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1 0 0 0 476.30
## 31 0 0 0 1338.99
## 32 0 0 0 234.60
## 34 0 0 0 281.10
## 55 0 0 0 158.10
## 84 0 0 0 1728.00
## 123 0 0 0 877.80
## 127 0 0 0 600.00
## 135 0 0 0 189.00
## 155 0 0 0 422.60
str(df_PRT_Mixed_cat)
## 'data.frame': 6639 obs. of 24 variables:
## $ df_PRT1.Age : int 51 39 71 43 38 28 41 44 45 60 ...
## $ df_PRT1.DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ df_PRT1.AverageLeadTime : num 45 1 85 78 98 103 2 1 0 10 ...
## $ df_PRT1.BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.BookingsCheckedIn : num 3 9 1 1 1 1 1 1 1 1 ...
## $ df_PRT1.PersonsNights : num 8 18 6 6 3 12 9 8 4 4 ...
## $ df_PRT1.RoomNights : num 5 14 3 3 3 4 3 2 1 2 ...
## $ df_PRT1.DaysSinceLastStay : num 151 591 1098 1098 1098 ...
## $ df_PRT1.DaysSinceFirstStay : num 1074 1117 1098 1098 1098 ...
## $ df_PRT1.TotalRevenue : num 476 1339 235 281 158 ...
## $ df_PRT1.DistributionChannel: chr "Corporate" "Direct" "Direct" "Direct" ...
## $ df_PRT1.MarketSegment : chr "Corporate" "Corporate" "Direct" "Direct" ...
## $ df_PRT1.SRHighFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRCrib : num 0 0 0 1 0 0 1 0 0 0 ...
## $ df_PRT1.SRKingSizeBed : num 0 1 0 0 0 1 1 1 0 0 ...
## $ df_PRT1.SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
#define original categorical labels
labs = LabelEncoder.fit(df_PRT1$DistributionChannel)
#convert labels to numeric values
df_PRT1$DistributionChannel = transform(labs, df_PRT1$DistributionChannel)
#define original categorical labels
labs = LabelEncoder.fit(df_PRT1$MarketSegment)
#convert labels to numeric values
df_PRT1$MarketSegment = transform(labs, df_PRT1$MarketSegment)
str(df_PRT1)
## 'data.frame': 6639 obs. of 32 variables:
## $ ID : num 1 31 32 34 55 84 123 127 135 155 ...
## $ Nationality : chr "PRT" "PRT" "PRT" "PRT" ...
## $ Age : int 51 39 71 43 38 28 41 44 45 60 ...
## $ DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ NameHash : chr "0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E" "0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0" "0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB" "0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725" ...
## $ DocIDHash : chr "0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137" "0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7" "0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5" "0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19" ...
## $ AverageLeadTime : num 45 1 85 78 98 103 2 1 0 10 ...
## $ LodgingRevenue : chr "371.00" "1083.50" "180.60" "180.60" ...
## $ OtherRevenue : chr "105.30" "255.49" "54.00" "100.50" ...
## $ BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BookingsCheckedIn : num 3 9 1 1 1 1 1 1 1 1 ...
## $ PersonsNights : num 8 18 6 6 3 12 9 8 4 4 ...
## $ RoomNights : num 5 14 3 3 3 4 3 2 1 2 ...
## $ DaysSinceLastStay : num 151 591 1098 1098 1098 ...
## $ DaysSinceFirstStay : num 1074 1117 1098 1098 1098 ...
## $ DistributionChannel : int 1 2 2 2 4 2 2 2 2 4 ...
## $ MarketSegment : int 3 3 4 4 7 4 4 4 4 7 ...
## $ SRHighFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRCrib : num 0 0 0 1 0 0 1 0 0 0 ...
## $ SRKingSizeBed : num 0 1 0 0 0 1 1 1 0 0 ...
## $ SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRAwayFromElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRNoAlcoholInMiniBar: num 0 0 0 0 0 0 0 0 0 0 ...
## $ SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalRevenue : num 476 1339 235 281 158 ...
df_PRT_Cat <- data.frame(df_PRT1$Age, df_PRT1$DaysSinceCreation, df_PRT1$AverageLeadTime, df_PRT1$BookingsCanceled, df_PRT1$BookingsNoShowed, df_PRT1$BookingsCheckedIn, df_PRT1$PersonsNights, df_PRT1$RoomNights, df_PRT1$DaysSinceLastStay, df_PRT1$DaysSinceFirstStay, df_PRT1$TotalRevenue, df_PRT1$DistributionChannel, df_PRT1$MarketSegment,df_PRT1$SRLowFloor,df_PRT1$SRAccessibleRoom,df_PRT1$SRMediumFloor, df_PRT1$SRBathtub, df_PRT1$SRShower, df_PRT1$SRCrib, df_PRT1$SRKingSizeBed, df_PRT1$SRTwinBed, df_PRT1$SRNearElevator, df_PRT1$SRQuietRoom)
head(df_PRT_Cat, 10)
## df_PRT1.Age df_PRT1.DaysSinceCreation df_PRT1.AverageLeadTime
## 1 51 150 45
## 2 39 1095 1
## 3 71 1095 85
## 4 43 1095 78
## 5 38 1095 98
## 6 28 1094 103
## 7 41 1094 2
## 8 44 1094 1
## 9 45 1094 0
## 10 60 1094 10
## df_PRT1.BookingsCanceled df_PRT1.BookingsNoShowed df_PRT1.BookingsCheckedIn
## 1 1 0 3
## 2 0 0 9
## 3 0 0 1
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## 7 0 0 1
## 8 0 0 1
## 9 0 0 1
## 10 0 0 1
## df_PRT1.PersonsNights df_PRT1.RoomNights df_PRT1.DaysSinceLastStay
## 1 8 5 151
## 2 18 14 591
## 3 6 3 1098
## 4 6 3 1098
## 5 3 3 1098
## 6 12 4 1098
## 7 9 3 1097
## 8 8 2 1096
## 9 4 1 1095
## 10 4 2 1096
## df_PRT1.DaysSinceFirstStay df_PRT1.TotalRevenue df_PRT1.DistributionChannel
## 1 1074 476.30 1
## 2 1117 1338.99 2
## 3 1098 234.60 2
## 4 1098 281.10 2
## 5 1098 158.10 4
## 6 1098 1728.00 2
## 7 1097 877.80 2
## 8 1096 600.00 2
## 9 1095 189.00 2
## 10 1096 422.60 4
## df_PRT1.MarketSegment df_PRT1.SRLowFloor df_PRT1.SRAccessibleRoom
## 1 3 0 0
## 2 3 0 0
## 3 4 0 0
## 4 4 0 0
## 5 7 0 0
## 6 4 0 0
## 7 4 0 0
## 8 4 0 0
## 9 4 0 0
## 10 7 0 0
## df_PRT1.SRMediumFloor df_PRT1.SRBathtub df_PRT1.SRShower df_PRT1.SRCrib
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 1
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## df_PRT1.SRKingSizeBed df_PRT1.SRTwinBed df_PRT1.SRNearElevator
## 1 0 0 0
## 2 1 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
## 7 1 0 0
## 8 1 0 0
## 9 0 0 0
## 10 0 0 0
## df_PRT1.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
str(df_PRT_Cat)
## 'data.frame': 6639 obs. of 23 variables:
## $ df_PRT1.Age : int 51 39 71 43 38 28 41 44 45 60 ...
## $ df_PRT1.DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ df_PRT1.AverageLeadTime : num 45 1 85 78 98 103 2 1 0 10 ...
## $ df_PRT1.BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.BookingsCheckedIn : num 3 9 1 1 1 1 1 1 1 1 ...
## $ df_PRT1.PersonsNights : num 8 18 6 6 3 12 9 8 4 4 ...
## $ df_PRT1.RoomNights : num 5 14 3 3 3 4 3 2 1 2 ...
## $ df_PRT1.DaysSinceLastStay : num 151 591 1098 1098 1098 ...
## $ df_PRT1.DaysSinceFirstStay : num 1074 1117 1098 1098 1098 ...
## $ df_PRT1.TotalRevenue : num 476 1339 235 281 158 ...
## $ df_PRT1.DistributionChannel: int 1 2 2 2 4 2 2 2 2 4 ...
## $ df_PRT1.MarketSegment : int 3 3 4 4 7 4 4 4 4 7 ...
## $ df_PRT1.SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRCrib : num 0 0 0 1 0 0 1 0 0 0 ...
## $ df_PRT1.SRKingSizeBed : num 0 1 0 0 0 1 1 1 0 0 ...
## $ df_PRT1.SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT1.SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
dim(df_PRT_Cat)
## [1] 6639 23
df_PRT_Cat_scaled<- scale(df_PRT_Cat)
head(df_PRT_Cat_scaled)
## df_PRT1.Age df_PRT1.DaysSinceCreation df_PRT1.AverageLeadTime
## [1,] 0.4088669 -1.263162 -0.06144993
## [2,] -0.4658037 1.659981 -0.66333119
## [3,] 1.8666513 1.659981 0.48571486
## [4,] -0.1742468 1.659981 0.38996102
## [5,] -0.5386930 1.659981 0.66354341
## [6,] -1.2675852 1.656887 0.73193901
## df_PRT1.BookingsCanceled df_PRT1.BookingsNoShowed
## [1,] 4.66405102 -0.05799844
## [2,] -0.07785323 -0.05799844
## [3,] -0.07785323 -0.05799844
## [4,] -0.07785323 -0.05799844
## [5,] -0.07785323 -0.05799844
## [6,] -0.07785323 -0.05799844
## df_PRT1.BookingsCheckedIn df_PRT1.PersonsNights df_PRT1.RoomNights
## [1,] 1.0314323 0.7894798 0.7450725
## [2,] 4.4907295 2.9915454 3.4657733
## [3,] -0.1216668 0.3490667 0.1404723
## [4,] -0.1216668 0.3490667 0.1404723
## [5,] -0.1216668 -0.3115530 0.1404723
## [6,] -0.1216668 1.6703061 0.4427724
## df_PRT1.DaysSinceLastStay df_PRT1.DaysSinceFirstStay df_PRT1.TotalRevenue
## [1,] -1.2076818 1.586808 0.2919791
## [2,] 0.1441954 1.719782 2.2157474
## [3,] 1.7019267 1.661026 -0.2470033
## [4,] 1.7019267 1.661026 -0.1433100
## [5,] 1.7019267 1.661026 -0.4175956
## [6,] 1.7019267 1.661026 3.0832260
## df_PRT1.DistributionChannel df_PRT1.MarketSegment df_PRT1.SRLowFloor
## [1,] -1.9056150 -1.4831535 -0.04073544
## [2,] -1.0338877 -1.4831535 -0.04073544
## [3,] -1.0338877 -0.7622621 -0.04073544
## [4,] -1.0338877 -0.7622621 -0.04073544
## [5,] 0.7095668 1.4004123 -0.04073544
## [6,] -1.0338877 -0.7622621 -0.04073544
## df_PRT1.SRAccessibleRoom df_PRT1.SRMediumFloor df_PRT1.SRBathtub
## [1,] -0.01735787 -0.03883679 -0.0475831
## [2,] -0.01735787 -0.03883679 -0.0475831
## [3,] -0.01735787 -0.03883679 -0.0475831
## [4,] -0.01735787 -0.03883679 -0.0475831
## [5,] -0.01735787 -0.03883679 -0.0475831
## [6,] -0.01735787 -0.03883679 -0.0475831
## df_PRT1.SRShower df_PRT1.SRCrib df_PRT1.SRKingSizeBed df_PRT1.SRTwinBed
## [1,] -0.03684103 -0.1191848 -0.6558488 -0.3125756
## [2,] -0.03684103 -0.1191848 1.5245119 -0.3125756
## [3,] -0.03684103 -0.1191848 -0.6558488 -0.3125756
## [4,] -0.03684103 8.3890700 -0.6558488 -0.3125756
## [5,] -0.03684103 -0.1191848 -0.6558488 -0.3125756
## [6,] -0.03684103 -0.1191848 1.5245119 -0.3125756
## df_PRT1.SRNearElevator df_PRT1.SRQuietRoom
## [1,] -0.01227294 -0.216006
## [2,] -0.01227294 -0.216006
## [3,] -0.01227294 -0.216006
## [4,] -0.01227294 -0.216006
## [5,] -0.01227294 -0.216006
## [6,] -0.01227294 -0.216006
hopkins(df_PRT_Cat_scaled)
## [1] 1
fviz_nbclust(df_PRT_Cat_scaled, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
fviz_nbclust(df_PRT_Cat_scaled, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
kmeanscat2 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=2)
kmeanscat3 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=3)
kmeanscat4 <- eclust(df_PRT_Cat_scaled, "kmeans", hc_metric="euclidean", k=4)
fviz_silhouette(kmeanscat2)
## cluster size ave.sil.width
## 1 1 4251 0.19
## 2 2 2388 0.13
fviz_silhouette(kmeanscat3)
## cluster size ave.sil.width
## 1 1 3761 0.17
## 2 2 2217 0.20
## 3 3 661 -0.06
fviz_silhouette(kmeanscat4)
## cluster size ave.sil.width
## 1 1 1952 0.13
## 2 2 1592 0.20
## 3 3 647 -0.11
## 4 4 2448 0.24
round(calinhara(df_PRT_Cat_scaled,kmeanscat2$cluster),digits=2)
## [1] 569.19
round(calinhara(df_PRT_Cat_scaled,kmeanscat3$cluster),digits=2)
## [1] 460.48
round(calinhara(df_PRT_Cat_scaled,kmeanscat4$cluster),digits=2)
## [1] 494.16
The resulted average silhouette width is the lowest given all the other algorithms that were applied for both mixed and only continuous data.
Using k-means on binary data can result in biased or misleading results, particularly if the data is imbalanced, since k-means is designed to minimize the sum of squared distances between data points and cluster centroids, which is not well-suited for binary data.
Therefore, while it is possible to use kmeans on converted binary categorical attributes, it is important to consider the potential drawbacks and to evaluate the results carefully.
K-prototype is an extension of K-means clustering algorithm that is suitable for datasets that have both categorical and continuous variables. It is typically used when we have a dataset with a mix of numerical and categorical data and we want to cluster the data based on both types of variables. For customer segmentation, K-prototype can be used to segment customers based on demographic variables such as age, gender, and income, as well as their behavior such as their purchase history and preferences.
df_PRT_Mixed <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue, df_PRT$DistributionChannel, df_PRT$MarketSegment,df_PRT$SRHighFloor,df_PRT$SRLowFloor,df_PRT$SRAccessibleRoom,df_PRT$SRMediumFloor, df_PRT$SRBathtub, df_PRT$SRShower, df_PRT$SRCrib, df_PRT$SRKingSizeBed, df_PRT$SRTwinBed, df_PRT$SRNearElevator, df_PRT$SRQuietRoom)
head(df_PRT, 10)
## ID Nationality Age DaysSinceCreation
## 1 1 PRT 51 150
## 31 31 PRT 39 1095
## 32 32 PRT 71 1095
## 34 34 PRT 43 1095
## 55 55 PRT 38 1095
## 84 84 PRT 28 1094
## 123 123 PRT 41 1094
## 127 127 PRT 44 1094
## 135 135 PRT 45 1094
## 155 155 PRT 60 1094
## NameHash
## 1 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31 0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32 0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34 0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55 0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84 0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
## DocIDHash
## 1 0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31 0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32 0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34 0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55 0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84 0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
## AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1 45 371.00 105.30 1
## 31 1 1083.50 255.49 0
## 32 85 180.60 54.00 0
## 34 78 180.60 100.50 0
## 55 98 142.80 15.30 0
## 84 103 770.10 957.90 0
## 123 2 404.50 473.30 0
## 127 1 376.00 224.00 0
## 135 0 165.00 24.00 0
## 155 10 249.10 173.50 0
## BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1 0 3 8 5
## 31 0 9 18 14
## 32 0 1 6 3
## 34 0 1 6 3
## 55 0 1 3 3
## 84 0 1 12 4
## 123 0 1 9 3
## 127 0 1 8 2
## 135 0 1 4 1
## 155 0 1 4 2
## DaysSinceLastStay DaysSinceFirstStay DistributionChannel
## 1 151 1074 Corporate
## 31 591 1117 Direct
## 32 1098 1098 Direct
## 34 1098 1098 Direct
## 55 1098 1098 Travel Agent/Operator
## 84 1098 1098 Direct
## 123 1097 1097 Direct
## 127 1096 1096 Direct
## 135 1095 1095 Direct
## 155 1096 1096 Travel Agent/Operator
## MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1 Corporate 0 0 0 0
## 31 Corporate 0 0 0 0
## 32 Direct 0 0 0 0
## 34 Direct 0 0 0 0
## 55 Travel Agent/Operator 0 0 0 0
## 84 Direct 0 0 0 0
## 123 Direct 0 0 0 0
## 127 Direct 0 0 0 0
## 135 Direct 0 0 0 0
## 155 Travel Agent/Operator 0 0 0 0
## SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1 0 0 0 0 0 0
## 31 0 0 0 1 0 0
## 32 0 0 0 0 0 0
## 34 0 0 1 0 0 0
## 55 0 0 0 0 0 0
## 84 0 0 0 1 0 0
## 123 0 0 1 1 0 0
## 127 0 0 0 1 0 0
## 135 0 0 0 0 0 0
## 155 0 0 0 0 0 0
## SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1 0 0 0 476.30
## 31 0 0 0 1338.99
## 32 0 0 0 234.60
## 34 0 0 0 281.10
## 55 0 0 0 158.10
## 84 0 0 0 1728.00
## 123 0 0 0 877.80
## 127 0 0 0 600.00
## 135 0 0 0 189.00
## 155 0 0 0 422.60
str(df_PRT_Mixed)
## 'data.frame': 6639 obs. of 24 variables:
## $ df_PRT.Age : int 51 39 71 43 38 28 41 44 45 60 ...
## $ df_PRT.DaysSinceCreation : num 150 1095 1095 1095 1095 ...
## $ df_PRT.AverageLeadTime : num 45 1 85 78 98 103 2 1 0 10 ...
## $ df_PRT.BookingsCanceled : num 1 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.BookingsNoShowed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.BookingsCheckedIn : num 3 9 1 1 1 1 1 1 1 1 ...
## $ df_PRT.PersonsNights : num 8 18 6 6 3 12 9 8 4 4 ...
## $ df_PRT.RoomNights : num 5 14 3 3 3 4 3 2 1 2 ...
## $ df_PRT.DaysSinceLastStay : num 151 591 1098 1098 1098 ...
## $ df_PRT.DaysSinceFirstStay : num 1074 1117 1098 1098 1098 ...
## $ df_PRT.TotalRevenue : num 476 1339 235 281 158 ...
## $ df_PRT.DistributionChannel: chr "Corporate" "Direct" "Direct" "Direct" ...
## $ df_PRT.MarketSegment : chr "Corporate" "Corporate" "Direct" "Direct" ...
## $ df_PRT.SRHighFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRLowFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRAccessibleRoom : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRMediumFloor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRBathtub : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRShower : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRCrib : num 0 0 0 1 0 0 1 0 0 0 ...
## $ df_PRT.SRKingSizeBed : num 0 1 0 0 0 1 1 1 0 0 ...
## $ df_PRT.SRTwinBed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRNearElevator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ df_PRT.SRQuietRoom : num 0 0 0 0 0 0 0 0 0 0 ...
# Scaling Continouos Variables
df_PRT$Age <- scale(df_PRT$Age)
df_PRT$DaysSinceCreation <- scale(df_PRT$DaysSinceCreation)
df_PRT$AverageLeadTime <- scale(df_PRT$AverageLeadTime)
df_PRT$BookingsCanceled <- scale(df_PRT$BookingsCanceled)
df_PRT$BookingsNoShowed <- scale(df_PRT$BookingsNoShowed)
df_PRT$BookingsCheckedIn <- scale(df_PRT$BookingsCheckedIn)
df_PRT$PersonsNights <- scale(df_PRT$PersonsNights)
df_PRT$RoomNights <- scale(df_PRT$RoomNights)
df_PRT$DaysSinceLastStay <- scale(df_PRT$DaysSinceLastStay)
df_PRT$DaysSinceFirstStay <- scale(df_PRT$DaysSinceFirstStay)
df_PRT$TotalRevenue <- scale(df_PRT$TotalRevenue)
# Convert categorical variables to factors using the "as.factor" function
df_PRT$DistributionChannel <- as.factor(df_PRT$DistributionChannel)
df_PRT$MarketSegment <- as.factor(df_PRT$MarketSegment)
df_PRT$SRHighFloor <- factor(df_PRT$SRHighFloor)
df_PRT$SRLowFloor <- as.factor(df_PRT$SRLowFloor)
df_PRT$SRAccessibleRoom <- as.factor(df_PRT$SRAccessibleRoom)
df_PRT$SRMediumFloor <- as.factor(df_PRT$SRMediumFloor)
df_PRT$SRBathtub <- as.factor(df_PRT$SRBathtub)
df_PRT$SRShower <- as.factor(df_PRT$SRShower)
df_PRT$SRCrib <- as.factor(df_PRT$SRCrib)
df_PRT$SRKingSizeBed <- as.factor(df_PRT$SRKingSizeBed)
df_PRT$SRTwinBed <- as.factor(df_PRT$SRTwinBed)
df_PRT$SRNearElevator <- as.factor(df_PRT$SRNearElevator)
df_PRT$SRQuietRoom <- as.factor(df_PRT$SRQuietRoom)
df_PRT_Mixed_K <- data.frame(df_PRT$Age, df_PRT$DaysSinceCreation, df_PRT$AverageLeadTime, df_PRT$BookingsCanceled, df_PRT$BookingsNoShowed, df_PRT$BookingsCheckedIn, df_PRT$PersonsNights, df_PRT$RoomNights, df_PRT$DaysSinceLastStay, df_PRT$DaysSinceFirstStay, df_PRT$TotalRevenue, df_PRT$DistributionChannel, df_PRT$MarketSegment,df_PRT$SRHighFloor,df_PRT$SRLowFloor,df_PRT$SRAccessibleRoom,df_PRT$SRMediumFloor, df_PRT$SRBathtub, df_PRT$SRShower, df_PRT$SRCrib, df_PRT$SRKingSizeBed, df_PRT$SRTwinBed, df_PRT$SRNearElevator, df_PRT$SRQuietRoom)
head(df_PRT, 10)
## ID Nationality Age DaysSinceCreation
## 1 1 PRT 0.40886692 -1.263162
## 31 31 PRT -0.46580373 1.659981
## 32 32 PRT 1.86665135 1.659981
## 34 34 PRT -0.17424685 1.659981
## 55 55 PRT -0.53869296 1.659981
## 84 84 PRT -1.26758517 1.656887
## 123 123 PRT -0.32002529 1.656887
## 127 127 PRT -0.10135763 1.656887
## 135 135 PRT -0.02846841 1.656887
## 155 155 PRT 1.06486991 1.656887
## NameHash
## 1 0x8E0A7AF39B633D5EA25C3B7EF4DFC5464B36DB7AF375716EB065E29697CC071E
## 31 0xC24A218407111A34D6B511DC4352DB901B4F8F488C8C6A6AE7DEC75BEEB323F0
## 32 0xAB7AC547231833569C05758686EA1F31A57136BA9241ED9D47A3FCD7B926E1FB
## 34 0x1BDE2BD375189D4BD0AEADE4AC20F08B7A8671A73D4F4027BA49DDEF76EE5725
## 55 0x2A35DCD5D94A57E99396B16436898B1AAF7EFF78870E8B1A8E13E199342455A9
## 84 0x613CA8C9AFBBF94118FA89BCB0DE2E4BDAAEB8C64D55C582B63733BF8E346DB2
## 123 0x016F11562D0B2768F09DD51B23584B07BFE1EF495DD75F82E22C6147AE836E24
## 127 0x919DA3A2B8AC15A644DF27E3C805E24559105B240B156E6EE90993D87A872435
## 135 0x988D34CEA1793D704A3B545CDA0BC0BE066C3972E3733058C006098E9C018C34
## 155 0x051DB339C5BD13EBFE6D0557DAFBFE8674CEDEA2ABB5CE3E14212115002C9BF4
## DocIDHash
## 1 0x71568459B729F7A7ABBED6C781A84CA4274D571003ACC7A4A791C3350D924137
## 31 0x3333A7C8D1C059E0FA19CC75EE1E445546E3627478389734C8EE878779B276D7
## 32 0x4FC217F50D72BEF965A1AB5EF68F6A20EC8B6E524021D45A7B727F1832CC39B5
## 34 0x90B616FB4FF85B2B8F4992DEEC1490EEF0846A5611EA623199E060E721687E19
## 55 0x8B41F2FBEB60DA7D6896D59D0750771A46CCDC2E28C75EA9F9E659334AC92D66
## 84 0xBD9CD5C6C82AC95EF28B4F079CF49BDD3F94FC584739F11FD131799BDAFC19BD
## 123 0x428A0B30331DA53CBFFD23E890A0DB0243E959987453DC7E8FEDDE6946F49F84
## 127 0x9B1681DDF3FE35DCD8C667C18C4AB503CB06433DD875D9BAF7368EC6E7C8651F
## 135 0xDEF0030D19EC08947FB032692E0AEC4960B29269E4A6A94916E5DF48A339051B
## 155 0x14E8C31EE4B9734A52DD9E968C9DAAE804B47A4BBB06876AD0E951E6007CF861
## AverageLeadTime LodgingRevenue OtherRevenue BookingsCanceled
## 1 -0.06144993 371.00 105.30 4.66405102
## 31 -0.66333119 1083.50 255.49 -0.07785323
## 32 0.48571486 180.60 54.00 -0.07785323
## 34 0.38996102 180.60 100.50 -0.07785323
## 55 0.66354341 142.80 15.30 -0.07785323
## 84 0.73193901 770.10 957.90 -0.07785323
## 123 -0.64965207 404.50 473.30 -0.07785323
## 127 -0.66333119 376.00 224.00 -0.07785323
## 135 -0.67701031 165.00 24.00 -0.07785323
## 155 -0.54021911 249.10 173.50 -0.07785323
## BookingsNoShowed BookingsCheckedIn PersonsNights RoomNights
## 1 -0.05799844 1.0314323 0.78947982 0.7450725
## 31 -0.05799844 4.4907295 2.99154543 3.4657733
## 32 -0.05799844 -0.1216668 0.34906670 0.1404723
## 34 -0.05799844 -0.1216668 0.34906670 0.1404723
## 55 -0.05799844 -0.1216668 -0.31155298 0.1404723
## 84 -0.05799844 -0.1216668 1.67030607 0.4427724
## 123 -0.05799844 -0.1216668 1.00968638 0.1404723
## 127 -0.05799844 -0.1216668 0.78947982 -0.1618278
## 135 -0.05799844 -0.1216668 -0.09134642 -0.4641279
## 155 -0.05799844 -0.1216668 -0.09134642 -0.1618278
## DaysSinceLastStay DaysSinceFirstStay DistributionChannel
## 1 -1.2076818 1.586808 Corporate
## 31 0.1441954 1.719782 Direct
## 32 1.7019267 1.661026 Direct
## 34 1.7019267 1.661026 Direct
## 55 1.7019267 1.661026 Travel Agent/Operator
## 84 1.7019267 1.661026 Direct
## 123 1.6988542 1.657934 Direct
## 127 1.6957818 1.654841 Direct
## 135 1.6927093 1.651749 Direct
## 155 1.6957818 1.654841 Travel Agent/Operator
## MarketSegment SRHighFloor SRLowFloor SRAccessibleRoom SRMediumFloor
## 1 Corporate 0 0 0 0
## 31 Corporate 0 0 0 0
## 32 Direct 0 0 0 0
## 34 Direct 0 0 0 0
## 55 Travel Agent/Operator 0 0 0 0
## 84 Direct 0 0 0 0
## 123 Direct 0 0 0 0
## 127 Direct 0 0 0 0
## 135 Direct 0 0 0 0
## 155 Travel Agent/Operator 0 0 0 0
## SRBathtub SRShower SRCrib SRKingSizeBed SRTwinBed SRNearElevator
## 1 0 0 0 0 0 0
## 31 0 0 0 1 0 0
## 32 0 0 0 0 0 0
## 34 0 0 1 0 0 0
## 55 0 0 0 0 0 0
## 84 0 0 0 1 0 0
## 123 0 0 1 1 0 0
## 127 0 0 0 1 0 0
## 135 0 0 0 0 0 0
## 155 0 0 0 0 0 0
## SRAwayFromElevator SRNoAlcoholInMiniBar SRQuietRoom TotalRevenue
## 1 0 0 0 0.2919791
## 31 0 0 0 2.2157474
## 32 0 0 0 -0.2470033
## 34 0 0 0 -0.1433100
## 55 0 0 0 -0.4175956
## 84 0 0 0 3.0832260
## 123 0 0 0 1.1873100
## 127 0 0 0 0.5678258
## 135 0 0 0 -0.3486897
## 155 0 0 0 0.1722300
str(df_PRT_Mixed_K)
## 'data.frame': 6639 obs. of 24 variables:
## $ df_PRT.Age : num 0.409 -0.466 1.867 -0.174 -0.539 ...
## $ df_PRT.DaysSinceCreation : num -1.26 1.66 1.66 1.66 1.66 ...
## $ df_PRT.AverageLeadTime : num -0.0614 -0.6633 0.4857 0.39 0.6635 ...
## $ df_PRT.BookingsCanceled : num 4.6641 -0.0779 -0.0779 -0.0779 -0.0779 ...
## $ df_PRT.BookingsNoShowed : num -0.058 -0.058 -0.058 -0.058 -0.058 ...
## $ df_PRT.BookingsCheckedIn : num 1.031 4.491 -0.122 -0.122 -0.122 ...
## $ df_PRT.PersonsNights : num 0.789 2.992 0.349 0.349 -0.312 ...
## $ df_PRT.RoomNights : num 0.745 3.466 0.14 0.14 0.14 ...
## $ df_PRT.DaysSinceLastStay : num -1.208 0.144 1.702 1.702 1.702 ...
## $ df_PRT.DaysSinceFirstStay : num 1.59 1.72 1.66 1.66 1.66 ...
## $ df_PRT.TotalRevenue : num 0.292 2.216 -0.247 -0.143 -0.418 ...
## $ df_PRT.DistributionChannel: Factor w/ 4 levels "Corporate","Direct",..: 1 2 2 2 4 2 2 2 2 4 ...
## $ df_PRT.MarketSegment : Factor w/ 7 levels "Aviation","Complementary",..: 3 3 4 4 7 4 4 4 4 7 ...
## $ df_PRT.SRHighFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRLowFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRAccessibleRoom : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRMediumFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRBathtub : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRShower : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRCrib : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
## $ df_PRT.SRKingSizeBed : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...
## $ df_PRT.SRTwinBed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRNearElevator : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRQuietRoom : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
In k-prototype clustering, lambda is a parameter that controls the trade-off between the dissimilarity of numeric and categorical features. Specifically, lambda determines the weight of the categorical features in the clustering process. When lambda is set to 0, the k-prototype algorithm becomes equivalent to k-means for numeric data, while when lambda is set to 1, the algorithm becomes equivalent to clustering only on categorical data. A value of lambda between 0 and 1 allows for a combination of both categorical and numerical data in the clustering process. By default, the lambda parameter is calculated based on the variance of numerical variables, but it can be adjusted to use the standard deviation instead. Using standard deviation instead of variance in calculating the lambda parameter has the benefit of providing more robustness against outliers since the standard deviation is less sensitive to outliers compared to variance.
Essil <- sapply(2:10, function(i) {
kpres <- kproto(df_PRT_Mixed_K, k = i)
validation_kproto(method = "silhouette", object = kpres)
})
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
plot(2:10, Essil, type = "b", ylab = "Silhouette", xlab = "Number of clusters")
K=2,4 & 8 have the highest average silhouette width, indicating a well-defined cluster compared to others.
# Perform k-prototype clustering with 2 clusters
?clusplot
kp2Var <- kproto(df_PRT_Mixed_K, k = 2, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric variances:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric variance: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp2Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Perform k-prototype clustering with 3 clusters
kp4Var <- kproto(df_PRT_Mixed_K, k = 4, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric variances:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric variance: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp4Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Perform k-prototype clustering with 3 clusters
kp8Var <- kproto(df_PRT_Mixed_K, k = 8, lambda = lambdaest(df_PRT_Mixed_K, num.method = 1))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric variances:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric variance: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp8Var$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
Es2 <- numeric(10)
for(i in 1:10){
kpres <- kproto(df_PRT_Mixed_K, k = i, lambdaest(df_PRT_Mixed_K, num.method = 2))
Es2[i] <- kpres$tot.withinss
}
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
plot(1:10, Es2, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")
From the above graph, k=4,5,8 & 9 are also to be suggested as the optimal number of clusters to be used when performing K-prototype algorithm.
Essil2 <- numeric(10)
for(i in 2:10){
kpres <- kproto(df_PRT_Mixed_K, k = i, lambdaest(df_PRT_Mixed_K, num.method = 2))
Essil2[i] <- validation_kproto(method = "silhouette", object = kpres)
}
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
##
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
plot(1:10, Essil2, type = "b", ylab = "Silhouette", xlab = "Number of clusters")
K=9 has the highest average silhouette width followed by k=4, indicating a well-defined cluster compared to others.
# Perform k-prototype clustering with 2 clusters
kp4Sd <- kproto(df_PRT_Mixed_K, k = 4, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp4Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Perform k-prototype clustering with 2 clusters
kp5Sd <- kproto(df_PRT_Mixed_K, k = 5, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp5Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Perform k-prototype clustering with 2 clusters
kp8Sd <- kproto(df_PRT_Mixed_K, k = 8, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp8Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Perform k-prototype clustering with 2 clusters
kp9Sd <- kproto(df_PRT_Mixed_K, k = 9, lambda = lambdaest(df_PRT_Mixed_K, num.method = 2))
## # NAs in variables:
## df_PRT.Age df_PRT.DaysSinceCreation
## 0 0
## df_PRT.AverageLeadTime df_PRT.BookingsCanceled
## 0 0
## df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 0 0
## df_PRT.PersonsNights df_PRT.RoomNights
## 0 0
## df_PRT.DaysSinceLastStay df_PRT.DaysSinceFirstStay
## 0 0
## df_PRT.TotalRevenue df_PRT.DistributionChannel
## 0 0
## df_PRT.MarketSegment df_PRT.SRHighFloor
## 0 0
## df_PRT.SRLowFloor df_PRT.SRAccessibleRoom
## 0 0
## df_PRT.SRMediumFloor df_PRT.SRBathtub
## 0 0
## df_PRT.SRShower df_PRT.SRCrib
## 0 0
## df_PRT.SRKingSizeBed df_PRT.SRTwinBed
## 0 0
## df_PRT.SRNearElevator df_PRT.SRQuietRoom
## 0 0
## 0 observation(s) with NAs.
##
## Numeric standard deviations:
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 1 1
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 1 1
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 1 1
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue
## 1 1
## Average numeric standard deviation: 1
##
## Heuristic for categorical variables: (method = 1)
## df_PRT.DistributionChannel df_PRT.MarketSegment
## 0.5107484357 0.7795660169
## df_PRT.SRHighFloor df_PRT.SRLowFloor
## 0.0682879136 0.0033082616
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor
## 0.0006023189 0.0030079643
## df_PRT.SRBathtub df_PRT.SRShower
## 0.0045085433 0.0027075763
## df_PRT.SRCrib df_PRT.SRKingSizeBed
## 0.0276238119 0.4206373758
## df_PRT.SRTwinBed df_PRT.SRNearElevator
## 0.1621899432 0.0003012048
## df_PRT.SRQuietRoom
## 0.0851944063
## Average categorical variation: 0.1591295
##
## Estimated lambda: 6.284189
##
## 0 observation(s) with NAs.
clusplot(df_PRT_Mixed_K, kp9Sd$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
kp2Var$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.03022443 0.8618161 -0.03120781
## 2 -0.03102709 -0.8847031 0.03203658
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 0.01379821 0.002827812 0.03862871
## 2 -0.01416465 -0.002902910 -0.03965457
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 0.06239875 0.03781794 0.8385431
## 2 -0.06405586 -0.03882226 -0.8608121
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 0.8619061 -0.03137890 Travel Agent/Operator
## 2 -0.8847955 0.03221222 Travel Agent/Operator
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Other 0 0
## 2 Other 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 0 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
kp4Var$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.1229554 -0.1467219 -0.03964562
## 2 0.2881788 0.5500291 0.52793903
## 3 -0.0852627 -0.3025952 -0.42557610
## 4 0.2144957 0.5737608 -0.34384775
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.05192347 -0.04566126 -0.04481980
## 2 -0.06983874 -0.05121937 -0.09503184
## 3 -0.03777374 -0.01985941 -0.04278337
## 4 7.71994487 5.82473545 8.46251520
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 0.1533192 0.03550415 -0.1348671
## 2 -0.1484140 -0.03409533 0.5729967
## 3 -0.2267508 -0.20414658 -0.3054045
## 4 5.4236046 7.62407903 -0.9848269
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 -0.1469145 0.1356722 Travel Agent/Operator
## 2 0.5488496 -0.1386580 Travel Agent/Operator
## 3 -0.3033911 -0.1941631 Direct
## 4 0.6662951 4.8153723 Corporate
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Other 0 0
## 2 Groups 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
kp8Var$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.061254664 -1.0161557 -0.36456323
## 2 0.036566373 -0.2319464 -0.49615505
## 3 0.414825661 0.4075808 0.92227193
## 4 -0.184595319 0.7893740 -0.19848037
## 5 -0.003912145 -0.2663671 -0.09466020
## 6 -0.156279758 0.8217644 -0.45456177
## 7 -0.123844728 -1.1116870 0.03481985
## 8 -0.057610051 0.3013993 0.01400926
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.05977735 -0.05799844 -0.08210686
## 2 0.16060460 0.20133054 0.23287803
## 3 -0.07393753 -0.05799844 -0.11690586
## 4 -0.07785323 -0.05799844 -0.11098996
## 5 -0.07785323 -0.05799844 -0.07541951
## 6 -0.04464661 -0.05799844 -0.06433485
## 7 -0.07785323 -0.04636124 -0.10828461
## 8 0.21908683 0.12747924 0.31213396
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.19991077 -0.222518252 -0.9776815
## 2 -0.31376770 -0.026748269 -0.3880886
## 3 -0.11080315 -0.003313678 0.4536805
## 4 0.04866146 -0.063238125 0.8275117
## 5 0.02052323 -0.037351256 -0.2479936
## 6 -0.14932798 -0.205436881 0.8086575
## 7 0.01854390 -0.023541552 -1.0566356
## 8 0.52620782 0.421221160 0.2382958
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 -1.0157199 -0.13876781 Direct
## 2 -0.2304069 -0.14671198 Corporate
## 3 0.4074129 -0.09671927 Travel Agent/Operator
## 4 0.7875424 -0.06736439 Travel Agent/Operator
## 5 -0.2670119 0.10021160 Travel Agent/Operator
## 6 0.8190367 -0.18854062 Direct
## 7 -1.1122927 0.08686339 Travel Agent/Operator
## 8 0.3045435 0.40461341 Travel Agent/Operator
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Direct 0 0
## 2 Corporate 0 0
## 3 Groups 0 0
## 4 Other 0 0
## 5 Other 0 0
## 6 Direct 0 0
## 7 Other 0 0
## 8 Other 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
## 5 1
## 6 0
## 7 0
## 8 0
kp4Sd$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.04301233 0.1072099 0.2245371
## 2 -0.05636145 -1.0641813 -0.4022000
## 3 -0.12799166 0.8798871 -0.4603846
## 4 0.21449566 0.5737608 -0.3438478
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.06585393 -0.04692611 -0.082673114
## 2 -0.01811270 -0.01062496 -0.014074487
## 3 -0.03902494 -0.04568231 -0.003642332
## 4 7.71994487 5.82473545 8.462515197
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 0.06092893 0.00764666 0.1316017
## 2 -0.25849533 -0.16849265 -1.0503773
## 3 -0.18488349 -0.16615960 0.8252034
## 4 5.42360455 7.62407903 -0.9848269
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 0.1068864 0.03334654 Travel Agent/Operator
## 2 -1.0652561 -0.14345495 Direct
## 3 0.8784614 -0.18368616 Direct
## 4 0.6662951 4.81537235 Corporate
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Other 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
kp5Sd$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.393312571 0.08496927 1.1975527
## 2 -0.136699029 -0.95366843 -0.1491111
## 3 0.004884595 0.95935902 -0.1361254
## 4 -0.104120577 -0.14525267 -0.4105885
## 5 0.214495665 0.57376085 -0.3438478
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.07785323 -0.03496904 -0.10014964
## 2 -0.04763397 -0.03243739 -0.05399954
## 3 -0.07312079 -0.05199402 -0.06671622
## 4 -0.01793747 -0.03627847 -0.02453451
## 5 7.71994487 5.82473545 8.46251520
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 0.202824260 0.12658677 0.1150763
## 2 -0.090644751 -0.04399013 -0.9178029
## 3 -0.003329724 -0.05382035 0.9643873
## 4 -0.167663493 -0.17128601 -0.1686804
## 5 5.423604552 7.62407903 -0.9848269
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 0.08088115 0.04049591 Travel Agent/Operator
## 2 -0.95467703 0.04384743 Travel Agent/Operator
## 3 0.95819619 -0.05296965 Travel Agent/Operator
## 4 -0.14264283 -0.14165073 Direct
## 5 0.66629505 4.81537235 Corporate
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Travel Agent/Operator 0 0
## 2 Other 0 0
## 3 Other 0 0
## 4 Direct 0 0
## 5 Corporate 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
kp8Sd$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 0.19848212 0.5534327 -0.34342633
## 2 -0.19909924 0.8237105 -0.46773856
## 3 -0.01048388 0.8586176 0.04882682
## 4 0.22977485 -0.4741717 1.01501688
## 5 -0.03251781 -1.0706102 -0.39948478
## 6 0.24175668 1.0227551 0.22150706
## 7 -0.19860636 -0.3978247 -0.26394827
## 8 -0.03669783 -0.8537098 -0.07299062
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 7.789397007 5.95843394 8.52657626
## 2 -0.039874587 -0.03046328 -0.03986801
## 3 -0.049728052 -0.05799844 0.02195764
## 4 -0.072421491 -0.04421509 -0.11110003
## 5 -0.043059379 -0.04538538 -0.06425359
## 6 -0.077853225 -0.04695912 -0.10315374
## 7 -0.061431046 -0.05799844 -0.08722358
## 8 -0.007254155 0.01664712 0.01567502
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 5.48888801 7.71858598 -0.9998726
## 2 -0.19011183 -0.19434059 0.7868890
## 3 0.43814195 0.19282798 0.8152028
## 4 -0.03888025 0.08922329 -0.4247424
## 5 -0.26123324 -0.22805493 -1.0368066
## 6 -0.05417393 -0.04423580 1.0530115
## 7 -0.07971646 -0.11497779 -0.3664215
## 8 -0.01293540 -0.01442834 -0.8630454
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 0.6480450 4.84865759 Corporate
## 2 0.8209011 -0.19188243 Direct
## 3 0.8632603 0.33662092 Travel Agent/Operator
## 4 -0.4732629 0.13244813 Travel Agent/Operator
## 5 -1.0708035 -0.15932247 Direct
## 6 1.0191316 -0.19298918 Travel Agent/Operator
## 7 -0.3966113 -0.04374251 Travel Agent/Operator
## 8 -0.8582784 -0.03990061 Travel Agent/Operator
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Corporate 0 0
## 2 Direct 0 0
## 3 Other 0 0
## 4 Groups 0 0
## 5 Direct 0 0
## 6 Travel Agent/Operator 0 0
## 7 Other 0 0
## 8 Other 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 1 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
kp9Sd$centers
## df_PRT.Age df_PRT.DaysSinceCreation df_PRT.AverageLeadTime
## 1 -0.10912617 0.42212414 -0.46653696
## 2 0.01725412 -1.12752595 -0.32208154
## 3 -0.15643681 -1.36123997 -0.13662319
## 4 0.06387621 1.14397679 -0.11932906
## 5 0.31143337 0.31005134 -0.29120330
## 6 -0.01980585 0.29187986 -0.01972745
## 7 -0.26730490 -0.32929994 -0.24850353
## 8 0.40044176 0.09174735 1.26780366
## 9 0.03956153 0.22965152 -0.39795627
## df_PRT.BookingsCanceled df_PRT.BookingsNoShowed df_PRT.BookingsCheckedIn
## 1 -0.04709018 -0.04684658 -0.06182107
## 2 -0.03051308 -0.03797704 -0.02381647
## 3 -0.06176079 -0.04438661 -0.08775212
## 4 -0.07339655 -0.04668936 -0.10486884
## 5 1.95439145 0.92427623 3.27291568
## 6 -0.07382784 -0.04778379 -0.06782941
## 7 -0.06564755 -0.05799844 -0.09421206
## 8 -0.07785323 -0.03247805 -0.11310721
## 9 14.78011342 13.57924830 13.94614191
## df_PRT.PersonsNights df_PRT.RoomNights df_PRT.DaysSinceLastStay
## 1 -0.2229806 -0.22206365 0.4049520
## 2 -0.1935721 -0.16886970 -1.1285292
## 3 -0.1376795 -0.08796258 -1.3106887
## 4 -0.0845167 -0.09704917 1.1740229
## 5 2.8994182 3.49662028 -0.8107716
## 6 0.1264300 -0.01273069 0.2980056
## 7 -0.1726840 -0.08946249 -0.2955166
## 8 0.1585168 0.10777391 0.1350856
## 9 8.0122550 10.82174222 -1.2382014
## df_PRT.DaysSinceFirstStay df_PRT.TotalRevenue df_PRT.DistributionChannel
## 1 0.41963546 -0.188831485 Direct
## 2 -1.12665119 -0.160706366 Direct
## 3 -1.36268294 -0.001398094 Travel Agent/Operator
## 4 1.14258807 -0.126101412 Travel Agent/Operator
## 5 0.31821808 3.652950764 Corporate
## 6 0.29227489 0.048872250 Travel Agent/Operator
## 7 -0.32661316 -0.063176002 Travel Agent/Operator
## 8 0.08952496 -0.011145755 Travel Agent/Operator
## 9 0.47332754 5.318602039 Corporate
## df_PRT.MarketSegment df_PRT.SRHighFloor df_PRT.SRLowFloor
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Other 0 0
## 4 Other 0 0
## 5 Corporate 0 0
## 6 Other 0 0
## 7 Other 0 0
## 8 Travel Agent/Operator 0 0
## 9 Corporate 0 0
## df_PRT.SRAccessibleRoom df_PRT.SRMediumFloor df_PRT.SRBathtub df_PRT.SRShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## df_PRT.SRCrib df_PRT.SRKingSizeBed df_PRT.SRTwinBed df_PRT.SRNearElevator
## 1 0 0 0 0
## 2 0 1 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## 6 0 1 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 1 0 0
## df_PRT.SRQuietRoom
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
validation_kproto(method = "silhouette", object = kp2Var)
## [1] 0.241107
validation_kproto(method = "silhouette", object = kp4Var)
## [1] 0.2033768
validation_kproto(method = "silhouette", object = kp8Var)
## [1] 0.1317051
validation_kproto(method = "silhouette", object = kp4Sd)
## [1] 0.1807716
validation_kproto(method = "silhouette", object = kp5Sd)
## [1] 0.2272975
validation_kproto(method = "silhouette", object = kp8Sd)
## [1] 0.2254407
validation_kproto(method = "silhouette", object = kp9Sd)
## [1] 0.1840354
A very low Silhouette statistic was observed for the clustering generated with the lambda estimated using both standard deviation and variance suggesting that this clustering is not the most optimal as a expected given that this approach is more appropriate for mixed data. As a result of the low Silhouette statistic, it appears that this clustering approach was ineffective.
This may be be due to poor separation of clusters: it is possible the clustering algorithm failed to identify distinct clusters in the data, so the silhouette statistic is low for many data points. This could happen if there is a lot of overlap between the different categories in the categorical variables, or if the continuous variables do not provide sufficient separation between the clusters.
Gower distance can be used in clustering when the data being analyzed contains a mixture of data types. Other distance metrics, such as Euclidean distance or Manhattan distance, may not be appropriate for mixed data because they assume that all variables have the same scale and distribution. Gower distance takes into account the different scales and distributions of the variables, making it a better choice for mixed data.
For quantitative (interval) data, Gower’s distance calculates the range-normalized Manhattan distance between two observations. For ordinal data, the variable is first ranked and then the Manhattan distance is used with a special adjustment for ties. For nominal data, the variables are first converted into k binary columns and then the Dice coefficient is used to calculate the distance.
One advantage of Gower’s distance is that it does not assume a specific distribution for the data, making it more robust to outliers and skewness. However, it can be computationally expensive for large datasets.
gower_dist <- daisy(df_PRT_Mixed_K,
metric = "gower",
type = list(logratio = 3))
## Warning in daisy(df_PRT_Mixed_K, metric = "gower", type = list(logratio = 3)):
## NaNs produced
Esgower <- numeric(10)
for(i in 2:10){
pames <- pam(gower_dist, diss = TRUE, k = i)
Esgower[i] <- pames$silinfo$avg.width}
plot(1:10, Esgower, type = "b", ylab = "Silhouette", xlab = "Number of Clusters")
In this case, K =2, K= 5 & K=10 will be selected for clustering.
pamgower <- pam(gower_dist, diss = TRUE, k=2)
fviz_silhouette(pamgower)
## cluster size ave.sil.width
## 1 1 1941 0.32
## 2 2 4698 0.28
pamgower <- pam(gower_dist, diss = TRUE, k=5)
fviz_silhouette(pamgower)
## cluster size ave.sil.width
## 1 1 735 0.36
## 2 2 1403 0.32
## 3 3 1456 0.36
## 4 4 1394 0.30
## 5 5 1651 0.23
pamgower <- pam(gower_dist, diss = TRUE, k=10)
fviz_silhouette(pamgower)
## cluster size ave.sil.width
## 1 1 693 0.33
## 2 2 640 0.42
## 3 3 755 0.36
## 4 4 619 0.40
## 5 5 545 0.43
## 6 6 734 0.31
## 7 7 715 0.26
## 8 8 603 0.35
## 9 9 732 0.38
## 10 10 603 0.29
The results obtained are slightly better in comparison to K-prototype algorithm and similar to results obtained from applied clustering on only continuous data in part 1. But, generally the results are not considered to be the optimal further adoption.
FAMD is a method for analyzing data sets that contain a mixture of continuous, categorical, and count variables. FAMD is an extension of Principal Component Analysis (PCA) that is specifically designed for mixed data. It works by constructing a set of principal components that capture the most important features of the data, while taking into account the different types of variables.
FAMD is particularly useful when working with data that contains a mix of categorical and continuous variables. It can be used to identify patterns and relationships in the data that might not be apparent from a simple analysis of the variables separately. FAMD can also be used for data reduction, to identify the most important variables for a particular analysis.
str(df_PRT_Mixed_K)
## 'data.frame': 6639 obs. of 24 variables:
## $ df_PRT.Age : num 0.409 -0.466 1.867 -0.174 -0.539 ...
## $ df_PRT.DaysSinceCreation : num -1.26 1.66 1.66 1.66 1.66 ...
## $ df_PRT.AverageLeadTime : num -0.0614 -0.6633 0.4857 0.39 0.6635 ...
## $ df_PRT.BookingsCanceled : num 4.6641 -0.0779 -0.0779 -0.0779 -0.0779 ...
## $ df_PRT.BookingsNoShowed : num -0.058 -0.058 -0.058 -0.058 -0.058 ...
## $ df_PRT.BookingsCheckedIn : num 1.031 4.491 -0.122 -0.122 -0.122 ...
## $ df_PRT.PersonsNights : num 0.789 2.992 0.349 0.349 -0.312 ...
## $ df_PRT.RoomNights : num 0.745 3.466 0.14 0.14 0.14 ...
## $ df_PRT.DaysSinceLastStay : num -1.208 0.144 1.702 1.702 1.702 ...
## $ df_PRT.DaysSinceFirstStay : num 1.59 1.72 1.66 1.66 1.66 ...
## $ df_PRT.TotalRevenue : num 0.292 2.216 -0.247 -0.143 -0.418 ...
## $ df_PRT.DistributionChannel: Factor w/ 4 levels "Corporate","Direct",..: 1 2 2 2 4 2 2 2 2 4 ...
## $ df_PRT.MarketSegment : Factor w/ 7 levels "Aviation","Complementary",..: 3 3 4 4 7 4 4 4 4 7 ...
## $ df_PRT.SRHighFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRLowFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRAccessibleRoom : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRMediumFloor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRBathtub : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRShower : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRCrib : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
## $ df_PRT.SRKingSizeBed : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 2 2 1 1 ...
## $ df_PRT.SRTwinBed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRNearElevator : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ df_PRT.SRQuietRoom : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
famd <- FAMD(df_PRT_Mixed_K, ncp=50, graph=FALSE)
get_eigenvalue(famd)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.584092743 11.561589494 11.56159
## Dim.2 3.067958598 9.896640639 21.45823
## Dim.3 2.174948290 7.015962227 28.47419
## Dim.4 1.896785604 6.118663237 34.59286
## Dim.5 1.720483490 5.549946743 40.14280
## Dim.6 1.658684137 5.350593989 45.49340
## Dim.7 1.128821432 3.641359458 49.13476
## Dim.8 1.117848927 3.605964282 52.74072
## Dim.9 1.065240947 3.436261120 56.17698
## Dim.10 1.026606299 3.311633223 59.48861
## Dim.11 1.015636043 3.276245299 62.76486
## Dim.12 1.011654193 3.263400624 66.02826
## Dim.13 0.991563147 3.198590796 69.22685
## Dim.14 0.987386367 3.185117314 72.41197
## Dim.15 0.978225394 3.155565788 75.56753
## Dim.16 0.969621045 3.127809824 78.69534
## Dim.17 0.939634021 3.031077486 81.72642
## Dim.18 0.930299737 3.000966893 84.72739
## Dim.19 0.896980746 2.893486278 87.62087
## Dim.20 0.771394522 2.488369426 90.10924
## Dim.21 0.697714913 2.250693267 92.35994
## Dim.22 0.620553495 2.001785468 94.36172
## Dim.23 0.510055836 1.645341407 96.00706
## Dim.24 0.360962624 1.164395561 97.17146
## Dim.25 0.292455201 0.943403875 98.11486
## Dim.26 0.253859205 0.818900661 98.93376
## Dim.27 0.132623884 0.427818982 99.36158
## Dim.28 0.126186683 0.407053816 99.76864
## Dim.29 0.037931497 0.122359667 99.89100
## Dim.30 0.031102913 0.100331978 99.99133
## Dim.31 0.002688065 0.008671177 100.00000
fviz_eig(famd, ncp=24, addlabels=TRUE)
famd <- FAMD(df_PRT_Mixed_K, ncp=12, graph=FALSE)
famdvar <- get_famd_var(famd)
fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_contrib(famd, choice = "var", axes = 1, top = 24)
fviz_contrib(famd, choice = "var", axes = 2, top = 20)
fviz_contrib(famd, choice = "var", axes = 3, top = 20)
fviz_contrib(famd, choice = "var", axes = 4, top = 20)
fviz_contrib(famd, choice = "var", axes = 5, top = 20)
fviz_contrib(famd, choice = "var", axes = 6, top = 20)
datafamd <- data.frame(famd$ind$coord)
hopkins(datafamd)
## [1] 1
get_clust_tendency(datafamd, 2, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.9872454
fviz_nbclust(datafamd, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
fviz_nbclust(datafamd, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
kmeansfamd4 <- eclust(datafamd, "kmeans", hc_metric="euclidean", k=4, graph = T)
fviz_cluster(kmeansfamd4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeansfamd4)
## cluster size ave.sil.width
## 1 1 2189 0.19
## 2 2 1748 0.10
## 3 3 48 0.11
## 4 4 2654 0.20
kmeansfamd7 <- eclust(datafamd, "kmeans", hc_metric="euclidean", k=7, graph = T)
fviz_cluster(kmeansfamd7, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeansfamd7)
## cluster size ave.sil.width
## 1 1 1212 0.23
## 2 2 1470 0.22
## 3 3 432 0.10
## 4 4 1238 0.24
## 5 5 1517 0.17
## 6 6 734 0.26
## 7 7 36 0.13
The silhouette statistics obtained in this case are different to the one obtained with the use of PCA, and are considered to be unsatisfactory given the low silhouette statistics.
Exploring and comparing various approaches of clustering and dimension reduction can help to identify the best technique for the given dataset, as different techniques have different assumptions and requirements. Additionally, it can help to understand the underlying structure of the dataset and provide insights into the relationships between the variables. By applying different techniques, it is possible to identify the best combination of techniques to extract useful information from the data. This can help to make informed decisions in areas such as marketing in the tourism sector.
7.1 Nuno Antonio, Ana de Almeida, Luís Nunes, A hotel’s customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015–2018), Data in Brief, Volume 33,2020, 106583, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2020.106583.(https://www.sciencedirect.com/science/article/pii/S2352340920314645)
7.2 van Leeuwen, Rik and Koole, Ger, Data-Driven Market Segmentation in Hospitality Using Unsupervised Machine Learning. Available at SSRN: https://ssrn.com/abstract=4091700 or http://dx.doi.org/10.2139/ssrn.4091700
7.3 Charrad, Malika & Ghazzali, Nadia & Boiteau, Véronique & Niknafs, Azam. (2013). An examination of indices for determining the number of clusters : NbClust Package.
7.4 G. Szepannek. clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal Vol. 10/2, 2018, ISSN 2073-4859, https://journal.r-project.org/archive/2018/RJ-2018-048/RJ-2018-048.pdf
7.5 Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet. (2022). Package ‘FactoMineR’: Multivariate Exploratory Data Analysis and Data Mining. Version 2.7. Retrieved from https://cran.r-project.org/package=FactoMineR.