In this research we will be applying PCA as a method of data reduction in order to reduce the dimensionality of the data while retaining as much of the original information as possible. PCA allows us to condense the information into a smaller set of principal components, which can simplify subsequent analyses and visualizations.This will help in identifying patterns, clusters, and relationships among the assessed variables, offering a more interpretative representation of the data, and giving us the most important information at the same time.
Data set Overview:
The data set contains booking information for a city hotel and a resort hotel, such as, when the booking was made, length of stay, the number of adults, children and the number of available parking spaces, among other things.
Pattern Discovery: The goal is to extract the most important information from the data by minimizing its dimensionality, which may also uncover latent patterns from variables that affect booking behavior.
Dimensionality Reduction: PCA will help to save feature space, by only retaining the most salient information while discarding noise, improving the efficiency of the analyses.
Variable Importance Assessment: Understanding the importance of different variables in shaping the overall variance of the data set and identifying the elements that make a substantial contribution to the patterns we see.
The first step is for installing the necessary packages that are needed for this project as well as the initial exploration of the data so we can gain necessary insights into the dataset and understanding the variable types.
options(repos = c(CRAN = "https://cloud.r-project.org"))
installation of packages:
install.packages("mvtnorm")
##
## The downloaded binary packages are in
## /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages
install.packages("FactoMineR")
##
## The downloaded binary packages are in
## /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages
install.packages("caret")
##
## The downloaded binary packages are in
## /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(caret)
library(corrplot)
## corrplot 0.92 loaded
library(FactoMineR)
library(factoextra)
library(dplyr)
library(scatterplot3d)
library(readr)
hoteldata <- read_csv("hotel_bookings.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
understanding the variables:
head(hoteldata)
## # A tibble: 6 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
This function is showing the first few rows and columns of the dataset indicating that it includes information about the hotels, booking details and the customer-related features.
summary(hoteldata)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
Here we are provided with the summary statistics for each variable helping us understand the distribution and characteristics of the data set.
Removing Unnecessary columns
The following process involves removing unnecessary columns that contain unique identifiers and do not provide meaningful information. Some of them have the same value across all rows which will not be considered as variability and others have highly missing values through out altogether.
#removing unnecessary columns
library(dplyr)
new_hoteldata<- hoteldata %>%
select(
-hotel,
-company,
-agent,
-reservation_status,
-reservation_status_date,
-arrival_date_year,
-arrival_date_week_number,
-arrival_date_month,
-reservation_status_date,
-country,
-children,
-babies,
-meal,
)
The select function removed all the variables in it, and below are the ones that were left:
remaining_columns <- names(new_hoteldata)
print(remaining_columns)
## [1] "is_canceled" "lead_time"
## [3] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [5] "stays_in_week_nights" "adults"
## [7] "market_segment" "distribution_channel"
## [9] "is_repeated_guest" "previous_cancellations"
## [11] "previous_bookings_not_canceled" "reserved_room_type"
## [13] "assigned_room_type" "booking_changes"
## [15] "deposit_type" "days_in_waiting_list"
## [17] "customer_type" "adr"
## [19] "required_car_parking_spaces" "total_of_special_requests"
Given the objectives these columns will help determine the best time to book a hotel or resort for optimal service, considering factors like hotel traffic, days of the week, booking charges, children, and the number of stays. Patterns like guest composition,traffic and popular booking times can be seen.
Encoding Data
Below is a process of transforming all the categorical variables into numeric ones by using the one-hot encoding technique. This technique starts by converting all character columns to factor types and then converts them again to numeric types.
# One-hot encode categorical variables
new_hoteldata <- new_hoteldata %>%
mutate(across(where(is.character), as.factor)) %>% # Convert character columns to factor
mutate(across(everything(), as.numeric)) # Convert all columns to numeric
To check if all the variables were all converted to numeric, l used the code below and the output was positive:
# Check data types of columns}
column_types <- sapply(new_hoteldata, class)
print(column_types)
## is_canceled lead_time
## "numeric" "numeric"
## arrival_date_day_of_month stays_in_weekend_nights
## "numeric" "numeric"
## stays_in_week_nights adults
## "numeric" "numeric"
## market_segment distribution_channel
## "numeric" "numeric"
## is_repeated_guest previous_cancellations
## "numeric" "numeric"
## previous_bookings_not_canceled reserved_room_type
## "numeric" "numeric"
## assigned_room_type booking_changes
## "numeric" "numeric"
## deposit_type days_in_waiting_list
## "numeric" "numeric"
## customer_type adr
## "numeric" "numeric"
## required_car_parking_spaces total_of_special_requests
## "numeric" "numeric"
The following code checks for missing values in the entire data set, which were not found to be there.
any(is.na(new_hoteldata))
## [1] FALSE
# Check for missing values in each column
missing_values <- colSums(is.na(new_hoteldata))
missing_values
## is_canceled lead_time
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
# Display columns with missing values
print(names(new_hoteldata)[missing_values > 0])
## character(0)
The next step was focused on normalizing the data set. The following code ensures that only numeric columns are considered for normalization and then it applies centering and scaling to these numeric columns.
# Selecting only the numeric columns for normalization (adjust if needed)
numeric_cols <- sapply(new_hoteldata, is.numeric)
numeric_cols
## is_canceled lead_time
## TRUE TRUE
## arrival_date_day_of_month stays_in_weekend_nights
## TRUE TRUE
## stays_in_week_nights adults
## TRUE TRUE
## market_segment distribution_channel
## TRUE TRUE
## is_repeated_guest previous_cancellations
## TRUE TRUE
## previous_bookings_not_canceled reserved_room_type
## TRUE TRUE
## assigned_room_type booking_changes
## TRUE TRUE
## deposit_type days_in_waiting_list
## TRUE TRUE
## customer_type adr
## TRUE TRUE
## required_car_parking_spaces total_of_special_requests
## TRUE TRUE
new_hoteldata <- new_hoteldata[, numeric_cols]
# Normalizing the numeric columns
preproc1 <- preProcess(new_hoteldata, method = c("center", "scale"))
new_hoteldata <- predict(preproc1, new_hoteldata)
# Combining the normalized numeric columns with non-numeric columns (if any)
new_hoteldata <- cbind(new_hoteldata[, !numeric_cols, drop = FALSE], new_hoteldata)
# View summary of the normalized dataset
summary(new_hoteldata)
## is_canceled lead_time arrival_date_day_of_month
## Min. :-0.767 Min. :-0.9733 Min. :-1.68529
## 1st Qu.:-0.767 1st Qu.:-0.8049 1st Qu.:-0.88810
## Median :-0.767 Median :-0.3276 Median : 0.02298
## Mean : 0.000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.304 3rd Qu.: 0.5239 3rd Qu.: 0.82017
## Max. : 1.304 Max. : 5.9234 Max. : 1.73124
## stays_in_weekend_nights stays_in_week_nights adults
## Min. :-0.9289 Min. :-1.3102 Min. :-3.2048
## 1st Qu.:-0.9289 1st Qu.:-0.7862 1st Qu.: 0.2479
## Median : 0.0725 Median :-0.2622 Median : 0.2479
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0739 3rd Qu.: 0.2619 3rd Qu.: 0.2479
## Max. :18.0975 Max. :24.8913 Max. :91.7438
## market_segment distribution_channel is_repeated_guest
## Min. :-3.89043 Min. :-2.8486 Min. :-0.1816
## 1st Qu.:-0.73268 1st Qu.: 0.4569 1st Qu.:-0.1816
## Median : 0.05676 Median : 0.4569 Median :-0.1816
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.84620 3rd Qu.: 0.4569 3rd Qu.:-0.1816
## Max. : 1.63563 Max. : 1.5587 Max. : 5.5078
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. :-0.1032 Min. :-0.09155 Min. :-0.583
## 1st Qu.:-0.1032 1st Qu.:-0.09155 1st Qu.:-0.583
## Median :-0.1032 Median :-0.09155 Median :-0.583
## Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.:-0.1032 3rd Qu.:-0.09155 3rd Qu.: 1.185
## Max. :30.6902 Max. :47.99061 Max. : 4.720
## assigned_room_type booking_changes deposit_type days_in_waiting_list
## Min. :-0.7076 Min. :-0.339 Min. :-0.3732 Min. :-0.1319
## 1st Qu.:-0.7076 1st Qu.:-0.339 1st Qu.:-0.3732 1st Qu.:-0.1319
## Median :-0.7076 Median :-0.339 Median :-0.3732 Median :-0.1319
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8892 3rd Qu.:-0.339 3rd Qu.:-0.3732 3rd Qu.:-0.1319
## Max. : 5.1473 Max. :31.855 Max. : 5.6027 Max. :22.0907
## customer_type adr required_car_parking_spaces
## Min. :-3.704 Min. : -2.1413 Min. :-0.2549
## 1st Qu.:-0.238 1st Qu.: -0.6439 1st Qu.:-0.2549
## Median :-0.238 Median : -0.1436 Median :-0.2549
## Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.238 3rd Qu.: 0.4783 3rd Qu.:-0.2549
## Max. : 1.495 Max. :104.8399 Max. :32.3594
## total_of_special_requests
## Min. :-0.7207
## 1st Qu.:-0.7207
## Median :-0.7207
## Mean : 0.0000
## 3rd Qu.: 0.5407
## Max. : 5.5861
This process is to check if the data set is feasible for applying PCA. The output will help in understanding the variability in the data set by indicating the amount of variance captured by each principal component. Therefore, high eigenvalues and corresponding eigenvectors will suggest that there are meaningful patterns and structures in the data that can be captured by the principal components.
# Computing Covariance Matrix
hoteldata.cov <- cov(new_hoteldata)
hoteldata.cov
## is_canceled lead_time
## is_canceled 1.000000000 0.2931233558
## lead_time 0.293123356 1.0000000000
## arrival_date_day_of_month -0.006130079 0.0022675527
## stays_in_weekend_nights -0.001791078 0.0856711329
## stays_in_week_nights 0.024764629 0.1657993639
## adults 0.060017213 0.1195186926
## market_segment 0.059338114 0.0137965306
## distribution_channel 0.167600278 0.2204135894
## is_repeated_guest -0.084793418 -0.1244099080
## previous_cancellations 0.110132808 0.0860418019
## previous_bookings_not_canceled -0.057357723 -0.0735481679
## reserved_room_type -0.061281949 -0.1060893825
## assigned_room_type -0.176027704 -0.1722186443
## booking_changes -0.144380991 0.0001488301
## deposit_type 0.468633824 0.3756669301
## days_in_waiting_list 0.054185824 0.1700841843
## customer_type -0.068140105 0.0734027445
## adr 0.047556598 -0.0630768525
## required_car_parking_spaces -0.195497817 -0.1164505701
## total_of_special_requests -0.234657774 -0.0957120489
## arrival_date_day_of_month
## is_canceled -0.0061300789
## lead_time 0.0022675527
## arrival_date_day_of_month 1.0000000000
## stays_in_weekend_nights -0.0163542995
## stays_in_week_nights -0.0281735214
## adults -0.0015659791
## market_segment -0.0040881739
## distribution_channel 0.0015777224
## is_repeated_guest -0.0061450207
## previous_cancellations -0.0270107761
## previous_bookings_not_canceled -0.0002997868
## reserved_room_type 0.0169290557
## assigned_room_type 0.0116464858
## booking_changes 0.0106128560
## deposit_type -0.0013583178
## days_in_waiting_list 0.0227275352
## customer_type 0.0121878981
## adr 0.0302451948
## required_car_parking_spaces 0.0086834665
## total_of_special_requests 0.0030621241
## stays_in_weekend_nights stays_in_week_nights
## is_canceled -0.001791078 0.02476463
## lead_time 0.085671133 0.16579936
## arrival_date_day_of_month -0.016354300 -0.02817352
## stays_in_weekend_nights 1.000000000 0.49896882
## stays_in_week_nights 0.498968818 1.00000000
## adults 0.091871020 0.09297551
## market_segment 0.115349712 0.10856906
## distribution_channel 0.093096604 0.08718507
## is_repeated_guest -0.087239379 -0.09724497
## previous_cancellations -0.012774619 -0.01399243
## previous_bookings_not_canceled -0.042715235 -0.04874255
## reserved_room_type 0.142082770 0.16861584
## assigned_room_type 0.086642777 0.10079464
## booking_changes 0.063281316 0.09620945
## deposit_type -0.111434936 -0.07678763
## days_in_waiting_list -0.054151113 -0.00201981
## customer_type -0.109220019 -0.12722251
## adr 0.049341906 0.06523748
## required_car_parking_spaces -0.018553809 -0.02485942
## total_of_special_requests 0.072670830 0.06819178
## adults market_segment distribution_channel
## is_canceled 0.060017213 0.059338114 0.167600278
## lead_time 0.119518693 0.013796531 0.220413589
## arrival_date_day_of_month -0.001565979 -0.004088174 0.001577722
## stays_in_weekend_nights 0.091871020 0.115349712 0.093096604
## stays_in_week_nights 0.092975513 0.108569055 0.087185072
## adults 1.000000000 0.208409260 0.178977853
## market_segment 0.208409260 1.000000000 0.767751444
## distribution_channel 0.178977853 0.767751444 1.000000000
## is_repeated_guest -0.146426116 -0.250286406 -0.263218702
## previous_cancellations -0.006738096 -0.059645115 -0.022482556
## previous_bookings_not_canceled -0.107983172 -0.179589330 -0.204730629
## reserved_room_type 0.211434290 0.094539569 -0.041719598
## assigned_room_type 0.144779408 0.026377154 -0.104501946
## booking_changes -0.051672774 -0.071818120 -0.113600981
## deposit_type -0.027643874 -0.184846756 0.092580407
## days_in_waiting_list -0.008283347 -0.041502532 0.048642060
## customer_type -0.101755964 -0.165814403 -0.069640412
## adr 0.230641216 0.232762689 0.092396341
## required_car_parking_spaces 0.014784817 -0.062226077 -0.132280183
## total_of_special_requests 0.122883546 0.274372834 0.098815474
## is_repeated_guest previous_cancellations
## is_canceled -0.084793418 0.110132808
## lead_time -0.124409908 0.086041802
## arrival_date_day_of_month -0.006145021 -0.027010776
## stays_in_weekend_nights -0.087239379 -0.012774619
## stays_in_week_nights -0.097244972 -0.013992431
## adults -0.146426116 -0.006738096
## market_segment -0.250286406 -0.059645115
## distribution_channel -0.263218702 -0.022482556
## is_repeated_guest 1.000000000 0.082293234
## previous_cancellations 0.082293234 1.000000000
## previous_bookings_not_canceled 0.418055995 0.152728115
## reserved_room_type -0.029536937 -0.048808602
## assigned_room_type 0.032440896 -0.058457287
## booking_changes 0.012091787 -0.026992663
## deposit_type -0.057502006 0.139401090
## days_in_waiting_list -0.022234965 0.005928941
## customer_type -0.017111320 -0.008188270
## adr -0.134314447 -0.065645638
## required_car_parking_spaces 0.077089573 -0.018492250
## total_of_special_requests 0.013050009 -0.048384118
## previous_bookings_not_canceled
## is_canceled -0.0573577232
## lead_time -0.0735481679
## arrival_date_day_of_month -0.0002997868
## stays_in_weekend_nights -0.0427152350
## stays_in_week_nights -0.0487425495
## adults -0.1079831725
## market_segment -0.1795893298
## distribution_channel -0.2047306294
## is_repeated_guest 0.4180559949
## previous_cancellations 0.1527281149
## previous_bookings_not_canceled 1.0000000000
## reserved_room_type -0.0217713822
## assigned_room_type 0.0031332292
## booking_changes 0.0116075289
## deposit_type -0.0314751591
## days_in_waiting_list -0.0093969779
## customer_type -0.0122594219
## adr -0.0721441957
## required_car_parking_spaces 0.0476530869
## total_of_special_requests 0.0378237757
## reserved_room_type assigned_room_type
## is_canceled -0.06128195 -0.176027704
## lead_time -0.10608938 -0.172218644
## arrival_date_day_of_month 0.01692906 0.011646486
## stays_in_weekend_nights 0.14208277 0.086642777
## stays_in_week_nights 0.16861584 0.100794643
## adults 0.21143429 0.144779408
## market_segment 0.09453957 0.026377154
## distribution_channel -0.04171960 -0.104501946
## is_repeated_guest -0.02953694 0.032440896
## previous_cancellations -0.04880860 -0.058457287
## previous_bookings_not_canceled -0.02177138 0.003133229
## reserved_room_type 1.00000000 0.814004850
## assigned_room_type 0.81400485 1.000000000
## booking_changes 0.04505990 0.096161792
## deposit_type -0.19968852 -0.242384323
## days_in_waiting_list -0.06882141 -0.068675519
## customer_type -0.12097824 -0.084426540
## adr 0.39206017 0.258134371
## required_car_parking_spaces 0.13158299 0.160131191
## total_of_special_requests 0.13746590 0.124682772
## booking_changes deposit_type
## is_canceled -0.1443809911 0.468633824
## lead_time 0.0001488301 0.375666930
## arrival_date_day_of_month 0.0106128560 -0.001358318
## stays_in_weekend_nights 0.0632813159 -0.111434936
## stays_in_week_nights 0.0962094460 -0.076787627
## adults -0.0516727735 -0.027643874
## market_segment -0.0718181200 -0.184846756
## distribution_channel -0.1136009808 0.092580407
## is_repeated_guest 0.0120917873 -0.057502006
## previous_cancellations -0.0269926626 0.139401090
## previous_bookings_not_canceled 0.0116075289 -0.031475159
## reserved_room_type 0.0450599012 -0.199688519
## assigned_room_type 0.0961617923 -0.242384323
## booking_changes 1.0000000000 -0.112153435
## deposit_type -0.1121534348 1.000000000
## days_in_waiting_list -0.0116339446 0.121016520
## customer_type 0.0920289768 -0.076403851
## adr 0.0196176738 -0.089838490
## required_car_parking_spaces 0.0656201914 -0.090929065
## total_of_special_requests 0.0528334357 -0.266672398
## days_in_waiting_list customer_type adr
## is_canceled 0.054185824 -0.06814011 0.04755660
## lead_time 0.170084184 0.07340274 -0.06307685
## arrival_date_day_of_month 0.022727535 0.01218790 0.03024519
## stays_in_weekend_nights -0.054151113 -0.10922002 0.04934191
## stays_in_week_nights -0.002019810 -0.12722251 0.06523748
## adults -0.008283347 -0.10175596 0.23064122
## market_segment -0.041502532 -0.16581440 0.23276269
## distribution_channel 0.048642060 -0.06964041 0.09239634
## is_repeated_guest -0.022234965 -0.01711132 -0.13431445
## previous_cancellations 0.005928941 -0.00818827 -0.06564564
## previous_bookings_not_canceled -0.009396978 -0.01225942 -0.07214420
## reserved_room_type -0.068821409 -0.12097824 0.39206017
## assigned_room_type -0.068675519 -0.08442654 0.25813437
## booking_changes -0.011633945 0.09202898 0.01961767
## deposit_type 0.121016520 -0.07640385 -0.08983849
## days_in_waiting_list 1.000000000 0.09912121 -0.04075641
## customer_type 0.099121207 1.00000000 -0.07715529
## adr -0.040756412 -0.07715529 1.00000000
## required_car_parking_spaces -0.030600046 -0.03006034 0.05662809
## total_of_special_requests -0.082729719 -0.13562449 0.17218526
## required_car_parking_spaces
## is_canceled -0.195497817
## lead_time -0.116450570
## arrival_date_day_of_month 0.008683466
## stays_in_weekend_nights -0.018553809
## stays_in_week_nights -0.024859423
## adults 0.014784817
## market_segment -0.062226077
## distribution_channel -0.132280183
## is_repeated_guest 0.077089573
## previous_cancellations -0.018492250
## previous_bookings_not_canceled 0.047653087
## reserved_room_type 0.131582990
## assigned_room_type 0.160131191
## booking_changes 0.065620191
## deposit_type -0.090929065
## days_in_waiting_list -0.030600046
## customer_type -0.030060336
## adr 0.056628092
## required_car_parking_spaces 1.000000000
## total_of_special_requests 0.082626338
## total_of_special_requests
## is_canceled -0.234657774
## lead_time -0.095712049
## arrival_date_day_of_month 0.003062124
## stays_in_weekend_nights 0.072670830
## stays_in_week_nights 0.068191782
## adults 0.122883546
## market_segment 0.274372834
## distribution_channel 0.098815474
## is_repeated_guest 0.013050009
## previous_cancellations -0.048384118
## previous_bookings_not_canceled 0.037823776
## reserved_room_type 0.137465901
## assigned_room_type 0.124682772
## booking_changes 0.052833436
## deposit_type -0.266672398
## days_in_waiting_list -0.082729719
## customer_type -0.135624489
## adr 0.172185264
## required_car_parking_spaces 0.082626338
## total_of_special_requests 1.000000000
The ‘hoteldata.cov’ variable produced an output of a covariance matrix which shows how each variable co-varies with every other variable. The goal is to find the directions of the maximum variance of the data, therefore, the code below:
# Performing Eigenvalue Decomposition
hoteldata.eigen <- eigen(hoteldata.cov)
# Extracting and Print Eigenvalues
cat("Eigenvalues:\n", hoteldata.eigen$values, "\n\n")
## Eigenvalues:
## 2.763267 2.506172 1.597247 1.426701 1.338641 1.099364 0.997682 0.9624933 0.9392584 0.8735969 0.8591947 0.8041858 0.7536473 0.7106868 0.5698421 0.5387623 0.4932718 0.4427023 0.1629183 0.1603658
# Extracting and Print First 6 Eigenvectors
cat("First 6 Eigenvectors:\n", head(hoteldata.eigen$vectors), "\n")
## First 6 Eigenvectors:
## 0.163022 0.1329142 -0.008132136 -0.2080319 -0.2050434 -0.2330377 -0.3324272 -0.3304974 0.007542649 -0.1120023 -0.133031 -0.1802476 -0.3242017 -0.2921065 0.009815215 -0.2297461 -0.2934714 -0.1135843 -0.1955721 0.1052285 -0.08369444 0.5521852 0.5409331 -0.1580566 -0.1189917 0.07457223 0.08336091 -0.02177799 0.008217913 -0.04322034 0.1226364 -0.3128372 -0.2188933 0.1222228 0.05064075 -0.1304778 0.04419859 -0.04367869 0.9242116 0.08792635 0.06250399 -0.08819262 0.1069988 -0.1368163 0.04974372 0.04718475 0.009377744 -0.2015718 -0.09059974 -0.02748142 -0.2031573 0.07104095 0.1006474 -0.03213227 -0.1916116 -0.09613566 0.1894226 0.158335 0.04679615 0.08243297 0.07892016 -0.1223133 -0.01600518 -0.03963314 -0.002969157 -0.6915568 -0.1363588 -0.1777587 0.015917 -0.1038286 -0.02778924 -0.04383995 -0.184022 0.2765265 0.04977108 -0.1693823 -0.1193284 0.2790982 0.04182551 -0.3830099 0.01001301 0.0586844 -0.02294277 0.4630745 0.3356148 -0.1644262 0.01629506 0.1133005 -0.01223876 0.04859274 0.4926357 -0.3850945 0.003840265 0.2162595 -0.1566094 0.1112059 -0.2986884 0.03660823 -0.02467556 0.6403913 -0.6132342 -0.02148762 0.3393498 0.4374404 -0.009461983 0.1263904 -0.3537707 -0.1109285 -0.08288969 -0.02046703 -0.002973465 -0.01597409 -0.03365896 -0.01069454 -0.0709218 0.06696958 0.006544829 0.02404613 0.003690696 0.008714622
corrplot(hoteldata.cov,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45)
As shown above, the components that have eigenvalues greater than zero are definitely strongly correlated, but citing the rule of thumb, only those that are above one will be considered for the principal component analysis. Therefore, the ones that were found in the negatives will be discarded.
In this section, we aim to gain a deeper understanding of the numeric variables in our data set by visually exploring their distributions. This step is crucial for ensuring the reliability of results obtained making it ready for the Principal Component Analysis (PCA).
With these histograms, we can uncover valuable insights into the data’s characteristics and identify patterns that may influence the analysis.
# Select numeric columns for visualization
numeric_data <- new_hoteldata %>%
select_if(is.numeric)
# Visualize histograms and density plots
par(mfrow = c(2, 2))
for (col in names(numeric_data)) {
hist(numeric_data[[col]], main = col, col = "lightblue", xlab = col)
lines(density(numeric_data[[col]]), col = "red", lty = 2, lwd = 2)
}
A majority of the variables show a tendency towards positive skewness and a few normal distribution but the overall trend in the data is seen to be consistent therefore, the overall distributional characteristics align with expectations. Hence, we can proceed with the principal analysis.
The Principal Component Analysis (PCA) method is the one that will help us find patterns within high-dimensional data sets. Which will be essentially helpful with the desired dimension reduction of data while retaining as much of the original variability as possible. The process will be to transform the correlated variables into a set of linearly uncorrelated variables, ‘principal components’ , which will represent a more concise data set.
The objective of this section is to walk through the application of PCA to our data set, exploring its numeric variables to reveal underlying structures and relationships.
This initial step involves isolating the numeric variables from our data set. This ensures that the analysis focuses on the quantitative aspects of the data.
#SELECTING NUMERIC VARIABLES
numeric_data <- new_hoteldata %>%
select_if(is.numeric)
PCA is then applied to the selected numeric variables, generating principal components that capture the essential information present in the original data.
#applying PCA
pca_result <- PCA(numeric_data, graph = FALSE)
pca_result
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 119390 individuals, described by 20 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
These results collectively provide a comprehensive overview of the PCA, helping to understand how variables and individuals contribute to the identified principal components.
Next is to determine the optimal number of components to remain in the set, which will be done using the Kaiser’s Stopping Rule - a measure used when deciding which components should be chosen.
eigenvalues:
eigenvalues <- pca_result$eig
print(eigenvalues)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.7632675 13.8163374 13.81634
## comp 2 2.5061716 12.5308581 26.34720
## comp 3 1.5972473 7.9862366 34.33343
## comp 4 1.4267008 7.1335039 41.46694
## comp 5 1.3386413 6.6932063 48.16014
## comp 6 1.0993637 5.4968186 53.65696
## comp 7 0.9976820 4.9884102 58.64537
## comp 8 0.9624933 4.8124665 63.45784
## comp 9 0.9392584 4.6962918 68.15413
## comp 10 0.8735969 4.3679846 72.52211
## comp 11 0.8591947 4.2959733 76.81809
## comp 12 0.8041858 4.0209292 80.83902
## comp 13 0.7536473 3.7682365 84.60725
## comp 14 0.7106868 3.5534341 88.16069
## comp 15 0.5698421 2.8492106 91.00990
## comp 16 0.5387623 2.6938117 93.70371
## comp 17 0.4932718 2.4663590 96.17007
## comp 18 0.4427023 2.2135113 98.38358
## comp 19 0.1629183 0.8145914 99.19817
## comp 20 0.1603658 0.8018290 100.00000
To satisfy Kaiser’s rule, the ones with eigenvalues over 1 are the components that should be taken.
To better visualize the eigenvalues, we will plot the scree test. There, all the elements are plotted in the descending order, and based on the rule of elbow, we choose the optimum number.
Based on Kaiser’s rule, 6 components should be chosen.
Based on the above, l should consider retaining the first 5 or 6 components. These components together explain approximately 48-53% of the total variance.
To provide a comparative analysis, we switched to using the prcomp function from base R to conduct PCA. This method complements the PCA function and offers additional perspective.
# Performing PCA using prcomp
pca_result_prcomp <- prcomp(numeric_data, center = TRUE, scale. = TRUE)
pca_result_prcomp
## Standard deviations (1, .., p=20):
## [1] 1.6623079 1.5830893 1.2638225 1.1944458 1.1569967 1.0485055 0.9988403
## [8] 0.9810674 0.9691534 0.9346641 0.9269275 0.8967641 0.8681286 0.8430224
## [15] 0.7548789 0.7340043 0.7023331 0.6653587 0.4036314 0.4004570
##
## Rotation (n x k) = (20 x 20):
## PC1 PC2 PC3
## is_canceled 0.163021986 -0.332427158 0.324201724
## lead_time 0.132914223 -0.330497439 0.292106455
## arrival_date_day_of_month -0.008132136 0.007542649 -0.009815215
## stays_in_weekend_nights -0.208031895 -0.112002278 0.229746143
## stays_in_week_nights -0.205043369 -0.133031032 0.293471357
## adults -0.233037697 -0.180247625 0.113584293
## market_segment -0.299445296 -0.358777543 -0.337873387
## distribution_channel -0.148025680 -0.464231643 -0.253688076
## is_repeated_guest 0.122589294 0.319976263 0.089260625
## previous_cancellations 0.117543721 -0.015753450 0.204017022
## previous_bookings_not_canceled 0.102496026 0.257870580 0.120229219
## reserved_room_type -0.432367978 0.121475142 0.350289559
## assigned_room_type -0.399415338 0.208268854 0.295567164
## booking_changes -0.064507375 0.132845923 0.044349424
## deposit_type 0.297673821 -0.257855970 0.331049702
## days_in_waiting_list 0.110667458 -0.092118938 0.090936236
## customer_type 0.147885253 0.074871202 -0.099651394
## adr -0.319806693 -0.062085978 0.105754093
## required_car_parking_spaces -0.109249871 0.205316099 0.012723571
## total_of_special_requests -0.278497631 0.041741734 -0.252843728
## PC4 PC5 PC6
## is_canceled -0.19557209 0.118991743 -0.12263644
## lead_time 0.10522852 -0.074572227 0.31283718
## arrival_date_day_of_month -0.08369444 -0.083360911 0.21889329
## stays_in_weekend_nights 0.55218523 0.021777993 -0.12222281
## stays_in_week_nights 0.54093314 -0.008217913 -0.05064075
## adults -0.15805656 0.043220343 0.13047775
## market_segment -0.02208721 0.206749397 0.13166308
## distribution_channel -0.02239653 0.149595661 0.18289464
## is_repeated_guest 0.04227424 0.413429019 0.18086437
## previous_cancellations -0.00559325 0.341349814 0.22019459
## previous_bookings_not_canceled 0.07339005 0.484211941 0.29677084
## reserved_room_type -0.23739719 -0.060077845 0.01522525
## assigned_room_type -0.21749655 -0.090371938 0.02651198
## booking_changes 0.26996725 -0.276146767 0.28401829
## deposit_type -0.19564815 0.055429422 -0.05220138
## days_in_waiting_list -0.03453661 -0.193591664 0.54057298
## customer_type 0.01207297 -0.435890318 0.36961397
## adr -0.28843486 -0.015880374 0.07629755
## required_car_parking_spaces -0.06508786 -0.015527591 0.11722493
## total_of_special_requests 0.09689391 0.254588253 0.22775593
## PC7 PC8 PC9
## is_canceled -0.044198589 -0.106998772 0.09059974
## lead_time 0.043678692 0.136816296 0.02748142
## arrival_date_day_of_month -0.924211559 -0.049743715 0.20315733
## stays_in_weekend_nights -0.087926351 -0.047184747 -0.07104095
## stays_in_week_nights -0.062503990 -0.009377744 -0.10064736
## adults 0.088192616 0.201571758 0.03213227
## market_segment 0.028434994 -0.099096531 0.01582843
## distribution_channel 0.033460117 -0.090431951 -0.01277267
## is_repeated_guest -0.051508033 -0.123281652 -0.14918532
## previous_cancellations 0.239567222 -0.020443131 0.45877333
## previous_bookings_not_canceled -0.026653803 -0.168477805 -0.04317219
## reserved_room_type 0.037806124 -0.168816378 -0.04687782
## assigned_room_type 0.057118362 -0.165364093 -0.06214205
## booking_changes 0.078539184 0.113963381 0.50892717
## deposit_type -0.047977722 0.225550647 0.04460333
## days_in_waiting_list 0.011949017 0.012016968 -0.61736591
## customer_type 0.215129024 -0.330493593 0.15604943
## adr -0.009006076 -0.036106192 0.14490237
## required_car_parking_spaces 0.013117473 0.772196381 -0.05356783
## total_of_special_requests -0.005771240 0.189069199 0.04619349
## PC10 PC11 PC12
## is_canceled -0.191611605 -0.078920159 0.1363588135
## lead_time -0.096135665 0.122313276 0.1777586937
## arrival_date_day_of_month 0.189422592 0.016005183 -0.0159170037
## stays_in_weekend_nights 0.158334962 0.039633144 0.1038285799
## stays_in_week_nights 0.046796153 0.002969157 0.0277892375
## adults 0.082432966 0.691556793 0.0438399452
## market_segment 0.001030065 -0.255379953 0.1196642819
## distribution_channel 0.026062459 -0.292554465 0.2082896355
## is_repeated_guest -0.193920444 -0.003103573 0.2492025766
## previous_cancellations 0.576336567 -0.119276804 -0.3730003655
## previous_bookings_not_canceled -0.142942490 0.047799738 0.2245788198
## reserved_room_type 0.035279122 -0.152612151 0.0196428585
## assigned_room_type 0.060424990 -0.215949736 -0.0009237532
## booking_changes -0.511071229 -0.239112548 -0.1582899415
## deposit_type -0.174635240 -0.112615804 -0.0021827884
## days_in_waiting_list 0.016867086 -0.115607904 -0.4132197776
## customer_type 0.223719771 0.160355288 0.4573075399
## adr -0.236863969 0.155002110 0.0106679952
## required_car_parking_spaces 0.227275502 -0.263230171 0.3939534817
## total_of_special_requests -0.196327119 0.243350359 -0.2570372923
## PC13 PC14 PC15 PC16
## is_canceled 0.18402198 0.04182551 -0.33561481 0.492635741
## lead_time -0.27652652 -0.38300995 0.16442622 -0.385094523
## arrival_date_day_of_month -0.04977108 0.01001301 -0.01629506 0.003840265
## stays_in_weekend_nights 0.16938229 0.05868440 -0.11330050 0.216259495
## stays_in_week_nights 0.11932843 -0.02294277 0.01223876 -0.156609365
## adults -0.27909824 0.46307451 -0.04859274 0.111205888
## market_segment -0.01707275 0.11967305 -0.01480022 -0.000301591
## distribution_channel -0.13846135 0.08147704 0.02504908 -0.055305229
## is_repeated_guest -0.12806924 0.10234909 -0.61132010 -0.340316263
## previous_cancellations 0.09100219 0.01953027 -0.09136413 -0.089463939
## previous_bookings_not_canceled 0.10399801 0.09922743 0.62292972 0.258533231
## reserved_room_type -0.14304067 -0.12341214 0.01656907 0.054653993
## assigned_room_type -0.29748758 -0.11127244 0.01804668 0.078152468
## booking_changes -0.12420380 0.31425698 -0.03960680 0.081980808
## deposit_type -0.03245664 -0.16135469 0.05206010 0.078432272
## days_in_waiting_list 0.16775920 0.17373975 -0.07271031 0.096064495
## customer_type 0.15606577 -0.18383648 -0.14298258 0.190808374
## adr 0.71060584 0.01548226 0.04144802 -0.389844964
## required_car_parking_spaces 0.16774495 0.03900083 -0.04525367 0.088814900
## total_of_special_requests 0.01914205 -0.61093004 -0.19846847 0.322719976
## PC17 PC18 PC19
## is_canceled -0.298688448 0.339349780 -0.082889685
## lead_time 0.036608231 0.437440402 -0.020467027
## arrival_date_day_of_month -0.024675558 -0.009461983 -0.002973465
## stays_in_weekend_nights 0.640391301 0.126390401 -0.015974088
## stays_in_week_nights -0.613234154 -0.353770720 -0.033658964
## adults -0.021487615 -0.110928468 -0.010694539
## market_segment -0.048756692 0.041787024 0.077791785
## distribution_channel 0.098476957 -0.194047184 -0.069115403
## is_repeated_guest 0.070264082 -0.083546822 0.022946779
## previous_cancellations -0.009575395 0.015609893 -0.005974060
## previous_bookings_not_canceled -0.027800799 0.007186436 -0.009333450
## reserved_room_type 0.002715796 0.010251674 0.714792811
## assigned_room_type 0.038860476 -0.047522048 -0.671490681
## booking_changes 0.029697816 0.004273712 0.033201770
## deposit_type 0.283277500 -0.654337310 0.040260117
## days_in_waiting_list 0.019034254 0.029155690 0.005528546
## customer_type -0.017669343 -0.222753089 0.017739294
## adr 0.120802461 -0.008773265 -0.120726153
## required_car_parking_spaces -0.076714059 0.078067399 -0.001772977
## total_of_special_requests -0.023288575 -0.078162381 -0.015694060
## PC20
## is_canceled 0.070921796
## lead_time -0.066969575
## arrival_date_day_of_month -0.006544829
## stays_in_weekend_nights -0.024046131
## stays_in_week_nights -0.003690696
## adults -0.008714622
## market_segment -0.701041025
## distribution_channel 0.648961547
## is_repeated_guest -0.022901202
## previous_cancellations 0.009339823
## previous_bookings_not_canceled 0.007469654
## reserved_room_type 0.091430939
## assigned_room_type -0.082576837
## booking_changes 0.018834694
## deposit_type -0.216580574
## days_in_waiting_list -0.013888179
## customer_type -0.076359100
## adr 0.051468828
## required_car_parking_spaces 0.021864973
## total_of_special_requests 0.070045017
From the above, larger standard deviations imply more information captured by that component.
Loadings indicate the correlation between original variables and principal components. This step provides insights into the contribution of each variable to the principal components.
# Accessing loadings
loadings_prcomp <- pca_result_prcomp$rotation
print(loadings_prcomp[, 1:3])
## PC1 PC2 PC3
## is_canceled 0.163021986 -0.332427158 0.324201724
## lead_time 0.132914223 -0.330497439 0.292106455
## arrival_date_day_of_month -0.008132136 0.007542649 -0.009815215
## stays_in_weekend_nights -0.208031895 -0.112002278 0.229746143
## stays_in_week_nights -0.205043369 -0.133031032 0.293471357
## adults -0.233037697 -0.180247625 0.113584293
## market_segment -0.299445296 -0.358777543 -0.337873387
## distribution_channel -0.148025680 -0.464231643 -0.253688076
## is_repeated_guest 0.122589294 0.319976263 0.089260625
## previous_cancellations 0.117543721 -0.015753450 0.204017022
## previous_bookings_not_canceled 0.102496026 0.257870580 0.120229219
## reserved_room_type -0.432367978 0.121475142 0.350289559
## assigned_room_type -0.399415338 0.208268854 0.295567164
## booking_changes -0.064507375 0.132845923 0.044349424
## deposit_type 0.297673821 -0.257855970 0.331049702
## days_in_waiting_list 0.110667458 -0.092118938 0.090936236
## customer_type 0.147885253 0.074871202 -0.099651394
## adr -0.319806693 -0.062085978 0.105754093
## required_car_parking_spaces -0.109249871 0.205316099 0.012723571
## total_of_special_requests -0.278497631 0.041741734 -0.252843728
This output has loadings for the first three principal components (PC1, PC2, and PC3) and all the variables. The loadings represent the weights assigned to each original variable in the creation of each principal component, larger magnitude loadings (closer to 1 or -1) indicate a stronger relationship and Positive loadings indicate a positive correlation with the principal component, while negative loadings indicate a negative correlation.
As shown from the above, various variables are driving the variability in each principal component. PC1 is positively influenced by variables like “reserved_room_type,” “assigned_room_type,” and “deposit_type.” and Negatively influenced by variables like “adults,” “market_segment,” and “adr.” While PC2 is positively influenced by variables like “is_repeated_guest,” “previous_bookings_not_canceled,” and “required_car_parking_spaces.” but Negatively influenced by variables like “distribution_channel,” “customer_type,” and “adr.”
The magnitude of these loadings is essential for understanding the contribution of each variable to a particular PC. From the plot above, as we can see the PCs that have high intensity or darker colors indicate that they have low value magnitude while those that have the lighters colors have a positive value magnitude. But in overall, a gradual descent is observed where the values start at a peak and gradually decrease.
principal_component_scores_prcomp <- as.matrix(numeric_data) %*% loadings_prcomp
print(principal_component_scores_prcomp[1:30 , 1:3])
## PC1 PC2 PC3
## [1,] 1.29668608 1.86266665 1.114822599
## [2,] 1.68908798 0.84469895 2.262529701
## [3,] 1.50640510 2.29393287 -0.302966930
## [4,] 2.33855069 2.84840622 -0.054954755
## [5,] -0.10166373 -0.17861992 -1.878628205
## [6,] -0.10166373 -0.17861992 -1.878628205
## [7,] 0.27595892 2.03852836 0.507501914
## [8,] -0.03881807 2.06825925 0.204806843
## [9,] 0.31802561 -1.13662989 -0.892908036
## [10,] -0.65745941 -0.35671845 0.805698687
## [11,] -1.64406018 -0.38791668 0.950941107
## [12,] -1.63859828 0.05388931 -0.005260471
## [13,] -2.56024202 0.27960796 -0.814960349
## [14,] -3.57375504 0.89804380 0.808887549
## [15,] -1.78519586 0.29191260 0.258669568
## [16,] -2.56024202 0.27960796 -0.814960349
## [17,] -2.07902356 0.31221635 0.876625928
## [18,] -0.43378053 0.30008387 -1.110184664
## [19,] 0.43934556 3.20273957 1.117360268
## [20,] -2.44196521 2.60329118 2.066360100
## [21,] -2.42616261 0.72282924 0.342826155
## [22,] 0.35978019 1.37100371 0.446945893
## [23,] 0.35978019 1.37100371 0.446945893
## [24,] -1.13715378 1.89984955 1.569444298
## [25,] -3.42195956 0.16747471 1.835192255
## [26,] -1.80307624 0.84714135 0.874455462
## [27,] -2.00818628 -0.17614219 1.119241576
## [28,] -2.72344622 -0.67141601 0.994663748
## [29,] -0.38569644 -0.85465063 -0.474370820
## [30,] -2.51461554 0.52249638 0.395798817
This next step transforms the data into a new set of variables that captures the most important patterns in the data, allowing for dimensional reduction and interpretation of the underlying structure. This is when we transform the original variables into a new set of uncorrelated variables (principal components).
# Viewing the dimensions of the result
dim(principal_component_scores_prcomp)
## [1] 119390 20
Each of the 119,390 observations in the original data set have been transformed into a set of 20 scores along the 20 principal components. These scores represent the coordinates of each observation in the reduced-dimensional space defined by the principal components.
The bi-plot below visualizes the relationship between the principal component scores and variable loadings. This graphical representation aids in interpreting the significance of each variable in the principal component space.
fviz_pca_biplot(pca_result_prcomp, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), scale = 0)
## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position
## = "identity", : Ignoring unknown parameters: `scale`
The points in the biplot represent individual observations from the data set while their positions are determined by the scores of the principal components.
Longer arrows indicate higher variance for that variable and all the points that are close to each other are seen to be similar in terms of their variable values. Therefore, the overall shape of the point cloud shows the variability in the data indicating that the points that are close to each other are similar in the reduced-dimensional space. It is also seen that vectors pointing in the same direction are positively correlated, while vectors pointing in opposite directions are negatively correlated.
Below, the standard deviation and variance of the principal components are computed, offering statistical measures of variability.
We start by calculating the variance of each principal component, that is, squaring the corresponding standard deviation values.
std_dev <- pca_result_prcomp$sdev
# Computing variance
pr_var <- std_dev^2
print(pr_var[1:10])
## [1] 2.7632675 2.5061716 1.5972473 1.4267008 1.3386413 1.0993637 0.9976820
## [8] 0.9624933 0.9392584 0.8735969
The output above shows the variance captured by the first 10 principal components and each value represents the variance explained by a specific principal component. For example, PC1 explains 2.76 units of variance, PC2 explains 2.51 units, and so on.
In this step, the proportion of variance explained by each principal component is examined, providing insights into the significance of each component in capturing the dataset’s variability.
prop_varex <- pr_var/sum(pr_var)
prop_varex
## [1] 0.138163374 0.125308581 0.079862366 0.071335039 0.066932063 0.054968186
## [7] 0.049884102 0.048124665 0.046962918 0.043679846 0.042959733 0.040209292
## [13] 0.037682365 0.035534341 0.028492106 0.026938117 0.024663590 0.022135113
## [19] 0.008145914 0.008018290
This provides the proportion of total variance explained by each principal component. The sum of these proportions should add up to 1. In this case, PC1 explains approximately 13.82% of the total variance, PC2 explains 12.53%, and so on.
scatterplot3d(principal_component_scores_prcomp[, 1:3], main = "3D Scatter Plot")
The above scatter takes into consideration the first 3 principal components that are represented by the axis. The individual points in the plot represent samples from the data set, and as observed, the cloud looks exactly the same as the one from the biplot. Hence when it comes to the direction of movement, the points are seen to be following a specific direction, that is, the scores are leaning more into the positive side as we go along the second principle component hence indicateing how much it contributes to the variation in the data.
However, there is a concentration of points in the corresponding principal component, indicating very low variability hence similarities or relationships in the cluster. We also take note of some outliers which are the individual points that are significantly distant from the main clusters. These may represent unique or unusual observations in the data.
plot(prop_varex, type = "b", xlab = "Principal Components", ylab = "Proportion of Variance",
main = "Scree Plot")
The scree plot typically shows a rapid decrease in the proportion of variance explained as we move from the first to the subsequent components. I will be using the “elbow” of the plot, where the decline slows down, to consider how many components to retain. In this case, l will be going with 6, but just to be sure, l conducted the cumulative proportion of variance plot below.
# cumulative variance
cumulative_variance <- cumsum(pca_result_prcomp$sdev^2) / sum(pca_result_prcomp$sdev^2)
In this plot, each point on the plot corresponds to a cumulative proportion of variance. The higher the point on the y-axis, the more variance is captured by the included components. We see that 6 components will retain us about 90% of the information which is more than enough.
Extracting the Selected Principal Components:
principal_components <- pca_result_prcomp$x[1:10, 1:6]
principal_components
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 1.29668608 1.8626667 1.11482260 0.85725410 -2.12959320 0.7724049
## [2,] 1.68908798 0.8446990 2.26252970 1.66007843 -2.82857545 2.3641574
## [3,] 1.50640510 2.2939329 -0.30296693 -0.30622316 -0.65749591 -1.6710075
## [4,] 2.33855069 2.8484062 -0.05495476 -0.02666939 -0.89352432 -1.9871248
## [5,] -0.10166373 -0.1786199 -1.87862821 -0.16791151 0.63733398 -0.4431630
## [6,] -0.10166373 -0.1786199 -1.87862821 -0.16791151 0.63733398 -0.4431630
## [7,] 0.27595892 2.0385284 0.50750191 -0.76488877 -0.66315333 -1.4265349
## [8,] -0.03881807 2.0682592 0.20480684 -0.61097870 -0.34705075 -1.1189459
## [9,] 0.31802561 -1.1366299 -0.89290804 -0.12819116 0.83491092 -0.5399558
## [10,] -0.65745941 -0.3567184 0.80569869 -1.14385125 0.09966777 -0.8557261
From the PCA results, the first 6 principal components were extracted. These will be the new features that represent the most important patterns in your data.
In alignment with our goal, these 6 PCs will be enough to extract the most important information of booking behaviors for the hotels making them eligible for any further analysis like clustering or regression.