DIMENSION REDUCTION IN HOTEL BOOKINGS

Introduction

In this research we will be applying PCA as a method of data reduction in order to reduce the dimensionality of the data while retaining as much of the original information as possible. PCA allows us to condense the information into a smaller set of principal components, which can simplify subsequent analyses and visualizations.This will help in identifying patterns, clusters, and relationships among the assessed variables, offering a more interpretative representation of the data, and giving us the most important information at the same time.

Data set Overview:

The data set contains booking information for a city hotel and a resort hotel, such as, when the booking was made, length of stay, the number of adults, children and the number of available parking spaces, among other things.

Project Objectives:

Pattern Discovery: The goal is to extract the most important information from the data by minimizing its dimensionality, which may also uncover latent patterns from variables that affect booking behavior.
Dimensionality Reduction: PCA will help to save feature space, by only retaining the most salient information while discarding noise, improving the efficiency of the analyses.
Variable Importance Assessment: Understanding the importance of different variables in shaping the overall variance of the data set and identifying the elements that make a substantial contribution to the patterns we see.

Data Preprocessing and Cleaning

The first step is for installing the necessary packages that are needed for this project as well as the initial exploration of the data so we can gain necessary insights into the dataset and understanding the variable types.

options(repos = c(CRAN = "https://cloud.r-project.org"))

installation of packages:

install.packages("mvtnorm")

## 
## The downloaded binary packages are in
##  /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages

install.packages("FactoMineR")

## 
## The downloaded binary packages are in
##  /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages

install.packages("caret")

## 
## The downloaded binary packages are in
##  /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp6MBiVl/downloaded_packages

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(dplyr)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(caret)
library(corrplot)

## corrplot 0.92 loaded

library(FactoMineR)
library(factoextra)
library(dplyr)
library(scatterplot3d)
library(readr)
hoteldata <- read_csv("hotel_bookings.csv")

## Rows: 119390 Columns: 32

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

understanding the variables:

head(hoteldata)

## # A tibble: 6 × 32
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         7              2015 July              
## 4 Resort Hotel           0        13              2015 July              
## 5 Resort Hotel           0        14              2015 July              
## 6 Resort Hotel           0        14              2015 July              
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …

This function is showing the first few rows and columns of the dataset indicating that it includes information about the hotels, booking details and the customer-related features.

summary(hoteldata)

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
##

Here we are provided with the summary statistics for each variable helping us understand the distribution and characteristics of the data set.

Removing Unnecessary columns

The following process involves removing unnecessary columns that contain unique identifiers and do not provide meaningful information. Some of them have the same value across all rows which will not be considered as variability and others have highly missing values through out altogether.

#removing unnecessary columns 

library(dplyr)

new_hoteldata<- hoteldata %>%
  select(
    -hotel,
    -company,
    -agent,
    -reservation_status,
    -reservation_status_date,
    -arrival_date_year,
    -arrival_date_week_number,
    -arrival_date_month,
    -reservation_status_date,
    -country,
    -children,
    -babies,
    -meal,
  )

The select function removed all the variables in it, and below are the ones that were left:

remaining_columns <- names(new_hoteldata)
print(remaining_columns)

##  [1] "is_canceled"                    "lead_time"                     
##  [3] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [5] "stays_in_week_nights"           "adults"                        
##  [7] "market_segment"                 "distribution_channel"          
##  [9] "is_repeated_guest"              "previous_cancellations"        
## [11] "previous_bookings_not_canceled" "reserved_room_type"            
## [13] "assigned_room_type"             "booking_changes"               
## [15] "deposit_type"                   "days_in_waiting_list"          
## [17] "customer_type"                  "adr"                           
## [19] "required_car_parking_spaces"    "total_of_special_requests"

Given the objectives these columns will help determine the best time to book a hotel or resort for optimal service, considering factors like hotel traffic, days of the week, booking charges, children, and the number of stays. Patterns like guest composition,traffic and popular booking times can be seen.

Encoding Data

Below is a process of transforming all the categorical variables into numeric ones by using the one-hot encoding technique. This technique starts by converting all character columns to factor types and then converts them again to numeric types.

# One-hot encode categorical variables
new_hoteldata <- new_hoteldata %>%
  mutate(across(where(is.character), as.factor)) %>%  # Convert character columns to factor
  mutate(across(everything(), as.numeric))  # Convert all columns to numeric

To check if all the variables were all converted to numeric, l used the code below and the output was positive:

# Check data types of columns}
column_types <- sapply(new_hoteldata, class)
print(column_types)

##                    is_canceled                      lead_time 
##                      "numeric"                      "numeric" 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                      "numeric"                      "numeric" 
##           stays_in_week_nights                         adults 
##                      "numeric"                      "numeric" 
##                 market_segment           distribution_channel 
##                      "numeric"                      "numeric" 
##              is_repeated_guest         previous_cancellations 
##                      "numeric"                      "numeric" 
## previous_bookings_not_canceled             reserved_room_type 
##                      "numeric"                      "numeric" 
##             assigned_room_type                booking_changes 
##                      "numeric"                      "numeric" 
##                   deposit_type           days_in_waiting_list 
##                      "numeric"                      "numeric" 
##                  customer_type                            adr 
##                      "numeric"                      "numeric" 
##    required_car_parking_spaces      total_of_special_requests 
##                      "numeric"                      "numeric"

HANDLING MISSING VALUES

The following code checks for missing values in the entire data set, which were not found to be there.

any(is.na(new_hoteldata))

## [1] FALSE

# Check for missing values in each column
missing_values <- colSums(is.na(new_hoteldata))
missing_values

##                    is_canceled                      lead_time 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0

# Display columns with missing values
print(names(new_hoteldata)[missing_values > 0])

## character(0)

NORMALISING THE DATA SET

The next step was focused on normalizing the data set. The following code ensures that only numeric columns are considered for normalization and then it applies centering and scaling to these numeric columns.

# Selecting only the numeric columns for normalization (adjust if needed)
numeric_cols <- sapply(new_hoteldata, is.numeric)
numeric_cols

##                    is_canceled                      lead_time 
##                           TRUE                           TRUE 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                           TRUE                           TRUE 
##           stays_in_week_nights                         adults 
##                           TRUE                           TRUE 
##                 market_segment           distribution_channel 
##                           TRUE                           TRUE 
##              is_repeated_guest         previous_cancellations 
##                           TRUE                           TRUE 
## previous_bookings_not_canceled             reserved_room_type 
##                           TRUE                           TRUE 
##             assigned_room_type                booking_changes 
##                           TRUE                           TRUE 
##                   deposit_type           days_in_waiting_list 
##                           TRUE                           TRUE 
##                  customer_type                            adr 
##                           TRUE                           TRUE 
##    required_car_parking_spaces      total_of_special_requests 
##                           TRUE                           TRUE

new_hoteldata <- new_hoteldata[, numeric_cols]

# Normalizing the numeric columns
preproc1 <- preProcess(new_hoteldata, method = c("center", "scale"))
new_hoteldata <- predict(preproc1, new_hoteldata)

# Combining the normalized numeric columns with non-numeric columns (if any)
new_hoteldata <- cbind(new_hoteldata[, !numeric_cols, drop = FALSE], new_hoteldata)


# View summary of the normalized dataset
summary(new_hoteldata)

##   is_canceled       lead_time       arrival_date_day_of_month
##  Min.   :-0.767   Min.   :-0.9733   Min.   :-1.68529         
##  1st Qu.:-0.767   1st Qu.:-0.8049   1st Qu.:-0.88810         
##  Median :-0.767   Median :-0.3276   Median : 0.02298         
##  Mean   : 0.000   Mean   : 0.0000   Mean   : 0.00000         
##  3rd Qu.: 1.304   3rd Qu.: 0.5239   3rd Qu.: 0.82017         
##  Max.   : 1.304   Max.   : 5.9234   Max.   : 1.73124         
##  stays_in_weekend_nights stays_in_week_nights     adults       
##  Min.   :-0.9289         Min.   :-1.3102      Min.   :-3.2048  
##  1st Qu.:-0.9289         1st Qu.:-0.7862      1st Qu.: 0.2479  
##  Median : 0.0725         Median :-0.2622      Median : 0.2479  
##  Mean   : 0.0000         Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 1.0739         3rd Qu.: 0.2619      3rd Qu.: 0.2479  
##  Max.   :18.0975         Max.   :24.8913      Max.   :91.7438  
##  market_segment     distribution_channel is_repeated_guest
##  Min.   :-3.89043   Min.   :-2.8486      Min.   :-0.1816  
##  1st Qu.:-0.73268   1st Qu.: 0.4569      1st Qu.:-0.1816  
##  Median : 0.05676   Median : 0.4569      Median :-0.1816  
##  Mean   : 0.00000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.84620   3rd Qu.: 0.4569      3rd Qu.:-0.1816  
##  Max.   : 1.63563   Max.   : 1.5587      Max.   : 5.5078  
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   :-0.1032        Min.   :-0.09155               Min.   :-0.583    
##  1st Qu.:-0.1032        1st Qu.:-0.09155               1st Qu.:-0.583    
##  Median :-0.1032        Median :-0.09155               Median :-0.583    
##  Mean   : 0.0000        Mean   : 0.00000               Mean   : 0.000    
##  3rd Qu.:-0.1032        3rd Qu.:-0.09155               3rd Qu.: 1.185    
##  Max.   :30.6902        Max.   :47.99061               Max.   : 4.720    
##  assigned_room_type booking_changes   deposit_type     days_in_waiting_list
##  Min.   :-0.7076    Min.   :-0.339   Min.   :-0.3732   Min.   :-0.1319     
##  1st Qu.:-0.7076    1st Qu.:-0.339   1st Qu.:-0.3732   1st Qu.:-0.1319     
##  Median :-0.7076    Median :-0.339   Median :-0.3732   Median :-0.1319     
##  Mean   : 0.0000    Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.: 0.8892    3rd Qu.:-0.339   3rd Qu.:-0.3732   3rd Qu.:-0.1319     
##  Max.   : 5.1473    Max.   :31.855   Max.   : 5.6027   Max.   :22.0907     
##  customer_type         adr           required_car_parking_spaces
##  Min.   :-3.704   Min.   : -2.1413   Min.   :-0.2549            
##  1st Qu.:-0.238   1st Qu.: -0.6439   1st Qu.:-0.2549            
##  Median :-0.238   Median : -0.1436   Median :-0.2549            
##  Mean   : 0.000   Mean   :  0.0000   Mean   : 0.0000            
##  3rd Qu.:-0.238   3rd Qu.:  0.4783   3rd Qu.:-0.2549            
##  Max.   : 1.495   Max.   :104.8399   Max.   :32.3594            
##  total_of_special_requests
##  Min.   :-0.7207          
##  1st Qu.:-0.7207          
##  Median :-0.7207          
##  Mean   : 0.0000          
##  3rd Qu.: 0.5407          
##  Max.   : 5.5861

Variance Exploration: Eigenvalue Decomposition

This process is to check if the data set is feasible for applying PCA. The output will help in understanding the variability in the data set by indicating the amount of variance captured by each principal component. Therefore, high eigenvalues and corresponding eigenvectors will suggest that there are meaningful patterns and structures in the data that can be captured by the principal components.

# Computing Covariance Matrix
hoteldata.cov <- cov(new_hoteldata)
hoteldata.cov

##                                 is_canceled     lead_time
## is_canceled                     1.000000000  0.2931233558
## lead_time                       0.293123356  1.0000000000
## arrival_date_day_of_month      -0.006130079  0.0022675527
## stays_in_weekend_nights        -0.001791078  0.0856711329
## stays_in_week_nights            0.024764629  0.1657993639
## adults                          0.060017213  0.1195186926
## market_segment                  0.059338114  0.0137965306
## distribution_channel            0.167600278  0.2204135894
## is_repeated_guest              -0.084793418 -0.1244099080
## previous_cancellations          0.110132808  0.0860418019
## previous_bookings_not_canceled -0.057357723 -0.0735481679
## reserved_room_type             -0.061281949 -0.1060893825
## assigned_room_type             -0.176027704 -0.1722186443
## booking_changes                -0.144380991  0.0001488301
## deposit_type                    0.468633824  0.3756669301
## days_in_waiting_list            0.054185824  0.1700841843
## customer_type                  -0.068140105  0.0734027445
## adr                             0.047556598 -0.0630768525
## required_car_parking_spaces    -0.195497817 -0.1164505701
## total_of_special_requests      -0.234657774 -0.0957120489
##                                arrival_date_day_of_month
## is_canceled                                -0.0061300789
## lead_time                                   0.0022675527
## arrival_date_day_of_month                   1.0000000000
## stays_in_weekend_nights                    -0.0163542995
## stays_in_week_nights                       -0.0281735214
## adults                                     -0.0015659791
## market_segment                             -0.0040881739
## distribution_channel                        0.0015777224
## is_repeated_guest                          -0.0061450207
## previous_cancellations                     -0.0270107761
## previous_bookings_not_canceled             -0.0002997868
## reserved_room_type                          0.0169290557
## assigned_room_type                          0.0116464858
## booking_changes                             0.0106128560
## deposit_type                               -0.0013583178
## days_in_waiting_list                        0.0227275352
## customer_type                               0.0121878981
## adr                                         0.0302451948
## required_car_parking_spaces                 0.0086834665
## total_of_special_requests                   0.0030621241
##                                stays_in_weekend_nights stays_in_week_nights
## is_canceled                               -0.001791078           0.02476463
## lead_time                                  0.085671133           0.16579936
## arrival_date_day_of_month                 -0.016354300          -0.02817352
## stays_in_weekend_nights                    1.000000000           0.49896882
## stays_in_week_nights                       0.498968818           1.00000000
## adults                                     0.091871020           0.09297551
## market_segment                             0.115349712           0.10856906
## distribution_channel                       0.093096604           0.08718507
## is_repeated_guest                         -0.087239379          -0.09724497
## previous_cancellations                    -0.012774619          -0.01399243
## previous_bookings_not_canceled            -0.042715235          -0.04874255
## reserved_room_type                         0.142082770           0.16861584
## assigned_room_type                         0.086642777           0.10079464
## booking_changes                            0.063281316           0.09620945
## deposit_type                              -0.111434936          -0.07678763
## days_in_waiting_list                      -0.054151113          -0.00201981
## customer_type                             -0.109220019          -0.12722251
## adr                                        0.049341906           0.06523748
## required_car_parking_spaces               -0.018553809          -0.02485942
## total_of_special_requests                  0.072670830           0.06819178
##                                      adults market_segment distribution_channel
## is_canceled                     0.060017213    0.059338114          0.167600278
## lead_time                       0.119518693    0.013796531          0.220413589
## arrival_date_day_of_month      -0.001565979   -0.004088174          0.001577722
## stays_in_weekend_nights         0.091871020    0.115349712          0.093096604
## stays_in_week_nights            0.092975513    0.108569055          0.087185072
## adults                          1.000000000    0.208409260          0.178977853
## market_segment                  0.208409260    1.000000000          0.767751444
## distribution_channel            0.178977853    0.767751444          1.000000000
## is_repeated_guest              -0.146426116   -0.250286406         -0.263218702
## previous_cancellations         -0.006738096   -0.059645115         -0.022482556
## previous_bookings_not_canceled -0.107983172   -0.179589330         -0.204730629
## reserved_room_type              0.211434290    0.094539569         -0.041719598
## assigned_room_type              0.144779408    0.026377154         -0.104501946
## booking_changes                -0.051672774   -0.071818120         -0.113600981
## deposit_type                   -0.027643874   -0.184846756          0.092580407
## days_in_waiting_list           -0.008283347   -0.041502532          0.048642060
## customer_type                  -0.101755964   -0.165814403         -0.069640412
## adr                             0.230641216    0.232762689          0.092396341
## required_car_parking_spaces     0.014784817   -0.062226077         -0.132280183
## total_of_special_requests       0.122883546    0.274372834          0.098815474
##                                is_repeated_guest previous_cancellations
## is_canceled                         -0.084793418            0.110132808
## lead_time                           -0.124409908            0.086041802
## arrival_date_day_of_month           -0.006145021           -0.027010776
## stays_in_weekend_nights             -0.087239379           -0.012774619
## stays_in_week_nights                -0.097244972           -0.013992431
## adults                              -0.146426116           -0.006738096
## market_segment                      -0.250286406           -0.059645115
## distribution_channel                -0.263218702           -0.022482556
## is_repeated_guest                    1.000000000            0.082293234
## previous_cancellations               0.082293234            1.000000000
## previous_bookings_not_canceled       0.418055995            0.152728115
## reserved_room_type                  -0.029536937           -0.048808602
## assigned_room_type                   0.032440896           -0.058457287
## booking_changes                      0.012091787           -0.026992663
## deposit_type                        -0.057502006            0.139401090
## days_in_waiting_list                -0.022234965            0.005928941
## customer_type                       -0.017111320           -0.008188270
## adr                                 -0.134314447           -0.065645638
## required_car_parking_spaces          0.077089573           -0.018492250
## total_of_special_requests            0.013050009           -0.048384118
##                                previous_bookings_not_canceled
## is_canceled                                     -0.0573577232
## lead_time                                       -0.0735481679
## arrival_date_day_of_month                       -0.0002997868
## stays_in_weekend_nights                         -0.0427152350
## stays_in_week_nights                            -0.0487425495
## adults                                          -0.1079831725
## market_segment                                  -0.1795893298
## distribution_channel                            -0.2047306294
## is_repeated_guest                                0.4180559949
## previous_cancellations                           0.1527281149
## previous_bookings_not_canceled                   1.0000000000
## reserved_room_type                              -0.0217713822
## assigned_room_type                               0.0031332292
## booking_changes                                  0.0116075289
## deposit_type                                    -0.0314751591
## days_in_waiting_list                            -0.0093969779
## customer_type                                   -0.0122594219
## adr                                             -0.0721441957
## required_car_parking_spaces                      0.0476530869
## total_of_special_requests                        0.0378237757
##                                reserved_room_type assigned_room_type
## is_canceled                           -0.06128195       -0.176027704
## lead_time                             -0.10608938       -0.172218644
## arrival_date_day_of_month              0.01692906        0.011646486
## stays_in_weekend_nights                0.14208277        0.086642777
## stays_in_week_nights                   0.16861584        0.100794643
## adults                                 0.21143429        0.144779408
## market_segment                         0.09453957        0.026377154
## distribution_channel                  -0.04171960       -0.104501946
## is_repeated_guest                     -0.02953694        0.032440896
## previous_cancellations                -0.04880860       -0.058457287
## previous_bookings_not_canceled        -0.02177138        0.003133229
## reserved_room_type                     1.00000000        0.814004850
## assigned_room_type                     0.81400485        1.000000000
## booking_changes                        0.04505990        0.096161792
## deposit_type                          -0.19968852       -0.242384323
## days_in_waiting_list                  -0.06882141       -0.068675519
## customer_type                         -0.12097824       -0.084426540
## adr                                    0.39206017        0.258134371
## required_car_parking_spaces            0.13158299        0.160131191
## total_of_special_requests              0.13746590        0.124682772
##                                booking_changes deposit_type
## is_canceled                      -0.1443809911  0.468633824
## lead_time                         0.0001488301  0.375666930
## arrival_date_day_of_month         0.0106128560 -0.001358318
## stays_in_weekend_nights           0.0632813159 -0.111434936
## stays_in_week_nights              0.0962094460 -0.076787627
## adults                           -0.0516727735 -0.027643874
## market_segment                   -0.0718181200 -0.184846756
## distribution_channel             -0.1136009808  0.092580407
## is_repeated_guest                 0.0120917873 -0.057502006
## previous_cancellations           -0.0269926626  0.139401090
## previous_bookings_not_canceled    0.0116075289 -0.031475159
## reserved_room_type                0.0450599012 -0.199688519
## assigned_room_type                0.0961617923 -0.242384323
## booking_changes                   1.0000000000 -0.112153435
## deposit_type                     -0.1121534348  1.000000000
## days_in_waiting_list             -0.0116339446  0.121016520
## customer_type                     0.0920289768 -0.076403851
## adr                               0.0196176738 -0.089838490
## required_car_parking_spaces       0.0656201914 -0.090929065
## total_of_special_requests         0.0528334357 -0.266672398
##                                days_in_waiting_list customer_type         adr
## is_canceled                             0.054185824   -0.06814011  0.04755660
## lead_time                               0.170084184    0.07340274 -0.06307685
## arrival_date_day_of_month               0.022727535    0.01218790  0.03024519
## stays_in_weekend_nights                -0.054151113   -0.10922002  0.04934191
## stays_in_week_nights                   -0.002019810   -0.12722251  0.06523748
## adults                                 -0.008283347   -0.10175596  0.23064122
## market_segment                         -0.041502532   -0.16581440  0.23276269
## distribution_channel                    0.048642060   -0.06964041  0.09239634
## is_repeated_guest                      -0.022234965   -0.01711132 -0.13431445
## previous_cancellations                  0.005928941   -0.00818827 -0.06564564
## previous_bookings_not_canceled         -0.009396978   -0.01225942 -0.07214420
## reserved_room_type                     -0.068821409   -0.12097824  0.39206017
## assigned_room_type                     -0.068675519   -0.08442654  0.25813437
## booking_changes                        -0.011633945    0.09202898  0.01961767
## deposit_type                            0.121016520   -0.07640385 -0.08983849
## days_in_waiting_list                    1.000000000    0.09912121 -0.04075641
## customer_type                           0.099121207    1.00000000 -0.07715529
## adr                                    -0.040756412   -0.07715529  1.00000000
## required_car_parking_spaces            -0.030600046   -0.03006034  0.05662809
## total_of_special_requests              -0.082729719   -0.13562449  0.17218526
##                                required_car_parking_spaces
## is_canceled                                   -0.195497817
## lead_time                                     -0.116450570
## arrival_date_day_of_month                      0.008683466
## stays_in_weekend_nights                       -0.018553809
## stays_in_week_nights                          -0.024859423
## adults                                         0.014784817
## market_segment                                -0.062226077
## distribution_channel                          -0.132280183
## is_repeated_guest                              0.077089573
## previous_cancellations                        -0.018492250
## previous_bookings_not_canceled                 0.047653087
## reserved_room_type                             0.131582990
## assigned_room_type                             0.160131191
## booking_changes                                0.065620191
## deposit_type                                  -0.090929065
## days_in_waiting_list                          -0.030600046
## customer_type                                 -0.030060336
## adr                                            0.056628092
## required_car_parking_spaces                    1.000000000
## total_of_special_requests                      0.082626338
##                                total_of_special_requests
## is_canceled                                 -0.234657774
## lead_time                                   -0.095712049
## arrival_date_day_of_month                    0.003062124
## stays_in_weekend_nights                      0.072670830
## stays_in_week_nights                         0.068191782
## adults                                       0.122883546
## market_segment                               0.274372834
## distribution_channel                         0.098815474
## is_repeated_guest                            0.013050009
## previous_cancellations                      -0.048384118
## previous_bookings_not_canceled               0.037823776
## reserved_room_type                           0.137465901
## assigned_room_type                           0.124682772
## booking_changes                              0.052833436
## deposit_type                                -0.266672398
## days_in_waiting_list                        -0.082729719
## customer_type                               -0.135624489
## adr                                          0.172185264
## required_car_parking_spaces                  0.082626338
## total_of_special_requests                    1.000000000

The ‘hoteldata.cov’ variable produced an output of a covariance matrix which shows how each variable co-varies with every other variable. The goal is to find the directions of the maximum variance of the data, therefore, the code below:

# Performing Eigenvalue Decomposition
hoteldata.eigen <- eigen(hoteldata.cov)

# Extracting and Print Eigenvalues
cat("Eigenvalues:\n", hoteldata.eigen$values, "\n\n")

## Eigenvalues:
##  2.763267 2.506172 1.597247 1.426701 1.338641 1.099364 0.997682 0.9624933 0.9392584 0.8735969 0.8591947 0.8041858 0.7536473 0.7106868 0.5698421 0.5387623 0.4932718 0.4427023 0.1629183 0.1603658

# Extracting and Print First 6 Eigenvectors
cat("First 6 Eigenvectors:\n", head(hoteldata.eigen$vectors), "\n")

## First 6 Eigenvectors:
##  0.163022 0.1329142 -0.008132136 -0.2080319 -0.2050434 -0.2330377 -0.3324272 -0.3304974 0.007542649 -0.1120023 -0.133031 -0.1802476 -0.3242017 -0.2921065 0.009815215 -0.2297461 -0.2934714 -0.1135843 -0.1955721 0.1052285 -0.08369444 0.5521852 0.5409331 -0.1580566 -0.1189917 0.07457223 0.08336091 -0.02177799 0.008217913 -0.04322034 0.1226364 -0.3128372 -0.2188933 0.1222228 0.05064075 -0.1304778 0.04419859 -0.04367869 0.9242116 0.08792635 0.06250399 -0.08819262 0.1069988 -0.1368163 0.04974372 0.04718475 0.009377744 -0.2015718 -0.09059974 -0.02748142 -0.2031573 0.07104095 0.1006474 -0.03213227 -0.1916116 -0.09613566 0.1894226 0.158335 0.04679615 0.08243297 0.07892016 -0.1223133 -0.01600518 -0.03963314 -0.002969157 -0.6915568 -0.1363588 -0.1777587 0.015917 -0.1038286 -0.02778924 -0.04383995 -0.184022 0.2765265 0.04977108 -0.1693823 -0.1193284 0.2790982 0.04182551 -0.3830099 0.01001301 0.0586844 -0.02294277 0.4630745 0.3356148 -0.1644262 0.01629506 0.1133005 -0.01223876 0.04859274 0.4926357 -0.3850945 0.003840265 0.2162595 -0.1566094 0.1112059 -0.2986884 0.03660823 -0.02467556 0.6403913 -0.6132342 -0.02148762 0.3393498 0.4374404 -0.009461983 0.1263904 -0.3537707 -0.1109285 -0.08288969 -0.02046703 -0.002973465 -0.01597409 -0.03365896 -0.01069454 -0.0709218 0.06696958 0.006544829 0.02404613 0.003690696 0.008714622

corrplot(hoteldata.cov, 
         method = "color", 
         type = "upper", 
         tl.col = "black", 
         tl.srt = 45)

As shown above, the components that have eigenvalues greater than zero are definitely strongly correlated, but citing the rule of thumb, only those that are above one will be considered for the principal component analysis. Therefore, the ones that were found in the negatives will be discarded.

Assessing Data Structure

In this section, we aim to gain a deeper understanding of the numeric variables in our data set by visually exploring their distributions. This step is crucial for ensuring the reliability of results obtained making it ready for the Principal Component Analysis (PCA).

With these histograms, we can uncover valuable insights into the data’s characteristics and identify patterns that may influence the analysis.

# Select numeric columns for visualization
numeric_data <- new_hoteldata %>%
  select_if(is.numeric)

# Visualize histograms and density plots
par(mfrow = c(2, 2))
for (col in names(numeric_data)) {
  hist(numeric_data[[col]], main = col, col = "lightblue", xlab = col)
  lines(density(numeric_data[[col]]), col = "red", lty = 2, lwd = 2)
}

A majority of the variables show a tendency towards positive skewness and a few normal distribution but the overall trend in the data is seen to be consistent therefore, the overall distributional characteristics align with expectations. Hence, we can proceed with the principal analysis.

Principal Component Analysis (PCA)

The Principal Component Analysis (PCA) method is the one that will help us find patterns within high-dimensional data sets. Which will be essentially helpful with the desired dimension reduction of data while retaining as much of the original variability as possible. The process will be to transform the correlated variables into a set of linearly uncorrelated variables, ‘principal components’ , which will represent a more concise data set.

Objective

The objective of this section is to walk through the application of PCA to our data set, exploring its numeric variables to reveal underlying structures and relationships.

Selecting Numeric Variables

This initial step involves isolating the numeric variables from our data set. This ensures that the analysis focuses on the quantitative aspects of the data.

#SELECTING NUMERIC VARIABLES
numeric_data <- new_hoteldata %>% 
  select_if(is.numeric)

Applying PCA

PCA is then applied to the selected numeric variables, generating principal components that capture the essential information present in the original data.

#applying PCA
pca_result <- PCA(numeric_data, graph = FALSE)
pca_result

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 119390 individuals, described by 20 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

These results collectively provide a comprehensive overview of the PCA, helping to understand how variables and individuals contribute to the identified principal components.

OPTIMAL NUMBER OF COMPONENTS

Next is to determine the optimal number of components to remain in the set, which will be done using the Kaiser’s Stopping Rule - a measure used when deciding which components should be chosen.

eigenvalues:

eigenvalues <- pca_result$eig

print(eigenvalues)

##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   2.7632675             13.8163374                          13.81634
## comp 2   2.5061716             12.5308581                          26.34720
## comp 3   1.5972473              7.9862366                          34.33343
## comp 4   1.4267008              7.1335039                          41.46694
## comp 5   1.3386413              6.6932063                          48.16014
## comp 6   1.0993637              5.4968186                          53.65696
## comp 7   0.9976820              4.9884102                          58.64537
## comp 8   0.9624933              4.8124665                          63.45784
## comp 9   0.9392584              4.6962918                          68.15413
## comp 10  0.8735969              4.3679846                          72.52211
## comp 11  0.8591947              4.2959733                          76.81809
## comp 12  0.8041858              4.0209292                          80.83902
## comp 13  0.7536473              3.7682365                          84.60725
## comp 14  0.7106868              3.5534341                          88.16069
## comp 15  0.5698421              2.8492106                          91.00990
## comp 16  0.5387623              2.6938117                          93.70371
## comp 17  0.4932718              2.4663590                          96.17007
## comp 18  0.4427023              2.2135113                          98.38358
## comp 19  0.1629183              0.8145914                          99.19817
## comp 20  0.1603658              0.8018290                         100.00000

To satisfy Kaiser’s rule, the ones with eigenvalues over 1 are the components that should be taken.

To better visualize the eigenvalues, we will plot the scree test. There, all the elements are plotted in the descending order, and based on the rule of elbow, we choose the optimum number.

Based on Kaiser’s rule, 6 components should be chosen.

Based on the above, l should consider retaining the first 5 or 6 components. These components together explain approximately 48-53% of the total variance.

Prcomp for PCA

To provide a comparative analysis, we switched to using the prcomp function from base R to conduct PCA. This method complements the PCA function and offers additional perspective.

# Performing PCA using prcomp
pca_result_prcomp <- prcomp(numeric_data, center = TRUE, scale. = TRUE)
pca_result_prcomp

## Standard deviations (1, .., p=20):
##  [1] 1.6623079 1.5830893 1.2638225 1.1944458 1.1569967 1.0485055 0.9988403
##  [8] 0.9810674 0.9691534 0.9346641 0.9269275 0.8967641 0.8681286 0.8430224
## [15] 0.7548789 0.7340043 0.7023331 0.6653587 0.4036314 0.4004570
## 
## Rotation (n x k) = (20 x 20):
##                                         PC1          PC2          PC3
## is_canceled                     0.163021986 -0.332427158  0.324201724
## lead_time                       0.132914223 -0.330497439  0.292106455
## arrival_date_day_of_month      -0.008132136  0.007542649 -0.009815215
## stays_in_weekend_nights        -0.208031895 -0.112002278  0.229746143
## stays_in_week_nights           -0.205043369 -0.133031032  0.293471357
## adults                         -0.233037697 -0.180247625  0.113584293
## market_segment                 -0.299445296 -0.358777543 -0.337873387
## distribution_channel           -0.148025680 -0.464231643 -0.253688076
## is_repeated_guest               0.122589294  0.319976263  0.089260625
## previous_cancellations          0.117543721 -0.015753450  0.204017022
## previous_bookings_not_canceled  0.102496026  0.257870580  0.120229219
## reserved_room_type             -0.432367978  0.121475142  0.350289559
## assigned_room_type             -0.399415338  0.208268854  0.295567164
## booking_changes                -0.064507375  0.132845923  0.044349424
## deposit_type                    0.297673821 -0.257855970  0.331049702
## days_in_waiting_list            0.110667458 -0.092118938  0.090936236
## customer_type                   0.147885253  0.074871202 -0.099651394
## adr                            -0.319806693 -0.062085978  0.105754093
## required_car_parking_spaces    -0.109249871  0.205316099  0.012723571
## total_of_special_requests      -0.278497631  0.041741734 -0.252843728
##                                        PC4          PC5         PC6
## is_canceled                    -0.19557209  0.118991743 -0.12263644
## lead_time                       0.10522852 -0.074572227  0.31283718
## arrival_date_day_of_month      -0.08369444 -0.083360911  0.21889329
## stays_in_weekend_nights         0.55218523  0.021777993 -0.12222281
## stays_in_week_nights            0.54093314 -0.008217913 -0.05064075
## adults                         -0.15805656  0.043220343  0.13047775
## market_segment                 -0.02208721  0.206749397  0.13166308
## distribution_channel           -0.02239653  0.149595661  0.18289464
## is_repeated_guest               0.04227424  0.413429019  0.18086437
## previous_cancellations         -0.00559325  0.341349814  0.22019459
## previous_bookings_not_canceled  0.07339005  0.484211941  0.29677084
## reserved_room_type             -0.23739719 -0.060077845  0.01522525
## assigned_room_type             -0.21749655 -0.090371938  0.02651198
## booking_changes                 0.26996725 -0.276146767  0.28401829
## deposit_type                   -0.19564815  0.055429422 -0.05220138
## days_in_waiting_list           -0.03453661 -0.193591664  0.54057298
## customer_type                   0.01207297 -0.435890318  0.36961397
## adr                            -0.28843486 -0.015880374  0.07629755
## required_car_parking_spaces    -0.06508786 -0.015527591  0.11722493
## total_of_special_requests       0.09689391  0.254588253  0.22775593
##                                         PC7          PC8         PC9
## is_canceled                    -0.044198589 -0.106998772  0.09059974
## lead_time                       0.043678692  0.136816296  0.02748142
## arrival_date_day_of_month      -0.924211559 -0.049743715  0.20315733
## stays_in_weekend_nights        -0.087926351 -0.047184747 -0.07104095
## stays_in_week_nights           -0.062503990 -0.009377744 -0.10064736
## adults                          0.088192616  0.201571758  0.03213227
## market_segment                  0.028434994 -0.099096531  0.01582843
## distribution_channel            0.033460117 -0.090431951 -0.01277267
## is_repeated_guest              -0.051508033 -0.123281652 -0.14918532
## previous_cancellations          0.239567222 -0.020443131  0.45877333
## previous_bookings_not_canceled -0.026653803 -0.168477805 -0.04317219
## reserved_room_type              0.037806124 -0.168816378 -0.04687782
## assigned_room_type              0.057118362 -0.165364093 -0.06214205
## booking_changes                 0.078539184  0.113963381  0.50892717
## deposit_type                   -0.047977722  0.225550647  0.04460333
## days_in_waiting_list            0.011949017  0.012016968 -0.61736591
## customer_type                   0.215129024 -0.330493593  0.15604943
## adr                            -0.009006076 -0.036106192  0.14490237
## required_car_parking_spaces     0.013117473  0.772196381 -0.05356783
## total_of_special_requests      -0.005771240  0.189069199  0.04619349
##                                        PC10         PC11          PC12
## is_canceled                    -0.191611605 -0.078920159  0.1363588135
## lead_time                      -0.096135665  0.122313276  0.1777586937
## arrival_date_day_of_month       0.189422592  0.016005183 -0.0159170037
## stays_in_weekend_nights         0.158334962  0.039633144  0.1038285799
## stays_in_week_nights            0.046796153  0.002969157  0.0277892375
## adults                          0.082432966  0.691556793  0.0438399452
## market_segment                  0.001030065 -0.255379953  0.1196642819
## distribution_channel            0.026062459 -0.292554465  0.2082896355
## is_repeated_guest              -0.193920444 -0.003103573  0.2492025766
## previous_cancellations          0.576336567 -0.119276804 -0.3730003655
## previous_bookings_not_canceled -0.142942490  0.047799738  0.2245788198
## reserved_room_type              0.035279122 -0.152612151  0.0196428585
## assigned_room_type              0.060424990 -0.215949736 -0.0009237532
## booking_changes                -0.511071229 -0.239112548 -0.1582899415
## deposit_type                   -0.174635240 -0.112615804 -0.0021827884
## days_in_waiting_list            0.016867086 -0.115607904 -0.4132197776
## customer_type                   0.223719771  0.160355288  0.4573075399
## adr                            -0.236863969  0.155002110  0.0106679952
## required_car_parking_spaces     0.227275502 -0.263230171  0.3939534817
## total_of_special_requests      -0.196327119  0.243350359 -0.2570372923
##                                       PC13        PC14        PC15         PC16
## is_canceled                     0.18402198  0.04182551 -0.33561481  0.492635741
## lead_time                      -0.27652652 -0.38300995  0.16442622 -0.385094523
## arrival_date_day_of_month      -0.04977108  0.01001301 -0.01629506  0.003840265
## stays_in_weekend_nights         0.16938229  0.05868440 -0.11330050  0.216259495
## stays_in_week_nights            0.11932843 -0.02294277  0.01223876 -0.156609365
## adults                         -0.27909824  0.46307451 -0.04859274  0.111205888
## market_segment                 -0.01707275  0.11967305 -0.01480022 -0.000301591
## distribution_channel           -0.13846135  0.08147704  0.02504908 -0.055305229
## is_repeated_guest              -0.12806924  0.10234909 -0.61132010 -0.340316263
## previous_cancellations          0.09100219  0.01953027 -0.09136413 -0.089463939
## previous_bookings_not_canceled  0.10399801  0.09922743  0.62292972  0.258533231
## reserved_room_type             -0.14304067 -0.12341214  0.01656907  0.054653993
## assigned_room_type             -0.29748758 -0.11127244  0.01804668  0.078152468
## booking_changes                -0.12420380  0.31425698 -0.03960680  0.081980808
## deposit_type                   -0.03245664 -0.16135469  0.05206010  0.078432272
## days_in_waiting_list            0.16775920  0.17373975 -0.07271031  0.096064495
## customer_type                   0.15606577 -0.18383648 -0.14298258  0.190808374
## adr                             0.71060584  0.01548226  0.04144802 -0.389844964
## required_car_parking_spaces     0.16774495  0.03900083 -0.04525367  0.088814900
## total_of_special_requests       0.01914205 -0.61093004 -0.19846847  0.322719976
##                                        PC17         PC18         PC19
## is_canceled                    -0.298688448  0.339349780 -0.082889685
## lead_time                       0.036608231  0.437440402 -0.020467027
## arrival_date_day_of_month      -0.024675558 -0.009461983 -0.002973465
## stays_in_weekend_nights         0.640391301  0.126390401 -0.015974088
## stays_in_week_nights           -0.613234154 -0.353770720 -0.033658964
## adults                         -0.021487615 -0.110928468 -0.010694539
## market_segment                 -0.048756692  0.041787024  0.077791785
## distribution_channel            0.098476957 -0.194047184 -0.069115403
## is_repeated_guest               0.070264082 -0.083546822  0.022946779
## previous_cancellations         -0.009575395  0.015609893 -0.005974060
## previous_bookings_not_canceled -0.027800799  0.007186436 -0.009333450
## reserved_room_type              0.002715796  0.010251674  0.714792811
## assigned_room_type              0.038860476 -0.047522048 -0.671490681
## booking_changes                 0.029697816  0.004273712  0.033201770
## deposit_type                    0.283277500 -0.654337310  0.040260117
## days_in_waiting_list            0.019034254  0.029155690  0.005528546
## customer_type                  -0.017669343 -0.222753089  0.017739294
## adr                             0.120802461 -0.008773265 -0.120726153
## required_car_parking_spaces    -0.076714059  0.078067399 -0.001772977
## total_of_special_requests      -0.023288575 -0.078162381 -0.015694060
##                                        PC20
## is_canceled                     0.070921796
## lead_time                      -0.066969575
## arrival_date_day_of_month      -0.006544829
## stays_in_weekend_nights        -0.024046131
## stays_in_week_nights           -0.003690696
## adults                         -0.008714622
## market_segment                 -0.701041025
## distribution_channel            0.648961547
## is_repeated_guest              -0.022901202
## previous_cancellations          0.009339823
## previous_bookings_not_canceled  0.007469654
## reserved_room_type              0.091430939
## assigned_room_type             -0.082576837
## booking_changes                 0.018834694
## deposit_type                   -0.216580574
## days_in_waiting_list           -0.013888179
## customer_type                  -0.076359100
## adr                             0.051468828
## required_car_parking_spaces     0.021864973
## total_of_special_requests       0.070045017

From the above, larger standard deviations imply more information captured by that component.

Inspecting Loadings

Loadings indicate the correlation between original variables and principal components. This step provides insights into the contribution of each variable to the principal components.

# Accessing loadings
loadings_prcomp <- pca_result_prcomp$rotation

print(loadings_prcomp[, 1:3])

##                                         PC1          PC2          PC3
## is_canceled                     0.163021986 -0.332427158  0.324201724
## lead_time                       0.132914223 -0.330497439  0.292106455
## arrival_date_day_of_month      -0.008132136  0.007542649 -0.009815215
## stays_in_weekend_nights        -0.208031895 -0.112002278  0.229746143
## stays_in_week_nights           -0.205043369 -0.133031032  0.293471357
## adults                         -0.233037697 -0.180247625  0.113584293
## market_segment                 -0.299445296 -0.358777543 -0.337873387
## distribution_channel           -0.148025680 -0.464231643 -0.253688076
## is_repeated_guest               0.122589294  0.319976263  0.089260625
## previous_cancellations          0.117543721 -0.015753450  0.204017022
## previous_bookings_not_canceled  0.102496026  0.257870580  0.120229219
## reserved_room_type             -0.432367978  0.121475142  0.350289559
## assigned_room_type             -0.399415338  0.208268854  0.295567164
## booking_changes                -0.064507375  0.132845923  0.044349424
## deposit_type                    0.297673821 -0.257855970  0.331049702
## days_in_waiting_list            0.110667458 -0.092118938  0.090936236
## customer_type                   0.147885253  0.074871202 -0.099651394
## adr                            -0.319806693 -0.062085978  0.105754093
## required_car_parking_spaces    -0.109249871  0.205316099  0.012723571
## total_of_special_requests      -0.278497631  0.041741734 -0.252843728

This output has loadings for the first three principal components (PC1, PC2, and PC3) and all the variables. The loadings represent the weights assigned to each original variable in the creation of each principal component, larger magnitude loadings (closer to 1 or -1) indicate a stronger relationship and Positive loadings indicate a positive correlation with the principal component, while negative loadings indicate a negative correlation.

As shown from the above, various variables are driving the variability in each principal component. PC1 is positively influenced by variables like “reserved_room_type,” “assigned_room_type,” and “deposit_type.” and Negatively influenced by variables like “adults,” “market_segment,” and “adr.” While PC2 is positively influenced by variables like “is_repeated_guest,” “previous_bookings_not_canceled,” and “required_car_parking_spaces.” but Negatively influenced by variables like “distribution_channel,” “customer_type,” and “adr.”

Showing the magnitude

The magnitude of these loadings is essential for understanding the contribution of each variable to a particular PC. From the plot above, as we can see the PCs that have high intensity or darker colors indicate that they have low value magnitude while those that have the lighters colors have a positive value magnitude. But in overall, a gradual descent is observed where the values start at a peak and gradually decrease.

Principal component scores

principal_component_scores_prcomp <- as.matrix(numeric_data) %*% loadings_prcomp

print(principal_component_scores_prcomp[1:30 , 1:3])

##               PC1         PC2          PC3
##  [1,]  1.29668608  1.86266665  1.114822599
##  [2,]  1.68908798  0.84469895  2.262529701
##  [3,]  1.50640510  2.29393287 -0.302966930
##  [4,]  2.33855069  2.84840622 -0.054954755
##  [5,] -0.10166373 -0.17861992 -1.878628205
##  [6,] -0.10166373 -0.17861992 -1.878628205
##  [7,]  0.27595892  2.03852836  0.507501914
##  [8,] -0.03881807  2.06825925  0.204806843
##  [9,]  0.31802561 -1.13662989 -0.892908036
## [10,] -0.65745941 -0.35671845  0.805698687
## [11,] -1.64406018 -0.38791668  0.950941107
## [12,] -1.63859828  0.05388931 -0.005260471
## [13,] -2.56024202  0.27960796 -0.814960349
## [14,] -3.57375504  0.89804380  0.808887549
## [15,] -1.78519586  0.29191260  0.258669568
## [16,] -2.56024202  0.27960796 -0.814960349
## [17,] -2.07902356  0.31221635  0.876625928
## [18,] -0.43378053  0.30008387 -1.110184664
## [19,]  0.43934556  3.20273957  1.117360268
## [20,] -2.44196521  2.60329118  2.066360100
## [21,] -2.42616261  0.72282924  0.342826155
## [22,]  0.35978019  1.37100371  0.446945893
## [23,]  0.35978019  1.37100371  0.446945893
## [24,] -1.13715378  1.89984955  1.569444298
## [25,] -3.42195956  0.16747471  1.835192255
## [26,] -1.80307624  0.84714135  0.874455462
## [27,] -2.00818628 -0.17614219  1.119241576
## [28,] -2.72344622 -0.67141601  0.994663748
## [29,] -0.38569644 -0.85465063 -0.474370820
## [30,] -2.51461554  0.52249638  0.395798817

This next step transforms the data into a new set of variables that captures the most important patterns in the data, allowing for dimensional reduction and interpretation of the underlying structure. This is when we transform the original variables into a new set of uncorrelated variables (principal components).

# Viewing the dimensions of the result
dim(principal_component_scores_prcomp)

## [1] 119390     20

Each of the 119,390 observations in the original data set have been transformed into a set of 20 scores along the 20 principal components. These scores represent the coordinates of each observation in the reduced-dimensional space defined by the principal components.

Creating a Bi-Plot

The bi-plot below visualizes the relationship between the principal component scores and variable loadings. This graphical representation aids in interpreting the significance of each variable in the principal component space.

fviz_pca_biplot(pca_result_prcomp, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), scale = 0)

## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position
## = "identity", : Ignoring unknown parameters: `scale`

The points in the biplot represent individual observations from the data set while their positions are determined by the scores of the principal components.

Longer arrows indicate higher variance for that variable and all the points that are close to each other are seen to be similar in terms of their variable values. Therefore, the overall shape of the point cloud shows the variability in the data indicating that the points that are close to each other are similar in the reduced-dimensional space. It is also seen that vectors pointing in the same direction are positively correlated, while vectors pointing in opposite directions are negatively correlated.

Computing Variance

Below, the standard deviation and variance of the principal components are computed, offering statistical measures of variability.

We start by calculating the variance of each principal component, that is, squaring the corresponding standard deviation values.

std_dev <- pca_result_prcomp$sdev

# Computing variance
pr_var <- std_dev^2

print(pr_var[1:10])

##  [1] 2.7632675 2.5061716 1.5972473 1.4267008 1.3386413 1.0993637 0.9976820
##  [8] 0.9624933 0.9392584 0.8735969

The output above shows the variance captured by the first 10 principal components and each value represents the variance explained by a specific principal component. For example, PC1 explains 2.76 units of variance, PC2 explains 2.51 units, and so on.

Analyzing Proportion of Variance

In this step, the proportion of variance explained by each principal component is examined, providing insights into the significance of each component in capturing the dataset’s variability.

prop_varex <- pr_var/sum(pr_var)
prop_varex

##  [1] 0.138163374 0.125308581 0.079862366 0.071335039 0.066932063 0.054968186
##  [7] 0.049884102 0.048124665 0.046962918 0.043679846 0.042959733 0.040209292
## [13] 0.037682365 0.035534341 0.028492106 0.026938117 0.024663590 0.022135113
## [19] 0.008145914 0.008018290

This provides the proportion of total variance explained by each principal component. The sum of these proportions should add up to 1. In this case, PC1 explains approximately 13.82% of the total variance, PC2 explains 12.53%, and so on.

Visualizing Results

UNDERSTANDING THE RELATIONSHIPS AND PATTERNS

scatterplot3d(principal_component_scores_prcomp[, 1:3], main = "3D Scatter Plot")

The above scatter takes into consideration the first 3 principal components that are represented by the axis. The individual points in the plot represent samples from the data set, and as observed, the cloud looks exactly the same as the one from the biplot. Hence when it comes to the direction of movement, the points are seen to be following a specific direction, that is, the scores are leaning more into the positive side as we go along the second principle component hence indicateing how much it contributes to the variation in the data.

However, there is a concentration of points in the corresponding principal component, indicating very low variability hence similarities or relationships in the cluster. We also take note of some outliers which are the individual points that are significantly distant from the main clusters. These may represent unique or unusual observations in the data.

proportion of variance

plot(prop_varex, type = "b", xlab = "Principal Components", ylab = "Proportion of Variance",
     main = "Scree Plot")

The scree plot typically shows a rapid decrease in the proportion of variance explained as we move from the first to the subsequent components. I will be using the “elbow” of the plot, where the decline slows down, to consider how many components to retain. In this case, l will be going with 6, but just to be sure, l conducted the cumulative proportion of variance plot below.

CUMULATIVE PROPORTION OF VARIANCE

# cumulative variance
cumulative_variance <- cumsum(pca_result_prcomp$sdev^2) / sum(pca_result_prcomp$sdev^2)

In this plot, each point on the plot corresponds to a cumulative proportion of variance. The higher the point on the y-axis, the more variance is captured by the included components. We see that 6 components will retain us about 90% of the information which is more than enough.

RETAINED COMPONENTS FOR DIMENSION REDUCTION

Extracting the Selected Principal Components:

principal_components <- pca_result_prcomp$x[1:10, 1:6]

principal_components

##               PC1        PC2         PC3         PC4         PC5        PC6
##  [1,]  1.29668608  1.8626667  1.11482260  0.85725410 -2.12959320  0.7724049
##  [2,]  1.68908798  0.8446990  2.26252970  1.66007843 -2.82857545  2.3641574
##  [3,]  1.50640510  2.2939329 -0.30296693 -0.30622316 -0.65749591 -1.6710075
##  [4,]  2.33855069  2.8484062 -0.05495476 -0.02666939 -0.89352432 -1.9871248
##  [5,] -0.10166373 -0.1786199 -1.87862821 -0.16791151  0.63733398 -0.4431630
##  [6,] -0.10166373 -0.1786199 -1.87862821 -0.16791151  0.63733398 -0.4431630
##  [7,]  0.27595892  2.0385284  0.50750191 -0.76488877 -0.66315333 -1.4265349
##  [8,] -0.03881807  2.0682592  0.20480684 -0.61097870 -0.34705075 -1.1189459
##  [9,]  0.31802561 -1.1366299 -0.89290804 -0.12819116  0.83491092 -0.5399558
## [10,] -0.65745941 -0.3567184  0.80569869 -1.14385125  0.09966777 -0.8557261

From the PCA results, the first 6 principal components were extracted. These will be the new features that represent the most important patterns in your data.

CONCLUSION

In alignment with our goal, these 6 PCs will be enough to extract the most important information of booking behaviors for the hotels making them eligible for any further analysis like clustering or regression.

Dimension Reduction Project

cynthia T.M Nyahoda

2024-01-23