1 Introduction

The objective of this paper is to apply dimensionality reduction methods to analyze data related to the satisfaction of US airline passengers. The data set that will be the focus of this article is sourced from Kaggle (https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction). The original data set comprises 25 columns.

# Importing necessary libraries for whole project
library(knitr)
library(corrplot)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(pdp)
library(dplyr)
library(tidyr)
library(purrr)
library(ggrepel)
library(psych)
survey_raw <- read.csv("airline_survey.csv")

head((survey_raw)[98:103,])
##       X     id Gender     Customer.Type Age  Type.of.Travel    Class
## 98   97 114534   Male    Loyal Customer  19 Personal Travel Business
## 99   98  93076 Female    Loyal Customer  15 Personal Travel Eco Plus
## 100  99  96963 Female disloyal Customer  42 Business travel Business
## 101 100  85494 Female    Loyal Customer  16 Personal Travel      Eco
## 102 101  72989   Male    Loyal Customer  19 Business travel Business
## 103 102  42357 Female    Loyal Customer  29 Personal Travel      Eco
##     Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient
## 98              342                     3                                 1
## 99              297                     1                                 5
## 100             883                     3                                 3
## 101             332                     2                                 5
## 102            2388                     5                                 5
## 103             748                     2                                 5
##     Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 98                       3             2              3               3
## 99                       1             3              5               1
## 100                      3             3              3               3
## 101                      2             1              3               2
## 102                      5             5              4               5
## 103                      2             2              5               2
##     Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 98             3                      3                2                1
## 99             5                      5                5                1
## 100            4                      3                5                2
## 101            4                      3                5                4
## 102            4                      4                5                5
## 103            1                      5                4                2
##     Baggage.handling Checkin.service Inflight.service Cleanliness
## 98                 3               1                3           3
## 99                 1               3                4           5
## 100                4               3                4           3
## 101                5               4                4           3
## 102                1               5                4           4
## 103                4               3                5           5
##     Departure.Delay.in.Minutes Arrival.Delay.in.Minutes            satisfaction
## 98                           0                        0 neutral or dissatisfied
## 99                          67                       62 neutral or dissatisfied
## 100                         22                       27               satisfied
## 101                         40                       52 neutral or dissatisfied
## 102                          0                        0               satisfied
## 103                          0                        0 neutral or dissatisfied

In order to apply a dimension reduction algorithm, it was necessary to exclude columns with non-numerical values and retain only those with numerical values that refer to grading which was made in a range from 0 to 5. The resulting dataset comprises 14 columns of this nature.

survey_raw$X <- NULL
survey_raw$id <- NULL
survey_raw$Gender <- NULL
survey_raw$Customer.Type <- NULL
survey_raw$Age <- NULL
survey_raw$Type.of.Travel <- NULL
survey_raw$Class <- NULL
survey_raw$Flight.Distance <- NULL
survey_raw$Departure.Delay.in.Minutes <- NULL
survey_raw$Arrival.Delay.in.Minutes <- NULL
survey_raw$satisfaction <- NULL

It was anticipated that reducing the dimensions of the dataset would be a challenging undertaking, given the theoretical independence of the factors. Nevertheless, I was intrigued as to whether there might be any hitherto unidentified correlations between the factors.

head(survey_raw[98:103,])
##     Inflight.wifi.service Departure.Arrival.time.convenient
## 98                      3                                 1
## 99                      1                                 5
## 100                     3                                 3
## 101                     2                                 5
## 102                     5                                 5
## 103                     2                                 5
##     Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 98                       3             2              3               3
## 99                       1             3              5               1
## 100                      3             3              3               3
## 101                      2             1              3               2
## 102                      5             5              4               5
## 103                      2             2              5               2
##     Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 98             3                      3                2                1
## 99             5                      5                5                1
## 100            4                      3                5                2
## 101            4                      3                5                4
## 102            4                      4                5                5
## 103            1                      5                4                2
##     Baggage.handling Checkin.service Inflight.service Cleanliness
## 98                 3               1                3           3
## 99                 1               3                4           5
## 100                4               3                4           3
## 101                5               4                4           3
## 102                1               5                4           4
## 103                4               3                5           5

In order to verify the behavior of this dataset,Principal component analysis (PCA) method of dimension reduction was employed.

2 Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical method that is employed on high-dimensional continuous data with the objective of reducing the number of dimensions (number of variables) in a given dataset. Principal component analysis (PCA) enables the visualization of data on a two-dimensional plot, thereby facilitating the discovery of additional dependencies between sets of variables.

2.1 Pearson correlation

In the initial stage of the principal component analysis (PCA), I sought to ascertain whether any correlations existed between the variables under consideration. It was decided that the Pearson method would be employed. As anticipated, no negative correlations were observed. However, the graph below depicts a notable positive correlation between certain parameters, which may not initially appear to be directly connected. The strongest correlations are evident between the following variables: - “Ease of online booking” and “Inflight Wi-Fi service” - “Inflight entertainment” and “Food and drink” - “Cleanliness” and “Onboard service” - “Cleanliness” and “Seat Comfort”

cor<-cor(survey_raw, method="pearson") 

corrplot(cor)

2.2 Distribution of ratings

The plots below illustrate the distribution of responses for all the features examined in the survey. The majority of plots are skewed to the right, indicating that respondents are, on the whole, at least moderately satisfied with the quality of US Airlines’ services. The majority of measures appear to reach a peak at grade 4. Upon closer examination, it can be seen that the features rated below average are “Inflight Wi-Fi Service”, “Online Booking” and “Gate Location”. The plots for these features display a distribution shape that is almost normal.

survey_raw %>%
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  ggplot(aes(x = factor(value), fill = factor(value))) +
  geom_bar() +
  facet_wrap(~ key, scales = "free_y", ncol = 3) +
  labs(title = "Distribution of Ratings",
       x = "Rating",
       y = "Frequency")

2.3 Eigenvalues

In order to prepare for the dimension reduction process, it was decided that the first step would be to calculate the eigenvalues.

##  [1] 3.8001168 2.3619860 2.1658922 1.0632740 0.9509312 0.7003355 0.5399564
##  [8] 0.5146550 0.4694747 0.3686600 0.3284079 0.2950956 0.2531709 0.1880437

The plot above illustrates the calculated eigenvalues for each component. In accordance with the Kaiser Criterion, which asserts that every component with an eigenvalue below 1 should be excluded, the optimal number of dimensions is 4. However, a plot of the percentages of explained variances is also presented below for reference.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9494 1.5369 1.4717 1.03115 0.97516 0.83686 0.73482
## Proportion of Variance 0.2714 0.1687 0.1547 0.07595 0.06792 0.05002 0.03857
## Cumulative Proportion  0.2714 0.4401 0.5949 0.67080 0.73873 0.78875 0.82732
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.71739 0.68518 0.60717 0.57307 0.54323 0.50316 0.43364
## Proportion of Variance 0.03676 0.03353 0.02633 0.02346 0.02108 0.01808 0.01343
## Cumulative Proportion  0.86408 0.89762 0.92395 0.94741 0.96848 0.98657 1.00000

From the presented plot and table, which provide information regarding the relative importance of the components, it can be observed that the number of four-dimensional units selected in accordance with the Kaiser Criterion is capable of explaining 67% of the variance. Nevertheless, it would be erroneous to treat the percentage as a definitive explanation of the variance. It is notable that the distinction between two and three dimensions is considerable, with subsequent differences between three, four, five and six dimensions being relatively minor, varying by just 5 to 7.5% in terms of the explained variance.

2.4 Correlation between variables

The subsequent step involved the creation of a graphical representation between variables, as depicted in the accompanying graph. This type of plot enables the visualization of both positive and negative correlations between variables, as well as the grouping of those variables.

The graph demonstrates that no negative correlations exist between the parameters. Furthermore, it is evident that the variables do not form discernible clusters. However, it is evident that the “Online booking” variable is distinctly separated from the others, whereas “Inflight service”, “Baggage handling” and “Onboard service” remain in close proximity to one another.

2.5 Rotation

Due to the uncertainty regarding the optimal number of dimensions, which remains a topic of investigation, a PCA rotation was conducted to ascertain the most probable number of dimensions, namely 3, 4, 5 and 6.

## 
## Loadings:
##                                   RC1   RC3   RC2  
## Food.and.drink                     0.83            
## Seat.comfort                       0.85            
## Inflight.entertainment             0.76  0.45      
## Cleanliness                        0.88            
## On.board.service                         0.78      
## Leg.room.service                         0.61      
## Baggage.handling                         0.82      
## Inflight.service                         0.84      
## Inflight.wifi.service                          0.80
## Departure.Arrival.time.convenient              0.67
## Ease.of.Online.booking                         0.89
## Gate.location                                  0.67
## Online.boarding                    0.49        0.41
## Checkin.service                                    
## 
##                 RC1  RC3  RC2
## SS loadings    3.07 2.76 2.50
## Proportion Var 0.22 0.20 0.18
## Cumulative Var 0.22 0.42 0.59

The rotation made for division into three dimensions serves to illustrate some of the existing correlations. Nevertheless, the distribution of “online boarding” is almost equal between components RC1 and RC2. This parameter was also located between the two correlation points on the plot. Moreover, the parameter “Check-in service” appears to exert no influence on any of the components. The parameter groups observed in this division appear to lack a typical common factor that would allow for the consolidation of their features. Nevertheless, an attempt was made to identify some common ground between the parameters associated with the same components. With regard to RC1, it can be stated that the majority of contributing features pertain to the aircraft rather than to the service. Conversely, for RC2, the majority of features can be considered to relate to aircraft services. Finally, for RC3, the majority of features can be classified as pertaining to the airport and the online experience.

## 
## Loadings:
##                                   RC1   RC3   RC2   RC4  
## Food.and.drink                     0.85                  
## Seat.comfort                       0.82                  
## Inflight.entertainment             0.79  0.44            
## Cleanliness                        0.88                  
## On.board.service                         0.78            
## Leg.room.service                         0.60            
## Baggage.handling                         0.82            
## Inflight.service                         0.84            
## Departure.Arrival.time.convenient              0.76      
## Ease.of.Online.booking                         0.66  0.61
## Gate.location                                  0.83      
## Inflight.wifi.service                          0.54  0.66
## Online.boarding                                      0.84
## Checkin.service                                          
## 
##                 RC1  RC3  RC2  RC4
## SS loadings    2.93 2.71 2.04 1.71
## Proportion Var 0.21 0.19 0.15 0.12
## Cumulative Var 0.21 0.40 0.55 0.67

The division into four dimensions does not appear to resolve the majority of the issues observed in the previous three-dimensional example. The variables “ease of online booking” and “in-flight service” are distributed between RC2 and RC4, while “check-in service” appears to exert no influence on any of the components. The features of certain components seem to remain unchanged, but the component of features which could be assigned to the airport and online features seems to be divided into those two parts separately. However, this division is not clear, as there are two parameters in between what was previously stated.

## 
## Loadings:
##                                   RC1   RC3   RC4   RC2   RC5  
## Food.and.drink                     0.85                        
## Seat.comfort                       0.82                        
## Inflight.entertainment             0.79  0.47                  
## Cleanliness                        0.89                        
## On.board.service                         0.77                  
## Leg.room.service                         0.63                  
## Baggage.handling                         0.82                  
## Inflight.service                         0.83                  
## Inflight.wifi.service                          0.80            
## Ease.of.Online.booking                         0.76  0.49      
## Online.boarding                                0.78            
## Departure.Arrival.time.convenient                    0.81      
## Gate.location                                        0.82      
## Checkin.service                                            0.92
## 
##                 RC1  RC3  RC4  RC2  RC5
## SS loadings    2.95 2.64 1.95 1.74 1.07
## Proportion Var 0.21 0.19 0.14 0.12 0.08
## Cumulative Var 0.21 0.40 0.54 0.66 0.74

A division into five dimensions was implemented on this occasion, and it appears to be the most optimal of the three divisions. Although the components have not yet been clearly delineated, some common features can be discerned. RC1 and RC3 are comparable to those for dimensions 3 and 4, respectively, in that they encompass both onboard parameters related and unrelated to the services. However, this distinction between service and non-service components is somewhat artificial, particularly when one considers the inclusion of leg room as a service. This may be seen as an unwarranted extension of the concept of service. With regard to the RC4 component, it can be observed that features related to online experiences are present both during the reservation process and on board. This may indicate a correlation between the general online technology development of airlines and the aforementioned features. The RC2 component appears to connect features related to organisation and logistic convenience, including gate location and flight hours. Finally, the “check-in service” is shown to be related to RC5 as the only component.

## 
## Loadings:
##                                   RC1   RC3   RC4   RC2   RC5   RC6  
## Food.and.drink                     0.85                              
## Seat.comfort                       0.83                              
## Inflight.entertainment             0.79  0.45                        
## Cleanliness                        0.89                              
## On.board.service                         0.79                        
## Baggage.handling                         0.83                        
## Inflight.service                         0.86                        
## Inflight.wifi.service                          0.80                  
## Ease.of.Online.booking                         0.76  0.49            
## Online.boarding                                0.80                  
## Departure.Arrival.time.convenient                    0.80            
## Gate.location                                        0.84            
## Checkin.service                                            0.94      
## Leg.room.service                                                 0.93
## 
##                 RC1  RC3  RC4  RC2  RC5  RC6
## SS loadings    2.95 2.41 1.94 1.73 1.06 0.95
## Proportion Var 0.21 0.17 0.14 0.12 0.08 0.07
## Cumulative Var 0.21 0.38 0.52 0.65 0.72 0.79

A comparison between the two divisions reveals that the six-dimensional division is quite similar to the five-dimensional one. The only significant difference is observed in the leg room service, which is included in the newly created RC6 dimension for this division.

2.6 Contribution of variables

Following the creation of the aforementioned four trials, a decision was taken to produce a series of variable plots for division into five dimensions.

In the initial dimension, we observe a previously referenced group of non-service onboard parameters, from which “in-flight entertainment” exhibits the most substantial contribution. However, the percentages for the top four variables are approximately 10-15%, and the percentage values for other parameters, such as “online boarding,” do not exhibit significantly lower values.

The results for the second dimension are noteworthy, as they indicate that approximately 20% of the contribution can be attributed to four variables:

  • Ease of online booking
  • Inflight Wi-Fi service
  • Gate location
  • Departure and arrival time convenient

This suggests a combination of online-related and logistic features. The results differ from those achieved in the rotation, where the two groups were separated. However, upon examination of the plot of correlation between variables, it becomes evident that the two duels were not situated at considerable distances from each other.

In order to gain insight into the third dimension, it is necessary to consider the contribution of all the onboard-related services and variables. Of particular interest is the significant discrepancy between the “leg room service” and “online boarding” variables. There is a notable resemblance between these variables and those observed in the bottom left quarter of the correlation plot.

In the fourth dimension, we observe a similar combination of logistic and online-related features. However, in this case, the highest contribution is made by “Online boarding,” while the other variables with a high significance level are “Gate location” and “Departure-Arrival time convenient.”

With regard to the fifth dimension, the view is strikingly similar to that observed during the rotation process. It is evident that the “Checkin service” plays a pivotal role, accounting for over 60% of the total contribution.

3 Conclusion

In conclusion, the process of dimension reduction utilising the PCA method on this dataset proved to be a significant challenge. It was anticipated that the parameters of the question would be selected following comprehensive research, with the understanding that they would not overlap. Nevertheless, it was hypothesised that similarities could be identified between the results for certain variables. As evidenced by the data, this is indeed the case, although the results may be inconsistent and uneven. One of the conclusions that can be drawn from this analysis is that there is a similarity between the reviews of “Online booking” and “Inflight Wi-Fi service”. This suggests that airlines that demonstrate excellence in one digital and internet-related process may also exhibit success in other processes within this category. Furthermore, a positive correlation is discernible between gate location and convenience with regard to departure and arrival times. These two features are related to the logistics of airline operations, and there may also be a connection between airlines and airports influencing both parameters. Additionally, the parameter related to general onboard service experience exhibited a similar pattern of behaviour. However, depending on the type of analysis conducted, slightly different correlations may be identified. Following this analysis, it can be concluded that the survey was effectively designed and contains minimal content overlap. It can also be stated that dividing the airline passenger experience into distinct groups using a relatively simple method such as PCA is a challenging endeavour. Nevertheless, insights can be derived from this research.