PCA

Dataset link: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction

Unit of observation: an individual passenger.

Sample size: 200.

ID

Gender

Age

Customer satisfaction scores (from 1 to 5) for the next variables:

Research question: What are the dimensions of airline passenger satisfaction?

library(readxl)
mydata <- read_xlsx("./HW2_PCA.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)
##   ID Gender Age Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service
## 1  1   Male  48                4                3            5           5              5                 5
## 2  2 Female  35                3                5            4           5              3                 5
## 3  3   Male  41                4                3            5           5              5                 3
## 4  4   Male  50                3                5            5           4              4                 5
## 5  5 Female  49                3                3            4           5              4                 3
## 6  6   Male  43                3                4            4           3              3                 4
library(psych) 
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(mydata[ , c(-1, -2, -3)]) 
##                   vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## Check-in Service     1 200 3.42 1.21      3    3.53 1.48   1   5     4 -0.46    -0.54 0.09
## On-board Service     2 200 3.54 1.30      4    3.67 1.48   1   5     4 -0.64    -0.65 0.09
## Seat Comfort         3 200 3.79 1.16      4    3.91 1.48   1   5     4 -0.78    -0.37 0.08
## Cleanliness          4 200 3.48 1.27      4    3.59 1.48   1   5     4 -0.44    -0.88 0.09
## Food and Drink       5 200 3.28 1.33      3    3.33 1.48   1   5     4 -0.09    -1.31 0.09
## In-flight Service    6 200 3.81 1.22      4    3.97 1.48   1   5     4 -0.84    -0.25 0.09

Seat comfort variable has the lowest amount of information = lowest variability (sd = 1.16).

Half of the responses for Cleanliness variable were higher than 4 (median = 4).

Conducting initial checks to see if PCA can be done.

mydata_PCA <- mydata[ , c(4:9)] #1. Correlation matrix
R <- cor(mydata_PCA)
round (R, 3)
##                   Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service
## Check-in Service             1.000            0.328        0.246       0.235          0.113             0.315
## On-board Service             0.328            1.000        0.110       0.073         -0.033             0.754
## Seat Comfort                 0.246            0.110        1.000       0.677          0.581             0.116
## Cleanliness                  0.235            0.073        0.677       1.000          0.556             0.126
## Food and Drink               0.113           -0.033        0.581       0.556          1.000             0.031
## In-flight Service            0.315            0.754        0.116       0.126          0.031             1.000
library(psych)
corPlot(R)

There is a strong linear positive correlation between variables On-board Service and In-flight Service.

library(psych)
cortest.bartlett(R, n = nrow(mydata)) #2. Bartlett's test.
## $chisq
## [1] 425.3804
## 
## $p.value
## [1] 3.172885e-81
## 
## $df
## [1] 15
det(R)
## [1] 0.1143531
library(psych) #3. KMO and MSA.
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.66
## MSA for each item = 
##  Check-in Service  On-board Service      Seat Comfort       Cleanliness    Food and Drink In-flight Service 
##              0.87              0.55              0.69              0.71              0.76              0.56
  • Bartlett’s test:

H0: P = I

H0: P ≠ I

We reject H0 at p < 0.001 => correlation exists.

  • KMO > 0.5 and all of MSA are > 0.5

How many components should be there?

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
components <- PCA(mydata_PCA, 
                  scale.unit = TRUE, 
                  graph = FALSE)

library(factoextra) #Analyzing eigenvalues.
get_eigenvalue(components)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.4469605        40.782676                    40.78268
## Dim.2  1.7875899        29.793166                    70.57584
## Dim.3  0.7518615        12.531025                    83.10687
## Dim.4  0.4500027         7.500045                    90.60691
## Dim.5  0.3280424         5.467373                    96.07428
## Dim.6  0.2355430         3.925716                   100.00000

By Kaiser’s rule: 2 components.

By 70% rule: 2 components.

By 5% rule: 5 components.

fviz_eig(components, #Screeplot.
         choice = "eigenvalue", 
         main = "Screeplot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

By screeplot or Elbow method: the most evident jump is at 3 => 2 components.

library(psych)
fa.parallel(mydata_PCA,
            sim = FALSE,
            fa = "pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

By parallel analysis: 2 empirical eigenvalues are higher then theoretical eigenvalues => 2 components.

By the results of these testings we will create 2 components.

library(FactoMineR)
components <- PCA(mydata_PCA, 
                  ncp = 2, 
                  scale.unit = TRUE, 
                  graph = FALSE)

components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 200 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
components$var$cor
##                       Dim.1      Dim.2
## Check-in Service  0.5310956  0.3242828
## On-board Service  0.4692212  0.7807410
## Seat Comfort      0.7987448 -0.3675588
## Cleanliness       0.7823330 -0.3761201
## Food and Drink    0.6625400 -0.4932433
## In-flight Service 0.5056996  0.7436527
components$var$contrib
##                       Dim.1     Dim.2
## Check-in Service  11.527058  5.882744
## On-board Service   8.997632 34.099346
## Seat Comfort      26.072885  7.557632
## Cleanliness       25.012458  7.913801
## Food and Drink    17.938957 13.609888
## In-flight Service 10.451010 30.936588

Dimension 1: General satisfaction.

Dimension 2: Service quality factors (Check-in Service, On-board Service, In-flight Service) and comfort factors (Seat Comfort, Cleanliness, Food and Drink).

fviz_pca_var(components, repel = TRUE)

fviz_pca_biplot(components, repel = TRUE)

ID number 70 rated airline as below average in general, but with higher comfort compared to service quality.

ID number 105 rated airline as above average in general, and with higher service quality compared to comfort.

mydata$PC1 <- components$ind$coord[ , 1]
mydata$PC2 <- components$ind$coord[ , 2]
head(mydata)
##   ID Gender Age Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service        PC1
## 1  1   Male  48                4                3            5           5              5                 5  2.0368037
## 2  2 Female  35                3                5            4           5              3                 5  1.1400408
## 3  3   Male  41                4                3            5           5              5                 3  1.5047904
## 4  4   Male  50                3                5            5           4              4                 5  1.5046197
## 5  5 Female  49                3                3            4           5              4                 3  0.4654444
## 6  6   Male  43                3                4            4           3              3                 4 -0.1452188
##          PC2
## 1 -0.6832429
## 2  0.8077537
## 3 -1.5985770
## 4  0.5148204
## 5 -1.2843395
## 6  0.3441537

Principal component analysis was performed on standardized variables (n = 200). The KMO measure confirms the appropriateness of the variables, KMO = 0.66, which falls into the category “Mediocre”. The MSA statistics for the individual variables are above 0.50 for all the variables. Based on the parallel analysis (and other rules as well), it makes most sense to retain the first two principal components, which together summarize 70.6% of the information in the original 4 variables. Based on the component’s loadings, we conclude that PC1 (𝜆1 = 2.4) represents general satisfaction, while PC2 (𝜆2 = 1.8) represents the contrast between service quality factors and comfort factors for airline passengers satisfaction.