Dataset link: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction
Unit of observation: an individual passenger.
Sample size: 200.
ID
Gender
Age
Customer satisfaction scores (from 1 to 5) for the next variables:
Research question: What are the dimensions of airline passenger satisfaction?
library(readxl)
mydata <- read_xlsx("./HW2_PCA.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## ID Gender Age Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service
## 1 1 Male 48 4 3 5 5 5 5
## 2 2 Female 35 3 5 4 5 3 5
## 3 3 Male 41 4 3 5 5 5 3
## 4 4 Male 50 3 5 5 4 4 5
## 5 5 Female 49 3 3 4 5 4 3
## 6 6 Male 43 3 4 4 3 3 4
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(mydata[ , c(-1, -2, -3)])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Check-in Service 1 200 3.42 1.21 3 3.53 1.48 1 5 4 -0.46 -0.54 0.09
## On-board Service 2 200 3.54 1.30 4 3.67 1.48 1 5 4 -0.64 -0.65 0.09
## Seat Comfort 3 200 3.79 1.16 4 3.91 1.48 1 5 4 -0.78 -0.37 0.08
## Cleanliness 4 200 3.48 1.27 4 3.59 1.48 1 5 4 -0.44 -0.88 0.09
## Food and Drink 5 200 3.28 1.33 3 3.33 1.48 1 5 4 -0.09 -1.31 0.09
## In-flight Service 6 200 3.81 1.22 4 3.97 1.48 1 5 4 -0.84 -0.25 0.09
Seat comfort variable has the lowest amount of information = lowest variability (sd = 1.16).
Half of the responses for Cleanliness variable were higher than 4 (median = 4).
mydata_PCA <- mydata[ , c(4:9)] #1. Correlation matrix
R <- cor(mydata_PCA)
round (R, 3)
## Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service
## Check-in Service 1.000 0.328 0.246 0.235 0.113 0.315
## On-board Service 0.328 1.000 0.110 0.073 -0.033 0.754
## Seat Comfort 0.246 0.110 1.000 0.677 0.581 0.116
## Cleanliness 0.235 0.073 0.677 1.000 0.556 0.126
## Food and Drink 0.113 -0.033 0.581 0.556 1.000 0.031
## In-flight Service 0.315 0.754 0.116 0.126 0.031 1.000
library(psych)
corPlot(R)
There is a strong linear positive correlation between variables On-board Service and In-flight Service.
library(psych)
cortest.bartlett(R, n = nrow(mydata)) #2. Bartlett's test.
## $chisq
## [1] 425.3804
##
## $p.value
## [1] 3.172885e-81
##
## $df
## [1] 15
det(R)
## [1] 0.1143531
library(psych) #3. KMO and MSA.
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.66
## MSA for each item =
## Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service
## 0.87 0.55 0.69 0.71 0.76 0.56
H0: P = I
H0: P ≠ I
We reject H0 at p < 0.001 => correlation exists.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.2
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE)
library(factoextra) #Analyzing eigenvalues.
get_eigenvalue(components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.4469605 40.782676 40.78268
## Dim.2 1.7875899 29.793166 70.57584
## Dim.3 0.7518615 12.531025 83.10687
## Dim.4 0.4500027 7.500045 90.60691
## Dim.5 0.3280424 5.467373 96.07428
## Dim.6 0.2355430 3.925716 100.00000
By Kaiser’s rule: 2 components.
By 70% rule: 2 components.
By 5% rule: 5 components.
fviz_eig(components, #Screeplot.
choice = "eigenvalue",
main = "Screeplot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
By screeplot or Elbow method: the most evident jump is at 3 => 2 components.
library(psych)
fa.parallel(mydata_PCA,
sim = FALSE,
fa = "pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
By parallel analysis: 2 empirical eigenvalues are higher then theoretical eigenvalues => 2 components.
By the results of these testings we will create 2 components.
library(FactoMineR)
components <- PCA(mydata_PCA,
ncp = 2,
scale.unit = TRUE,
graph = FALSE)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 200 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
components$var$cor
## Dim.1 Dim.2
## Check-in Service 0.5310956 0.3242828
## On-board Service 0.4692212 0.7807410
## Seat Comfort 0.7987448 -0.3675588
## Cleanliness 0.7823330 -0.3761201
## Food and Drink 0.6625400 -0.4932433
## In-flight Service 0.5056996 0.7436527
components$var$contrib
## Dim.1 Dim.2
## Check-in Service 11.527058 5.882744
## On-board Service 8.997632 34.099346
## Seat Comfort 26.072885 7.557632
## Cleanliness 25.012458 7.913801
## Food and Drink 17.938957 13.609888
## In-flight Service 10.451010 30.936588
Dimension 1: General satisfaction.
Dimension 2: Service quality factors (Check-in Service, On-board Service, In-flight Service) and comfort factors (Seat Comfort, Cleanliness, Food and Drink).
fviz_pca_var(components, repel = TRUE)
fviz_pca_biplot(components, repel = TRUE)
ID number 70 rated airline as below average in general, but with higher comfort compared to service quality.
ID number 105 rated airline as above average in general, and with higher service quality compared to comfort.
mydata$PC1 <- components$ind$coord[ , 1]
mydata$PC2 <- components$ind$coord[ , 2]
head(mydata)
## ID Gender Age Check-in Service On-board Service Seat Comfort Cleanliness Food and Drink In-flight Service PC1
## 1 1 Male 48 4 3 5 5 5 5 2.0368037
## 2 2 Female 35 3 5 4 5 3 5 1.1400408
## 3 3 Male 41 4 3 5 5 5 3 1.5047904
## 4 4 Male 50 3 5 5 4 4 5 1.5046197
## 5 5 Female 49 3 3 4 5 4 3 0.4654444
## 6 6 Male 43 3 4 4 3 3 4 -0.1452188
## PC2
## 1 -0.6832429
## 2 0.8077537
## 3 -1.5985770
## 4 0.5148204
## 5 -1.2843395
## 6 0.3441537
Principal component analysis was performed on standardized variables (n = 200). The KMO measure confirms the appropriateness of the variables, KMO = 0.66, which falls into the category “Mediocre”. The MSA statistics for the individual variables are above 0.50 for all the variables. Based on the parallel analysis (and other rules as well), it makes most sense to retain the first two principal components, which together summarize 70.6% of the information in the original 4 variables. Based on the component’s loadings, we conclude that PC1 (𝜆1 = 2.4) represents general satisfaction, while PC2 (𝜆2 = 1.8) represents the contrast between service quality factors and comfort factors for airline passengers satisfaction.