Airline Passenger Satisfaction Across Genders

Applied data project 2 (Assignment 2)

Name - Munish Sabherwal Student ID - s3915526

Last updated: 29 May, 2022

Introduction

The dataset contains information on customer satisfaction from more than 120,000 passengers.It consists of ratings on different factors like convenience of departure/arrival times,ease of online booking, check-in services, gate location etc. the ratings are given from 0 - 5 with 5 being the best and 0 being the least rating. Based on the ratings the overall customer satisfaction is rated as “Satisfactory” or “Neutral or Dissatisfied”.

setwd("/Users/home/Desktop/Data Repository :: Applied Analytics")

Air <- read.csv("airline_passenger_satisfaction.csv")

Air <- Air %>% mutate( Satisfaction_Score = Departure.and.Arrival.Time.Convenience + Ease.of.Online.Booking + Check.in.Service+Online.Boarding+Gate.Location+On.board.Service+Seat.Comfort+Leg.Room.Service+Cleanliness+Food.and.Drink+In.flight.Service+In.flight.Wifi.Service+In.flight.Entertainment+Baggage.Handling)

head(Air$Satisfaction)
## [1] "Neutral or Dissatisfied" "Satisfied"              
## [3] "Satisfied"               "Satisfied"              
## [5] "Satisfied"               "Satisfied"
Air$Satisfaction <-  ifelse(Air$Satisfaction=="Satisfied",1,0)
Air$Satisfaction <- Air$Satisfaction %>% factor(levels = c(0,1),
                                                labels = c("Neutral or Dissatisfied","Satisfied"))

Air$Gender <- ifelse(Air$Gender == "Male",1,0)
Air$Gender <- Air$Gender %>% factor(levels = c(0,1),
                                                labels = c("Female","Male"))

tab <- table(Air$Satisfaction,Air$Gender)
tab %>% addmargins()
##                          
##                           Female   Male    Sum
##   Neutral or Dissatisfied  37630  35822  73452
##   Satisfied                28269  28159  56428
##   Sum                      65899  63981 129880
tab2 <- tab %>%  prop.table(margin=2) 
tab2
##                          
##                              Female      Male
##   Neutral or Dissatisfied 0.5710254 0.5598850
##   Satisfied               0.4289746 0.4401150

Problem Statement

The aim of the analysis is to determine whether there is an association between gender and the satisfaction level of the passengers.

We also attempt to determine if there is any statistical difference between the two genders in terms of their total Satisfaction Score(using the 0.5 level of significance)

To determine the above problem statement we will resort to hypothesis tests.

Data

The data is an open dataset collected from : Kaggle.com. 2022. Airline Passenger Satisfaction. [online] Available at: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=airline_passenger_satisfaction.csv [Accessed 29 May 2022].

Data Cont.

-The two important variables from the data are Gender and the Overall Satisfaction, other than the different factors that result in the overall satisfaction level.

Descriptive Statistics and Visualisation

boxplot(Satisfaction_Score~Gender, data = Air, ylab = "Satisfaction Score", xlab= "Gender")

Decsriptive Statistics Cont.

Air %>% group_by(Gender) %>% summarise(Min = min(Satisfaction_Score,na.rm = TRUE),
                                         Q1 = quantile(Satisfaction_Score,probs = .25,na.rm = TRUE),
                                         Median = median(Satisfaction_Score, na.rm = TRUE),
                                         Q3 = quantile(Satisfaction_Score,probs = .75,na.rm = TRUE),
                                         Max = max(Satisfaction_Score,na.rm = TRUE),
                                         Mean = mean(Satisfaction_Score, na.rm = TRUE),
                                         SD = sd(Satisfaction_Score, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Satisfaction_Score))) -> table1
knitr::kable(table1)
Gender Min Q1 Median Q3 Max Mean SD n Missing
Female 15 39 45 52 70 45.29239 9.282887 65899 0
Male 15 39 46 52 70 45.46565 9.244143 63981 0

Hypthesis Testing Cont.

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

\[S = \sum^n_{i = 1}d^2_i\]

Hypothesis Testing

Air_Male <- Air %>% filter(Gender == "Male") 
Air_Male$Satisfaction_Score %>% qqPlot(dist="norm")

## [1] 8180 6309
Air_Female <- Air %>% filter(Gender == "Female") 
Air_Female$Satisfaction_Score %>% qqPlot(dist="norm")

## [1] 10857  7090

Homogeneity of variances

leveneTest(Satisfaction_Score ~ Gender, data = Air)

As p>0.05, population variances are homogeneous. Now we can apply the two-sample t-test.

result<- t.test( 
  Satisfaction_Score ~ Gender, 
  data = Air, 
  var.equal = TRUE, 
  alternative = "two.sided" 
) 
result
## 
##  Two Sample t-test
## 
## data:  Satisfaction_Score by Gender
## t = -3.3699, df = 129878, p-value = 0.0007521
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.27404104 -0.07249303
## sample estimates:
## mean in group Female   mean in group Male 
##             45.29239             45.46565
result$p.value
## [1] 0.0007521196
result$conf.int
## [1] -0.27404104 -0.07249303
## attr(,"conf.level")
## [1] 0.95

Our decision should be to reject H0: μ1 = μ1 as the p < .05 and the 95% CI of the estimated population difference[-0.27404104, -0.07249303], which did not capture H0: μ1 - μ1 = 0. The results of the two-sample t-test were therefore statistically significant. This meant that the mean total Satisfaction Score for passenger genders who was significantly different from each other.

Second Hypoyhesis Test

H0: There is no association between gender and Satisfaction rating HA: There is an association between gender and Satisfaction rating

Assumption: No more than 25% of expected cell counts are below 5

chi2 <- chisq.test(tab) 
chi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 16.352, df = 1, p-value = 5.26e-05
chi2$expected
##                          
##                             Female     Male
##   Neutral or Dissatisfied 37268.35 36183.65
##   Satisfied               28630.65 27797.35

There are no cells with expected counts below 5.

Since we get a p-Value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent.

Discussion

References

Kaggle.com. 2022. Airline Passenger Satisfaction. [online] Available at: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=airline_passenger_satisfaction.csv [Accessed 29 May 2022].