Introduction

The dataset contains information on customer satisfaction from more than 120,000 passengers.It consists of ratings on different factors like convenience of departure/arrival times,ease of online booking, check-in services, gate location etc. the ratings are given from 0 - 5 with 5 being the best and 0 being the least rating. Based on the ratings the overall customer satisfaction is rated as “Satisfactory” or “Neutral or Dissatisfied”.

setwd("/Users/home/Desktop/Data Repository :: Applied Analytics")

Air <- read.csv("airline_passenger_satisfaction.csv")

Air <- Air %>% mutate( Satisfaction_Score = Departure.and.Arrival.Time.Convenience + Ease.of.Online.Booking + Check.in.Service+Online.Boarding+Gate.Location+On.board.Service+Seat.Comfort+Leg.Room.Service+Cleanliness+Food.and.Drink+In.flight.Service+In.flight.Wifi.Service+In.flight.Entertainment+Baggage.Handling)

head(Air$Satisfaction)

## [1] "Neutral or Dissatisfied" "Satisfied"              
## [3] "Satisfied"               "Satisfied"              
## [5] "Satisfied"               "Satisfied"

Air$Satisfaction <-  ifelse(Air$Satisfaction=="Satisfied",1,0)
Air$Satisfaction <- Air$Satisfaction %>% factor(levels = c(0,1),
                                                labels = c("Neutral or Dissatisfied","Satisfied"))

Air$Gender <- ifelse(Air$Gender == "Male",1,0)
Air$Gender <- Air$Gender %>% factor(levels = c(0,1),
                                                labels = c("Female","Male"))

tab <- table(Air$Satisfaction,Air$Gender)
tab %>% addmargins()

##                          
##                           Female   Male    Sum
##   Neutral or Dissatisfied  37630  35822  73452
##   Satisfied                28269  28159  56428
##   Sum                      65899  63981 129880

tab2 <- tab %>%  prop.table(margin=2) 
tab2

##                          
##                              Female      Male
##   Neutral or Dissatisfied 0.5710254 0.5598850
##   Satisfied               0.4289746 0.4401150

Problem Statement

The aim of the analysis is to determine whether there is an association between gender and the satisfaction level of the passengers.

We also attempt to determine if there is any statistical difference between the two genders in terms of their total Satisfaction Score(using the 0.5 level of significance)

To determine the above problem statement we will resort to hypothesis tests.

Data

The data is an open dataset collected from : Kaggle.com. 2022. Airline Passenger Satisfaction. [online] Available at: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=airline_passenger_satisfaction.csv [Accessed 29 May 2022].

Data Cont.

-The two important variables from the data are Gender and the Overall Satisfaction, other than the different factors that result in the overall satisfaction level.

We have converted both the gender and the overall satisfaction into factors with levels as 0 for “female” and 1 for “male” whereas we have taken the levels of Satisfaction as 0 for “Neutral or Dissatisfied” and 1 for “Satisfied”.
The ratings are mere numbers and as such have no scale of measure.
Before we can continue with our analysis, other than converting Gender and Satisfaction to factors we also create a new column, Satisfaction Score, using mutate() from dplyr package, this enables us to give a numerical value to the 2 satisfaction levels.
Then we form a table such to represent the satisfied and neutral/dissatisfied reviews across the genders along with the column and row sum totals.
The table is then adjusted to represent the proportion of the satisfaction levels across different levels and is represented as a barplot for better comparison

Descriptive Statistics and Visualisation

The mean and median for the different gender groups are close to each other signifying that they might give us a normal distribution.
Although there are some outliers as presented by the boxplot they aren’t significant enough to affect our analysis and such we haven’t omiited them or tried to alter the data to get rid of them.

boxplot(Satisfaction_Score~Gender, data = Air, ylab = "Satisfaction Score", xlab= "Gender")

Decsriptive Statistics Cont.

We use the knitr:kable function to print nice HTML tables.

Air %>% group_by(Gender) %>% summarise(Min = min(Satisfaction_Score,na.rm = TRUE),
                                         Q1 = quantile(Satisfaction_Score,probs = .25,na.rm = TRUE),
                                         Median = median(Satisfaction_Score, na.rm = TRUE),
                                         Q3 = quantile(Satisfaction_Score,probs = .75,na.rm = TRUE),
                                         Max = max(Satisfaction_Score,na.rm = TRUE),
                                         Mean = mean(Satisfaction_Score, na.rm = TRUE),
                                         SD = sd(Satisfaction_Score, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Satisfaction_Score))) -> table1
knitr::kable(table1)

Gender	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Female	15	39	45	52	70	45.29239	9.282887	65899	0
Male	15	39	46	52	70	45.46565	9.244143	63981	0

Hypthesis Testing Cont.

The hypothesis is that the mean total Satisfaction Score should be equal for both the genders.
The alternative hypothesis is that the mean total Satisfaction Score is different for both genders.

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

\[S = \sum^n_{i = 1}d^2_i\]

Hypothesis Testing

We select a two-sample t-test because we want to compare mean total Satisfaction Score of two independent groups. But before conducting the two-sample t-test we need to check the normality and variance homogeneity assumption.
The plot shows a dataset with “fat tails,” meaning that compared to the normal distribution there is more data located at the extremes of the distribution and less data in the center of the distribution. In terms of quantiles this means that the first quantile is much less than the first theoretical quantile and the last quantile is greater than the last theoretical quantile. This trend is reflected in the corresponding Q-Q plots.
We should be cautious about assuming normality for both the genders. Fortunately, due to the large sample size, n>30, we don’t have to worry too much.

Air_Male <- Air %>% filter(Gender == "Male") 
Air_Male$Satisfaction_Score %>% qqPlot(dist="norm")

## [1] 8180 6309

Air_Female <- Air %>% filter(Gender == "Female") 
Air_Female$Satisfaction_Score %>% qqPlot(dist="norm")

## [1] 10857  7090

Homogeneity of variances

leveneTest(Satisfaction_Score ~ Gender, data = Air)

As p>0.05, population variances are homogeneous. Now we can apply the two-sample t-test.

result<- t.test( 
  Satisfaction_Score ~ Gender, 
  data = Air, 
  var.equal = TRUE, 
  alternative = "two.sided" 
) 
result

## 
##  Two Sample t-test
## 
## data:  Satisfaction_Score by Gender
## t = -3.3699, df = 129878, p-value = 0.0007521
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.27404104 -0.07249303
## sample estimates:
## mean in group Female   mean in group Male 
##             45.29239             45.46565

result$p.value

## [1] 0.0007521196

result$conf.int

## [1] -0.27404104 -0.07249303
## attr(,"conf.level")
## [1] 0.95

Our decision should be to reject H0: μ1 = μ1 as the p < .05 and the 95% CI of the estimated population difference[-0.27404104, -0.07249303], which did not capture H0: μ1 - μ1 = 0. The results of the two-sample t-test were therefore statistically significant. This meant that the mean total Satisfaction Score for passenger genders who was significantly different from each other.

Second Hypoyhesis Test

H0: There is no association between gender and Satisfaction rating HA: There is an association between gender and Satisfaction rating

Assumption: No more than 25% of expected cell counts are below 5

chi2 <- chisq.test(tab) 
chi2

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 16.352, df = 1, p-value = 5.26e-05

chi2$expected

##                          
##                             Female     Male
##   Neutral or Dissatisfied 37268.35 36183.65
##   Satisfied               28630.65 27797.35

There are no cells with expected counts below 5.

Since we get a p-Value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent.

Discussion

The main findings of the analysis is that the mean total satisfaction score is not different for different genders, also we find that the there is a correlation between gender and their respective satisfaction ratings.
A two-sample t-test was used to test for a significant difference between the mean satisfaction scores of males and females. While the scores for both males and females exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size in each group. The Levene’s test of homogeneity of variance indicated that equal variance could be assumed. The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean satisfaction scores of males and females, t = -3.3699, df = 129878, p-value = 0.0007521, 95% CI for the difference in means [-0.27404104, -0.07249303]. The results of the investigation suggest that males have significantly higher average satisfcation scoress than females.
It’s possible that the passengers were biased.There could have been a pre flight incident to trigger the biasness as such we would need to follow-up with a proper random sample and repeat the Chi-square goodness of fit test before being confident in their conclusion.
A Chi-square test of association was used to test for a statistically significant association between breast cancer status and the age of a mother at first birth. The results of the test found a statistically significant association, χ2=16.352,p<.001. The results of this study suggest that there is a relation betweeen gender and the satisfaction scores given by them.

Airline Passenger Satisfaction Across Genders

Applied data project 2 (Assignment 2)

RPubs link information