Name - Munish Sabherwal Student ID - s3915526
Last updated: 29 May, 2022
The dataset contains information on customer satisfaction from more than 120,000 passengers.It consists of ratings on different factors like convenience of departure/arrival times,ease of online booking, check-in services, gate location etc. the ratings are given from 0 - 5 with 5 being the best and 0 being the least rating. Based on the ratings the overall customer satisfaction is rated as “Satisfactory” or “Neutral or Dissatisfied”.
setwd("/Users/home/Desktop/Data Repository :: Applied Analytics")
Air <- read.csv("airline_passenger_satisfaction.csv")
Air <- Air %>% mutate( Satisfaction_Score = Departure.and.Arrival.Time.Convenience + Ease.of.Online.Booking + Check.in.Service+Online.Boarding+Gate.Location+On.board.Service+Seat.Comfort+Leg.Room.Service+Cleanliness+Food.and.Drink+In.flight.Service+In.flight.Wifi.Service+In.flight.Entertainment+Baggage.Handling)
head(Air$Satisfaction)## [1] "Neutral or Dissatisfied" "Satisfied"
## [3] "Satisfied" "Satisfied"
## [5] "Satisfied" "Satisfied"
Air$Satisfaction <- ifelse(Air$Satisfaction=="Satisfied",1,0)
Air$Satisfaction <- Air$Satisfaction %>% factor(levels = c(0,1),
labels = c("Neutral or Dissatisfied","Satisfied"))
Air$Gender <- ifelse(Air$Gender == "Male",1,0)
Air$Gender <- Air$Gender %>% factor(levels = c(0,1),
labels = c("Female","Male"))
tab <- table(Air$Satisfaction,Air$Gender)
tab %>% addmargins()##
## Female Male Sum
## Neutral or Dissatisfied 37630 35822 73452
## Satisfied 28269 28159 56428
## Sum 65899 63981 129880
##
## Female Male
## Neutral or Dissatisfied 0.5710254 0.5598850
## Satisfied 0.4289746 0.4401150
The aim of the analysis is to determine whether there is an association between gender and the satisfaction level of the passengers.
We also attempt to determine if there is any statistical difference between the two genders in terms of their total Satisfaction Score(using the 0.5 level of significance)
To determine the above problem statement we will resort to hypothesis tests.
The data is an open dataset collected from : Kaggle.com. 2022. Airline Passenger Satisfaction. [online] Available at: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=airline_passenger_satisfaction.csv [Accessed 29 May 2022].
-The two important variables from the data are Gender and the Overall Satisfaction, other than the different factors that result in the overall satisfaction level.
We have converted both the gender and the overall satisfaction into factors with levels as 0 for “female” and 1 for “male” whereas we have taken the levels of Satisfaction as 0 for “Neutral or Dissatisfied” and 1 for “Satisfied”.
The ratings are mere numbers and as such have no scale of measure.
Before we can continue with our analysis, other than converting Gender and Satisfaction to factors we also create a new column, Satisfaction Score, using mutate() from dplyr package, this enables us to give a numerical value to the 2 satisfaction levels.
Then we form a table such to represent the satisfied and neutral/dissatisfied reviews across the genders along with the column and row sum totals.
The table is then adjusted to represent the proportion of the satisfaction levels across different levels and is represented as a barplot for better comparison
The mean and median for the different gender groups are close to each other signifying that they might give us a normal distribution.
Although there are some outliers as presented by the boxplot they aren’t significant enough to affect our analysis and such we haven’t omiited them or tried to alter the data to get rid of them.
knitr:kable function to print nice HTML tables.Air %>% group_by(Gender) %>% summarise(Min = min(Satisfaction_Score,na.rm = TRUE),
Q1 = quantile(Satisfaction_Score,probs = .25,na.rm = TRUE),
Median = median(Satisfaction_Score, na.rm = TRUE),
Q3 = quantile(Satisfaction_Score,probs = .75,na.rm = TRUE),
Max = max(Satisfaction_Score,na.rm = TRUE),
Mean = mean(Satisfaction_Score, na.rm = TRUE),
SD = sd(Satisfaction_Score, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Satisfaction_Score))) -> table1
knitr::kable(table1)| Gender | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Female | 15 | 39 | 45 | 52 | 70 | 45.29239 | 9.282887 | 65899 | 0 |
| Male | 15 | 39 | 46 | 52 | 70 | 45.46565 | 9.244143 | 63981 | 0 |
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
\[S = \sum^n_{i = 1}d^2_i\]
## [1] 8180 6309
Air_Female <- Air %>% filter(Gender == "Female")
Air_Female$Satisfaction_Score %>% qqPlot(dist="norm")## [1] 10857 7090
Homogeneity of variances
As p>0.05, population variances are homogeneous. Now we can apply the two-sample t-test.
result<- t.test(
Satisfaction_Score ~ Gender,
data = Air,
var.equal = TRUE,
alternative = "two.sided"
)
result##
## Two Sample t-test
##
## data: Satisfaction_Score by Gender
## t = -3.3699, df = 129878, p-value = 0.0007521
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -0.27404104 -0.07249303
## sample estimates:
## mean in group Female mean in group Male
## 45.29239 45.46565
## [1] 0.0007521196
## [1] -0.27404104 -0.07249303
## attr(,"conf.level")
## [1] 0.95
Our decision should be to reject H0: μ1 = μ1 as the p < .05 and the 95% CI of the estimated population difference[-0.27404104, -0.07249303], which did not capture H0: μ1 - μ1 = 0. The results of the two-sample t-test were therefore statistically significant. This meant that the mean total Satisfaction Score for passenger genders who was significantly different from each other.
H0: There is no association between gender and Satisfaction rating HA: There is an association between gender and Satisfaction rating
Assumption: No more than 25% of expected cell counts are below 5
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 16.352, df = 1, p-value = 5.26e-05
##
## Female Male
## Neutral or Dissatisfied 37268.35 36183.65
## Satisfied 28630.65 27797.35
There are no cells with expected counts below 5.
Since we get a p-Value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent.
The main findings of the analysis is that the mean total satisfaction score is not different for different genders, also we find that the there is a correlation between gender and their respective satisfaction ratings.
A two-sample t-test was used to test for a significant difference between the mean satisfaction scores of males and females. While the scores for both males and females exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size in each group. The Levene’s test of homogeneity of variance indicated that equal variance could be assumed. The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean satisfaction scores of males and females, t = -3.3699, df = 129878, p-value = 0.0007521, 95% CI for the difference in means [-0.27404104, -0.07249303]. The results of the investigation suggest that males have significantly higher average satisfcation scoress than females.
It’s possible that the passengers were biased.There could have been a pre flight incident to trigger the biasness as such we would need to follow-up with a proper random sample and repeat the Chi-square goodness of fit test before being confident in their conclusion.
A Chi-square test of association was used to test for a statistically significant association between breast cancer status and the age of a mother at first birth. The results of the test found a statistically significant association, χ2=16.352,p<.001. The results of this study suggest that there is a relation betweeen gender and the satisfaction scores given by them.
Kaggle.com. 2022. Airline Passenger Satisfaction. [online] Available at: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=airline_passenger_satisfaction.csv [Accessed 29 May 2022].