`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
Final Project BAIS 462
Introduction
I have grown up traveling, catching flights with my family, across the country or across the world. We love to find the best deals because we are a family of 6, which can get expensive when it comes to travel. However, we also agree that comfort and safeness are important when traveling too. We try to find the best airlines to accommodate these needs, while keeping the prices low. For my final project I explored a dataset that contains results from a survey an airline gave out to their customers to find their overall satisfaction, and what aspects of the flight led to this result. I also used webscraped data from the airline review website, Skytrax, to compare and find more airline ratings based off several flight aspects. I wanted to see what aspects of a flight lead to overall satisfaction, and how this compares across airlines.
Data
The dataset I found on Kaggle has many variables:
Gender - male, female
Customer Type - Loyal Customer, disloyal customer
Age - ranging from 7 to 85
Class - Business, Eco Plus (Economy Plus), Eco (Economy),
Flight Distance
Many variables scaled from 1-5 by the customer (5 being the best, 1 being the worst):
Inflight entertainment,
Leg room service,
Inflight service,
Cleanliness,
Departure and Arrival Delay (in minutes)
Overall Satisfaction - satisfied, neutral or dissatisfied
Analysis and Exploration with Flight Dataset
Questions:
Are there more male or female Loyal customers?
First, before diving too deep, I wanted to see if this data had reliability. Is there an aspect to loyal or disloyal customers that could cause different results in the survey?
I was surprised at these results because the female and male Loyal Customer count are almost exactly the same. There are large totals for female and male and those numbers are not the same so I was surprised how close they are. (Less than 10 customers).
What is the distribution of ages of flyers?
The distribution looks bell curved. There are most flyers in the 40-49 age range. My assumption is that most middle aged flyers are on business or work trips. There are many younger flyers, so I would assume those are for vacations or family trips. Also, there are 17 flyers without ages recorded.
Are all the survey answers bell curved? Are there some that feel strongly good/bad than others?
We can see a lot from this. Some are bell shaped, and the rest are mostly left skewed. This is good because a 5 on the scale was the best. Most graphs have very few values for 0, so maybe I would take the out for better visual. Online boarding had a lot of 0s so maybe this did not apply to some passengers.
Linear Regression:
spc_tbl_ [103,904 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ...1 : num [1:103904] 0 1 2 3 4 5 6 7 8 9 ...
$ id : num [1:103904] 70172 5047 110028 24026 119299 ...
$ Gender : chr [1:103904] "Male" "Male" "Female" "Female" ...
$ Customer Type : chr [1:103904] "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
$ Age : num [1:103904] 13 25 26 25 61 26 47 52 41 20 ...
$ Type of Travel : chr [1:103904] "Personal Travel" "Business travel" "Business travel" "Business travel" ...
$ Class : chr [1:103904] "Eco Plus" "Business" "Business" "Business" ...
$ Flight Distance : num [1:103904] 460 235 1142 562 214 ...
$ Inflight wifi service : num [1:103904] 3 3 2 2 3 3 2 4 1 3 ...
$ Departure/Arrival time convenient: num [1:103904] 4 2 2 5 3 4 4 3 2 3 ...
$ Ease of Online booking : num [1:103904] 3 3 2 5 3 2 2 4 2 3 ...
$ Gate location : num [1:103904] 1 3 2 5 3 1 3 4 2 4 ...
$ Food and drink : num [1:103904] 5 1 5 2 4 1 2 5 4 2 ...
$ Online boarding : num [1:103904] 3 3 5 2 5 2 2 5 3 3 ...
$ Seat comfort : num [1:103904] 5 1 5 2 5 1 2 5 3 3 ...
$ Inflight entertainment : num [1:103904] 5 1 5 2 3 1 2 5 1 2 ...
$ On-board service : num [1:103904] 4 1 4 2 3 3 3 5 1 2 ...
$ Leg room service : num [1:103904] 3 5 3 5 4 4 3 5 2 3 ...
$ Baggage handling : num [1:103904] 4 3 4 3 4 4 4 5 1 4 ...
$ Checkin service : num [1:103904] 4 1 4 1 3 4 3 4 4 4 ...
$ Inflight service : num [1:103904] 5 4 4 4 3 4 5 5 1 3 ...
$ Cleanliness : num [1:103904] 5 1 5 2 3 1 2 4 2 2 ...
$ Departure Delay in Minutes : num [1:103904] 25 1 0 11 0 0 9 4 0 0 ...
$ Arrival Delay in Minutes : num [1:103904] 18 6 0 9 0 0 23 0 0 0 ...
$ satisfaction : chr [1:103904] "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
$ AgeGroup : Factor w/ 9 levels "0-9","10-19",..: 2 3 3 3 7 3 6 6 5 3 ...
- attr(*, "spec")=
.. cols(
.. ...1 = col_double(),
.. id = col_double(),
.. Gender = col_character(),
.. `Customer Type` = col_character(),
.. Age = col_double(),
.. `Type of Travel` = col_character(),
.. Class = col_character(),
.. `Flight Distance` = col_double(),
.. `Inflight wifi service` = col_double(),
.. `Departure/Arrival time convenient` = col_double(),
.. `Ease of Online booking` = col_double(),
.. `Gate location` = col_double(),
.. `Food and drink` = col_double(),
.. `Online boarding` = col_double(),
.. `Seat comfort` = col_double(),
.. `Inflight entertainment` = col_double(),
.. `On-board service` = col_double(),
.. `Leg room service` = col_double(),
.. `Baggage handling` = col_double(),
.. `Checkin service` = col_double(),
.. `Inflight service` = col_double(),
.. Cleanliness = col_double(),
.. `Departure Delay in Minutes` = col_double(),
.. `Arrival Delay in Minutes` = col_double(),
.. satisfaction = col_character()
.. )
- attr(*, "problems")=<externalptr>
Call:
glm(formula = satisfaction ~ `Flight Distance` + `Inflight wifi service` +
`Ease of Online booking` + `Gate location` + `Food and drink` +
`Online boarding` + `Seat comfort` + `Inflight entertainment` +
`Inflight service` + `Leg room service` + `Baggage handling` +
`Checkin service` + `Inflight service` + Cleanliness + `Departure Delay in Minutes` +
`Arrival Delay in Minutes` + Gender + `Customer Type` + `Type of Travel` +
Class, family = binomial, data = flight_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.771e+00 7.422e-02 -104.702 < 2e-16 ***
`Flight Distance` -1.030e-05 1.117e-05 -0.922 0.35642
`Inflight wifi service` 3.927e-01 1.128e-02 34.821 < 2e-16 ***
`Ease of Online booking` -1.916e-01 1.077e-02 -17.794 < 2e-16 ***
`Gate location` -2.201e-02 8.648e-03 -2.545 0.01094 *
`Food and drink` -5.301e-02 1.046e-02 -5.068 4.01e-07 ***
`Online boarding` 6.189e-01 1.009e-02 61.315 < 2e-16 ***
`Seat comfort` 3.611e-02 1.104e-02 3.270 0.00107 **
`Inflight entertainment` 1.764e-01 1.357e-02 12.998 < 2e-16 ***
`Inflight service` 1.928e-01 1.171e-02 16.464 < 2e-16 ***
`Leg room service` 2.739e-01 8.432e-03 32.484 < 2e-16 ***
`Baggage handling` 1.882e-01 1.121e-02 16.786 < 2e-16 ***
`Checkin service` 3.396e-01 8.445e-03 40.217 < 2e-16 ***
Cleanliness 1.890e-01 1.183e-02 15.969 < 2e-16 ***
`Departure Delay in Minutes` 4.496e-03 9.795e-04 4.590 4.43e-06 ***
`Arrival Delay in Minutes` -9.136e-03 9.659e-04 -9.459 < 2e-16 ***
GenderMale 3.239e-02 1.931e-02 1.678 0.09340 .
`Customer Type`Loyal Customer 1.852e+00 2.814e-02 65.821 < 2e-16 ***
`Type of Travel`Personal Travel -2.713e+00 3.023e-02 -89.758 < 2e-16 ***
ClassEco -7.892e-01 2.534e-02 -31.138 < 2e-16 ***
ClassEco Plus -9.185e-01 4.117e-02 -22.311 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 141768 on 103593 degrees of freedom
Residual deviance: 70370 on 103573 degrees of freedom
AIC: 70412
Number of Fisher Scoring iterations: 5
Actual
Predicted 0 1
0 15777 2311
1 1677 11314
[1] "Accuracy: 87.17 %"
This logistic regression model has a 87.17% accuracy.
Significant predictors are: Inflight wifi service, Ease of Online Booking, Online boarding, Leg room service, and Checkin service.
Higher values (in survey results) of Ease of Online booking and Arrival Delay in Minutes reduce satisfaction.
Business Class and Loyal Customers are more likely to be satisfied. This makes sense because if you are loyal you keep flying with this airline because you enjoy it, and business class members receive better accommodations, so they are more likely satisfied than any other passenger.
I would recommend improving inflight services and somehow minimizing delays. The delays are hard to deal with because most of the time it can be weather that causes delays, and that is entirely out of human control when it comes to flying. If this is the case, hopefully they can increase the quality of the inflight serivce, so passengers do not get too crabby or upset because the flight is delayed.
Compare to Skytrax Webscraped Data:
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 1,717 × 3
# Groups: airline [1]
airline word n
<chr> <chr> <int>
1 Aer Lingus flight 127
2 Aer Lingus aer 94
3 Aer Lingus lingus 93
4 Aer Lingus verified 70
5 Aer Lingus dublin 60
6 Aer Lingus service 55
7 Aer Lingus trip 49
8 Aer Lingus airline 48
9 Aer Lingus airport 42
10 Aer Lingus told 40
# ℹ 1,707 more rows
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 1,682 × 3
# Groups: airline [1]
airline word n
<chr> <chr> <int>
1 United flight 220
2 United united 126
3 United verified 70
4 United told 53
5 United time 52
6 United hours 51
7 United trip 51
8 United airline 47
9 United service 41
10 United airport 39
# ℹ 1,672 more rows
`summarise()` has grouped output by 'sentiment'. You can override using the
`.groups` argument.
There is not much to compare with this visual. There is not much positive or negative difference in sentiment scores for these two airlines. I think we should dive deeper by seeing which words are used more frequently for each airline.
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: airline [2]
airline sentiment n
<chr> <chr> <int>
1 Aer Lingus negative 194
2 United negative 178
3 Aer Lingus positive 106
4 United positive 87
I chose to include words used 5 or more times because I scraped 70 reviews for each airline. The fact that “delayed” was included in 37 of these 70 reviews for United Airlines is interesting. Based off the linear regression model for the flight survey data from Kaggle, United Airlines should, look in to why so many of their flight reviews include “delay” and also increase the quality of their inflight service. Even with these delays, their reviews might become more positive.
There is more variation and negative words used in United Airline reviews than Aer Lingus reviews. I found it interesting that the word “delayed” was used 31 more times in United reviews than Aer Lingus reviews. I would say that United reviews have more negative emotions within the reviews than Aer Lingus.
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline', 'month'. You can override using
the `.groups` argument.
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline', 'month'. You can override using
the `.groups` argument.
These visuals are inverses of each other, but as we can see, there are significantly more negative words and emotions used within these airline reviews.
United airlines does not have one month of the year where there is a higher positive score than negative score.
We can see that in the summer months, July, August, and September, there are a lot more reviews for United, and these are all more negative than positive. I think this is due to high travel period. I think both airlines see more negative comments and emotions exemplified because most people complain about airlines. There are not that many people that applaud or compliment airlines. As humans, we just want to safely and comfortably get to our destination so our vacation or trip can begin. Also, many people are already crabby when they board their flight to come home from their vacation or trip, simply because it is over. They had a great time and they do not want it to end. So anything going a little bit wrong or off during their flight experience can put a bad taste in their mouth and cause them to write a negative review on the airline. Also, if a flight is delayed, people are forced to sit in an airport with nothing to do. So, many have the extra time to sit down and complain about an airline. If a flight goes well, people will just go on with their life happily, but not express that happiness in a review.
Conclusion:
In conclusion, these datasets prove that many different types of aspects of travel and flying have an effect on a passenger’s overall feelings of satisfaction towards the flight. Personal aspects like gender, age, type of customer, and experience can shape a person’s emotion towards flights. Also, inflight services and other things related to the specific airline like baggage handling, cleanliness, food and drink, wifi service, etc. can have an effect on a passenger’s emotional response to a survey or review regarding the flight and airline.
I think airlines should focus on how they can provide better service when there is a delayed flight, because that is mostly out of their hands (when it comes to weather) but the way they respond and treat their passengers is up to them.
Finally, airlines should pay attention to certain types of flyers, like their loyal customers and different class flyers. Treatment of these groups are different, so they should be aware and try and attend to the needs of different flyers throughout the entire flight and travel process.