Final Project BAIS 462

Author

Colleen Malloy

Introduction

I have grown up traveling, catching flights with my family, across the country or across the world. We love to find the best deals because we are a family of 6, which can get expensive when it comes to travel. However, we also agree that comfort and safeness are important when traveling too. We try to find the best airlines to accommodate these needs, while keeping the prices low. For my final project I explored a dataset that contains results from a survey an airline gave out to their customers to find their overall satisfaction, and what aspects of the flight led to this result. I also used webscraped data from the airline review website, Skytrax, to compare and find more airline ratings based off several flight aspects. I wanted to see what aspects of a flight lead to overall satisfaction, and how this compares across airlines.

Data

The dataset I found on Kaggle has many variables:

Gender - male, female

Customer Type - Loyal Customer, disloyal customer

Age - ranging from 7 to 85

Class - Business, Eco Plus (Economy Plus), Eco (Economy),

Flight Distance

Many variables scaled from 1-5 by the customer (5 being the best, 1 being the worst):

  • Inflight entertainment,

  • Leg room service,

  • Inflight service,

  • Cleanliness,

  • Departure and Arrival Delay (in minutes)

Overall Satisfaction - satisfied, neutral or dissatisfied

Analysis and Exploration with Flight Dataset

Questions:

Are there more male or female Loyal customers?

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

First, before diving too deep, I wanted to see if this data had reliability. Is there an aspect to loyal or disloyal customers that could cause different results in the survey?

I was surprised at these results because the female and male Loyal Customer count are almost exactly the same. There are large totals for female and male and those numbers are not the same so I was surprised how close they are. (Less than 10 customers).

What is the distribution of ages of flyers?

The distribution looks bell curved. There are most flyers in the 40-49 age range. My assumption is that most middle aged flyers are on business or work trips. There are many younger flyers, so I would assume those are for vacations or family trips. Also, there are 17 flyers without ages recorded.

Are all the survey answers bell curved? Are there some that feel strongly good/bad than others?

We can see a lot from this. Some are bell shaped, and the rest are mostly left skewed. This is good because a 5 on the scale was the best. Most graphs have very few values for 0, so maybe I would take the out for better visual. Online boarding had a lot of 0s so maybe this did not apply to some passengers.

Linear Regression:

spc_tbl_ [103,904 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1                             : num [1:103904] 0 1 2 3 4 5 6 7 8 9 ...
 $ id                               : num [1:103904] 70172 5047 110028 24026 119299 ...
 $ Gender                           : chr [1:103904] "Male" "Male" "Female" "Female" ...
 $ Customer Type                    : chr [1:103904] "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
 $ Age                              : num [1:103904] 13 25 26 25 61 26 47 52 41 20 ...
 $ Type of Travel                   : chr [1:103904] "Personal Travel" "Business travel" "Business travel" "Business travel" ...
 $ Class                            : chr [1:103904] "Eco Plus" "Business" "Business" "Business" ...
 $ Flight Distance                  : num [1:103904] 460 235 1142 562 214 ...
 $ Inflight wifi service            : num [1:103904] 3 3 2 2 3 3 2 4 1 3 ...
 $ Departure/Arrival time convenient: num [1:103904] 4 2 2 5 3 4 4 3 2 3 ...
 $ Ease of Online booking           : num [1:103904] 3 3 2 5 3 2 2 4 2 3 ...
 $ Gate location                    : num [1:103904] 1 3 2 5 3 1 3 4 2 4 ...
 $ Food and drink                   : num [1:103904] 5 1 5 2 4 1 2 5 4 2 ...
 $ Online boarding                  : num [1:103904] 3 3 5 2 5 2 2 5 3 3 ...
 $ Seat comfort                     : num [1:103904] 5 1 5 2 5 1 2 5 3 3 ...
 $ Inflight entertainment           : num [1:103904] 5 1 5 2 3 1 2 5 1 2 ...
 $ On-board service                 : num [1:103904] 4 1 4 2 3 3 3 5 1 2 ...
 $ Leg room service                 : num [1:103904] 3 5 3 5 4 4 3 5 2 3 ...
 $ Baggage handling                 : num [1:103904] 4 3 4 3 4 4 4 5 1 4 ...
 $ Checkin service                  : num [1:103904] 4 1 4 1 3 4 3 4 4 4 ...
 $ Inflight service                 : num [1:103904] 5 4 4 4 3 4 5 5 1 3 ...
 $ Cleanliness                      : num [1:103904] 5 1 5 2 3 1 2 4 2 2 ...
 $ Departure Delay in Minutes       : num [1:103904] 25 1 0 11 0 0 9 4 0 0 ...
 $ Arrival Delay in Minutes         : num [1:103904] 18 6 0 9 0 0 23 0 0 0 ...
 $ satisfaction                     : chr [1:103904] "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
 $ AgeGroup                         : Factor w/ 9 levels "0-9","10-19",..: 2 3 3 3 7 3 6 6 5 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_double(),
  ..   id = col_double(),
  ..   Gender = col_character(),
  ..   `Customer Type` = col_character(),
  ..   Age = col_double(),
  ..   `Type of Travel` = col_character(),
  ..   Class = col_character(),
  ..   `Flight Distance` = col_double(),
  ..   `Inflight wifi service` = col_double(),
  ..   `Departure/Arrival time convenient` = col_double(),
  ..   `Ease of Online booking` = col_double(),
  ..   `Gate location` = col_double(),
  ..   `Food and drink` = col_double(),
  ..   `Online boarding` = col_double(),
  ..   `Seat comfort` = col_double(),
  ..   `Inflight entertainment` = col_double(),
  ..   `On-board service` = col_double(),
  ..   `Leg room service` = col_double(),
  ..   `Baggage handling` = col_double(),
  ..   `Checkin service` = col_double(),
  ..   `Inflight service` = col_double(),
  ..   Cleanliness = col_double(),
  ..   `Departure Delay in Minutes` = col_double(),
  ..   `Arrival Delay in Minutes` = col_double(),
  ..   satisfaction = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Call:
glm(formula = satisfaction ~ `Flight Distance` + `Inflight wifi service` + 
    `Ease of Online booking` + `Gate location` + `Food and drink` + 
    `Online boarding` + `Seat comfort` + `Inflight entertainment` + 
    `Inflight service` + `Leg room service` + `Baggage handling` + 
    `Checkin service` + `Inflight service` + Cleanliness + `Departure Delay in Minutes` + 
    `Arrival Delay in Minutes` + Gender + `Customer Type` + `Type of Travel` + 
    Class, family = binomial, data = flight_data)

Coefficients:
                                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                     -7.771e+00  7.422e-02 -104.702  < 2e-16 ***
`Flight Distance`               -1.030e-05  1.117e-05   -0.922  0.35642    
`Inflight wifi service`          3.927e-01  1.128e-02   34.821  < 2e-16 ***
`Ease of Online booking`        -1.916e-01  1.077e-02  -17.794  < 2e-16 ***
`Gate location`                 -2.201e-02  8.648e-03   -2.545  0.01094 *  
`Food and drink`                -5.301e-02  1.046e-02   -5.068 4.01e-07 ***
`Online boarding`                6.189e-01  1.009e-02   61.315  < 2e-16 ***
`Seat comfort`                   3.611e-02  1.104e-02    3.270  0.00107 ** 
`Inflight entertainment`         1.764e-01  1.357e-02   12.998  < 2e-16 ***
`Inflight service`               1.928e-01  1.171e-02   16.464  < 2e-16 ***
`Leg room service`               2.739e-01  8.432e-03   32.484  < 2e-16 ***
`Baggage handling`               1.882e-01  1.121e-02   16.786  < 2e-16 ***
`Checkin service`                3.396e-01  8.445e-03   40.217  < 2e-16 ***
Cleanliness                      1.890e-01  1.183e-02   15.969  < 2e-16 ***
`Departure Delay in Minutes`     4.496e-03  9.795e-04    4.590 4.43e-06 ***
`Arrival Delay in Minutes`      -9.136e-03  9.659e-04   -9.459  < 2e-16 ***
GenderMale                       3.239e-02  1.931e-02    1.678  0.09340 .  
`Customer Type`Loyal Customer    1.852e+00  2.814e-02   65.821  < 2e-16 ***
`Type of Travel`Personal Travel -2.713e+00  3.023e-02  -89.758  < 2e-16 ***
ClassEco                        -7.892e-01  2.534e-02  -31.138  < 2e-16 ***
ClassEco Plus                   -9.185e-01  4.117e-02  -22.311  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 141768  on 103593  degrees of freedom
Residual deviance:  70370  on 103573  degrees of freedom
AIC: 70412

Number of Fisher Scoring iterations: 5
         Actual
Predicted     0     1
        0 15777  2311
        1  1677 11314
[1] "Accuracy: 87.17 %"

This logistic regression model has a 87.17% accuracy.

Significant predictors are: Inflight wifi service, Ease of Online Booking, Online boarding, Leg room service, and Checkin service.

Higher values (in survey results) of Ease of Online booking and Arrival Delay in Minutes reduce satisfaction.

Business Class and Loyal Customers are more likely to be satisfied. This makes sense because if you are loyal you keep flying with this airline because you enjoy it, and business class members receive better accommodations, so they are more likely satisfied than any other passenger.

I would recommend improving inflight services and somehow minimizing delays. The delays are hard to deal with because most of the time it can be weather that causes delays, and that is entirely out of human control when it comes to flying. If this is the case, hopefully they can increase the quality of the inflight serivce, so passengers do not get too crabby or upset because the flight is delayed.

Compare to Skytrax Webscraped Data:

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 1,717 × 3
# Groups:   airline [1]
   airline    word         n
   <chr>      <chr>    <int>
 1 Aer Lingus flight     127
 2 Aer Lingus aer         94
 3 Aer Lingus lingus      93
 4 Aer Lingus verified    70
 5 Aer Lingus dublin      60
 6 Aer Lingus service     55
 7 Aer Lingus trip        49
 8 Aer Lingus airline     48
 9 Aer Lingus airport     42
10 Aer Lingus told        40
# ℹ 1,707 more rows
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 1,682 × 3
# Groups:   airline [1]
   airline word         n
   <chr>   <chr>    <int>
 1 United  flight     220
 2 United  united     126
 3 United  verified    70
 4 United  told        53
 5 United  time        52
 6 United  hours       51
 7 United  trip        51
 8 United  airline     47
 9 United  service     41
10 United  airport     39
# ℹ 1,672 more rows
`summarise()` has grouped output by 'sentiment'. You can override using the
`.groups` argument.

There is not much to compare with this visual. There is not much positive or negative difference in sentiment scores for these two airlines. I think we should dive deeper by seeing which words are used more frequently for each airline.

`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups:   airline [2]
  airline    sentiment     n
  <chr>      <chr>     <int>
1 Aer Lingus negative    194
2 United     negative    178
3 Aer Lingus positive    106
4 United     positive     87

I chose to include words used 5 or more times because I scraped 70 reviews for each airline. The fact that “delayed” was included in 37 of these 70 reviews for United Airlines is interesting. Based off the linear regression model for the flight survey data from Kaggle, United Airlines should, look in to why so many of their flight reviews include “delay” and also increase the quality of their inflight service. Even with these delays, their reviews might become more positive.

There is more variation and negative words used in United Airline reviews than Aer Lingus reviews. I found it interesting that the word “delayed” was used 31 more times in United reviews than Aer Lingus reviews. I would say that United reviews have more negative emotions within the reviews than Aer Lingus.

Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline', 'month'. You can override using
the `.groups` argument.

Joining with `by = join_by(word)`
`summarise()` has grouped output by 'airline', 'month'. You can override using
the `.groups` argument.

These visuals are inverses of each other, but as we can see, there are significantly more negative words and emotions used within these airline reviews.

United airlines does not have one month of the year where there is a higher positive score than negative score.

We can see that in the summer months, July, August, and September, there are a lot more reviews for United, and these are all more negative than positive. I think this is due to high travel period. I think both airlines see more negative comments and emotions exemplified because most people complain about airlines. There are not that many people that applaud or compliment airlines. As humans, we just want to safely and comfortably get to our destination so our vacation or trip can begin. Also, many people are already crabby when they board their flight to come home from their vacation or trip, simply because it is over. They had a great time and they do not want it to end. So anything going a little bit wrong or off during their flight experience can put a bad taste in their mouth and cause them to write a negative review on the airline. Also, if a flight is delayed, people are forced to sit in an airport with nothing to do. So, many have the extra time to sit down and complain about an airline. If a flight goes well, people will just go on with their life happily, but not express that happiness in a review.

Conclusion:

In conclusion, these datasets prove that many different types of aspects of travel and flying have an effect on a passenger’s overall feelings of satisfaction towards the flight. Personal aspects like gender, age, type of customer, and experience can shape a person’s emotion towards flights. Also, inflight services and other things related to the specific airline like baggage handling, cleanliness, food and drink, wifi service, etc. can have an effect on a passenger’s emotional response to a survey or review regarding the flight and airline.

I think airlines should focus on how they can provide better service when there is a delayed flight, because that is mostly out of their hands (when it comes to weather) but the way they respond and treat their passengers is up to them.

Finally, airlines should pay attention to certain types of flyers, like their loyal customers and different class flyers. Treatment of these groups are different, so they should be aware and try and attend to the needs of different flyers throughout the entire flight and travel process.