TidyVerse Recipe Expanded by M. Kollontai

FiveThirtyEight - Data : Airline-Safety
TidyRecipe
Expansion by M. Kollontai

FiveThirtyEight - Data : Airline-Safety

TidyRecipe

Create an Example Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

Airline Data Selection & Import

When I was in High School, the Television show that everyone could not stop talking about was “Lost”. This famous show, created by J.J. Abrams, depicted a fictional plane crash and what happened subsequently to the passengers and crew. In real life, and it seems a lot more in recent history than ever, planes have been crashing or vanishing at sea at an alarming rate. Personally, my father travels about 25 weeks out of the year. Coming from a computer science background and not an engineering one. How can I give my father and the rest of the traveling community, some sort of “advise” for flying safe within the means of my area of expertise. Here is where this project can come into play. I found this dataset off of Kaggle’s GitHub page that was the data behind a very powerful article, Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?. For now, let’s create a “Tidy Recipe” for this dataset.

Kaggle’s GitHub Repo for Airline Saftey Data

Airline-Saftey_GitHub

Airline Saftey CSV RAW

CSV

as_link <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv"

as_data <- read_csv(as_link)

## Parsed with column specification:
## cols(
##   airline = col_character(),
##   avail_seat_km_per_week = col_double(),
##   incidents_85_99 = col_double(),
##   fatal_accidents_85_99 = col_double(),
##   fatalities_85_99 = col_double(),
##   incidents_00_14 = col_double(),
##   fatal_accidents_00_14 = col_double(),
##   fatalities_00_14 = col_double()
## )

head(as_data, 25) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	incidents_85_99	fatal_accidents_85_99	fatalities_85_99	incidents_00_14	fatal_accidents_00_14	fatalities_00_14
Aer Lingus	320906734	2	0	0	0	0	0
Aeroflot*	1197672318	76	14	128	6	1	88
Aerolineas Argentinas	385803648	6	0	0	1	0	0
Aeromexico*	596871813	3	1	64	5	0	0
Air Canada	1865253802	2	0	0	2	0	0
Air France	3004002661	14	4	79	6	2	337
Air India*	869253552	2	1	329	4	1	158
Air New Zealand*	710174817	3	0	0	5	1	7
Alaska Airlines*	965346773	5	0	0	5	1	88
Alitalia	698012498	7	2	50	4	0	0
All Nippon Airways	1841234177	3	1	1	7	0	0
American*	5228357340	21	5	101	17	3	416
Austrian Airlines	358239823	1	0	0	1	0	0
Avianca	396922563	5	3	323	0	0	0
British Airways*	3179760952	4	0	0	6	0	0
Cathay Pacific*	2582459303	0	0	0	2	0	0
China Airlines	813216487	12	6	535	2	1	225
Condor	417982610	2	1	16	0	0	0
COPA	550491507	3	1	47	0	0	0
Delta / Northwest*	6525658894	24	12	407	24	2	51
Egyptair	557699891	8	3	282	4	1	14
El Al	335448023	1	1	4	1	0	0
Ethiopian Airlines	488560643	25	5	167	5	2	92
Finnair	506464950	1	0	0	0	0	0
Garuda Indonesia	613356665	10	3	260	4	2	22

First Look

We can obserse above, that this dataset provides both the date-ranges : 85-99 and 00-14 in a row.

First we will want to convert this dataset into a format in which, for each airline there will be 2 rows: one for 85-99 and another row for 00-14.

From here, this will allow for each date range within an airline, where 1 row store the “count” which are accrued from - “incidents”, “fatal_incidents” and “fatalities”.

By doing so, this will create a new working data-frame that will be clean and easier to perform analysis on.

as_df <- as_data %>% select(1:3,6,4,7,5,8)
head(as_df, 25) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	incidents_85_99	incidents_00_14	fatal_accidents_85_99	fatal_accidents_00_14	fatalities_85_99	fatalities_00_14
Aer Lingus	320906734	2	0	0	0	0	0
Aeroflot*	1197672318	76	6	14	1	128	88
Aerolineas Argentinas	385803648	6	1	0	0	0	0
Aeromexico*	596871813	3	5	1	0	64	0
Air Canada	1865253802	2	2	0	0	0	0
Air France	3004002661	14	6	4	2	79	337
Air India*	869253552	2	4	1	1	329	158
Air New Zealand*	710174817	3	5	0	1	0	7
Alaska Airlines*	965346773	5	5	0	1	0	88
Alitalia	698012498	7	4	2	0	50	0
All Nippon Airways	1841234177	3	7	1	0	1	0
American*	5228357340	21	17	5	3	101	416
Austrian Airlines	358239823	1	1	0	0	0	0
Avianca	396922563	5	0	3	0	323	0
British Airways*	3179760952	4	6	0	0	0	0
Cathay Pacific*	2582459303	0	2	0	0	0	0
China Airlines	813216487	12	2	6	1	535	225
Condor	417982610	2	0	1	0	16	0
COPA	550491507	3	0	1	0	47	0
Delta / Northwest*	6525658894	24	24	12	2	407	51
Egyptair	557699891	8	4	3	1	282	14
El Al	335448023	1	1	1	0	4	0
Ethiopian Airlines	488560643	25	5	5	2	167	92
Finnair	506464950	1	0	0	0	0	0
Garuda Indonesia	613356665	10	4	3	2	260	22

Tidying -r

Using the \(tidyr\) package, we will convert this data into a long dataset using the \(gather\) function.

This will count variables “type” and “date_range” in a column.

In addition, the value of this count will be placed into another column.

as_df <- as_df %>% gather("date_range_type", "count", 3:8)
head(as_df, 25) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range_type	count
Aer Lingus	320906734	incidents_85_99	2
Aeroflot*	1197672318	incidents_85_99	76
Aerolineas Argentinas	385803648	incidents_85_99	6
Aeromexico*	596871813	incidents_85_99	3
Air Canada	1865253802	incidents_85_99	2
Air France	3004002661	incidents_85_99	14
Air India*	869253552	incidents_85_99	2
Air New Zealand*	710174817	incidents_85_99	3
Alaska Airlines*	965346773	incidents_85_99	5
Alitalia	698012498	incidents_85_99	7
All Nippon Airways	1841234177	incidents_85_99	3
American*	5228357340	incidents_85_99	21
Austrian Airlines	358239823	incidents_85_99	1
Avianca	396922563	incidents_85_99	5
British Airways*	3179760952	incidents_85_99	4
Cathay Pacific*	2582459303	incidents_85_99	0
China Airlines	813216487	incidents_85_99	12
Condor	417982610	incidents_85_99	2
COPA	550491507	incidents_85_99	3
Delta / Northwest*	6525658894	incidents_85_99	24
Egyptair	557699891	incidents_85_99	8
El Al	335448023	incidents_85_99	1
Ethiopian Airlines	488560643	incidents_85_99	25
Finnair	506464950	incidents_85_99	1
Garuda Indonesia	613356665	incidents_85_99	10

From here, we split the date range and type into two different columns

Furthermore, eliminating columns which has them both together:

Splitting the Range

as_df$date_range <- str_extract(as_df$date_range_type, "\\d+_\\d+$")
as_df$type <- str_extract(as_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")
as_df <- as_df %>% select(1:2, 5, 6, 4)
head(as_df, 25) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range	type	count
Aer Lingus	320906734	85_99	incidents	2
Aeroflot*	1197672318	85_99	incidents	76
Aerolineas Argentinas	385803648	85_99	incidents	6
Aeromexico*	596871813	85_99	incidents	3
Air Canada	1865253802	85_99	incidents	2
Air France	3004002661	85_99	incidents	14
Air India*	869253552	85_99	incidents	2
Air New Zealand*	710174817	85_99	incidents	3
Alaska Airlines*	965346773	85_99	incidents	5
Alitalia	698012498	85_99	incidents	7
All Nippon Airways	1841234177	85_99	incidents	3
American*	5228357340	85_99	incidents	21
Austrian Airlines	358239823	85_99	incidents	1
Avianca	396922563	85_99	incidents	5
British Airways*	3179760952	85_99	incidents	4
Cathay Pacific*	2582459303	85_99	incidents	0
China Airlines	813216487	85_99	incidents	12
Condor	417982610	85_99	incidents	2
COPA	550491507	85_99	incidents	3
Delta / Northwest*	6525658894	85_99	incidents	24
Egyptair	557699891	85_99	incidents	8
El Al	335448023	85_99	incidents	1
Ethiopian Airlines	488560643	85_99	incidents	25
Finnair	506464950	85_99	incidents	1
Garuda Indonesia	613356665	85_99	incidents	10

Using the \(spread\) function and using the variables, “incidents”, “fatal_incidents” and “fatalities”, while counting the occurences to create a value.

This is to convert back into a dataset, however, creating a date range wise split in the process.

Spread

as_df <- as_df %>% spread(type, count)
as_df <- as_df %>% select(1:3, 6, 4:5)
head(as_df, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range	incidents	fatal_accidents	fatalities
Aer Lingus	320906734	00_14	0	0	0
Aer Lingus	320906734	85_99	2	0	0
Aeroflot*	1197672318	00_14	6	1	88
Aeroflot*	1197672318	85_99	76	14	128
Aerolineas Argentinas	385803648	00_14	1	0	0
Aerolineas Argentinas	385803648	85_99	6	0	0
Aeromexico*	596871813	00_14	5	0	0
Aeromexico*	596871813	85_99	3	1	64
Air Canada	1865253802	00_14	2	0	0
Air Canada	1865253802	85_99	2	0	0
Air France	3004002661	00_14	6	2	337
Air France	3004002661	85_99	14	4	79
Air India*	869253552	00_14	4	1	158
Air India*	869253552	85_99	2	1	329
Air New Zealand*	710174817	00_14	5	1	7
Air New Zealand*	710174817	85_99	3	0	0
Alaska Airlines*	965346773	00_14	5	1	88
Alaska Airlines*	965346773	85_99	5	0	0
Alitalia	698012498	00_14	4	0	0
Alitalia	698012498	85_99	7	2	50

Now the data is fully ready with each airline and a date range having 1 row each

Sorting Quick

df_sorted_desc <- as_df[with(as_df, order(-incidents)), ]
head(df_sorted_desc, 10) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range	incidents	fatal_accidents	fatalities
Aeroflot*	1197672318	85_99	76	14	128
Ethiopian Airlines	488560643	85_99	25	5	167
Delta / Northwest*	6525658894	00_14	24	2	51
Delta / Northwest*	6525658894	85_99	24	12	407
American*	5228357340	85_99	21	5	101
United / Continental*	7139291291	85_99	19	8	319
American*	5228357340	00_14	17	3	416
US Airways / America West*	2455687887	85_99	16	7	224
Air France	3004002661	85_99	14	4	79
United / Continental*	7139291291	00_14	14	2	109

Expansion by M. Kollontai

In order to determine whether or not some of the airlines with poorer safety records improved over the time periods within our dataset, we can look at the differences in the values by plotting them alongside one another.

To begin, let’s create a new dataframe that lists the change in the incidents, fatal accidents and fatalities for each airline. To do so, use the summarise function after grouping by airline.

Summarise allows us to combine data based on a certain grouping (in this case by \(airline\)) and create new calculated values. In this case what we want is the difference between the values within the three columns \(incidents\), \(fatal accidents\) and \(fatalities\).

changes <- as_df %>% group_by(airline) %>% summarise(inc_change <- diff(incidents), fat_acc_change <- diff(fatal_accidents), fatalities_change <- diff(fatalities))
names(changes) <- c("Airline", "Incidents Change", "Fatal Accident Change", "Fatalities Change")

Now let’s create ne dataframes showing the most improved airline (we should note that this doesn’t take into account the starting point, so an airline that improved from 200 to 100 will appear better than one that stayed at 0). We will create 3 frames - one for each of the variables.

By_Inc <- changes[with(changes, order(-changes$`Incidents Change`)), 1:2]
By_Fat_Acc <- changes[with(changes, order(-changes$`Fatal Accident Change`)), c(1,3)]
By_Fatalities <- changes[with(changes, order(-changes$`Fatalities Change`)), c(1,4)]

Finally, let’s plot and see which airlines have become much more dangerous. For this example, we will take a look at the change in fatalities numbers. We will use barplot to create two plots: one of the 10 airlines with the highest increase of fatalities and one of the 10 most improved arlines in terms of number of fatalities.

n_air <- nrow(By_Fatalities)
par(las = 2)
par(mar = c(5,10,4,2))
barplot(By_Fatalities$`Fatalities Change`[(n_air-10):n_air], main = "Airlines with Greatest Increases in Number of Fatalities", horiz = TRUE, names.arg = By_Fatalities$Airline[(n_air-10):n_air], cex.names = 0.5)

par(las = 2)
par(mar = c(5,10,4,2))
barplot(By_Fatalities$`Fatalities Change`[1:10], main = "Airlines with Highest Reduction in Number of Fatalities", horiz = TRUE, names.arg = By_Fatalities$Airline[1:10], cex.names = 0.5)