Create an Example Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
When I was in High School, the Television show that everyone could not stop talking about was “Lost”. This famous show, created by J.J. Abrams, depicted a fictional plane crash and what happened subsequently to the passengers and crew. In real life, and it seems a lot more in recent history than ever, planes have been crashing or vanishing at sea at an alarming rate. Personally, my father travels about 25 weeks out of the year. Coming from a computer science background and not an engineering one. How can I give my father and the rest of the traveling community, some sort of “advise” for flying safe within the means of my area of expertise. Here is where this project can come into play. I found this dataset off of Kaggle’s GitHub page that was the data behind a very powerful article, Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?. For now, let’s create a “Tidy Recipe” for this dataset.
## Parsed with column specification:
## cols(
## airline = col_character(),
## avail_seat_km_per_week = col_double(),
## incidents_85_99 = col_double(),
## fatal_accidents_85_99 = col_double(),
## fatalities_85_99 = col_double(),
## incidents_00_14 = col_double(),
## fatal_accidents_00_14 = col_double(),
## fatalities_00_14 = col_double()
## )
airline | avail_seat_km_per_week | incidents_85_99 | fatal_accidents_85_99 | fatalities_85_99 | incidents_00_14 | fatal_accidents_00_14 | fatalities_00_14 |
---|---|---|---|---|---|---|---|
Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
Aeroflot* | 1197672318 | 76 | 14 | 128 | 6 | 1 | 88 |
Aerolineas Argentinas | 385803648 | 6 | 0 | 0 | 1 | 0 | 0 |
Aeromexico* | 596871813 | 3 | 1 | 64 | 5 | 0 | 0 |
Air Canada | 1865253802 | 2 | 0 | 0 | 2 | 0 | 0 |
Air France | 3004002661 | 14 | 4 | 79 | 6 | 2 | 337 |
Air India* | 869253552 | 2 | 1 | 329 | 4 | 1 | 158 |
Air New Zealand* | 710174817 | 3 | 0 | 0 | 5 | 1 | 7 |
Alaska Airlines* | 965346773 | 5 | 0 | 0 | 5 | 1 | 88 |
Alitalia | 698012498 | 7 | 2 | 50 | 4 | 0 | 0 |
All Nippon Airways | 1841234177 | 3 | 1 | 1 | 7 | 0 | 0 |
American* | 5228357340 | 21 | 5 | 101 | 17 | 3 | 416 |
Austrian Airlines | 358239823 | 1 | 0 | 0 | 1 | 0 | 0 |
Avianca | 396922563 | 5 | 3 | 323 | 0 | 0 | 0 |
British Airways* | 3179760952 | 4 | 0 | 0 | 6 | 0 | 0 |
Cathay Pacific* | 2582459303 | 0 | 0 | 0 | 2 | 0 | 0 |
China Airlines | 813216487 | 12 | 6 | 535 | 2 | 1 | 225 |
Condor | 417982610 | 2 | 1 | 16 | 0 | 0 | 0 |
COPA | 550491507 | 3 | 1 | 47 | 0 | 0 | 0 |
Delta / Northwest* | 6525658894 | 24 | 12 | 407 | 24 | 2 | 51 |
Egyptair | 557699891 | 8 | 3 | 282 | 4 | 1 | 14 |
El Al | 335448023 | 1 | 1 | 4 | 1 | 0 | 0 |
Ethiopian Airlines | 488560643 | 25 | 5 | 167 | 5 | 2 | 92 |
Finnair | 506464950 | 1 | 0 | 0 | 0 | 0 | 0 |
Garuda Indonesia | 613356665 | 10 | 3 | 260 | 4 | 2 | 22 |
We can obserse above, that this dataset provides both the date-ranges : 85-99 and 00-14 in a row.
First we will want to convert this dataset into a format in which, for each airline there will be 2 rows: one for 85-99 and another row for 00-14.
From here, this will allow for each date range within an airline, where 1 row store the “count” which are accrued from - “incidents”, “fatal_incidents” and “fatalities”.
By doing so, this will create a new working data-frame that will be clean and easier to perform analysis on.
airline | avail_seat_km_per_week | incidents_85_99 | incidents_00_14 | fatal_accidents_85_99 | fatal_accidents_00_14 | fatalities_85_99 | fatalities_00_14 |
---|---|---|---|---|---|---|---|
Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
Aeroflot* | 1197672318 | 76 | 6 | 14 | 1 | 128 | 88 |
Aerolineas Argentinas | 385803648 | 6 | 1 | 0 | 0 | 0 | 0 |
Aeromexico* | 596871813 | 3 | 5 | 1 | 0 | 64 | 0 |
Air Canada | 1865253802 | 2 | 2 | 0 | 0 | 0 | 0 |
Air France | 3004002661 | 14 | 6 | 4 | 2 | 79 | 337 |
Air India* | 869253552 | 2 | 4 | 1 | 1 | 329 | 158 |
Air New Zealand* | 710174817 | 3 | 5 | 0 | 1 | 0 | 7 |
Alaska Airlines* | 965346773 | 5 | 5 | 0 | 1 | 0 | 88 |
Alitalia | 698012498 | 7 | 4 | 2 | 0 | 50 | 0 |
All Nippon Airways | 1841234177 | 3 | 7 | 1 | 0 | 1 | 0 |
American* | 5228357340 | 21 | 17 | 5 | 3 | 101 | 416 |
Austrian Airlines | 358239823 | 1 | 1 | 0 | 0 | 0 | 0 |
Avianca | 396922563 | 5 | 0 | 3 | 0 | 323 | 0 |
British Airways* | 3179760952 | 4 | 6 | 0 | 0 | 0 | 0 |
Cathay Pacific* | 2582459303 | 0 | 2 | 0 | 0 | 0 | 0 |
China Airlines | 813216487 | 12 | 2 | 6 | 1 | 535 | 225 |
Condor | 417982610 | 2 | 0 | 1 | 0 | 16 | 0 |
COPA | 550491507 | 3 | 0 | 1 | 0 | 47 | 0 |
Delta / Northwest* | 6525658894 | 24 | 24 | 12 | 2 | 407 | 51 |
Egyptair | 557699891 | 8 | 4 | 3 | 1 | 282 | 14 |
El Al | 335448023 | 1 | 1 | 1 | 0 | 4 | 0 |
Ethiopian Airlines | 488560643 | 25 | 5 | 5 | 2 | 167 | 92 |
Finnair | 506464950 | 1 | 0 | 0 | 0 | 0 | 0 |
Garuda Indonesia | 613356665 | 10 | 4 | 3 | 2 | 260 | 22 |
Using the \(tidyr\) package, we will convert this data into a long dataset using the \(gather\) function.
This will count variables “type” and “date_range” in a column.
In addition, the value of this count will be placed into another column.
as_df <- as_df %>% gather("date_range_type", "count", 3:8)
head(as_df, 25) %>% kable() %>% kable_styling()
airline | avail_seat_km_per_week | date_range_type | count |
---|---|---|---|
Aer Lingus | 320906734 | incidents_85_99 | 2 |
Aeroflot* | 1197672318 | incidents_85_99 | 76 |
Aerolineas Argentinas | 385803648 | incidents_85_99 | 6 |
Aeromexico* | 596871813 | incidents_85_99 | 3 |
Air Canada | 1865253802 | incidents_85_99 | 2 |
Air France | 3004002661 | incidents_85_99 | 14 |
Air India* | 869253552 | incidents_85_99 | 2 |
Air New Zealand* | 710174817 | incidents_85_99 | 3 |
Alaska Airlines* | 965346773 | incidents_85_99 | 5 |
Alitalia | 698012498 | incidents_85_99 | 7 |
All Nippon Airways | 1841234177 | incidents_85_99 | 3 |
American* | 5228357340 | incidents_85_99 | 21 |
Austrian Airlines | 358239823 | incidents_85_99 | 1 |
Avianca | 396922563 | incidents_85_99 | 5 |
British Airways* | 3179760952 | incidents_85_99 | 4 |
Cathay Pacific* | 2582459303 | incidents_85_99 | 0 |
China Airlines | 813216487 | incidents_85_99 | 12 |
Condor | 417982610 | incidents_85_99 | 2 |
COPA | 550491507 | incidents_85_99 | 3 |
Delta / Northwest* | 6525658894 | incidents_85_99 | 24 |
Egyptair | 557699891 | incidents_85_99 | 8 |
El Al | 335448023 | incidents_85_99 | 1 |
Ethiopian Airlines | 488560643 | incidents_85_99 | 25 |
Finnair | 506464950 | incidents_85_99 | 1 |
Garuda Indonesia | 613356665 | incidents_85_99 | 10 |
From here, we split the date range and type into two different columns
Furthermore, eliminating columns which has them both together:
as_df$date_range <- str_extract(as_df$date_range_type, "\\d+_\\d+$")
as_df$type <- str_extract(as_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")
as_df <- as_df %>% select(1:2, 5, 6, 4)
head(as_df, 25) %>% kable() %>% kable_styling()
airline | avail_seat_km_per_week | date_range | type | count |
---|---|---|---|---|
Aer Lingus | 320906734 | 85_99 | incidents | 2 |
Aeroflot* | 1197672318 | 85_99 | incidents | 76 |
Aerolineas Argentinas | 385803648 | 85_99 | incidents | 6 |
Aeromexico* | 596871813 | 85_99 | incidents | 3 |
Air Canada | 1865253802 | 85_99 | incidents | 2 |
Air France | 3004002661 | 85_99 | incidents | 14 |
Air India* | 869253552 | 85_99 | incidents | 2 |
Air New Zealand* | 710174817 | 85_99 | incidents | 3 |
Alaska Airlines* | 965346773 | 85_99 | incidents | 5 |
Alitalia | 698012498 | 85_99 | incidents | 7 |
All Nippon Airways | 1841234177 | 85_99 | incidents | 3 |
American* | 5228357340 | 85_99 | incidents | 21 |
Austrian Airlines | 358239823 | 85_99 | incidents | 1 |
Avianca | 396922563 | 85_99 | incidents | 5 |
British Airways* | 3179760952 | 85_99 | incidents | 4 |
Cathay Pacific* | 2582459303 | 85_99 | incidents | 0 |
China Airlines | 813216487 | 85_99 | incidents | 12 |
Condor | 417982610 | 85_99 | incidents | 2 |
COPA | 550491507 | 85_99 | incidents | 3 |
Delta / Northwest* | 6525658894 | 85_99 | incidents | 24 |
Egyptair | 557699891 | 85_99 | incidents | 8 |
El Al | 335448023 | 85_99 | incidents | 1 |
Ethiopian Airlines | 488560643 | 85_99 | incidents | 25 |
Finnair | 506464950 | 85_99 | incidents | 1 |
Garuda Indonesia | 613356665 | 85_99 | incidents | 10 |
Using the \(spread\) function and using the variables, “incidents”, “fatal_incidents” and “fatalities”, while counting the occurences to create a value.
This is to convert back into a dataset, however, creating a date range wise split in the process.
as_df <- as_df %>% spread(type, count)
as_df <- as_df %>% select(1:3, 6, 4:5)
head(as_df, 20) %>% kable() %>% kable_styling()
airline | avail_seat_km_per_week | date_range | incidents | fatal_accidents | fatalities |
---|---|---|---|---|---|
Aer Lingus | 320906734 | 00_14 | 0 | 0 | 0 |
Aer Lingus | 320906734 | 85_99 | 2 | 0 | 0 |
Aeroflot* | 1197672318 | 00_14 | 6 | 1 | 88 |
Aeroflot* | 1197672318 | 85_99 | 76 | 14 | 128 |
Aerolineas Argentinas | 385803648 | 00_14 | 1 | 0 | 0 |
Aerolineas Argentinas | 385803648 | 85_99 | 6 | 0 | 0 |
Aeromexico* | 596871813 | 00_14 | 5 | 0 | 0 |
Aeromexico* | 596871813 | 85_99 | 3 | 1 | 64 |
Air Canada | 1865253802 | 00_14 | 2 | 0 | 0 |
Air Canada | 1865253802 | 85_99 | 2 | 0 | 0 |
Air France | 3004002661 | 00_14 | 6 | 2 | 337 |
Air France | 3004002661 | 85_99 | 14 | 4 | 79 |
Air India* | 869253552 | 00_14 | 4 | 1 | 158 |
Air India* | 869253552 | 85_99 | 2 | 1 | 329 |
Air New Zealand* | 710174817 | 00_14 | 5 | 1 | 7 |
Air New Zealand* | 710174817 | 85_99 | 3 | 0 | 0 |
Alaska Airlines* | 965346773 | 00_14 | 5 | 1 | 88 |
Alaska Airlines* | 965346773 | 85_99 | 5 | 0 | 0 |
Alitalia | 698012498 | 00_14 | 4 | 0 | 0 |
Alitalia | 698012498 | 85_99 | 7 | 2 | 50 |
Now the data is fully ready with each airline and a date range having 1 row each
df_sorted_desc <- as_df[with(as_df, order(-incidents)), ]
head(df_sorted_desc, 10) %>% kable() %>% kable_styling()
airline | avail_seat_km_per_week | date_range | incidents | fatal_accidents | fatalities |
---|---|---|---|---|---|
Aeroflot* | 1197672318 | 85_99 | 76 | 14 | 128 |
Ethiopian Airlines | 488560643 | 85_99 | 25 | 5 | 167 |
Delta / Northwest* | 6525658894 | 00_14 | 24 | 2 | 51 |
Delta / Northwest* | 6525658894 | 85_99 | 24 | 12 | 407 |
American* | 5228357340 | 85_99 | 21 | 5 | 101 |
United / Continental* | 7139291291 | 85_99 | 19 | 8 | 319 |
American* | 5228357340 | 00_14 | 17 | 3 | 416 |
US Airways / America West* | 2455687887 | 85_99 | 16 | 7 | 224 |
Air France | 3004002661 | 85_99 | 14 | 4 | 79 |
United / Continental* | 7139291291 | 00_14 | 14 | 2 | 109 |
In order to determine whether or not some of the airlines with poorer safety records improved over the time periods within our dataset, we can look at the differences in the values by plotting them alongside one another.
To begin, let’s create a new dataframe that lists the change in the incidents
, fatal accidents
and fatalities
for each airline. To do so, use the summarise
function after grouping by airline.
Summarise
allows us to combine data based on a certain grouping (in this case by \(airline\)) and create new calculated values. In this case what we want is the difference between the values within the three columns \(incidents\), \(fatal accidents\) and \(fatalities\).
changes <- as_df %>% group_by(airline) %>% summarise(inc_change <- diff(incidents), fat_acc_change <- diff(fatal_accidents), fatalities_change <- diff(fatalities))
names(changes) <- c("Airline", "Incidents Change", "Fatal Accident Change", "Fatalities Change")
Now let’s create ne dataframes showing the most improved airline (we should note that this doesn’t take into account the starting point, so an airline that improved from 200 to 100 will appear better than one that stayed at 0). We will create 3 frames - one for each of the variables.
By_Inc <- changes[with(changes, order(-changes$`Incidents Change`)), 1:2]
By_Fat_Acc <- changes[with(changes, order(-changes$`Fatal Accident Change`)), c(1,3)]
By_Fatalities <- changes[with(changes, order(-changes$`Fatalities Change`)), c(1,4)]
Finally, let’s plot and see which airlines have become much more dangerous. For this example, we will take a look at the change in fatalities numbers. We will use barplot
to create two plots: one of the 10 airlines with the highest increase of fatalities and one of the 10 most improved arlines in terms of number of fatalities.
n_air <- nrow(By_Fatalities)
par(las = 2)
par(mar = c(5,10,4,2))
barplot(By_Fatalities$`Fatalities Change`[(n_air-10):n_air], main = "Airlines with Greatest Increases in Number of Fatalities", horiz = TRUE, names.arg = By_Fatalities$Airline[(n_air-10):n_air], cex.names = 0.5)
par(las = 2)
par(mar = c(5,10,4,2))
barplot(By_Fatalities$`Fatalities Change`[1:10], main = "Airlines with Highest Reduction in Number of Fatalities", horiz = TRUE, names.arg = By_Fatalities$Airline[1:10], cex.names = 0.5)