Create an Example Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
When I was in High School, the Television show that everyone could not stop talking about was “Lost”. This famous show, created by J.J. Abrams, depicted a fictional plane crash and what happened subsequently to the passengers and crew. In real life, and it seems a lot more in recent history than ever, planes have been crashing or vanishing at sea at an alarming rate. Personally, my father travels about 25 weeks out of the year. Coming from a computer science background and not an engineering one. How can I give my father and the rest of the traveling community, some sort of “advise” for flying safe within the means of my area of expertise. Here is where this project can come into play. I found this dataset off of Kaggle’s GitHub page that was the data behind a very powerful article, Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?. For now, let’s create a “Tidy Recipe” for this dataset.
## Parsed with column specification:
## cols(
## airline = col_character(),
## avail_seat_km_per_week = col_double(),
## incidents_85_99 = col_integer(),
## fatal_accidents_85_99 = col_integer(),
## fatalities_85_99 = col_integer(),
## incidents_00_14 = col_integer(),
## fatal_accidents_00_14 = col_integer(),
## fatalities_00_14 = col_integer()
## )
| airline | avail_seat_km_per_week | incidents_85_99 | fatal_accidents_85_99 | fatalities_85_99 | incidents_00_14 | fatal_accidents_00_14 | fatalities_00_14 |
|---|---|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
| Aeroflot* | 1197672318 | 76 | 14 | 128 | 6 | 1 | 88 |
| Aerolineas Argentinas | 385803648 | 6 | 0 | 0 | 1 | 0 | 0 |
| Aeromexico* | 596871813 | 3 | 1 | 64 | 5 | 0 | 0 |
| Air Canada | 1865253802 | 2 | 0 | 0 | 2 | 0 | 0 |
| Air France | 3004002661 | 14 | 4 | 79 | 6 | 2 | 337 |
| Air India* | 869253552 | 2 | 1 | 329 | 4 | 1 | 158 |
| Air New Zealand* | 710174817 | 3 | 0 | 0 | 5 | 1 | 7 |
| Alaska Airlines* | 965346773 | 5 | 0 | 0 | 5 | 1 | 88 |
| Alitalia | 698012498 | 7 | 2 | 50 | 4 | 0 | 0 |
| All Nippon Airways | 1841234177 | 3 | 1 | 1 | 7 | 0 | 0 |
| American* | 5228357340 | 21 | 5 | 101 | 17 | 3 | 416 |
| Austrian Airlines | 358239823 | 1 | 0 | 0 | 1 | 0 | 0 |
| Avianca | 396922563 | 5 | 3 | 323 | 0 | 0 | 0 |
| British Airways* | 3179760952 | 4 | 0 | 0 | 6 | 0 | 0 |
| Cathay Pacific* | 2582459303 | 0 | 0 | 0 | 2 | 0 | 0 |
| China Airlines | 813216487 | 12 | 6 | 535 | 2 | 1 | 225 |
| Condor | 417982610 | 2 | 1 | 16 | 0 | 0 | 0 |
| COPA | 550491507 | 3 | 1 | 47 | 0 | 0 | 0 |
| Delta / Northwest* | 6525658894 | 24 | 12 | 407 | 24 | 2 | 51 |
| Egyptair | 557699891 | 8 | 3 | 282 | 4 | 1 | 14 |
| El Al | 335448023 | 1 | 1 | 4 | 1 | 0 | 0 |
| Ethiopian Airlines | 488560643 | 25 | 5 | 167 | 5 | 2 | 92 |
| Finnair | 506464950 | 1 | 0 | 0 | 0 | 0 | 0 |
| Garuda Indonesia | 613356665 | 10 | 3 | 260 | 4 | 2 | 22 |
We can obserse above, that thi dataset provides both the date-ranges : 85-99 and 00-14 in a row.
First we will want to convert this dataset into a format in which, for each airline there will be 2 rows: one for 85-99 and another row for 00-14.
From here, this will allow for each date range within an airline, where 1 row store the “count” which are accrued from - “incidents”, “fatal_incidents” and “fatalities”.
By doing so, this will create a new working data-frame that will be clean and easier to perform analysis on.
| airline | avail_seat_km_per_week | incidents_85_99 | incidents_00_14 | fatal_accidents_85_99 | fatal_accidents_00_14 | fatalities_85_99 | fatalities_00_14 |
|---|---|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
| Aeroflot* | 1197672318 | 76 | 6 | 14 | 1 | 128 | 88 |
| Aerolineas Argentinas | 385803648 | 6 | 1 | 0 | 0 | 0 | 0 |
| Aeromexico* | 596871813 | 3 | 5 | 1 | 0 | 64 | 0 |
| Air Canada | 1865253802 | 2 | 2 | 0 | 0 | 0 | 0 |
| Air France | 3004002661 | 14 | 6 | 4 | 2 | 79 | 337 |
| Air India* | 869253552 | 2 | 4 | 1 | 1 | 329 | 158 |
| Air New Zealand* | 710174817 | 3 | 5 | 0 | 1 | 0 | 7 |
| Alaska Airlines* | 965346773 | 5 | 5 | 0 | 1 | 0 | 88 |
| Alitalia | 698012498 | 7 | 4 | 2 | 0 | 50 | 0 |
| All Nippon Airways | 1841234177 | 3 | 7 | 1 | 0 | 1 | 0 |
| American* | 5228357340 | 21 | 17 | 5 | 3 | 101 | 416 |
| Austrian Airlines | 358239823 | 1 | 1 | 0 | 0 | 0 | 0 |
| Avianca | 396922563 | 5 | 0 | 3 | 0 | 323 | 0 |
| British Airways* | 3179760952 | 4 | 6 | 0 | 0 | 0 | 0 |
| Cathay Pacific* | 2582459303 | 0 | 2 | 0 | 0 | 0 | 0 |
| China Airlines | 813216487 | 12 | 2 | 6 | 1 | 535 | 225 |
| Condor | 417982610 | 2 | 0 | 1 | 0 | 16 | 0 |
| COPA | 550491507 | 3 | 0 | 1 | 0 | 47 | 0 |
| Delta / Northwest* | 6525658894 | 24 | 24 | 12 | 2 | 407 | 51 |
| Egyptair | 557699891 | 8 | 4 | 3 | 1 | 282 | 14 |
| El Al | 335448023 | 1 | 1 | 1 | 0 | 4 | 0 |
| Ethiopian Airlines | 488560643 | 25 | 5 | 5 | 2 | 167 | 92 |
| Finnair | 506464950 | 1 | 0 | 0 | 0 | 0 | 0 |
| Garuda Indonesia | 613356665 | 10 | 4 | 3 | 2 | 260 | 22 |
Using the \(tidyr\) package, we will convert this data into a long dataset using the \(gather\) function.
This will count variables “type” and “date_range” in a column.
In addition, the value of this count will be placed into another column.
as_df <- as_df %>% gather("date_range_type", "count", 3:8)
head(as_df, 25) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range_type | count |
|---|---|---|---|
| Aer Lingus | 320906734 | incidents_85_99 | 2 |
| Aeroflot* | 1197672318 | incidents_85_99 | 76 |
| Aerolineas Argentinas | 385803648 | incidents_85_99 | 6 |
| Aeromexico* | 596871813 | incidents_85_99 | 3 |
| Air Canada | 1865253802 | incidents_85_99 | 2 |
| Air France | 3004002661 | incidents_85_99 | 14 |
| Air India* | 869253552 | incidents_85_99 | 2 |
| Air New Zealand* | 710174817 | incidents_85_99 | 3 |
| Alaska Airlines* | 965346773 | incidents_85_99 | 5 |
| Alitalia | 698012498 | incidents_85_99 | 7 |
| All Nippon Airways | 1841234177 | incidents_85_99 | 3 |
| American* | 5228357340 | incidents_85_99 | 21 |
| Austrian Airlines | 358239823 | incidents_85_99 | 1 |
| Avianca | 396922563 | incidents_85_99 | 5 |
| British Airways* | 3179760952 | incidents_85_99 | 4 |
| Cathay Pacific* | 2582459303 | incidents_85_99 | 0 |
| China Airlines | 813216487 | incidents_85_99 | 12 |
| Condor | 417982610 | incidents_85_99 | 2 |
| COPA | 550491507 | incidents_85_99 | 3 |
| Delta / Northwest* | 6525658894 | incidents_85_99 | 24 |
| Egyptair | 557699891 | incidents_85_99 | 8 |
| El Al | 335448023 | incidents_85_99 | 1 |
| Ethiopian Airlines | 488560643 | incidents_85_99 | 25 |
| Finnair | 506464950 | incidents_85_99 | 1 |
| Garuda Indonesia | 613356665 | incidents_85_99 | 10 |
From here, we split the date range and type into two different columns
Furthermore, eliminating columns which hasthem both together:
as_df$date_range <- str_extract(as_df$date_range_type, "\\d+_\\d+$")
as_df$type <- str_extract(as_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")
as_df <- as_df %>% select(1:2, 5, 6, 4)
head(as_df, 25) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range | type | count |
|---|---|---|---|---|
| Aer Lingus | 320906734 | 85_99 | incidents | 2 |
| Aeroflot* | 1197672318 | 85_99 | incidents | 76 |
| Aerolineas Argentinas | 385803648 | 85_99 | incidents | 6 |
| Aeromexico* | 596871813 | 85_99 | incidents | 3 |
| Air Canada | 1865253802 | 85_99 | incidents | 2 |
| Air France | 3004002661 | 85_99 | incidents | 14 |
| Air India* | 869253552 | 85_99 | incidents | 2 |
| Air New Zealand* | 710174817 | 85_99 | incidents | 3 |
| Alaska Airlines* | 965346773 | 85_99 | incidents | 5 |
| Alitalia | 698012498 | 85_99 | incidents | 7 |
| All Nippon Airways | 1841234177 | 85_99 | incidents | 3 |
| American* | 5228357340 | 85_99 | incidents | 21 |
| Austrian Airlines | 358239823 | 85_99 | incidents | 1 |
| Avianca | 396922563 | 85_99 | incidents | 5 |
| British Airways* | 3179760952 | 85_99 | incidents | 4 |
| Cathay Pacific* | 2582459303 | 85_99 | incidents | 0 |
| China Airlines | 813216487 | 85_99 | incidents | 12 |
| Condor | 417982610 | 85_99 | incidents | 2 |
| COPA | 550491507 | 85_99 | incidents | 3 |
| Delta / Northwest* | 6525658894 | 85_99 | incidents | 24 |
| Egyptair | 557699891 | 85_99 | incidents | 8 |
| El Al | 335448023 | 85_99 | incidents | 1 |
| Ethiopian Airlines | 488560643 | 85_99 | incidents | 25 |
| Finnair | 506464950 | 85_99 | incidents | 1 |
| Garuda Indonesia | 613356665 | 85_99 | incidents | 10 |
Using the \(spread\) function and using the variables, “incidents”, “fatal_incidents” and “fatalities”, while counting the occurences to create a value.
This is to convert back into a dataset, however, creating a date range wise split in the process.
as_df <- as_df %>% spread(type, count)
as_df <- as_df %>% select(1:3, 6, 4:5)
head(as_df, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range | incidents | fatal_accidents | fatalities |
|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 00_14 | 0 | 0 | 0 |
| Aer Lingus | 320906734 | 85_99 | 2 | 0 | 0 |
| Aeroflot* | 1197672318 | 00_14 | 6 | 1 | 88 |
| Aeroflot* | 1197672318 | 85_99 | 76 | 14 | 128 |
| Aerolineas Argentinas | 385803648 | 00_14 | 1 | 0 | 0 |
| Aerolineas Argentinas | 385803648 | 85_99 | 6 | 0 | 0 |
| Aeromexico* | 596871813 | 00_14 | 5 | 0 | 0 |
| Aeromexico* | 596871813 | 85_99 | 3 | 1 | 64 |
| Air Canada | 1865253802 | 00_14 | 2 | 0 | 0 |
| Air Canada | 1865253802 | 85_99 | 2 | 0 | 0 |
| Air France | 3004002661 | 00_14 | 6 | 2 | 337 |
| Air France | 3004002661 | 85_99 | 14 | 4 | 79 |
| Air India* | 869253552 | 00_14 | 4 | 1 | 158 |
| Air India* | 869253552 | 85_99 | 2 | 1 | 329 |
| Air New Zealand* | 710174817 | 00_14 | 5 | 1 | 7 |
| Air New Zealand* | 710174817 | 85_99 | 3 | 0 | 0 |
| Alaska Airlines* | 965346773 | 00_14 | 5 | 1 | 88 |
| Alaska Airlines* | 965346773 | 85_99 | 5 | 0 | 0 |
| Alitalia | 698012498 | 00_14 | 4 | 0 | 0 |
| Alitalia | 698012498 | 85_99 | 7 | 2 | 50 |
Now the data is fully ready with each airline and a date range having 1 row each
df_sorted_desc <- as_df[with(as_df, order(-incidents)), ]
head(df_sorted_desc) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range | incidents | fatal_accidents | fatalities |
|---|---|---|---|---|---|
| Aeroflot* | 1197672318 | 85_99 | 76 | 14 | 128 |
| Ethiopian Airlines | 488560643 | 85_99 | 25 | 5 | 167 |
| Delta / Northwest* | 6525658894 | 00_14 | 24 | 2 | 51 |
| Delta / Northwest* | 6525658894 | 85_99 | 24 | 12 | 407 |
| American* | 5228357340 | 85_99 | 21 | 5 | 101 |
| United / Continental* | 7139291291 | 85_99 | 19 | 8 | 319 |
You can see from a “quick” sort using the \(order\) plus a \(-\) before the variable to create a descending DataFrame based on incidents