This exercise is a part of DATA-607 Fall 2018 semester. As a part of this exercise, we need to perform the below 2 tasks: 1) Create one example of code which uses one or more packages from the Tidyverse, and put that RMD file into my Github 2) Collaborate into someone else’s code using fork code and pull request on someone else’s Github
Below is he code for the first part of this assignment:
I am taking the airline safety data set from fivethirtyeight github (https://github.com/fivethirtyeight/data/tree/master/airline-safety), and making it ready for the data analysis.
Step-1: Load the required packages into R:
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(stringr)
library(kableExtra)Step-2: Load the dataset into R
airline_safety_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv"
airline_safety_dataset <- read_csv(airline_safety_url)## Parsed with column specification:
## cols(
## airline = col_character(),
## avail_seat_km_per_week = col_double(),
## incidents_85_99 = col_integer(),
## fatal_accidents_85_99 = col_integer(),
## fatalities_85_99 = col_integer(),
## incidents_00_14 = col_integer(),
## fatal_accidents_00_14 = col_integer(),
## fatalities_00_14 = col_integer()
## )
dim(airline_safety_dataset)## [1] 56 8
head(airline_safety_dataset, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | incidents_85_99 | fatal_accidents_85_99 | fatalities_85_99 | incidents_00_14 | fatal_accidents_00_14 | fatalities_00_14 |
|---|---|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
| Aeroflot* | 1197672318 | 76 | 14 | 128 | 6 | 1 | 88 |
| Aerolineas Argentinas | 385803648 | 6 | 0 | 0 | 1 | 0 | 0 |
| Aeromexico* | 596871813 | 3 | 1 | 64 | 5 | 0 | 0 |
| Air Canada | 1865253802 | 2 | 0 | 0 | 2 | 0 | 0 |
| Air France | 3004002661 | 14 | 4 | 79 | 6 | 2 | 337 |
| Air India* | 869253552 | 2 | 1 | 329 | 4 | 1 | 158 |
| Air New Zealand* | 710174817 | 3 | 0 | 0 | 5 | 1 | 7 |
| Alaska Airlines* | 965346773 | 5 | 0 | 0 | 5 | 1 | 88 |
| Alitalia | 698012498 | 7 | 2 | 50 | 4 | 0 | 0 |
| All Nippon Airways | 1841234177 | 3 | 1 | 1 | 7 | 0 | 0 |
| American* | 5228357340 | 21 | 5 | 101 | 17 | 3 | 416 |
| Austrian Airlines | 358239823 | 1 | 0 | 0 | 1 | 0 | 0 |
| Avianca | 396922563 | 5 | 3 | 323 | 0 | 0 | 0 |
| British Airways* | 3179760952 | 4 | 0 | 0 | 6 | 0 | 0 |
| Cathay Pacific* | 2582459303 | 0 | 0 | 0 | 2 | 0 | 0 |
| China Airlines | 813216487 | 12 | 6 | 535 | 2 | 1 | 225 |
| Condor | 417982610 | 2 | 1 | 16 | 0 | 0 | 0 |
| COPA | 550491507 | 3 | 1 | 47 | 0 | 0 | 0 |
| Delta / Northwest* | 6525658894 | 24 | 12 | 407 | 24 | 2 | 51 |
As we see above, this dataset gives the data for both the date-ranges : 85-99 and 00-14 in the same row. Now we want to convert this dataset into a format where for each airline, there will be 2 rows, one for 85-99 and another row for 00-14. This will ensure that for each date range for an airline, we have 1 row with all the variable counts - incidents, fatal_incidents and fatalities.
Creating a new working data frame which will be used further. Creating this with all the same type counts togethr, e.g. incidents_85_99 and incidents_00_14 next to each other, and so on.
airline_safety_df <- airline_safety_dataset %>% select(1:3,6,4,7,5,8)
head(airline_safety_df, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | incidents_85_99 | incidents_00_14 | fatal_accidents_85_99 | fatal_accidents_00_14 | fatalities_85_99 | fatalities_00_14 |
|---|---|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 2 | 0 | 0 | 0 | 0 | 0 |
| Aeroflot* | 1197672318 | 76 | 6 | 14 | 1 | 128 | 88 |
| Aerolineas Argentinas | 385803648 | 6 | 1 | 0 | 0 | 0 | 0 |
| Aeromexico* | 596871813 | 3 | 5 | 1 | 0 | 64 | 0 |
| Air Canada | 1865253802 | 2 | 2 | 0 | 0 | 0 | 0 |
| Air France | 3004002661 | 14 | 6 | 4 | 2 | 79 | 337 |
| Air India* | 869253552 | 2 | 4 | 1 | 1 | 329 | 158 |
| Air New Zealand* | 710174817 | 3 | 5 | 0 | 1 | 0 | 7 |
| Alaska Airlines* | 965346773 | 5 | 5 | 0 | 1 | 0 | 88 |
| Alitalia | 698012498 | 7 | 4 | 2 | 0 | 50 | 0 |
| All Nippon Airways | 1841234177 | 3 | 7 | 1 | 0 | 1 | 0 |
| American* | 5228357340 | 21 | 17 | 5 | 3 | 101 | 416 |
| Austrian Airlines | 358239823 | 1 | 1 | 0 | 0 | 0 | 0 |
| Avianca | 396922563 | 5 | 0 | 3 | 0 | 323 | 0 |
| British Airways* | 3179760952 | 4 | 6 | 0 | 0 | 0 | 0 |
| Cathay Pacific* | 2582459303 | 0 | 2 | 0 | 0 | 0 | 0 |
| China Airlines | 813216487 | 12 | 2 | 6 | 1 | 535 | 225 |
| Condor | 417982610 | 2 | 0 | 1 | 0 | 16 | 0 |
| COPA | 550491507 | 3 | 0 | 1 | 0 | 47 | 0 |
| Delta / Northwest* | 6525658894 | 24 | 24 | 12 | 2 | 407 | 51 |
Now converting into a long dataset using the gather function from tidyr package, to have count type and date_range in a column and the value of the count in another column
airline_safety_df <- airline_safety_df %>% gather("date_range_type", "count", 3:8)
head(airline_safety_df, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range_type | count |
|---|---|---|---|
| Aer Lingus | 320906734 | incidents_85_99 | 2 |
| Aeroflot* | 1197672318 | incidents_85_99 | 76 |
| Aerolineas Argentinas | 385803648 | incidents_85_99 | 6 |
| Aeromexico* | 596871813 | incidents_85_99 | 3 |
| Air Canada | 1865253802 | incidents_85_99 | 2 |
| Air France | 3004002661 | incidents_85_99 | 14 |
| Air India* | 869253552 | incidents_85_99 | 2 |
| Air New Zealand* | 710174817 | incidents_85_99 | 3 |
| Alaska Airlines* | 965346773 | incidents_85_99 | 5 |
| Alitalia | 698012498 | incidents_85_99 | 7 |
| All Nippon Airways | 1841234177 | incidents_85_99 | 3 |
| American* | 5228357340 | incidents_85_99 | 21 |
| Austrian Airlines | 358239823 | incidents_85_99 | 1 |
| Avianca | 396922563 | incidents_85_99 | 5 |
| British Airways* | 3179760952 | incidents_85_99 | 4 |
| Cathay Pacific* | 2582459303 | incidents_85_99 | 0 |
| China Airlines | 813216487 | incidents_85_99 | 12 |
| Condor | 417982610 | incidents_85_99 | 2 |
| COPA | 550491507 | incidents_85_99 | 3 |
| Delta / Northwest* | 6525658894 | incidents_85_99 | 24 |
Now splitting the date range and type into 2 separate columns, and getting rid of the column which has both these together:
airline_safety_df$date_range <- str_extract(airline_safety_df$date_range_type, "\\d+_\\d+$")
airline_safety_df$type <- str_extract(airline_safety_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")
airline_safety_df <- airline_safety_df %>% select(1:2, 5, 6, 4)
head(airline_safety_df, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range | type | count |
|---|---|---|---|---|
| Aer Lingus | 320906734 | 85_99 | incidents | 2 |
| Aeroflot* | 1197672318 | 85_99 | incidents | 76 |
| Aerolineas Argentinas | 385803648 | 85_99 | incidents | 6 |
| Aeromexico* | 596871813 | 85_99 | incidents | 3 |
| Air Canada | 1865253802 | 85_99 | incidents | 2 |
| Air France | 3004002661 | 85_99 | incidents | 14 |
| Air India* | 869253552 | 85_99 | incidents | 2 |
| Air New Zealand* | 710174817 | 85_99 | incidents | 3 |
| Alaska Airlines* | 965346773 | 85_99 | incidents | 5 |
| Alitalia | 698012498 | 85_99 | incidents | 7 |
| All Nippon Airways | 1841234177 | 85_99 | incidents | 3 |
| American* | 5228357340 | 85_99 | incidents | 21 |
| Austrian Airlines | 358239823 | 85_99 | incidents | 1 |
| Avianca | 396922563 | 85_99 | incidents | 5 |
| British Airways* | 3179760952 | 85_99 | incidents | 4 |
| Cathay Pacific* | 2582459303 | 85_99 | incidents | 0 |
| China Airlines | 813216487 | 85_99 | incidents | 12 |
| Condor | 417982610 | 85_99 | incidents | 2 |
| COPA | 550491507 | 85_99 | incidents | 3 |
| Delta / Northwest* | 6525658894 | 85_99 | incidents | 24 |
Now using the spread function, using type (“incidents”, “fatal_incidents” and “fatalities”) as the key and count as the value: to convert it back into a rectangular data set, but this time it will be date range wise split
airline_safety_df <- airline_safety_df %>% spread(type, count)
airline_safety_df <- airline_safety_df %>% select(1:3, 6, 4:5)
head(airline_safety_df, 20) %>% kable() %>% kable_styling()| airline | avail_seat_km_per_week | date_range | incidents | fatal_accidents | fatalities |
|---|---|---|---|---|---|
| Aer Lingus | 320906734 | 00_14 | 0 | 0 | 0 |
| Aer Lingus | 320906734 | 85_99 | 2 | 0 | 0 |
| Aeroflot* | 1197672318 | 00_14 | 6 | 1 | 88 |
| Aeroflot* | 1197672318 | 85_99 | 76 | 14 | 128 |
| Aerolineas Argentinas | 385803648 | 00_14 | 1 | 0 | 0 |
| Aerolineas Argentinas | 385803648 | 85_99 | 6 | 0 | 0 |
| Aeromexico* | 596871813 | 00_14 | 5 | 0 | 0 |
| Aeromexico* | 596871813 | 85_99 | 3 | 1 | 64 |
| Air Canada | 1865253802 | 00_14 | 2 | 0 | 0 |
| Air Canada | 1865253802 | 85_99 | 2 | 0 | 0 |
| Air France | 3004002661 | 00_14 | 6 | 2 | 337 |
| Air France | 3004002661 | 85_99 | 14 | 4 | 79 |
| Air India* | 869253552 | 00_14 | 4 | 1 | 158 |
| Air India* | 869253552 | 85_99 | 2 | 1 | 329 |
| Air New Zealand* | 710174817 | 00_14 | 5 | 1 | 7 |
| Air New Zealand* | 710174817 | 85_99 | 3 | 0 | 0 |
| Alaska Airlines* | 965346773 | 00_14 | 5 | 1 | 88 |
| Alaska Airlines* | 965346773 | 85_99 | 5 | 0 | 0 |
| Alitalia | 698012498 | 00_14 | 4 | 0 | 0 |
| Alitalia | 698012498 | 85_99 | 7 | 2 | 50 |
Now the data is fully ready with each airline and a date range having 1 row each. Now this data can be used further for analysis. This is the end point for the first part of this assignment.