Deepak Mongia Tidyverse Recipes Assignment

Deepak Mongia

December 1, 2018

This exercise is a part of DATA-607 Fall 2018 semester. As a part of this exercise, we need to perform the below 2 tasks: 1) Create one example of code which uses one or more packages from the Tidyverse, and put that RMD file into my Github 2) Collaborate into someone else’s code using fork code and pull request on someone else’s Github

Below is he code for the first part of this assignment:

I am taking the airline safety data set from fivethirtyeight github (https://github.com/fivethirtyeight/data/tree/master/airline-safety), and making it ready for the data analysis.

Step-1: Load the required packages into R:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(stringr)
library(kableExtra)

Step-2: Load the dataset into R

airline_safety_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv"

airline_safety_dataset <- read_csv(airline_safety_url)
## Parsed with column specification:
## cols(
##   airline = col_character(),
##   avail_seat_km_per_week = col_double(),
##   incidents_85_99 = col_integer(),
##   fatal_accidents_85_99 = col_integer(),
##   fatalities_85_99 = col_integer(),
##   incidents_00_14 = col_integer(),
##   fatal_accidents_00_14 = col_integer(),
##   fatalities_00_14 = col_integer()
## )
dim(airline_safety_dataset)
## [1] 56  8
head(airline_safety_dataset, 20) %>% kable() %>% kable_styling()
airline avail_seat_km_per_week incidents_85_99 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14 fatalities_00_14
Aer Lingus 320906734 2 0 0 0 0 0
Aeroflot* 1197672318 76 14 128 6 1 88
Aerolineas Argentinas 385803648 6 0 0 1 0 0
Aeromexico* 596871813 3 1 64 5 0 0
Air Canada 1865253802 2 0 0 2 0 0
Air France 3004002661 14 4 79 6 2 337
Air India* 869253552 2 1 329 4 1 158
Air New Zealand* 710174817 3 0 0 5 1 7
Alaska Airlines* 965346773 5 0 0 5 1 88
Alitalia 698012498 7 2 50 4 0 0
All Nippon Airways 1841234177 3 1 1 7 0 0
American* 5228357340 21 5 101 17 3 416
Austrian Airlines 358239823 1 0 0 1 0 0
Avianca 396922563 5 3 323 0 0 0
British Airways* 3179760952 4 0 0 6 0 0
Cathay Pacific* 2582459303 0 0 0 2 0 0
China Airlines 813216487 12 6 535 2 1 225
Condor 417982610 2 1 16 0 0 0
COPA 550491507 3 1 47 0 0 0
Delta / Northwest* 6525658894 24 12 407 24 2 51

As we see above, this dataset gives the data for both the date-ranges : 85-99 and 00-14 in the same row. Now we want to convert this dataset into a format where for each airline, there will be 2 rows, one for 85-99 and another row for 00-14. This will ensure that for each date range for an airline, we have 1 row with all the variable counts - incidents, fatal_incidents and fatalities.

Creating a new working data frame which will be used further. Creating this with all the same type counts togethr, e.g. incidents_85_99 and incidents_00_14 next to each other, and so on.

airline_safety_df <- airline_safety_dataset %>% select(1:3,6,4,7,5,8)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()
airline avail_seat_km_per_week incidents_85_99 incidents_00_14 fatal_accidents_85_99 fatal_accidents_00_14 fatalities_85_99 fatalities_00_14
Aer Lingus 320906734 2 0 0 0 0 0
Aeroflot* 1197672318 76 6 14 1 128 88
Aerolineas Argentinas 385803648 6 1 0 0 0 0
Aeromexico* 596871813 3 5 1 0 64 0
Air Canada 1865253802 2 2 0 0 0 0
Air France 3004002661 14 6 4 2 79 337
Air India* 869253552 2 4 1 1 329 158
Air New Zealand* 710174817 3 5 0 1 0 7
Alaska Airlines* 965346773 5 5 0 1 0 88
Alitalia 698012498 7 4 2 0 50 0
All Nippon Airways 1841234177 3 7 1 0 1 0
American* 5228357340 21 17 5 3 101 416
Austrian Airlines 358239823 1 1 0 0 0 0
Avianca 396922563 5 0 3 0 323 0
British Airways* 3179760952 4 6 0 0 0 0
Cathay Pacific* 2582459303 0 2 0 0 0 0
China Airlines 813216487 12 2 6 1 535 225
Condor 417982610 2 0 1 0 16 0
COPA 550491507 3 0 1 0 47 0
Delta / Northwest* 6525658894 24 24 12 2 407 51

Now converting into a long dataset using the gather function from tidyr package, to have count type and date_range in a column and the value of the count in another column

airline_safety_df <- airline_safety_df %>% gather("date_range_type", "count", 3:8)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()
airline avail_seat_km_per_week date_range_type count
Aer Lingus 320906734 incidents_85_99 2
Aeroflot* 1197672318 incidents_85_99 76
Aerolineas Argentinas 385803648 incidents_85_99 6
Aeromexico* 596871813 incidents_85_99 3
Air Canada 1865253802 incidents_85_99 2
Air France 3004002661 incidents_85_99 14
Air India* 869253552 incidents_85_99 2
Air New Zealand* 710174817 incidents_85_99 3
Alaska Airlines* 965346773 incidents_85_99 5
Alitalia 698012498 incidents_85_99 7
All Nippon Airways 1841234177 incidents_85_99 3
American* 5228357340 incidents_85_99 21
Austrian Airlines 358239823 incidents_85_99 1
Avianca 396922563 incidents_85_99 5
British Airways* 3179760952 incidents_85_99 4
Cathay Pacific* 2582459303 incidents_85_99 0
China Airlines 813216487 incidents_85_99 12
Condor 417982610 incidents_85_99 2
COPA 550491507 incidents_85_99 3
Delta / Northwest* 6525658894 incidents_85_99 24

Now splitting the date range and type into 2 separate columns, and getting rid of the column which has both these together:

airline_safety_df$date_range <- str_extract(airline_safety_df$date_range_type, "\\d+_\\d+$")
airline_safety_df$type <- str_extract(airline_safety_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")

airline_safety_df <- airline_safety_df %>% select(1:2, 5, 6, 4)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()
airline avail_seat_km_per_week date_range type count
Aer Lingus 320906734 85_99 incidents 2
Aeroflot* 1197672318 85_99 incidents 76
Aerolineas Argentinas 385803648 85_99 incidents 6
Aeromexico* 596871813 85_99 incidents 3
Air Canada 1865253802 85_99 incidents 2
Air France 3004002661 85_99 incidents 14
Air India* 869253552 85_99 incidents 2
Air New Zealand* 710174817 85_99 incidents 3
Alaska Airlines* 965346773 85_99 incidents 5
Alitalia 698012498 85_99 incidents 7
All Nippon Airways 1841234177 85_99 incidents 3
American* 5228357340 85_99 incidents 21
Austrian Airlines 358239823 85_99 incidents 1
Avianca 396922563 85_99 incidents 5
British Airways* 3179760952 85_99 incidents 4
Cathay Pacific* 2582459303 85_99 incidents 0
China Airlines 813216487 85_99 incidents 12
Condor 417982610 85_99 incidents 2
COPA 550491507 85_99 incidents 3
Delta / Northwest* 6525658894 85_99 incidents 24

Now using the spread function, using type (“incidents”, “fatal_incidents” and “fatalities”) as the key and count as the value: to convert it back into a rectangular data set, but this time it will be date range wise split

airline_safety_df <- airline_safety_df %>% spread(type, count)

airline_safety_df <- airline_safety_df %>% select(1:3, 6, 4:5)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()
airline avail_seat_km_per_week date_range incidents fatal_accidents fatalities
Aer Lingus 320906734 00_14 0 0 0
Aer Lingus 320906734 85_99 2 0 0
Aeroflot* 1197672318 00_14 6 1 88
Aeroflot* 1197672318 85_99 76 14 128
Aerolineas Argentinas 385803648 00_14 1 0 0
Aerolineas Argentinas 385803648 85_99 6 0 0
Aeromexico* 596871813 00_14 5 0 0
Aeromexico* 596871813 85_99 3 1 64
Air Canada 1865253802 00_14 2 0 0
Air Canada 1865253802 85_99 2 0 0
Air France 3004002661 00_14 6 2 337
Air France 3004002661 85_99 14 4 79
Air India* 869253552 00_14 4 1 158
Air India* 869253552 85_99 2 1 329
Air New Zealand* 710174817 00_14 5 1 7
Air New Zealand* 710174817 85_99 3 0 0
Alaska Airlines* 965346773 00_14 5 1 88
Alaska Airlines* 965346773 85_99 5 0 0
Alitalia 698012498 00_14 4 0 0
Alitalia 698012498 85_99 7 2 50

Now the data is fully ready with each airline and a date range having 1 row each. Now this data can be used further for analysis. This is the end point for the first part of this assignment.