Deepak Mongia Tidyverse Recipes Assignment

This exercise is a part of DATA-607 Fall 2018 semester. As a part of this exercise, we need to perform the below 2 tasks: 1) Create one example of code which uses one or more packages from the Tidyverse, and put that RMD file into my Github 2) Collaborate into someone else’s code using fork code and pull request on someone else’s Github

Below is he code for the first part of this assignment:

I am taking the airline safety data set from fivethirtyeight github (https://github.com/fivethirtyeight/data/tree/master/airline-safety), and making it ready for the data analysis.

Step-1: Load the required packages into R:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyr)
library(stringr)
library(kableExtra)

Step-2: Load the dataset into R

airline_safety_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv"

airline_safety_dataset <- read_csv(airline_safety_url)

## Parsed with column specification:
## cols(
##   airline = col_character(),
##   avail_seat_km_per_week = col_double(),
##   incidents_85_99 = col_integer(),
##   fatal_accidents_85_99 = col_integer(),
##   fatalities_85_99 = col_integer(),
##   incidents_00_14 = col_integer(),
##   fatal_accidents_00_14 = col_integer(),
##   fatalities_00_14 = col_integer()
## )

dim(airline_safety_dataset)

## [1] 56  8

head(airline_safety_dataset, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	incidents_85_99	fatal_accidents_85_99	fatalities_85_99	incidents_00_14	fatal_accidents_00_14	fatalities_00_14
Aer Lingus	320906734	2	0	0	0	0	0
Aeroflot*	1197672318	76	14	128	6	1	88
Aerolineas Argentinas	385803648	6	0	0	1	0	0
Aeromexico*	596871813	3	1	64	5	0	0
Air Canada	1865253802	2	0	0	2	0	0
Air France	3004002661	14	4	79	6	2	337
Air India*	869253552	2	1	329	4	1	158
Air New Zealand*	710174817	3	0	0	5	1	7
Alaska Airlines*	965346773	5	0	0	5	1	88
Alitalia	698012498	7	2	50	4	0	0
All Nippon Airways	1841234177	3	1	1	7	0	0
American*	5228357340	21	5	101	17	3	416
Austrian Airlines	358239823	1	0	0	1	0	0
Avianca	396922563	5	3	323	0	0	0
British Airways*	3179760952	4	0	0	6	0	0
Cathay Pacific*	2582459303	0	0	0	2	0	0
China Airlines	813216487	12	6	535	2	1	225
Condor	417982610	2	1	16	0	0	0
COPA	550491507	3	1	47	0	0	0
Delta / Northwest*	6525658894	24	12	407	24	2	51

As we see above, this dataset gives the data for both the date-ranges : 85-99 and 00-14 in the same row. Now we want to convert this dataset into a format where for each airline, there will be 2 rows, one for 85-99 and another row for 00-14. This will ensure that for each date range for an airline, we have 1 row with all the variable counts - incidents, fatal_incidents and fatalities.

Creating a new working data frame which will be used further. Creating this with all the same type counts togethr, e.g. incidents_85_99 and incidents_00_14 next to each other, and so on.

airline_safety_df <- airline_safety_dataset %>% select(1:3,6,4,7,5,8)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	incidents_85_99	incidents_00_14	fatal_accidents_85_99	fatal_accidents_00_14	fatalities_85_99	fatalities_00_14
Aer Lingus	320906734	2	0	0	0	0	0
Aeroflot*	1197672318	76	6	14	1	128	88
Aerolineas Argentinas	385803648	6	1	0	0	0	0
Aeromexico*	596871813	3	5	1	0	64	0
Air Canada	1865253802	2	2	0	0	0	0
Air France	3004002661	14	6	4	2	79	337
Air India*	869253552	2	4	1	1	329	158
Air New Zealand*	710174817	3	5	0	1	0	7
Alaska Airlines*	965346773	5	5	0	1	0	88
Alitalia	698012498	7	4	2	0	50	0
All Nippon Airways	1841234177	3	7	1	0	1	0
American*	5228357340	21	17	5	3	101	416
Austrian Airlines	358239823	1	1	0	0	0	0
Avianca	396922563	5	0	3	0	323	0
British Airways*	3179760952	4	6	0	0	0	0
Cathay Pacific*	2582459303	0	2	0	0	0	0
China Airlines	813216487	12	2	6	1	535	225
Condor	417982610	2	0	1	0	16	0
COPA	550491507	3	0	1	0	47	0
Delta / Northwest*	6525658894	24	24	12	2	407	51

Now converting into a long dataset using the gather function from tidyr package, to have count type and date_range in a column and the value of the count in another column

airline_safety_df <- airline_safety_df %>% gather("date_range_type", "count", 3:8)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range_type	count
Aer Lingus	320906734	incidents_85_99	2
Aeroflot*	1197672318	incidents_85_99	76
Aerolineas Argentinas	385803648	incidents_85_99	6
Aeromexico*	596871813	incidents_85_99	3
Air Canada	1865253802	incidents_85_99	2
Air France	3004002661	incidents_85_99	14
Air India*	869253552	incidents_85_99	2
Air New Zealand*	710174817	incidents_85_99	3
Alaska Airlines*	965346773	incidents_85_99	5
Alitalia	698012498	incidents_85_99	7
All Nippon Airways	1841234177	incidents_85_99	3
American*	5228357340	incidents_85_99	21
Austrian Airlines	358239823	incidents_85_99	1
Avianca	396922563	incidents_85_99	5
British Airways*	3179760952	incidents_85_99	4
Cathay Pacific*	2582459303	incidents_85_99	0
China Airlines	813216487	incidents_85_99	12
Condor	417982610	incidents_85_99	2
COPA	550491507	incidents_85_99	3
Delta / Northwest*	6525658894	incidents_85_99	24

Now splitting the date range and type into 2 separate columns, and getting rid of the column which has both these together:

airline_safety_df$date_range <- str_extract(airline_safety_df$date_range_type, "\\d+_\\d+$")
airline_safety_df$type <- str_extract(airline_safety_df$date_range_type, "^[:alpha:]+(_[:alpha:]+)*")

airline_safety_df <- airline_safety_df %>% select(1:2, 5, 6, 4)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range	type	count
Aer Lingus	320906734	85_99	incidents	2
Aeroflot*	1197672318	85_99	incidents	76
Aerolineas Argentinas	385803648	85_99	incidents	6
Aeromexico*	596871813	85_99	incidents	3
Air Canada	1865253802	85_99	incidents	2
Air France	3004002661	85_99	incidents	14
Air India*	869253552	85_99	incidents	2
Air New Zealand*	710174817	85_99	incidents	3
Alaska Airlines*	965346773	85_99	incidents	5
Alitalia	698012498	85_99	incidents	7
All Nippon Airways	1841234177	85_99	incidents	3
American*	5228357340	85_99	incidents	21
Austrian Airlines	358239823	85_99	incidents	1
Avianca	396922563	85_99	incidents	5
British Airways*	3179760952	85_99	incidents	4
Cathay Pacific*	2582459303	85_99	incidents	0
China Airlines	813216487	85_99	incidents	12
Condor	417982610	85_99	incidents	2
COPA	550491507	85_99	incidents	3
Delta / Northwest*	6525658894	85_99	incidents	24

Now using the spread function, using type (“incidents”, “fatal_incidents” and “fatalities”) as the key and count as the value: to convert it back into a rectangular data set, but this time it will be date range wise split

airline_safety_df <- airline_safety_df %>% spread(type, count)

airline_safety_df <- airline_safety_df %>% select(1:3, 6, 4:5)

head(airline_safety_df, 20) %>% kable() %>% kable_styling()

airline	avail_seat_km_per_week	date_range	incidents	fatal_accidents	fatalities
Aer Lingus	320906734	00_14	0	0	0
Aer Lingus	320906734	85_99	2	0	0
Aeroflot*	1197672318	00_14	6	1	88
Aeroflot*	1197672318	85_99	76	14	128
Aerolineas Argentinas	385803648	00_14	1	0	0
Aerolineas Argentinas	385803648	85_99	6	0	0
Aeromexico*	596871813	00_14	5	0	0
Aeromexico*	596871813	85_99	3	1	64
Air Canada	1865253802	00_14	2	0	0
Air Canada	1865253802	85_99	2	0	0
Air France	3004002661	00_14	6	2	337
Air France	3004002661	85_99	14	4	79
Air India*	869253552	00_14	4	1	158
Air India*	869253552	85_99	2	1	329
Air New Zealand*	710174817	00_14	5	1	7
Air New Zealand*	710174817	85_99	3	0	0
Alaska Airlines*	965346773	00_14	5	1	88
Alaska Airlines*	965346773	85_99	5	0	0
Alitalia	698012498	00_14	4	0	0
Alitalia	698012498	85_99	7	2	50

Now the data is fully ready with each airline and a date range having 1 row each. Now this data can be used further for analysis. This is the end point for the first part of this assignment.