Introduction

The main purpose of this tutorial is to practice 4 key functions from dplyr. The functions and their purposes are listed as follows:

filter() Selects Observations Based on Values
arrange() Sorts Observations Based on Criteria
select() or rename() Selects, Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New Variables Which Were Originally Nonexistant

We will practice our skills using the dataset flights by loading the R package nycflights13.

Part 1: The Uniqueness of Flight Numbers

Step 1 Subset Observations

Using filter(), start by creating a new tibble called f1a that only contains records from flight number 807.

f1a<-filter(flights, flight==807)
f1a

## # A tibble: 355 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     2      744            730        14     1017           1007
##  2  2013     1     3      729            730        -1     1000           1004
##  3  2013     1     4      728            730        -2      946           1004
##  4  2013     1     5      727            730        -3      941           1004
##  5  2013     1     7      901            730        91     1118           1004
##  6  2013     1     8      729            730        -1     1004           1004
##  7  2013     1     9      727            730        -3     1015           1004
##  8  2013     1    10      727            730        -3      954           1004
##  9  2013     1    11      726            730        -4      954           1004
## 10  2013     1    12      728            730        -2      933           1004
## # ... with 345 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Step 2 Subset Variables

Now, apply select() to create a new dataset f1b based on f1a only containing variables “flight”, “carrier”, “origin”, and “dest”.

f1b<-select(f1a, flight, carrier, origin, dest)
f1b

## # A tibble: 355 x 4
##    flight carrier origin dest 
##     <int> <chr>   <chr>  <chr>
##  1    807 DL      EWR    ATL  
##  2    807 DL      EWR    ATL  
##  3    807 DL      EWR    ATL  
##  4    807 DL      EWR    ATL  
##  5    807 DL      EWR    ATL  
##  6    807 DL      EWR    ATL  
##  7    807 DL      EWR    ATL  
##  8    807 DL      EWR    ATL  
##  9    807 DL      EWR    ATL  
## 10    807 DL      EWR    ATL  
## # ... with 345 more rows

Step 3 Rename Variable

Rename the variable “dest” to “destination” in f1c.

f1c=rename(f1b,destination=dest)
f1c

## # A tibble: 355 x 4
##    flight carrier origin destination
##     <int> <chr>   <chr>  <chr>      
##  1    807 DL      EWR    ATL        
##  2    807 DL      EWR    ATL        
##  3    807 DL      EWR    ATL        
##  4    807 DL      EWR    ATL        
##  5    807 DL      EWR    ATL        
##  6    807 DL      EWR    ATL        
##  7    807 DL      EWR    ATL        
##  8    807 DL      EWR    ATL        
##  9    807 DL      EWR    ATL        
## 10    807 DL      EWR    ATL        
## # ... with 345 more rows

Step 4 Sort Results

Because f1c only contains records for flight number 807, sorting by flight number is irrelevant. Create a new dataset f1d that is identical in content to f1b but is sorted by “carrier” first, “origin” second, and “destination”" last. Use the function head()to display the first 5 rows of the data in f1d.

f1d<-arrange(f1c,carrier, origin, destination)
head(f1d,5)

## # A tibble: 5 x 4
##   flight carrier origin destination
##    <int> <chr>   <chr>  <chr>      
## 1    807 DL      EWR    ATL        
## 2    807 DL      EWR    ATL        
## 3    807 DL      EWR    ATL        
## 4    807 DL      EWR    ATL        
## 5    807 DL      EWR    ATL

At first glance, it seems that 807 uniquely represents the flight from EWR_ to ATL by carrier Delta. To confirm this, create a new dataset f1e that is f1d sorted by the three variables previously mentioned, all in descending order. The function desc() is required here. Follow this by displaying the first 18 rows of the data in f1d.

f1e<-arrange(f1d,desc(carrier),desc(origin),desc(destination))
head(f1e,18)

## # A tibble: 18 x 4
##    flight carrier origin destination
##     <int> <chr>   <chr>  <chr>      
##  1    807 WN      EWR    MDW        
##  2    807 WN      EWR    MDW        
##  3    807 WN      EWR    MDW        
##  4    807 WN      EWR    MDW        
##  5    807 UA      LGA    IAH        
##  6    807 UA      LGA    IAH        
##  7    807 UA      LGA    IAH        
##  8    807 UA      LGA    IAH        
##  9    807 UA      EWR    MCO        
## 10    807 UA      EWR    MCO        
## 11    807 UA      EWR    LAS        
## 12    807 UA      EWR    IAH        
## 13    807 UA      EWR    IAH        
## 14    807 UA      EWR    IAH        
## 15    807 UA      EWR    BOS        
## 16    807 DL      JFK    PIT        
## 17    807 DL      EWR    ATL        
## 18    807 DL      EWR    ATL

May we conclude that flight numbers can be mapped to a unique carrier? No__ (Yes/No)

Why or why not?

If we know the flight number, carrier, and origin, can we know for sure what the destination will be? No_ (Yes/No)

Why or why not?

Part 2: Time to Get on the Pipe

Step 1: Simple Example

The pipe %>% is used in a similar fashion to + in the ggplot2() package.

One simple example: Compute the logarithm of \(x\), return suitably lagged and iterated differences, compute the exponential function and round the result to 1 decimal place.

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

Method 1:

round(exp(diff(log(x))), 1)

## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

Method 2:

x %>% log() %>%
    diff() %>%
    exp() %>%
    round(1)

## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

Step 2: On data frame

It’s time to take a magical journey through the tidyverse on the pipe. Start by using transmute() to create a new variable “dep_hr” based on “dep_time” which represents departure time in hours since midnight. Recall the code for this is dep_hr=dep_time%/%100+(dep_time%%100)/60. In the same step, do this also for “sched_dep_time”,“arr_time”, and “sched_arr_time” naming the new variables “sched_dep_hr”, “arr_hr”, and “sched_arr_hr”, respectively. Save all these new variables to a new tibble called f2a. Use the function names() to ensure that f2a only contains the new variables and the function head() to view the top 5 rows.

f2a=transmute(flights,
          dep_hr=dep_time%/%100+(dep_time%%100)/60,
          sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
          arr_hr=arr_time%/%100+(arr_time%%100)/60,
          sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60)
names(f2a)

## [1] "dep_hr"       "sched_dep_hr" "arr_hr"       "sched_arr_hr"

head(f2a,5)

## # A tibble: 5 x 4
##   dep_hr sched_dep_hr arr_hr sched_arr_hr
##    <dbl>        <dbl>  <dbl>        <dbl>
## 1   5.28         5.25   8.5          8.32
## 2   5.55         5.48   8.83         8.5 
## 3   5.7          5.67   9.38         8.83
## 4   5.73         5.75  10.1         10.4 
## 5   5.9          6      8.2          8.62

Now we can create true delay variables, measured in hours, for both departure and arrival. Using mutate(), create a new variable “dep_delay_hr” which equals the difference between “dep_hr” and “sched_dep_hr”. Analogously, perform the same operation for arrival. Call the new dataset f2b.

f2b=mutate(f2a,
           dep_delay_hr=dep_hr-sched_dep_hr,
           arr_delay_hr=arr_hr-sched_arr_hr)
f2b

## # A tibble: 336,776 x 6
##    dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
##     <dbl>        <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
##  1   5.28         5.25   8.5          8.32       0.0333        0.183
##  2   5.55         5.48   8.83         8.5        0.0667        0.333
##  3   5.7          5.67   9.38         8.83       0.0333        0.550
##  4   5.73         5.75  10.1         10.4       -0.0167       -0.300
##  5   5.9          6      8.2          8.62      -0.100        -0.417
##  6   5.9          5.97   7.67         7.47      -0.0667        0.200
##  7   5.92         6      9.22         8.9       -0.0833        0.317
##  8   5.95         6      7.15         7.38      -0.0500       -0.233
##  9   5.95         6      8.63         8.77      -0.0500       -0.133
## 10   5.97         6      7.88         7.75      -0.0333        0.133
## # ... with 336,766 more rows

Use the function percent_rank() to create a new variable “percent_dep_delay_hr” which represents the percentiles of the variable you previously created. Notice that you can develop variables based on recently transformed variables in the same iteration of mutate(), but be careful little buddy because order matters. Name the new dataset f2c.

f2c=mutate(f2b,
           percent_dep_delay_hr=percent_rank(dep_delay_hr))
f2c

## # A tibble: 336,776 x 7
##    dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
##     <dbl>        <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
##  1   5.28         5.25   8.5          8.32       0.0333        0.183
##  2   5.55         5.48   8.83         8.5        0.0667        0.333
##  3   5.7          5.67   9.38         8.83       0.0333        0.550
##  4   5.73         5.75  10.1         10.4       -0.0167       -0.300
##  5   5.9          6      8.2          8.62      -0.100        -0.417
##  6   5.9          5.97   7.67         7.47      -0.0667        0.200
##  7   5.92         6      9.22         8.9       -0.0833        0.317
##  8   5.95         6      7.15         7.38      -0.0500       -0.233
##  9   5.95         6      8.63         8.77      -0.0500       -0.133
## 10   5.97         6      7.88         7.75      -0.0333        0.133
## # ... with 336,766 more rows, and 1 more variable: percent_dep_delay_hr <dbl>

Use filter() to select the observations where percent_dep_delay_hr<0.1 or percent_dep_delay_hr>0.9 in a new dataset f2d. The tibble f2d will contain the bottom 10% and top 10% of flights based off “dep_delay_hr”.

f2d<-filter(f2c,percent_dep_delay_hr<0.1|percent_dep_delay_hr>0.9)
f2d

## # A tibble: 65,737 x 7
##    dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
##     <dbl>        <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
##  1   6.03         6.17   8.2          8.33       -0.133      -0.133 
##  2   6.37         6.5   10.3         10.2        -0.133       0.0500
##  3   6.62         6.75   9.5          9.58       -0.133      -0.0833
##  4   6.93         7.08  10.1          9.67       -0.150       0.450 
##  5   8            8.17   9.82         9.92       -0.167      -0.100 
##  6   8.08         8.25  10.1         10.2        -0.167      -0.0667
##  7   8.18         6.5   10.8          8.5         1.68        2.28  
##  8   8.33         8.5    9.67         9.9        -0.167      -0.233 
##  9   8.37         8.5   10.2         10.3        -0.133      -0.0333
## 10   8.43         7.25  11.6         10.8         1.18        0.85  
## # ... with 65,727 more rows, and 1 more variable: percent_dep_delay_hr <dbl>

Finally, sort the data using arrange() from largest to smallest based on the variable “percent_dep_delay_hr”. Name the sorted tibble f2e. Use head() on f2e to show the top 5 flights based on the constructed variable “dep_delay_hr”.

f2e<-arrange(f2d,desc(percent_dep_delay_hr))
head(f2e,5)

## # A tibble: 5 x 7
##   dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
##    <dbl>        <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
## 1   23.4         8.17   1.58         10.3         15.2        -8.75
## 2   23.0         7.98   1.35         10.4         15.0        -9.08
## 3   22.7         8.5    1            11.1         14.2       -10.1 
## 4   23.4        10.3    1.23         12.4         13.1       -11.2 
## 5   19.4         6.25  21.6           8.7         13.1        12.9 
## # ... with 1 more variable: percent_dep_delay_hr <dbl>

Getting the original data from flights to f2e required multiple steps. If we know what we want to do with the raw data from flights, we can use the pipe %>% to obtain the same result without intermittently introducing new tibbles into our global environment. The R code below turns your pipe dreams into reality.

f2e.pipedream = 
  flights %>%
  transmute(dep_hr=dep_time%/%100+(dep_time%%100)/60,
            sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
            arr_hr=arr_time%/%100+(arr_time%%100)/60,
            sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60) %>%
  
  mutate(dep_delay_hr=dep_hr-sched_dep_hr,
        arr_delay_hr=arr_hr-sched_arr_hr) %>%
  
  mutate(percent_dep_delay_hr=percent_rank(dep_delay_hr)) %>%
  
  filter(percent_dep_delay_hr<0.1|percent_dep_delay_hr>0.9) %>%
  
  arrange(desc(percent_dep_delay_hr))

head(f2e.pipedream,5)

## # A tibble: 5 x 7
##   dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
##    <dbl>        <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
## 1   23.4         8.17   1.58         10.3         15.2        -8.75
## 2   23.0         7.98   1.35         10.4         15.0        -9.08
## 3   22.7         8.5    1            11.1         14.2       -10.1 
## 4   23.4        10.3    1.23         12.4         13.1       -11.2 
## 5   19.4         6.25  21.6           8.7         13.1        12.9 
## # ... with 1 more variable: percent_dep_delay_hr <dbl>

The tibble f2e.pipedream is identical to f2e in the number of observations (65,737), the number of variables (7), and the order of observations. We can check to see if the tibble f2e is identical to f2e.pipedream using identical(). Can you feel the sensation? It’s piping hot up in here.

identical(f2e,f2e.pipedream)

## [1] TRUE

Part 3: Measuring Accuracy

How would you measure the accuracy of individual flights? We can say that a flight is accurate if it leaves on time and arrives on time. Suppose we want to create an accuracy measure that captures this information where larger values indicate more inaccurate flights. Try to think of a creative way to measure accuracy by using mutate() to construct a new variable named “accuracy”. Call the new tibble f.accuracy. As a data scientist, the metric you want is not always in the raw data. This is an example of problem that requires a level of innovation for which a job will grant you some cash money.

f.accuracy<-mutate(flights,
                dep_hr=dep_time%/%100+(dep_time%%100)/60,
                sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
                arr_hr=arr_time%/%100+(arr_time%%100)/60,
                sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60,
                dep_delay_hr=dep_hr-sched_dep_hr,
                arr_delay_hr=arr_hr-sched_arr_hr,
                accuracy=abs(dep_delay)+abs(arr_delay))
head(f.accuracy,5)

## # A tibble: 5 x 26
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## # ... with 18 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hr <dbl>,
## #   sched_dep_hr <dbl>, arr_hr <dbl>, sched_arr_hr <dbl>, dep_delay_hr <dbl>,
## #   arr_delay_hr <dbl>, accuracy <dbl>

Preparing for comparisons of airline carriers on accuracy, we use the select() function to create a new tibble named f.accuracy2 which only contains the variables “carrier” and “accuracy”.

f.accuracy2=select(f.accuracy,carrier,accuracy)

Next, we can evaluate carriers based on their average accuracy across all flights based on our new metric. Furthermore, the standard deviation of this accuracy metric can help us measure the consistency of these airline carriers in performance. The summarize() function combined with group_by() allows for quick aggregation on the carrier level.

carrier.summary<- f.accuracy2 %>%
                  group_by(carrier) %>%
                  summarize(
                    mean.accuracy=mean(accuracy,na.rm=T),
                    sd.accuracy=sd(accuracy,na.rm=T)
                  ) %>%
                  arrange(mean.accuracy)
carrier.summary

## # A tibble: 16 x 3
##    carrier mean.accuracy sd.accuracy
##    <chr>           <dbl>       <dbl>
##  1 US               31.4        50.8
##  2 DL               38.9        73.5
##  3 HA               40.0       142. 
##  4 AA               40.2        67.5
##  5 UA               40.6        64.7
##  6 AS               42.4        50.4
##  7 MQ               42.6        71.9
##  8 B6               43.9        70.3
##  9 VX               45.1        82.0
## 10 WN               46.3        80.7
## 11 FL               52.5        99.5
## 12 9E               53.1        81.3
## 13 OO               53.3        77.0
## 14 EV               55.7        85.2
## 15 F9               57.8       112. 
## 16 YV               59.1        88.8

Conclusion

Following this tutorial, we will explore the depth of summarize(), the most important function in dplyr. For right now, just know that not all pipes are created free and equal; however, this pipe %>% is straight-up magical. In the early stages, use the aforementioned pipe with caution, but once you develop a tolerance, you are prescribed to use it liberally.

STOR 320 Tutorial on Data Transformation 1

Wonkyung Jang

May 25, 2021