Introduction

The main purpose of this tutorial is to practice 4 key functions from dplyr. The functions and their purposes are listed as follows:

filter() Selects Observations Based on Values
arrange() Sorts Observations Based on Criteria
select() or rename() Selects, Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New Variables Which Were Originally Nonexistant

We will practice our skills using the dataset flights by loading the R package nycflights13.

Part 1: The Uniqueness of Flight Numbers

Step 1 Subset Observations

Using filter(), start by creating a new tibble called f1a that only contains records from flight number 807.

f1a<-filter(flights, flight==807)
f1a

## # A tibble: 355 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     2      744            730        14     1017           1007
##  2  2013     1     3      729            730        -1     1000           1004
##  3  2013     1     4      728            730        -2      946           1004
##  4  2013     1     5      727            730        -3      941           1004
##  5  2013     1     7      901            730        91     1118           1004
##  6  2013     1     8      729            730        -1     1004           1004
##  7  2013     1     9      727            730        -3     1015           1004
##  8  2013     1    10      727            730        -3      954           1004
##  9  2013     1    11      726            730        -4      954           1004
## 10  2013     1    12      728            730        -2      933           1004
## # ... with 345 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Step 2 Subset Variables

Now, apply select() to create a new dataset f1b based on f1a only containing variables “flight”, “carrier”, “origin”, and “dest”.

f1b<-select(f1a, flight, carrier, origin, dest)
f1b

## # A tibble: 355 x 4
##    flight carrier origin dest 
##     <int> <chr>   <chr>  <chr>
##  1    807 DL      EWR    ATL  
##  2    807 DL      EWR    ATL  
##  3    807 DL      EWR    ATL  
##  4    807 DL      EWR    ATL  
##  5    807 DL      EWR    ATL  
##  6    807 DL      EWR    ATL  
##  7    807 DL      EWR    ATL  
##  8    807 DL      EWR    ATL  
##  9    807 DL      EWR    ATL  
## 10    807 DL      EWR    ATL  
## # ... with 345 more rows

Step 3 Rename Variable

Rename the variable “dest” to “destination” in f1c.

f1c=rename(f1b,destination=dest)
f1c

## # A tibble: 355 x 4
##    flight carrier origin destination
##     <int> <chr>   <chr>  <chr>      
##  1    807 DL      EWR    ATL        
##  2    807 DL      EWR    ATL        
##  3    807 DL      EWR    ATL        
##  4    807 DL      EWR    ATL        
##  5    807 DL      EWR    ATL        
##  6    807 DL      EWR    ATL        
##  7    807 DL      EWR    ATL        
##  8    807 DL      EWR    ATL        
##  9    807 DL      EWR    ATL        
## 10    807 DL      EWR    ATL        
## # ... with 345 more rows

Step 4 Sort Results

Because f1c only contains records for flight number 807, sorting by flight number is irrelevant. Create a new dataset f1d that is identical in content to f1b but is sorted by “carrier” first, “origin” second, and “destination”" last. Use the function head()to display the first 5 rows of the data in f1d.

f1d<-arrange(f1c,carrier, origin, destination)
head(f1d,5)

## # A tibble: 5 x 4
##   flight carrier origin destination
##    <int> <chr>   <chr>  <chr>      
## 1    807 DL      EWR    ATL        
## 2    807 DL      EWR    ATL        
## 3    807 DL      EWR    ATL        
## 4    807 DL      EWR    ATL        
## 5    807 DL      EWR    ATL

At first glance, it seems that 807 uniquely represents the flight from EWR_ to ATL by carrier Delta. To confirm this, create a new dataset f1e that is f1d sorted by the three variables previously mentioned, all in descending order. The function desc() is required here. Follow this by displaying the first 18 rows of the data in f1d.

f1e<-arrange(f1d,desc(carrier),desc(origin),desc(destination))
head(f1e,18)

## # A tibble: 18 x 4
##    flight carrier origin destination
##     <int> <chr>   <chr>  <chr>      
##  1    807 WN      EWR    MDW        
##  2    807 WN      EWR    MDW        
##  3    807 WN      EWR    MDW        
##  4    807 WN      EWR    MDW        
##  5    807 UA      LGA    IAH        
##  6    807 UA      LGA    IAH        
##  7    807 UA      LGA    IAH        
##  8    807 UA      LGA    IAH        
##  9    807 UA      EWR    MCO        
## 10    807 UA      EWR    MCO        
## 11    807 UA      EWR    LAS        
## 12    807 UA      EWR    IAH        
## 13    807 UA      EWR    IAH        
## 14    807 UA      EWR    IAH        
## 15    807 UA      EWR    BOS        
## 16    807 DL      JFK    PIT        
## 17    807 DL      EWR    ATL        
## 18    807 DL      EWR    ATL

May we conclude that flight numbers can be mapped to a unique carrier? No__ (Yes/No)

Why or why not?

If we know the flight number, carrier, and origin, can we know for sure what the destination will be? No_ (Yes/No)