The main purpose of this tutorial is to practice 4 key functions from dplyr. The functions and their purposes are listed as follows:
filter() Selects Observations Based on Values
arrange() Sorts Observations Based on Criteria
select() or rename() Selects, Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New Variables Which Were Originally Nonexistant
We will practice our skills using the dataset flights by loading the R package nycflights13.
Using filter(), start by creating a new tibble called f1a that only contains records from flight number 807.
f1a<-filter(flights, flight==807)
f1a
## # A tibble: 355 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 2 744 730 14 1017 1007
## 2 2013 1 3 729 730 -1 1000 1004
## 3 2013 1 4 728 730 -2 946 1004
## 4 2013 1 5 727 730 -3 941 1004
## 5 2013 1 7 901 730 91 1118 1004
## 6 2013 1 8 729 730 -1 1004 1004
## 7 2013 1 9 727 730 -3 1015 1004
## 8 2013 1 10 727 730 -3 954 1004
## 9 2013 1 11 726 730 -4 954 1004
## 10 2013 1 12 728 730 -2 933 1004
## # ... with 345 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Now, apply select() to create a new dataset f1b based on f1a only containing variables “flight”, “carrier”, “origin”, and “dest”.
f1b<-select(f1a, flight, carrier, origin, dest)
f1b
## # A tibble: 355 x 4
## flight carrier origin dest
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
## 6 807 DL EWR ATL
## 7 807 DL EWR ATL
## 8 807 DL EWR ATL
## 9 807 DL EWR ATL
## 10 807 DL EWR ATL
## # ... with 345 more rows
Rename the variable “dest” to “destination” in f1c.
f1c=rename(f1b,destination=dest)
f1c
## # A tibble: 355 x 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
## 6 807 DL EWR ATL
## 7 807 DL EWR ATL
## 8 807 DL EWR ATL
## 9 807 DL EWR ATL
## 10 807 DL EWR ATL
## # ... with 345 more rows
Because f1c only contains records for flight number 807, sorting by flight number is irrelevant. Create a new dataset f1d that is identical in content to f1b but is sorted by “carrier” first, “origin” second, and “destination”" last. Use the function head()to display the first 5 rows of the data in f1d.
f1d<-arrange(f1c,carrier, origin, destination)
head(f1d,5)
## # A tibble: 5 x 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
At first glance, it seems that 807 uniquely represents the flight from EWR_ to ATL by carrier Delta. To confirm this, create a new dataset f1e that is f1d sorted by the three variables previously mentioned, all in descending order. The function desc() is required here. Follow this by displaying the first 18 rows of the data in f1d.
f1e<-arrange(f1d,desc(carrier),desc(origin),desc(destination))
head(f1e,18)
## # A tibble: 18 x 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 WN EWR MDW
## 2 807 WN EWR MDW
## 3 807 WN EWR MDW
## 4 807 WN EWR MDW
## 5 807 UA LGA IAH
## 6 807 UA LGA IAH
## 7 807 UA LGA IAH
## 8 807 UA LGA IAH
## 9 807 UA EWR MCO
## 10 807 UA EWR MCO
## 11 807 UA EWR LAS
## 12 807 UA EWR IAH
## 13 807 UA EWR IAH
## 14 807 UA EWR IAH
## 15 807 UA EWR BOS
## 16 807 DL JFK PIT
## 17 807 DL EWR ATL
## 18 807 DL EWR ATL
May we conclude that flight numbers can be mapped to a unique carrier? No__ (Yes/No)
Why or why not?
If we know the flight number, carrier, and origin, can we know for sure what the destination will be? No_ (Yes/No)
Why or why not?