The main purpose of this tutorial is to put together 4 key functions from dplyr. The functions and their purposes are listed as follows:
filter() Selects Observations Based on
Values
arrange() Sorts Observations Based on
Criteria
select() or rename() Selects,
Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New
Variables Which Were Originally Nonexistant
You will need to modify the code chunks so that the code works within
each of chunk (usually this means modifying anything in ALL CAPS). You
will also need to modify the code outside the code chunk. When you get
the desired result for each step, change Eval=F to
Eval=T and knit the document to HTML to make sure it works.
Do not work out of order or skip around. After you complete the lab or
by the end of class, you should submit your HTML file to Sakai.
Using filter(), start by creating a new tibble called
f1a that only contains records from flight number
807.
f1a<-filter(flights, flight==807)
Now, apply select() to create a new dataset
f1b based on f1a only containing
variables “flight”, “carrier”, “origin”, and “dest”.
f1b<-dplyr::select(f1a, flight, carrier, origin, dest)
Rename the variable “dest” to “destination” in f1c.
f1c=dplyr::rename(f1b, destination = dest)
Because f1c only contains records for flight number
807, sorting by flight number is irrelevant. Create a new dataset
f1d that is identical in content to
f1c but is sorted by “carrier” in descending order. Use
the function head()to display the first 5 rows of the data
in f1d.
f1d<-arrange(f1c, desc(carrier))
head(f1d,5)
## # A tibble: 5 × 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 WN EWR MDW
## 2 807 WN EWR MDW
## 3 807 WN EWR MDW
## 4 807 WN EWR MDW
## 5 807 UA EWR IAH
Answer the following questions using the output from the previous question:
May we conclude that flight numbers can be mapped to a unique carrier? (Yes/No) No
Why or why not? From the data, we can see that there are two types of carriers present while the flight number remains the same. We see that WN (Southwest Airlines) and UA (United Airlines) both had flights in which the flight number was 807. As a result, we can conclude that flight numbers are not unique to a single carrier.
In the previous section, we created a new tibble every time we wanted
to perform a modification to the data. Behold the pipe
%>%, your path to RAM-saving, sleek code. The pipe
%>% is used in a similar fashion to + in
the ggplot2() package.
Start by using transmute() to create a new variable
“dep_hr” based on “dep_time” which represents departure time in hours
since midnight. Recall the code for this is
dep_hr=dep_time%/%100+(dep_time%%100)/60. In the same step,
do this also for “sched_dep_time”,“arr_time”, and “sched_arr_time”
naming the new variables “sched_dep_hr”, “arr_hr”, and “sched_arr_hr”,
respectively. Save all these new variables to a new tibble called
f2a. Use the function names() to ensure
that f2a only contains the new variables and the
function head() to view the top 5 rows.
f2a=transmute(flights,
dep_hr=dep_time%/%100+(dep_time%%100)/60,
sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
arr_hr=arr_time%/%100+(arr_time%%100)/60,
sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60)
names(f2a)
## [1] "dep_hr" "sched_dep_hr" "arr_hr" "sched_arr_hr"
head(f2a,5)
## # A tibble: 5 × 4
## dep_hr sched_dep_hr arr_hr sched_arr_hr
## <dbl> <dbl> <dbl> <dbl>
## 1 5.28 5.25 8.5 8.32
## 2 5.55 5.48 8.83 8.5
## 3 5.7 5.67 9.38 8.83
## 4 5.73 5.75 10.1 10.4
## 5 5.9 6 8.2 8.62
Now we can create true delay variables, measured in hours, for both
departure and arrival. Using mutate(), create a new
variable “dep_delay_hr” which equals the difference between “dep_hr” and
“sched_dep_hr”. Analogously, perform the same operation for arrival.
Call the new dataset f2b.
f2b=mutate(f2a,
dep_delay_hr = dep_hr - sched_dep_hr,
arr_delay_hr = arr_hr - sched_arr_hr)
Next, use mutate() again to create a new variable called
“gain_hr” which is the difference between “arr_delay_hr” and
“dep_delay_hr”, measured in hours. Within the same step, use the
function percent_rank() to create a new variable
“percent_gain_hr” which represents the percentiles of the previously
variable you created. Notice that you can develop variables based on
recently transformed variables in the same iteration of
mutate(), but be careful because order matters. Name the
new dataset f2c.
f2c=mutate(f2b,
gain_hr = arr_delay_hr - dep_delay_hr,
percent_gain_hr=percent_rank(gain_hr))
Finally, sort the data using arrange() from largest to
smallest based on the variable “percent_gain_hr”. Name the sorted tibble
f2d. To conclude, display the first five rows of this
tibble.
f2d<-arrange(f2c, desc(percent_gain_hr))
head(f2d, 5)
## # A tibble: 5 × 8
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_del…¹ gain_hr perce…²
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0833 24.0 7.97 4.63 -23.9 3.33 27.2 1
## 2 0.333 24.0 7.38 3.83 -23.7 3.55 27.2 1.00
## 3 0.183 24.0 6.78 3.83 -23.8 2.95 26.8 1.00
## 4 0.4 24.0 6.85 3.73 -23.6 3.12 26.7 1.00
## 5 0.217 24.0 6.32 3.83 -23.8 2.48 26.3 1.00
## # … with abbreviated variable names ¹arr_delay_hr, ²percent_gain_hr
Getting the original data from flights to
f2d required multiple steps. If we know what we want to
do with the raw data from flights, we can use the pipe
%>% to obtain the same result without intermittently
introducing new tibbles into our global environment. Modify the code
below so that the tibble named f2d.pipedream is
identical to f2d. Display the first five rows.
f2d.pipedream = flights %>%
transmute(flights,
dep_hr=dep_time%/%100+(dep_time%%100)/60,
sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
arr_hr=arr_time%/%100+(arr_time%%100)/60,
sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60) %>%
mutate(flights,
dep_delay_hr = dep_hr - sched_dep_hr,
arr_delay_hr = arr_hr - sched_arr_hr) %>%
mutate(flights,
gain_hr = arr_delay_hr - dep_delay_hr,
percent_gain_hr=percent_rank(gain_hr))
f2d.pipedream<-dplyr::select(f2d.pipedream,dep_hr,sched_dep_hr,arr_hr,sched_arr_hr,dep_delay_hr,arr_delay_hr,gain_hr,percent_gain_hr)
f2d.pipedream<-arrange(f2d.pipedream, desc(percent_gain_hr))
head(f2d.pipedream, 5)
## # A tibble: 5 × 8
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_del…¹ gain_hr perce…²
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0833 24.0 7.97 4.63 -23.9 3.33 27.2 1
## 2 0.333 24.0 7.38 3.83 -23.7 3.55 27.2 1.00
## 3 0.183 24.0 6.78 3.83 -23.8 2.95 26.8 1.00
## 4 0.4 24.0 6.85 3.73 -23.6 3.12 26.7 1.00
## 5 0.217 24.0 6.32 3.83 -23.8 2.48 26.3 1.00
## # … with abbreviated variable names ¹arr_delay_hr, ²percent_gain_hr