1 Lab 1a: Basic Data Transformation

1.1 Introduction

The main purpose of this tutorial is to put together 4 key functions from dplyr. The functions and their purposes are listed as follows:

  • filter() Selects Observations Based on Values

  • arrange() Sorts Observations Based on Criteria

  • select() or rename() Selects, Deselects, Renames, and Reorders Variables

  • mutate() or transmute() Creates New Variables Which Were Originally Nonexistant

You will need to modify the code chunks so that the code works within each of chunk (usually this means modifying anything in ALL CAPS). You will also need to modify the code outside the code chunk. When you get the desired result for each step, change Eval=F to Eval=T and knit the document to HTML to make sure it works. Do not work out of order or skip around. After you complete the lab or by the end of class, you should submit your HTML file to Sakai.

1.2 Part 1: The Uniqueness of Flight Numbers

1.2.1 (1 Points)

Using filter(), start by creating a new tibble called f1a that only contains records from flight number 807.

f1a<-filter(flights, flight==807)

1.2.2 (1 Points)

Now, apply select() to create a new dataset f1b based on f1a only containing variables “flight”, “carrier”, “origin”, and “dest”.

f1b<-dplyr::select(f1a, flight, carrier, origin, dest)

1.2.3 (1 Points)

Rename the variable “dest” to “destination” in f1c.

f1c=dplyr::rename(f1b, destination = dest)

1.2.4 (1 Points)

Because f1c only contains records for flight number 807, sorting by flight number is irrelevant. Create a new dataset f1d that is identical in content to f1c but is sorted by “carrier” in descending order. Use the function head()to display the first 5 rows of the data in f1d.

f1d<-arrange(f1c, desc(carrier))
head(f1d,5)
## # A tibble: 5 × 4
##   flight carrier origin destination
##    <int> <chr>   <chr>  <chr>      
## 1    807 WN      EWR    MDW        
## 2    807 WN      EWR    MDW        
## 3    807 WN      EWR    MDW        
## 4    807 WN      EWR    MDW        
## 5    807 UA      EWR    IAH

1.2.5 (1 Points)

Answer the following questions using the output from the previous question:

May we conclude that flight numbers can be mapped to a unique carrier? (Yes/No) No

Why or why not? From the data, we can see that there are two types of carriers present while the flight number remains the same. We see that WN (Southwest Airlines) and UA (United Airlines) both had flights in which the flight number was 807. As a result, we can conclude that flight numbers are not unique to a single carrier.

1.3 Piping

In the previous section, we created a new tibble every time we wanted to perform a modification to the data. Behold the pipe %>%, your path to RAM-saving, sleek code. The pipe %>% is used in a similar fashion to + in the ggplot2() package.

1.3.1 (1 Points)

Start by using transmute() to create a new variable “dep_hr” based on “dep_time” which represents departure time in hours since midnight. Recall the code for this is dep_hr=dep_time%/%100+(dep_time%%100)/60. In the same step, do this also for “sched_dep_time”,“arr_time”, and “sched_arr_time” naming the new variables “sched_dep_hr”, “arr_hr”, and “sched_arr_hr”, respectively. Save all these new variables to a new tibble called f2a. Use the function names() to ensure that f2a only contains the new variables and the function head() to view the top 5 rows.

f2a=transmute(flights,
          dep_hr=dep_time%/%100+(dep_time%%100)/60,
          sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
          arr_hr=arr_time%/%100+(arr_time%%100)/60,
          sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60)
names(f2a)
## [1] "dep_hr"       "sched_dep_hr" "arr_hr"       "sched_arr_hr"
head(f2a,5)
## # A tibble: 5 × 4
##   dep_hr sched_dep_hr arr_hr sched_arr_hr
##    <dbl>        <dbl>  <dbl>        <dbl>
## 1   5.28         5.25   8.5          8.32
## 2   5.55         5.48   8.83         8.5 
## 3   5.7          5.67   9.38         8.83
## 4   5.73         5.75  10.1         10.4 
## 5   5.9          6      8.2          8.62

1.3.2 (1 Points)

Now we can create true delay variables, measured in hours, for both departure and arrival. Using mutate(), create a new variable “dep_delay_hr” which equals the difference between “dep_hr” and “sched_dep_hr”. Analogously, perform the same operation for arrival. Call the new dataset f2b.

f2b=mutate(f2a,
           dep_delay_hr = dep_hr - sched_dep_hr,
           arr_delay_hr = arr_hr - sched_arr_hr)

1.3.3 (1 Points)

Next, use mutate() again to create a new variable called “gain_hr” which is the difference between “arr_delay_hr” and “dep_delay_hr”, measured in hours. Within the same step, use the function percent_rank() to create a new variable “percent_gain_hr” which represents the percentiles of the previously variable you created. Notice that you can develop variables based on recently transformed variables in the same iteration of mutate(), but be careful because order matters. Name the new dataset f2c.

f2c=mutate(f2b,
           gain_hr = arr_delay_hr - dep_delay_hr,
           percent_gain_hr=percent_rank(gain_hr))

1.3.4 (1 Points)

Finally, sort the data using arrange() from largest to smallest based on the variable “percent_gain_hr”. Name the sorted tibble f2d. To conclude, display the first five rows of this tibble.

f2d<-arrange(f2c, desc(percent_gain_hr))
head(f2d, 5)
## # A tibble: 5 × 8
##   dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_del…¹ gain_hr perce…²
##    <dbl>        <dbl>  <dbl>        <dbl>        <dbl>     <dbl>   <dbl>   <dbl>
## 1 0.0833         24.0   7.97         4.63        -23.9      3.33    27.2    1   
## 2 0.333          24.0   7.38         3.83        -23.7      3.55    27.2    1.00
## 3 0.183          24.0   6.78         3.83        -23.8      2.95    26.8    1.00
## 4 0.4            24.0   6.85         3.73        -23.6      3.12    26.7    1.00
## 5 0.217          24.0   6.32         3.83        -23.8      2.48    26.3    1.00
## # … with abbreviated variable names ¹​arr_delay_hr, ²​percent_gain_hr

1.3.5 (1 Points)

Getting the original data from flights to f2d required multiple steps. If we know what we want to do with the raw data from flights, we can use the pipe %>% to obtain the same result without intermittently introducing new tibbles into our global environment. Modify the code below so that the tibble named f2d.pipedream is identical to f2d. Display the first five rows.

f2d.pipedream = flights %>%
                transmute(flights,
          dep_hr=dep_time%/%100+(dep_time%%100)/60,
          sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
          arr_hr=arr_time%/%100+(arr_time%%100)/60,
          sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60) %>%
                mutate(flights,
           dep_delay_hr = dep_hr - sched_dep_hr,
           arr_delay_hr = arr_hr - sched_arr_hr) %>%
                mutate(flights,
           gain_hr = arr_delay_hr - dep_delay_hr,
           percent_gain_hr=percent_rank(gain_hr)) 
      

f2d.pipedream<-dplyr::select(f2d.pipedream,dep_hr,sched_dep_hr,arr_hr,sched_arr_hr,dep_delay_hr,arr_delay_hr,gain_hr,percent_gain_hr)

f2d.pipedream<-arrange(f2d.pipedream, desc(percent_gain_hr))


head(f2d.pipedream, 5)
## # A tibble: 5 × 8
##   dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_del…¹ gain_hr perce…²
##    <dbl>        <dbl>  <dbl>        <dbl>        <dbl>     <dbl>   <dbl>   <dbl>
## 1 0.0833         24.0   7.97         4.63        -23.9      3.33    27.2    1   
## 2 0.333          24.0   7.38         3.83        -23.7      3.55    27.2    1.00
## 3 0.183          24.0   6.78         3.83        -23.8      2.95    26.8    1.00
## 4 0.4            24.0   6.85         3.73        -23.6      3.12    26.7    1.00
## 5 0.217          24.0   6.32         3.83        -23.8      2.48    26.3    1.00
## # … with abbreviated variable names ¹​arr_delay_hr, ²​percent_gain_hr