Introduction to Data Transformation

Now we will move from data visualization to data transformation. In many situations, we will find that we don’t have the data in exactly the right form for what we need in the original data set. In that case we have to refer to data transformation.

Example

As an example, let’s go to the mpg data set. All the graphs we have created so far handles the whole data set. But what if we want to do a task that is as simple as:

For all the SUVs, plot the hwy against manufacturer.

We find that we know how to do the plotting part, but we would have to only plot for the SUV data alone - a part of the original data set.

To do that, we need to ``filter’’ the data by only picking up SUV samples. This operation belongs to data transformation.

Introduction to `dplyr` package

In this module, we will be using the dplyr package to perform data transformation in R. We will use a new data set on flights departing New York City in 2013. You must install the package nycflights13 to make it available.

install.packages("nycflights13")

library(tidyverse)
library(nycflights13)

`nycflights13` data set

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Please check the meaning of each variable from the help document.

Variable types

To summarize the data types in R so far:

int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
lgl stands for logical, vectors that contain only TRUE or FALSE.
fct stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.

dplyr basics

Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by()

Filter rows with `filter()`

filter() allows you to subset observations based on their values. For example, to pick up the flights on January 1st, we may do the following:

filter(flights, month == 1, day == 1)

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Use the assignment operator, `<-`:

To use that filtered data, we need to assign it to a new name with the assignment <-. (Later we will learn to use a pipe to simplify this process.)

jan1 <- filter(flights, month == 1, day == 1)

jan1

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 832 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Template for using `fliter()`

fliter(<DATA_NAME>, ...<LOGICAL CONDITIONS>...)

Everything after data_name should be an expression returning some logical value. When multiple conditions are given, then rows that satisfy all conditions (every condition is TRUE) will be retained.

Comparisons

fliter() function supports all logical comparisons to fliter our data using the following operators:

>, >=, <, <=, != (not equal to), and == (equal to).

Error message when use `=`

filter(flights, month = 1)

## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 1`?

Caution with `==` when dealing with float:

When handling float numbers, it is usually not a good idea to use == since there is machine error for floats and even mathmatically true equations may turn out to be FALSE.

sqrt(2) ^ 2 == 2

## [1] FALSE

1/49 * 49 == 1

## [1] FALSE

Instead, one should use near() instead to check equations about float numbers.

near(sqrt(2) ^ 2,  2)

## [1] TRUE

near(1 / 49 * 49, 1)

## [1] TRUE

Lab Exercise

Find all flights that travel a distance of less than 200 miles. How many of such flights are there in the data set?

Logical operators

Use Boolean operators with filter() : & is “and”, | is “or”, and ! is “not”.

Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Finds all flights that departed in November or December:

filter(flights, month == 11 | month == 12)

## # A tibble: 55,403 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # ℹ 55,393 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Note that the following doesn’t work as you expected:

filter(flights, month == (11 | 12))

This is because (11 | 12) is equivalent to evaluate TRUE|TRUE (any non zero value is considered TRUE in logical expressions) which returns TRUE. Then TRUE is interpreted as 1 to match the numeric variable month. So this finds all flights in January instead of November and December!

Use `x %in% y`

%in% is equivalent to the in keyword in Python. It can conveniently replace | when there are many options. The following code finds flights that depart in May, July or November.

nov_dec <- filter(flights, month %in% c(5, 7, 11))

R also has `&&` and `||`, but don’t use them here in `filter()`!

& and | are vectorized operators:

c(T, T, F) & c(T, F, F)

## [1]  TRUE FALSE FALSE

So the result is a vector with values of T&T, T&F, F&F.

On the other hand, && and || are not vectorized:

c(T, T, F) && c(T, F, F)

## [1] TRUE

Here c(T, T, F) and c(T, F, F) are both treated as a single logical value TRUE since their first values are not zero.

Use `between()` helper function for (X >= a) & (X <= b)

For filtering a value between two values, for example, depature time between 0:00am and 6:00am, we can use the between function:

    filter(flights, between(dep_time, 0000, 0600))

## # A tibble: 9,344 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 9,334 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Lab Exercise

a) Find all flights that depart from `JFK` and land on `ORD` or `CVG` (Chicago or Dallas).

b) Find all flights that flied a distance between 200 and 500 miles.

c) Plot the histogram of departure time for filtered data either in a) or b).

Missing values `NA`

Missing values are represented as NAs (“not available”) in R. NA by itself is a constant in R, and can be of any type depending on the type of other available data.

NA are “contagious” in mathematical operations:

NA > 5

## [1] NA

10 == NA

## [1] NA

NA + 10

## [1] NA

NA / 2

## [1] NA

Why is the following expression `FALSE`?

NA == NA

## [1] NA

To understand with example

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y

## [1] NA

# We don't know!

Use `is.na()` to filter out `NA` in your data set

If you want to determine if a value is missing, use is.na():

is.na(x)

## [1] TRUE

is.na() is also vectorized:

is.na(c(1,2,3,NA))

## [1] FALSE FALSE FALSE  TRUE

Note: `filter()` excludes both `FALSE` and `NA` values.

filter(flights, is.na(dep_time)) # Only keep flights with `NA` departure time

## # A tibble: 8,255 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # ℹ 8,245 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights, !is.na(dep_time)) # filter out all flights with `NA` departure time

## # A tibble: 328,521 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 328,511 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights, dep_time > 1900) # filter out departure time before 1900 or of `NA` value

## # A tibble: 54,702 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1904           1905        -1     2139           2227
##  2  2013     1     1     1904           1905        -1     2157           2208
##  3  2013     1     1     1905           1900         5     2311           2301
##  4  2013     1     1     1906           1915        -9     2211           2244
##  5  2013     1     1     1909           1912        -3     2239           2237
##  6  2013     1     1     1909           1910        -1     2212           2224
##  7  2013     1     1     1910           1909         1     2126           2046
##  8  2013     1     1     1910           1910         0     2126           2107
##  9  2013     1     1     1910           1855        15     2118           2103
## 10  2013     1     1     1911           1910         1     2050           2055
## # ℹ 54,692 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The first result shows that there are missing values in actual departure time, and the second result shows that when we do filtering, NA values are dropped.

Lab Exercise

How many flights have a missing plane tail number? What is the percentage of flights with a missing plane tail number?

Data analysis example

Let’s exercise what we just learned combining data transformation and data visualization. One routine practice is to remove samples with NA values before making the graph.

For example, if we want to create a histogram of dep_time, we can directly make the graph and ggplot will smartly throw away the NA values. A warning message would pop out to hint us with this fact.

ggplot(flights) + geom_histogram(aes(dep_time))

But sometimes we hope to know information about those flights with NA values. For example, NA in dep_time means that the flight has been canceled (therefore no departure time). If we hope to know which airline company canceled the most flights in our data set, we can do the following:

canceled_flight <- filter(flights, is.na(dep_time))

ggplot(canceled_flight) + 
  geom_bar(aes(carrier)) +
  labs(title = "Canceled Flights by Carrier", 
       x = "Carrier (in abbreviation)", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

So we see that airline `EV` canceled the most flights in our data set. After checking `airlines` commands we see that it is ExpressJet Airlines.

Lab Homework (Required)

Use data transformation and data visualization, answer that For the airline EV in the data set flights, during which month were most flights canceled? Submit your code, graph and answer.

Select columns with `select()`

The second function to study here is the select() function, which select variables from the original data set and form a new one.

There are a few ways to use select(), as shown below.

# Select columns by name
select(flights, year, month, day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

# Select all columns between year and day (inclusive)
select(flights, year:day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# Select by col number
select(flights, 1:3)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

Similarly to filter(), if you want to save the data after selecting, we must either name it or use the pipe (discussed later).

new_data <- select(flights, year, month, day)

Helper functions `starts_with(), ends_with(), contains()`

There are a number of helper functions you can use within select():

* `starts_with("abc")`: matches names that begin with "abc".
* `ends_with("xyz")`: matches names that end with "xyz".
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression.
*  `num_range("x", 1:3)` matches `x1`, `x2` and `x3`.
   
* See `?select` for more details.

# Select columns of year, month, day and those starting with "dep" or "arr"
dep_data = select(flights, year, month, day, starts_with(c("dep", "arr")))
glimpse(dep_data)

## Rows: 336,776
## Columns: 7
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, …
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31…

This may not look useful for a data set with only dozens of columns. But when you have hundreds of columns or more, this can be very helpful.

`rename()` to rename a column

Sometimes we hope to rename a column for convenience. We can do the following:

rename(flights, sch_dep = sched_dep_time) # new_name = old_name

## # A tibble: 336,776 × 19
##     year month   day dep_time sch_dep dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>   <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517     515         2      830            819
##  2  2013     1     1      533     529         4      850            830
##  3  2013     1     1      542     540         2      923            850
##  4  2013     1     1      544     545        -1     1004           1022
##  5  2013     1     1      554     600        -6      812            837
##  6  2013     1     1      554     558        -4      740            728
##  7  2013     1     1      555     600        -5      913            854
##  8  2013     1     1      557     600        -3      709            723
##  9  2013     1     1      557     600        -3      838            846
## 10  2013     1     1      558     600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Use `everything()` to reorder the columns

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

For example, if we want to move time_hour and air_time to the first two columns and keep all others. We can do

select(flights, time_hour, air_time, everything())

## # A tibble: 336,776 × 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ℹ 336,766 more rows
## # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

3. Arrange rows with `arrange()`

Next, we will study the arrange() function which changes row order by sorting the values in one or more given columns.

For example, the original data set does not completely follow the order of time since month 10, 11, 12 goes right after month 1.

Lab Exercise: Find a way to verify that flights data do not strictly follow the order of time and month 10, 11, 12 goes after month 1.

To resolve this, we can rearrange the rows by month and day.

flights_md <- arrange(flights, month, day)

Similar to all previous functions, the first argument is the name of data set, followed by column names.

Arrange from the largest to the smallest values - `desc()`

By default, the arrange() function sorting out values from the smallest to the largest. To reverse it, we can simply put an desc (representing “descending”) function on the column name.

for example, we hope to see flights with the worst delay in departure, we can do the following:

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We see that the worst one delayed by 1301 minutes, which is over 21 hours!

arrange() function can be very useful when we hope to quickly check the minimum and maximum values of any column.

Lab Exercise

1. What was the longest travel distance for any flight in our data set? What was the origin and the destination?

2. What was the shortest air time for any flight (that actually finished the trip) in our data set? What was the origin and the destination?

Note: missing values are always sorted at the end:

(df <- tibble(x = c(5, 2, NA)))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

arrange(df, x)

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA

arrange(df, desc(x))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

show sorted columns with `select()` and `arrange()`

When there are too many columns, we may hope to make the arranged column the first one. For example, if we arrange the data set by air_time, then we can do the following:

select(arrange(flights, air_time), flight, month, day, year, air_time)

## # A tibble: 336,776 × 5
##    flight month   day  year air_time
##     <int> <int> <int> <int>    <dbl>
##  1   4368     1    16  2013       20
##  2   4631     4    13  2013       20
##  3   4276    12     6  2013       21
##  4   4619     2     3  2013       21
##  5   4368     2     5  2013       21
##  6   4619     2    12  2013       21
##  7   2132     3     2  2013       21
##  8   3650     3     8  2013       21
##  9   4118     3    18  2013       21
## 10   4276     3    19  2013       21
## # ℹ 336,766 more rows

Data analysis Example

Let’s do another analysis example. Let’s try to analyze what is the prominent factor that causes very long delays. First, let’s separate our data set into two parts - those with long delay (more than 2 hours) and all other flights.

long_delay <- filter(flights, dep_delay >= 120 | arr_delay >= 120)
others <- filter(flights, dep_delay < 120 & arr_delay < 120)

So we have 11606 observations in long_delay. That’s about 3.4% of the whole data set.

nrow(long_delay)/nrow(flights)

## [1] 0.03446208

Let’s see whether carrier is a big factor here:

ggplot(long_delay) + 
  geom_bar(aes(carrier, y = after_stat(count/sum(count)))) + 
  labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))  

ggplot(others) + 
  geom_bar(aes(carrier, y = after_stat(count/sum(count)))) +
  labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We can see that the distributions of carriers are quite different between the two graphs. There are some carriers with significantly higher proportion of flights with long delays - AA, EV, UA, US etc.

Lab Homework (Required)

Use your knowledge of data transformation and data visualization, answer the following questions with a graph or result. Submit your code and graph/result along with your answer to the question.

In the mpg data set, which manufacturer produced the most fuel economic SUVs?
In the mpg data set, which SUV manufacturer improved fuel economy most between 1999 and 2008?
In the flights data set, pick up another variable other than carrier and analyze whether that variable correlates with long-delay flights or not. .

Data Transformation Part One

Miao Yu

2024-01-31

Introduction to Data Transformation

Example

Introduction to dplyr package

nycflights13 data set

Variable types

dplyr basics

Filter rows with filter()

Use the assignment operator, <-:

Template for using fliter()

Comparisons

Error message when use =

Caution with == when dealing with float:

Lab Exercise

Find all flights that travel a distance of less than 200 miles. How many of such flights are there in the data set?

Logical operators

Finds all flights that departed in November or December:

Note that the following doesn’t work as you expected:

Use x %in% y

R also has && and ||, but don’t use them here in filter()!

Use between() helper function for (X >= a) & (X <= b)

Lab Exercise

a) Find all flights that depart from JFK and land on ORD or CVG (Chicago or Dallas).

b) Find all flights that flied a distance between 200 and 500 miles.

c) Plot the histogram of departure time for filtered data either in a) or b).

Missing values NA

Why is the following expression FALSE?

To understand with example

Use is.na() to filter out NA in your data set

Note: filter() excludes both FALSE and NA values.

Lab Exercise

How many flights have a missing plane tail number? What is the percentage of flights with a missing plane tail number?

Data analysis example

So we see that airline EV canceled the most flights in our data set. After checking airlines commands we see that it is ExpressJet Airlines.

Lab Homework (Required)

Select columns with select()

Helper functions starts_with(), ends_with(), contains()

rename() to rename a column

Use everything() to reorder the columns

3. Arrange rows with arrange()

Arrange from the largest to the smallest values - desc()

Lab Exercise

1. What was the longest travel distance for any flight in our data set? What was the origin and the destination?

2. What was the shortest air time for any flight (that actually finished the trip) in our data set? What was the origin and the destination?

Note: missing values are always sorted at the end:

show sorted columns with select() and arrange()

Data analysis Example

Lab Homework (Required)

Introduction to `dplyr` package

`nycflights13` data set

Filter rows with `filter()`

Use the assignment operator, `<-`:

Template for using `fliter()`

Error message when use `=`

Caution with `==` when dealing with float:

Use `x %in% y`

R also has `&&` and `||`, but don’t use them here in `filter()`!

Use `between()` helper function for (X >= a) & (X <= b)

a) Find all flights that depart from `JFK` and land on `ORD` or `CVG` (Chicago or Dallas).

Missing values `NA`

Why is the following expression `FALSE`?

Use `is.na()` to filter out `NA` in your data set

Note: `filter()` excludes both `FALSE` and `NA` values.

So we see that airline `EV` canceled the most flights in our data set. After checking `airlines` commands we see that it is ExpressJet Airlines.

Select columns with `select()`

Helper functions `starts_with(), ends_with(), contains()`

`rename()` to rename a column

Use `everything()` to reorder the columns

3. Arrange rows with `arrange()`

Arrange from the largest to the smallest values - `desc()`

show sorted columns with `select()` and `arrange()`