3장 데이터 변형

3.1 들어가기

(43p) 데이터를 좀 더 쉽게 사용할 수 있도록 새로운 변수나 요약값을 만들어야 할 수도 있고, 아니면 변수 이름을 변경하거나 관측값들을 재정렬해야 되는 경우가 종종 있습니다. 이 장에서는 dplyr 패키지와 2013년 뉴욕시 출발 항공편에 대한 새로운 데이터셋을 이용하여 데이터 변형 방법을 배울 것입니다.

3.1.1 준비하기

3장에서는 아래와 같은 패키지를 사용하기 때문에 사전 준비가 필요합니다. 앞서 1장에서 배웠듯이 패키지는 한 번만 설치하면 되지만, 새로운 세션을 시작할 때마다 다시 로드해야 합니다.

install.packages("nycflights13")
install.packages("tidyverse")
library(nycflights13)
library(tidyverse)

3.1.2 nycflights13

dplyr의 기본 데이터 작업(manipulation-머니펼레이션) 동사를 탐색하기 위해서 nycflights13::flights를 사용할 것입니다. 이 데이터프레임에는 뉴욕시에서 2013년에 출발한 모든 항공편이 포함되어 있습니다.

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

열 이름(year, month 등) 아래의 줄임말 행은 각 변수의 유형을 설명합니다.

int 정수
dbl 더블형, 또는 실수
chr 문자형 벡터, 혹은 문자열
dttm 데이트-타임형(날짜 + 시간)

3.1.3 dplyr 기초

이 장에서 대부분의 데이터 작업 문제를 풀 수 있는 다섯 개의 핵심 dplyr 함수들을 배울 것입니다.

fliter() 값을 기준으로 선택하라
arrange() 행을 재정렬하라
select() 이름으로 변수를 선택하라
mutate() 기존 변수들의 함수로 새로운 변수를 생성하라
summarize() 많은 값을 하나의 요약값으로 합쳐라

이것들은 모두 group_by()와 함께 사용할 수 있는데, 이는 전체 데이터셋에 동작하지 않고 그룹마다 동작하도록 각 함수의 범위를 변경합니다. 이 여섯 함수가 데이터 작업 언어에서 동사가 됩니다. 이 속성들을 함께 이용하면 여러 단순한 단계를 쉽게 연결하여 복잡한 결과를 얻을 수 있습니다.

첫 인수는 데이터프레임입니다.
그 이후의 인수들은 (따옴표가 없는) 변수 이름을 사용하여 데이터프레임에 무엇을 할지를 설명합니다.
결과는 새로운 데이터프레임입니다.

3.2 filter()로 행 필터링하기

(45p)

filter()를 이용하면 값을 기준으로 데이터를 서브셋 할 수 있습니다. 첫 번째 인수는 데이터프레임 이름이고, 두 번째 이후의 인수들은 데이터프레임을 필터링하는 표현식들입니다. 예를 들어 1월 1일 항공편 모두를 다음과 같이 선택할 수 있습니다.

filter(flights, month == 1, day == 1)

## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

dplyr 함수들은 입력을 절대 수정하지 않기 때문에, 결과를 저장하려면 할당 연산자 <-를 사용해야 합니다.

jan1 <- filter(flights, month == 1, day == 1)

R은 결과를 출력하거나 변수에 저장합니다. 둘 다 수행되게 하려면 할당문을 괄호로 묶으면 됩니다.

(dec25 <- filter(flights, month == 12, day == 25))

## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3.2.1 비교 연산

필터링을 효과적으로 사용하려면 비교 연산자를 사용하여 원하는 관측값을 선택하는 방법을 알아야 합니다. R에는 다음과 같은 표준연산자군이 있습니다.

> 크다
>= 크거나 같다
< 작다
<= 작거나 같다
!= 같지 않다
== 같다

R을 배우기 시작할 때 가장 범하기 쉬운 실수는, 같음을 테스트할 때 == 대신 =를 사용하는 것입니다. 이런 실수를 하면 오류가 발생하면서 해당 내용을 알려줍니다.

컴퓨터는 무한대 수를 저장할 수 없어서 눈앞에 보이는 숫자는 근사값이라는 것을 기억해야 합니다. 따라서 숫자를 비교할 때 == 대신, near()를 사용해야 합니다.

near(sqrt(2) ^ 2, 2)

## [1] TRUE

near(1 / 49 * 49, 1)

## [1] TRUE

3.2.2 shsfl dustkswk

filter()의 인수들은 ‘and’로 결합됩니다. 즉, 모든 표현식이 참이어야 행이 출력에 포함됩니다. 다른 유형의 조합을 만들려면 직접 불(Boolean) 연산자를 사용해야 합니다. &는 ’and’, |는 ‘or’, !는 ’not’입니다. 다음의 그림은 불 연산자 전체 집합을 보여줍니다. transform-logical 다음 코드는 11월이나 12월에 출발한 항공편 모두를 찾습니다.

filter(flights, month == 11 | month == 12)

연산 순서는 영어에서의 순서와 다릅니다. filter(flights, month == (11 | 12))로 쓰면 직역으로 ’finds all flights that departed in November or December’로 번역되겠지만 이렇게 쓰면 안 됩니다. 앞서 설명했듯이 R은 11월이나 12월에 출발한 항공편 대신 11 | 12(이 표현식은 TRUE가 됨와 같은 달을 모두 찾습니다. 수치형 문맥에서 TRUE는 1이 되므로 이는 1월의 모든 항공편을 찾습니다. 이 문제에 유용한 팁은 x %in% y입니다. 이는 x가 y에 있는 값 중 하나인 행을 모두 선택합니다.

nov_dec <- filter(flights, month %in% c(11, 12))

드 모르간 법칙을 이용하여 복잡한 서브셋 동작을 단순화할 수도 있습니다. 앞서 설명했듯이 둘 다 수행되게 하려면 할당문을 괄호로 묶으면 됩니다. 예를 들어 (출발 혹은 도착에서) 두 시간 이상 지연되지 않은 항공편을 모두 찾고 싶다면 다음의 두 필터 중 하나를 사용해도 됩니다.

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

filter() 안의 표현식이 복잡하고 여러 개 나열되기 시작하면, 항상 이들을 명시적 변수들로 만드는 것을 고려해야 합니다. 이렇게 하면 작업을 확인하기 훨씬 쉬워집니다.

3.2.3 결측값

R에서 비교를 까다롭게 만드는 중요한 특징은 결측값, 즉 NA(not availiable, 이용불가)입니다. NA는 모르는 값을 나타내므로, 모르는 값이 연관된 연산의 결과도 대부분 모르는 값이 됩니다.

# x를 메리의 나이라고 하자, 우리는 그녀가 몇 살인지 모른다.
x <- NA
# y를 존의 나이라고 하자. 우리는 그가 몇 살인지 모른다.
y <- NA
# 존과 메리는 같은 나이인가?
x == y
[1] NA
# 우린 모른다!

값이 결측인지를 확인하고 싶으면 is.na()를 사용합니다.

x <- NA
is.na(x)

## [1] TRUE

filter()는 조건이 TURE인 열만 포함됩니다. FALSE와 NA 값들은 제외합니다. 결측값들은 남기려면 명시적으로 요청해야 합니다.

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)

## # A tibble: 1 x 1
##       x
##   <dbl>
## 1     3

filter(df, is.na(x) | x > 1)

## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

3.3 arrange()로 행 정렬하기

arrange()는 행을 선택하는 대신, 순서를 바꾼다는 것만 제외하고는 filter()와 유사하게 작동합니다. 데이터프레임과 정렬기준으로 지정할 열 이름 집합(혹은 복잡한 표현식)을 입력으로 합니다. 하나 이상의 열 이름을 제공하면 각 열은 이전 열의 동점값(tie) 상황을 해결하는 데 사용됩니다.

arrange(flights, year, month, day)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

내림차순으로 열을 재정렬하려면 desc()를 사용합니다.

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

결측값은 항상 마지막에 정렬됩니다.

df <- tibble(x = c(5, 2, NA))
arrange(df, x)

## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA

arrange(df, desc(x))

## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

3.4 select()로 열 선택하기

변수가 매우 많은 데이터셋을 만날 결우 첫 과제는 실제로 관심 있는 변수들로 좁히는 것입니다. select()와 변수 이름에 기반한 연산들을 이용하면 유용한 서브셋으로 신속하게 줌인(zoom in)해 볼 수 있습니다.

# 이름으로 열 선택
select(flights, year, month, day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# year과 day 사이의(경계 포함) 열 모두 선택
select(flights, year:day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# year에서 day까지의(경계 포함) 열들을 제외한 열 모두 선택
select(flights, -(year:day))

## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

select() 안에서 사용할 수 있는 도우미 함수들

starts_with("abc") ’abc로 시작하는 이름에 매칭
ends_with("xyz") ’xyz’로 끝나는 이름에 매칭
contains("ijk") ’ijk’를 포함하는 이름에 매칭
matches("(.)\\1") 정규표현식에 매칭되는 변수들을 선택. 이 표현식은 반복되는 문자를 포함하는 변수에 매칭됩니다.
num_range("x", 1:3) x1, x2, x3에 매칭

변수명 변경에 select()를 이용할 수 있지만, 명시적으로 언급하지 않은 모든 변수를 누락하기 때문에 유용하지 않습니다. 대신 select()의 변형인 rename()을 사용하면 명시적으로 언급하지 않은 모든 변수를 유지합니다.

rename(flights, tail_num = tailnum)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

다른 방법은 select()를 도우미 함수인 everything()과 함께 사용하는 것입니다. 몇 개의 변수를 데이터프레임의 시작 부분으로 옮기고 싶을 때 유용합니다.

select(flights, time_hour, air_time, everything())

## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

3.5 mutate()로 새로운 변수 추가하기

mutate()는 기존 열들의 함수인 새로운 열을 추가하는 일을 합니다.

# `mutate()`는 새로운 열을 항상 데이터셋 마지막에 추가하기 때문에, 새로운 변수를 보기 편하게 우선 더 좁은 데이터셋을 생성하겠습니다.
flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9  370.
##  2  2013     1     1         4        20     1416      227   -16  374.
##  3  2013     1     1         2        33     1089      160   -31  408.
##  4  2013     1     1        -1       -18     1576      183    17  517.
##  5  2013     1     1        -6       -25      762      116    19  394.
##  6  2013     1     1        -4        12      719      150   -16  288.
##  7  2013     1     1        -5        19     1065      158   -24  404.
##  8  2013     1     1        -3       -14      229       53    11  259.
##  9  2013     1     1        -3        -8      944      140     5  405.
## 10  2013     1     1        -2         8      733      138   -10  319.
## # ... with 336,766 more rows

# 방금 생성한 열을 참조할 수 있습니다.
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time  gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9 3.78 
##  2  2013     1     1         4        20     1416      227   -16 3.78 
##  3  2013     1     1         2        33     1089      160   -31 2.67 
##  4  2013     1     1        -1       -18     1576      183    17 3.05 
##  5  2013     1     1        -6       -25      762      116    19 1.93 
##  6  2013     1     1        -4        12      719      150   -16 2.5  
##  7  2013     1     1        -5        19     1065      158   -24 2.63 
##  8  2013     1     1        -3       -14      229       53    11 0.883
##  9  2013     1     1        -3        -8      944      140     5 2.33 
## 10  2013     1     1        -2         8      733      138   -10 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

# 새 변수만을 남기고 싶다면 transmute()를 사용합니다.
transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

## # A tibble: 336,776 x 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1    -9 3.78          -2.38
##  2   -16 3.78          -4.23
##  3   -31 2.67         -11.6 
##  4    17 3.05           5.57
##  5    19 1.93           9.83
##  6   -16 2.5           -6.4 
##  7   -24 2.63          -9.11
##  8    11 0.883         12.5 
##  9     5 2.33           2.14
## 10   -10 2.3           -4.35
## # ... with 336,766 more rows

3.5.1 유용한 생성 함수

mutate()와 사용할 수 잇는 변수 생성 함수가 많습니다. 이 함수들이 벡터화되어야 한다는 것이 핵심입니다. 즉, 벡터를 입력으로 하여 같은 개수의 값을 가진 벡터를 출력해야 합니다.

산술 연산자 +, -, *, /, ^

’재활용 규칙’을 이용하여 이들은 모두 벡터화됩니다. 인수 하나가 단일 숫자인 경우에 가장 유용합니다. air_time / 60, hours * 60 + minute 등.

모듈러 연산 %/%, %%

%/%(정수 나누기), %%(나머지). 모듈러 연산은 정수를 조각으로 분해할 수 있기 때문에 편리한 도구입니다.

# 예를 들어 항공편 데이터셋의 dep_time으로부터 hour와 minute을 다음과 같이 계산할 수 있습니다.
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)

## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # ... with 336,766 more rows

로그 log(), log2(), log10()

로그는 여러 차수를 넘나드는 데이터를 처리하는 데 매우 유용한 변환입니다.

오프셋

lead()와 lag()를 사용하면 벡터를 앞으로 당기거나(leading), 뒤로 미는(lag-ging) 것을 참조할 수 있습니다. 또 연속된 차이값(differences)을 계산하거나 값들이 변경된 곳을 찾는 데 사용할 수 있습니다. group_by()와 함께 사용할 때 가장 유용합니다.

(x <- 1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

lag(x)

##  [1] NA  1  2  3  4  5  6  7  8  9

lead(x)

##  [1]  2  3  4  5  6  7  8  9 10 NA

누적 및 롤링 집계

R에는 연속하는 합계 cumsum(), 곱셈 cumprod(), 최솟값 cummin(), 최댓값 cummax(), 함수가 있습니다. dplyr에는 누적평균을 구하는 cummean()이 있습니다.

(x <- 1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

cumsum(x)

##  [1]  1  3  6 10 15 21 28 36 45 55

cummean(x)

##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

논리형 비교 연산자 <, <=, >, >=, !=

복잡한 일련의 논리형 연산을 수행한다면 새 변수에 중간 값들을 저장하여 각 단계가 예상대로 작동하는지 확인하는 것이 좋습니다.

랭킹

min_rank()는 가장 평범한 유형의 랭킹을 수행합니다.(예: 첫 번째, 두 번재, 세 번째).

y <- c(1, 2, 2, NA, 3, 4)
# 기본값에서 가장 작은 값이 가장 낮은 순서가 됩니다.
min_rank(y)

## [1]  1  2  2 NA  4  5

# 가장 큰 값을 가장 낮은 순서로 만들고 싶다면 desc()를 사용합니다.
min_rank(desc(y))

## [1]  5  3  3 NA  2  1

3.6 summarize()로 그룹화 요약하기

summarize()는 데이터프레임을 하나의 행으로 축약합니다.

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

group_by()는 분석의 단위를 전체 데이터셋에서 개별 그룹으로 변경시킵니다. 이후 dplyr 동사를 그룹화된 데이터프레임에 사용하면 이 동사가 ‘그룹마다(by group)’ 적용됩니다.

# 날짜로 그룹화된 데이터프레임에 정확히 같은 코드를 적용하면 날짜별 평균 지연시간이 나옵니다.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

group_by()와 summarize()를 조합하면 dplyr로 작업할 때 가장 빈번히 사용할 도구들 중 하나인 그룹화 요약이 됩니다.

3.6.1 파이프로 여러 작업 결합하기

# 각 위치에 대해 거리와 평균 지연 사이에 관계를 탐색하고 싶다면 다음과 같이 코드를 작성할 것입니다.
# 세 단계로 이 데이터를 전처리합니다.
# 1. 목적지별로 항공편을 그룹화.
# 2. 거리, 평균 지연시간, 항공편 수를 계산하여 요약.
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)

## `summarise()` ungrouping output (override with `.groups` argument)

# 3. 잡음이 많은 점과 호놀룰루 공항(다음으로 먼 공항보다 거의 두 배가 먼 공항)을 제거하는 필터링.
delay <- filter(delay, count > 20, dest != "HNL")

# 지연시간은 거리에 따라 750마일까지는 증가하다가 감소하는 것 같습니다.
# 항로가 길수록 비행 중에 지연시간을 만회할 여력이 더 있는 것인가? 
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

위 코드는 크게 상관없는 중간 데이터프레임들에 이름을 모두 지어 주어야 하기 때문에, 이름 짓는 것이 쉽지 않아서 분석 속도가 늦어집니다. 이 문제를 파이프, %>%로 해결하는 방법이 있습니다. 이 방법은 변형 자체에 초점을 맞춰서, 코드를 더 읽기 쉽게 만듭니다. 여기에서 제안된 것처럼 코드를 읽을 때 %>%를 ’그다음’으로 읽는 것이 좋습니다.

# 그룹화하고, 그다음 요약하고, 그다음 필터링하라.
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

## `summarise()` ungrouping output (override with `.groups` argument)

파이프를 사용하여 다중 작업을 왼쪽에서 오른쪽으로, 위에서 아래로 읽을 수 있게 다시 쓸 수 있습니다. 파이프를 사용하면 코드 가독성이 훨씬 좋아집니다. 파이프로 작업하는 것은 tidyverse에 속하기 위한 핵심 기준 중 하나입니다.

3.6.2 결측값

# na.rm 인수를 설정하지 않으면 결측값이 많이 생깁니다.
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ... with 355 more rows

집계 함수는 결측값에 관한 일반적인 법칙(즉, 입력에 결측값이 있으면 출력도 결측값이 된다)을 따르기 때문입니다. 모든 집계 함수에는 na.rm 인수가 있어서 계산 전에 결측값들을 제거할 수 있습니다.

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

이 경우에서 결측값은 취소된 항공편을 나타내므로, 취소된 항공편을 제거하여 문제를 해결할 수 있습니다.

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.4 
##  2  2013     1     2 13.7 
##  3  2013     1     3 10.9 
##  4  2013     1     4  8.97
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.56
##  9  2013     1     9  2.30
## 10  2013     1    10  2.84
## # ... with 355 more rows

3.6.3 카운트

집계를 수행할 때마다 카운트 n() 혹은, 결측이 아닌 값의 카운트 sum(!is.na(x))를 포함하는 것이 좋습니다. 이렇게 하면 매우 적은 양의 데이터를 기반으로 결론을 도출하지 않는지 확인할 수 있습니다.

# 예를 들어 평균 지연시간이 가장 긴 항공기(꼬리 번호(tail number)로 식별)로 봅시다. 어떤 항공기들은 평균 5시간(300분)이 지연된 걸 알 수 있습니다.
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

# 비행 횟수 대 평균 지연시간의 산점도를 그리면 더 많은 통찰력을 얻을 수 있습니다.
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

당연히 비행이 적을 때 평균지연시간에 변동이 훨씬 더 큽니다. 평균(혹은 다른 요약값) 대 그룹 크기의 플롯을 그리면 표본 크기가 커짐에 따라 변동이 줄어드는 것을 볼 수 있습니다. 이런 종류의 플롯을 살펴볼 때는, 관측값 개수가 가장 적은 그룹을 필터링하는 것이 좋은 경우가 많습니다. 심한 변동이 아닌 패턴이 더 잘 보이기 때문입니다.

# 이를 수행하는 다음 코드는 ggplot2를 dplyr 플로에 통합하는 편리한 패턴도 보여줍니다.
delays %>% 
  filter(n > 25) %>% 
  ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

야구에서 타자의 평균 능력치가 타석 수와 어떻게 관련되었는지 살펴봅시다. 여기에서 Lahman 패키지 데이터를 사용하여 메이저리그의 모든 야구 선수의 타율(안타수/유호타석수)을 계산합니다. 타자의 기술(타율, ba로 측정)을 안타 기회 횟수에 대해 플롯을 그리면 두 가지 패턴이 보입니다. - 앞에서와 같이 집계값의 변동량을 데이터 포인트가 많아짐에 따라 감소합니다. - 기술 수준(ba)과 볼을 칠 기회(ab) 사이에 양의 상관관계가 있습니다. 팀이 누구를 타석에 내보낼지 선택할 때 당연히 최고의 선수를 선택할 것이기 때문입니다.

# 보기 좋게 화면 출력되도록 티블로 변형
batting <- as_tibble(Lahman::Batting)

batters <- batting %>% 
  group_by(playerID) %>% 
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

batters %>% 
  filter(ab > 100) %>% 
  ggplot(mapping = aes(x = ab, y = ba)) +
    geom_point() + 
    geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

이 사실은 순위에 중요한 영향을 줍니다. 단순히 desc(ba)로 정렬하면 평균 타율이 가장 높은 선수는 능력치가 좋은 것이 아니라 단순히 운이 좋은 선수들입니다.

batters %>% 
  arrange(desc(ba))

## # A tibble: 19,689 x 3
##    playerID     ba    ab
##    <chr>     <dbl> <int>
##  1 abramge01     1     1
##  2 alanirj01     1     1
##  3 alberan01     1     1
##  4 banisje01     1     1
##  5 bartocl01     1     1
##  6 bassdo01      1     1
##  7 birasst01     1     2
##  8 bruneju01     1     1
##  9 burnscb01     1     1
## 10 cammaer01     1     1
## # ... with 19,679 more rows

3.6.4 유용한 요약 함수

위치 측정값 median(x)

평균 mean(x)은 총합 나누기 길이이고, 중앙값 median(x)은 x의 50%가 위에 위치하고 50%는 아래에 위치하게 되는 값입니다.

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    # 지연시간의 평균:
    avg_delay2 = mean(arr_delay[arr_delay > 0])
  )

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # ... with 355 more rows

산포 측정값 sd(x), IQR(x), mad(x)

sd(x) 평균제곱편차, 혹은 표준편차, 산포의 표준 측정값
IQR() 사분위범위
mad(x) 중위절대편차, 이상값이 있을 때 더 유용할 수 있는 로버스트한 대체값

# 왜 어떤 목적지는 그곳까지의 거리가 다른 곳보다 더 변동성이 있는가?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 104 x 2
##    dest  distance_sd
##    <chr>       <dbl>
##  1 EGE         10.5 
##  2 SAN         10.4 
##  3 SFO         10.2 
##  4 HNL         10.0 
##  5 SEA          9.98
##  6 LAS          9.91
##  7 PDX          9.87
##  8 PHX          9.86
##  9 LAX          9.66
## 10 IND          9.46
## # ... with 94 more rows

순위 측정값 min(x), quntile(x, 0.25), max(x)

분위수는 중앙값의 일반화입니다. 예를 들어 quantile(x, 0.25)는 25%보다는 크고 나머지 75%보다는 작은 값을 찾습니다.

# 각 날짜의 처음과 마지막 항공편은 언제 출발하는가?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # ... with 355 more rows

자리(position) 측정값 first(x), nth(x, 2), last(x)

x[1], x[2], x[leagth(x)]와 유사하게 동작하지만 자리가 존재하지 않을 때(예를 들어 두 개의 요소만 있는 그룹에서 세 번째 요소를 접근하려고 할 때) 기본값으로 설정할 수 있습니다. 예를 들어 각 날짜에 처음과 마지막 출발을 찾을 수 있습니다.

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows

이 함수들은 순위로 필터링하는 데 사용할 수 있습니다. 필터링하면 모든 변수를 얻을 수 있는데, 각 관측값을 별도의 행으로 얻을 수 있습니다.

not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r))

## # A tibble: 770 x 20
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1     2356           2359        -3      425            437
##  3  2013     1     2       42           2359        43      518            442
##  4  2013     1     2     2354           2359        -5      413            437
##  5  2013     1     3       32           2359        33      504            442
##  6  2013     1     3     2349           2359       -10      434            445
##  7  2013     1     4       25           2359        26      505            442
##  8  2013     1     4     2358           2359        -1      429            437
##  9  2013     1     4     2358           2359        -1      436            445
## 10  2013     1     5       14           2359        15      503            445
## # ... with 760 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   r <int>

카운트

n() 인수가 없고 현재 그룹의 크기를 반환
sum(!is.na(x)) 결측이 아닌 값의 수를 카운트
n_distinct(x) 유일값 개수를 카운트

# 어느 목적지에 항공사가 가장 많은가?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

카운트는 유용하기 때문에 dplyr에는 단순히 카운트만 원할 경우 사용할 수 있는 단순한 도우미 함수도 있습니다.

not_cancelled %>% 
  count(dest)

## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

가중치 변수를 선택적으로 지정할 수도 있습니다. 예를 들어 이를 사용하여 항공기가 비행한 마일 수를 ’카운트(합)’할 수 있습니다.

not_cancelled %>% 
  count(tailnum, wt = distance)

## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

논리형 값의 카운트와 비율 sum(x > 10), mean(y == 0)

수치형 함수를 사용할 경우 TRUE는 1로 FALSE는 0으로 바뀝니다. sum(x)는 TRUE의 개수를, mean(x)은 비율을 제공합니다.

# 아침 5시 이전 항공편은 몇 편인가?
# (보통 전날 지연된 경우)
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(n_early = sum(dep_time < 500))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows

# 한 시간 이상 지연된 항공편의 비율은?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(hour_prop = mean(arr_delay > 60))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_prop
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows

3.6.5 여러 변수로 그룹화

여러 변수로 그룹화하면 각 요약값은 그룹화의 한 수준씩 벗겨냅니다. 이를 이용하면 데이터셋을 점진적으로 쉽게 요약할 수 있습니다.

daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows

3.6.6 그룹화 해제

그룹화를 제거하고 그룹화되지 않은 데이터 작업으로 돌아가려면 ungroup()을 사용합니다.

daily %>% 
  ungroup() %>%             # date 기반 그룹화 해제
  summarise(flights = n())  # 모든 항공편

## # A tibble: 1 x 1
##   flights
##     <int>
## 1  336776

3.7 그룹화 뮤테이트(와 필터링)

그룹화는 summarize()와 조합하여 사용하면 가장 유용하지만 mutate()와 filter()로 편리한 작업을 할 수도 있습니다.

# 각 그룹에서 최악의 멤버들을 찾아봅시다.
flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)

## # A tibble: 3,306 x 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184       41
##  2  2013     1     1       290       338     1134      213
##  3  2013     1     1       260       263      266       46
##  4  2013     1     1       157       174      213       60
##  5  2013     1     1       216       222      708      121
##  6  2013     1     1       255       250      589      115
##  7  2013     1     1       285       246     1085      146
##  8  2013     1     1       192       191      199       44
##  9  2013     1     1       379       456     1092      222
## 10  2013     1     2       224       207      550       94
## # ... with 3,296 more rows

# 기준값보다 큰 그룹을 모두 찾아봅시다.
popular_dests <- flights %>% 
  group_by(dest) %>% 
  filter(n() > 365)
popular_dests

## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# 그룹별 척도를 위해 표준화해봅시다.
popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)

## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH          11  0.000111 
##  2  2013     1     1 IAH          20  0.000201 
##  3  2013     1     1 MIA          33  0.000235 
##  4  2013     1     1 ORD          12  0.0000424
##  5  2013     1     1 FLL          19  0.0000938
##  6  2013     1     1 ORD           8  0.0000283
##  7  2013     1     1 LAX           7  0.0000344
##  8  2013     1     1 DFW          31  0.000282 
##  9  2013     1     1 ATL          12  0.0000400
## 10  2013     1     1 DTW          16  0.000116 
## # ... with 131,096 more rows

그룹화 필터링은 그룹화 뮤테이트 이후 그룹화하지 않은 필터링이다. 저자는 데이터 응급 작업의 경우가 아니면 일반적으로 이를 사용하지 않는다고 합니다. 작업을 올바르게 했는지 확인하기 어렵기 때문입니다.